Skip to content

feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9)#9

Merged
clearclown merged 12 commits intomainfrom
feat/judge-yaml-and-golden
Apr 24, 2026
Merged

feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9)#9
clearclown merged 12 commits intomainfrom
feat/judge-yaml-and-golden

Conversation

@clearclown
Copy link
Copy Markdown
Owner

@clearclown clearclown commented Apr 24, 2026

概要

ハルシネーション防止制約の外出し + CI 検証ツール。PR#8 と並行可。

変更内容

prompts/judge.yaml

  • evidence: require_urls, url_must_resolve, skip_domains(EDINET APIドメイン)
  • source_priority: edinet > chuusho_kaiji > corporate_site > gmaps_cluster
    • gmaps_cluster: standalone_allowed: false, standalone_max: 0.0
    • unknown_source: "error"(未知ソースはサイレント通過禁止)
  • confidence_rules + confidence_penalty(combine: multiply, floor: 0.0)
  • hallucination_guards: refuse_phrases(JA体言止め+丁寧形 + EN 両方)
  • output_schema: required/optional フィールド定義
  • validation: 宣言的ルール(source_url空→reject, count_source not in priority→error等)
  • ci_checks リスト
  • hard_rules: gmaps_cluster_requires_l3, unknown_source_rejected
  • brand_hq_review: is_brand_hq=true 時の追加制約
  • judge_panel: min_agreement: 0.6
  • system_prompt / hallucination_instruction

test/fixtures/judgement-golden.csv(10件)

# ケース 期待結果
1-3 真陽性(TODO_REAL_URL) confidence≥0.6, is_mega=true
4 真陰性(10店未満) is_mega=false
5 真陰性(記載なし) reject
6 偽陽性(本部名) confidence<0.5
7-8 source_must_contain_claim 違反 reject
9 約20店(曖昧) confidence<0.5
10 evidence_urls 空 reject

⚠️ #1-3 の source_url は TODO_REAL_URL。マージ前に実URLで置換すること。

test/fixtures/cluster-golden.csv(8件)

ClusterThreshold=4 の境界ケースを上下からカバー。

cmd/validate-judge/main.go

  • ValidationRule 構造体 + applyRules()
  • SourcePriorityNames() で judge.yaml から動的に許可ソース取得(ハードコードなし)
  • source_has_claim 別検証(true/false/partial)
  • シングル出力モード: --output output.json
  • golden.csv バッチ: --golden test/fixtures/judgement-golden.csv
  • --skip-url-resolve: フィクスチャURL の CI スキップ用
  • TODO_ プレフィックスの URL は自動スキップ

マージ条件

# TODO プレースホルダーが全件置換されていること
grep -c "TODO" test/fixtures/judgement-golden.csv
# → 0

# CI パス確認
go run ./cmd/validate-judge --config prompts/judge.yaml \
  --golden test/fixtures/judgement-golden.csv --skip-url-resolve
# → All tests passed

実URL確認は ablaze がブラウザで行い、差し替え後にマージ。

Copilot AI review requested due to automatic review settings April 24, 2026 15:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a declarative judging rule configuration (prompts/judge.yaml), adds golden fixture datasets, and adds a Go CLI (cmd/validate-judge) intended to validate LLM judge outputs against those constraints in CI.

Changes:

  • Add prompts/judge.yaml with evidence/source/confidence/hallucination guard rules and a list of CI checks.
  • Replace test/fixtures/judgement-golden.csv with a new “judge-by-evidence” style golden schema; add test/fixtures/cluster-golden.csv.
  • Add cmd/validate-judge CLI to validate single JSON outputs and run batch validation against the golden CSV.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

File Description
test/fixtures/judgement-golden.csv Replaces the existing golden dataset with a new schema for judge/evidence validation cases.
test/fixtures/cluster-golden.csv Adds a new golden dataset for cluster threshold boundary cases.
prompts/judge.yaml Adds a declarative rules/config file describing validation constraints and CI checks.
cmd/validate-judge/main.go Adds a Go CLI that loads judge.yaml and applies a subset of validation rules in single/batch modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/fixtures/judgement-golden.csv Outdated
Comment on lines +1 to +11
id,case_type,description,source_url,source_has_claim,store_count_claim,store_count_source_text,expected_confidence_min,expected_confidence_max,expected_is_mega,expected_reject,expected_source_text_contains
1,true_positive,EDINET明記・確実なメガジー,TODO_REAL_URL,true,28,全国28店舗を展開,0.85,1.0,true,false,店舗
2,true_positive,企業サイト明記・メガジー,TODO_REAL_URL,true,22,現在22店舗を運営しております,0.65,0.85,true,false,運営
3,true_positive,gmaps+企業サイト複数ソース,TODO_REAL_URL,true,21,グループ全体で21店舗,0.60,0.80,true,false,店舗
4,true_negative,10店未満・明記あり,TODO_REAL_URL,true,8,現在8店舗を展開中,0.70,0.90,false,false,店舗
5,true_negative,店舗数記載なし,TODO_REAL_URL,false,0,,0.0,0.0,false,true,
6,false_positive_hq,本部名混入・直営疑い,TODO_REAL_URL,true,450,全国450店舗,0.0,0.5,false,false,
7,source_claim_violation,URL有・主張未記載,TODO_REAL_URL,false,25,,0.0,0.0,false,true,
8,source_claim_violation,URL有・別ページ推論,TODO_REAL_URL,false,20,グループ企業を含む店舗数,0.0,0.0,false,true,グループ
9,ambiguous_expression,約20店の曖昧表現,TODO_REAL_URL,partial,20,約20店舗を運営,0.0,0.5,false,false,約
10,no_url,evidence_urls空,,false,0,,0.0,0.0,false,true,
Comment on lines +38 to +46
type Output struct {
SourceURL string `json:"source_url"`
CorporateNameFromSource string `json:"corporate_name_from_source"`
StoreCountClaim int `json:"store_count_claim"`
StoreCountSourceText string `json:"store_count_source_text"`
CountSource string `json:"count_source"`
CountUnit string `json:"count_unit"`
Confidence float64 `json:"confidence"`
IsMegaFranchisee bool `json:"is_mega_franchisee"`
Comment on lines +101 to +106
client := &http.Client{Timeout: 5 * time.Second}
resp, err := client.Head(o.SourceURL)
if err != nil || resp.StatusCode >= 400 {
return true // 失敗 → flag
}
return false
Comment on lines +186 to +206
var rows []GoldenRow
for i, rec := range records {
if i == 0 {
continue // ヘッダースキップ
}
if len(rec) < 12 {
continue
}
cMin, _ := strconv.ParseFloat(rec[7], 64)
cMax, _ := strconv.ParseFloat(rec[8], 64)
isMega := rec[9] == "true"
reject := rec[10] == "true"
rows = append(rows, GoldenRow{
ID: rec[0], CaseType: rec[1], Description: rec[2],
SourceURL: rec[3], SourceHasClaim: rec[4],
StoreCountClaim: rec[5], StoreCountSourceText: rec[6],
ExpectedConfidenceMin: cMin, ExpectedConfidenceMax: cMax,
ExpectedIsMega: isMega, ExpectedReject: reject,
ExpectedSourceTextContains: rec[11],
})
}
Comment on lines +59 to +66
// buildRules は judge.yaml の constraints に対応するルール一覧を返す。
func buildRules(cfg JudgeConfig, skipURLResolve bool) []ValidationRule {
return []ValidationRule{
{
Name: "source_url_empty",
Check: func(o Output) bool {
return o.SourceURL == "" || o.SourceURL == "null"
},
Comment on lines +262 to +294
failed := 0
for _, row := range rows {
if strings.HasPrefix(row.SourceURL, "TODO") {
fmt.Printf("SKIP #%s (%s): TODO_REAL_URL\n", row.ID, row.Description)
continue
}
fmt.Printf("CHECK #%s (%s)... ", row.ID, row.Description)
// golden.csv の各行を Output として構築してルール適用
cnt, _ := strconv.Atoi(row.StoreCountClaim)
output := Output{
SourceURL: row.SourceURL,
CorporateNameFromSource: "テスト社名",
StoreCountClaim: cnt,
StoreCountSourceText: row.StoreCountSourceText,
Confidence: 1.0,
}
applyRules(&output, rules)

ok := true
if row.ExpectedReject && !output.Reject {
fmt.Printf("FAIL (expected reject but not rejected)\n")
ok = false
}
if !row.ExpectedReject && output.Reject {
fmt.Printf("FAIL (unexpected reject: %s)\n", output.RejectReason)
ok = false
}
if ok {
fmt.Println("PASS")
} else {
failed++
}
}
Comment thread prompts/judge.yaml
Comment on lines +15 to +17
- gmaps_cluster:
standalone_allowed: false # 単独時 confidence=0.0 強制reject
unknown_source: "error" # リスト外ソースはエラー(サイレント通過禁止)
Comment thread prompts/judge.yaml

judge_panel:
enabled: true
min_agreement: 0.6 # パネル過半数一致で採用(3モデルなら2/3以上)
Comment thread prompts/judge.yaml
Comment on lines +104 to +105
- if: "count_source == 'gmaps_cluster' and no_other_source"
then: { confidence_multiplier: 0.8, requires_l3_verification: true }
Comment thread test/fixtures/judgement-golden.csv Outdated
Comment on lines +2 to +4
1,true_positive,EDINET明記・確実なメガジー,TODO_REAL_URL,true,28,全国28店舗を展開,0.85,1.0,true,false,店舗
2,true_positive,企業サイト明記・メガジー,TODO_REAL_URL,true,22,現在22店舗を運営しております,0.65,0.85,true,false,運営
3,true_positive,gmaps+企業サイト複数ソース,TODO_REAL_URL,true,21,グループ全体で21店舗,0.60,0.80,true,false,店舗
aegis-ceo and others added 12 commits April 25, 2026 01:44
…(PR#9)

prompts/judge.yaml:
  - Full constraint spec: evidence, source_priority, confidence_rules/penalty
  - hallucination_guards with refuse_phrases (JA + EN)
  - output_schema with required/optional fields
  - validation rules (source_url, source_text, corporate_name)
  - ci_checks list, brand_hq_review, judge_panel
  - url_must_resolve: true (CI can override with --skip-url-resolve)

test/fixtures/judgement-golden.csv (10件):
  - #1-3: 真陽性(TODO_REAL_URL、マージ前に実URLで置換)
  - #4: 真陰性(10店未満)
  - #5: 真陰性(店舗数記載なし)→ reject
  - #6: 偽陽性候補(本部名混入)→ confidence<0.5
  - #7-8: source_must_contain_claim 違反 → reject
  - #9: 曖昧表現(約20店)→ confidence<0.5
  - #10: evidence_urls空 → reject

test/fixtures/cluster-golden.csv (8件):
  - ClusterThreshold=4 の境界ケースを上下からカバー

cmd/validate-judge/main.go:
  - ValidationRule構造体 + applyRules()
  - シングル出力モード (--output)
  - golden.csv バッチモード (--golden)
  - --skip-url-resolve フラグ(フィクスチャURLの CI スキップ用)
@clearclown clearclown force-pushed the feat/judge-yaml-and-golden branch from 089720a to b88840a Compare April 24, 2026 16:44
@clearclown clearclown merged commit d138d53 into main Apr 24, 2026
2 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants