feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9)#9
Merged
clearclown merged 12 commits intomainfrom Apr 24, 2026
Merged
feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9)#9clearclown merged 12 commits intomainfrom
clearclown merged 12 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a declarative judging rule configuration (prompts/judge.yaml), adds golden fixture datasets, and adds a Go CLI (cmd/validate-judge) intended to validate LLM judge outputs against those constraints in CI.
Changes:
- Add
prompts/judge.yamlwith evidence/source/confidence/hallucination guard rules and a list of CI checks. - Replace
test/fixtures/judgement-golden.csvwith a new “judge-by-evidence” style golden schema; addtest/fixtures/cluster-golden.csv. - Add
cmd/validate-judgeCLI to validate single JSON outputs and run batch validation against the golden CSV.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| test/fixtures/judgement-golden.csv | Replaces the existing golden dataset with a new schema for judge/evidence validation cases. |
| test/fixtures/cluster-golden.csv | Adds a new golden dataset for cluster threshold boundary cases. |
| prompts/judge.yaml | Adds a declarative rules/config file describing validation constraints and CI checks. |
| cmd/validate-judge/main.go | Adds a Go CLI that loads judge.yaml and applies a subset of validation rules in single/batch modes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1
to
+11
| id,case_type,description,source_url,source_has_claim,store_count_claim,store_count_source_text,expected_confidence_min,expected_confidence_max,expected_is_mega,expected_reject,expected_source_text_contains | ||
| 1,true_positive,EDINET明記・確実なメガジー,TODO_REAL_URL,true,28,全国28店舗を展開,0.85,1.0,true,false,店舗 | ||
| 2,true_positive,企業サイト明記・メガジー,TODO_REAL_URL,true,22,現在22店舗を運営しております,0.65,0.85,true,false,運営 | ||
| 3,true_positive,gmaps+企業サイト複数ソース,TODO_REAL_URL,true,21,グループ全体で21店舗,0.60,0.80,true,false,店舗 | ||
| 4,true_negative,10店未満・明記あり,TODO_REAL_URL,true,8,現在8店舗を展開中,0.70,0.90,false,false,店舗 | ||
| 5,true_negative,店舗数記載なし,TODO_REAL_URL,false,0,,0.0,0.0,false,true, | ||
| 6,false_positive_hq,本部名混入・直営疑い,TODO_REAL_URL,true,450,全国450店舗,0.0,0.5,false,false, | ||
| 7,source_claim_violation,URL有・主張未記載,TODO_REAL_URL,false,25,,0.0,0.0,false,true, | ||
| 8,source_claim_violation,URL有・別ページ推論,TODO_REAL_URL,false,20,グループ企業を含む店舗数,0.0,0.0,false,true,グループ | ||
| 9,ambiguous_expression,約20店の曖昧表現,TODO_REAL_URL,partial,20,約20店舗を運営,0.0,0.5,false,false,約 | ||
| 10,no_url,evidence_urls空,,false,0,,0.0,0.0,false,true, |
Comment on lines
+38
to
+46
| type Output struct { | ||
| SourceURL string `json:"source_url"` | ||
| CorporateNameFromSource string `json:"corporate_name_from_source"` | ||
| StoreCountClaim int `json:"store_count_claim"` | ||
| StoreCountSourceText string `json:"store_count_source_text"` | ||
| CountSource string `json:"count_source"` | ||
| CountUnit string `json:"count_unit"` | ||
| Confidence float64 `json:"confidence"` | ||
| IsMegaFranchisee bool `json:"is_mega_franchisee"` |
Comment on lines
+101
to
+106
| client := &http.Client{Timeout: 5 * time.Second} | ||
| resp, err := client.Head(o.SourceURL) | ||
| if err != nil || resp.StatusCode >= 400 { | ||
| return true // 失敗 → flag | ||
| } | ||
| return false |
Comment on lines
+186
to
+206
| var rows []GoldenRow | ||
| for i, rec := range records { | ||
| if i == 0 { | ||
| continue // ヘッダースキップ | ||
| } | ||
| if len(rec) < 12 { | ||
| continue | ||
| } | ||
| cMin, _ := strconv.ParseFloat(rec[7], 64) | ||
| cMax, _ := strconv.ParseFloat(rec[8], 64) | ||
| isMega := rec[9] == "true" | ||
| reject := rec[10] == "true" | ||
| rows = append(rows, GoldenRow{ | ||
| ID: rec[0], CaseType: rec[1], Description: rec[2], | ||
| SourceURL: rec[3], SourceHasClaim: rec[4], | ||
| StoreCountClaim: rec[5], StoreCountSourceText: rec[6], | ||
| ExpectedConfidenceMin: cMin, ExpectedConfidenceMax: cMax, | ||
| ExpectedIsMega: isMega, ExpectedReject: reject, | ||
| ExpectedSourceTextContains: rec[11], | ||
| }) | ||
| } |
Comment on lines
+59
to
+66
| // buildRules は judge.yaml の constraints に対応するルール一覧を返す。 | ||
| func buildRules(cfg JudgeConfig, skipURLResolve bool) []ValidationRule { | ||
| return []ValidationRule{ | ||
| { | ||
| Name: "source_url_empty", | ||
| Check: func(o Output) bool { | ||
| return o.SourceURL == "" || o.SourceURL == "null" | ||
| }, |
Comment on lines
+262
to
+294
| failed := 0 | ||
| for _, row := range rows { | ||
| if strings.HasPrefix(row.SourceURL, "TODO") { | ||
| fmt.Printf("SKIP #%s (%s): TODO_REAL_URL\n", row.ID, row.Description) | ||
| continue | ||
| } | ||
| fmt.Printf("CHECK #%s (%s)... ", row.ID, row.Description) | ||
| // golden.csv の各行を Output として構築してルール適用 | ||
| cnt, _ := strconv.Atoi(row.StoreCountClaim) | ||
| output := Output{ | ||
| SourceURL: row.SourceURL, | ||
| CorporateNameFromSource: "テスト社名", | ||
| StoreCountClaim: cnt, | ||
| StoreCountSourceText: row.StoreCountSourceText, | ||
| Confidence: 1.0, | ||
| } | ||
| applyRules(&output, rules) | ||
|
|
||
| ok := true | ||
| if row.ExpectedReject && !output.Reject { | ||
| fmt.Printf("FAIL (expected reject but not rejected)\n") | ||
| ok = false | ||
| } | ||
| if !row.ExpectedReject && output.Reject { | ||
| fmt.Printf("FAIL (unexpected reject: %s)\n", output.RejectReason) | ||
| ok = false | ||
| } | ||
| if ok { | ||
| fmt.Println("PASS") | ||
| } else { | ||
| failed++ | ||
| } | ||
| } |
Comment on lines
+15
to
+17
| - gmaps_cluster: | ||
| standalone_allowed: false # 単独時 confidence=0.0 強制reject | ||
| unknown_source: "error" # リスト外ソースはエラー(サイレント通過禁止) |
|
|
||
| judge_panel: | ||
| enabled: true | ||
| min_agreement: 0.6 # パネル過半数一致で採用(3モデルなら2/3以上) |
Comment on lines
+104
to
+105
| - if: "count_source == 'gmaps_cluster' and no_other_source" | ||
| then: { confidence_multiplier: 0.8, requires_l3_verification: true } |
Comment on lines
+2
to
+4
| 1,true_positive,EDINET明記・確実なメガジー,TODO_REAL_URL,true,28,全国28店舗を展開,0.85,1.0,true,false,店舗 | ||
| 2,true_positive,企業サイト明記・メガジー,TODO_REAL_URL,true,22,現在22店舗を運営しております,0.65,0.85,true,false,運営 | ||
| 3,true_positive,gmaps+企業サイト複数ソース,TODO_REAL_URL,true,21,グループ全体で21店舗,0.60,0.80,true,false,店舗 |
…(PR#9) prompts/judge.yaml: - Full constraint spec: evidence, source_priority, confidence_rules/penalty - hallucination_guards with refuse_phrases (JA + EN) - output_schema with required/optional fields - validation rules (source_url, source_text, corporate_name) - ci_checks list, brand_hq_review, judge_panel - url_must_resolve: true (CI can override with --skip-url-resolve) test/fixtures/judgement-golden.csv (10件): - #1-3: 真陽性(TODO_REAL_URL、マージ前に実URLで置換) - #4: 真陰性(10店未満) - #5: 真陰性(店舗数記載なし)→ reject - #6: 偽陽性候補(本部名混入)→ confidence<0.5 - #7-8: source_must_contain_claim 違反 → reject - #9: 曖昧表現(約20店)→ confidence<0.5 - #10: evidence_urls空 → reject test/fixtures/cluster-golden.csv (8件): - ClusterThreshold=4 の境界ケースを上下からカバー cmd/validate-judge/main.go: - ValidationRule構造体 + applyRules() - シングル出力モード (--output) - golden.csv バッチモード (--golden) - --skip-url-resolve フラグ(フィクスチャURLの CI スキップ用)
089720a to
b88840a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
概要
ハルシネーション防止制約の外出し + CI 検証ツール。PR#8 と並行可。
変更内容
prompts/judge.yaml
test/fixtures/judgement-golden.csv(10件)
test/fixtures/cluster-golden.csv(8件)
ClusterThreshold=4 の境界ケースを上下からカバー。
cmd/validate-judge/main.go
--output output.json--golden test/fixtures/judgement-golden.csv--skip-url-resolve: フィクスチャURL の CI スキップ用マージ条件
実URL確認は ablaze がブラウザで行い、差し替え後にマージ。