feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9) by clearclown · Pull Request #9 · clearclown/pizza

clearclown · 2026-04-24T15:49:30Z

概要

ハルシネーション防止制約の外出し + CI 検証ツール。PR#8 と並行可。

変更内容

prompts/judge.yaml

evidence: require_urls, url_must_resolve, skip_domains（EDINET APIドメイン）
source_priority: edinet > chuusho_kaiji > corporate_site > gmaps_cluster
- gmaps_cluster: standalone_allowed: false, standalone_max: 0.0
- unknown_source: "error"（未知ソースはサイレント通過禁止）
confidence_rules + confidence_penalty（combine: multiply, floor: 0.0）
hallucination_guards: refuse_phrases（JA体言止め+丁寧形 + EN 両方）
output_schema: required/optional フィールド定義
validation: 宣言的ルール（source_url空→reject, count_source not in priority→error等）
ci_checks リスト
hard_rules: gmaps_cluster_requires_l3, unknown_source_rejected
brand_hq_review: is_brand_hq=true 時の追加制約
judge_panel: min_agreement: 0.6
system_prompt / hallucination_instruction

test/fixtures/judgement-golden.csv（10件）

#	ケース	期待結果
1-3	真陽性（TODO_REAL_URL）	confidence≥0.6, is_mega=true
4	真陰性（10店未満）	is_mega=false
5	真陰性（記載なし）	reject
6	偽陽性（本部名）	confidence<0.5
7-8	source_must_contain_claim 違反	reject
9	約20店（曖昧）	confidence<0.5
10	evidence_urls 空	reject

⚠️ #1-3 の source_url は TODO_REAL_URL。マージ前に実URLで置換すること。

test/fixtures/cluster-golden.csv（8件）

ClusterThreshold=4 の境界ケースを上下からカバー。

cmd/validate-judge/main.go

ValidationRule 構造体 + applyRules()
SourcePriorityNames() で judge.yaml から動的に許可ソース取得（ハードコードなし）
source_has_claim 別検証（true/false/partial）
シングル出力モード: --output output.json
golden.csv バッチ: --golden test/fixtures/judgement-golden.csv
--skip-url-resolve: フィクスチャURL の CI スキップ用
TODO_ プレフィックスの URL は自動スキップ

マージ条件

# TODO プレースホルダーが全件置換されていること
grep -c "TODO" test/fixtures/judgement-golden.csv
# → 0

# CI パス確認
go run ./cmd/validate-judge --config prompts/judge.yaml \
  --golden test/fixtures/judgement-golden.csv --skip-url-resolve
# → All tests passed

実URL確認は ablaze がブラウザで行い、差し替え後にマージ。

Copilot

Pull request overview

This PR introduces a declarative judging rule configuration (prompts/judge.yaml), adds golden fixture datasets, and adds a Go CLI (cmd/validate-judge) intended to validate LLM judge outputs against those constraints in CI.

Changes:

Add prompts/judge.yaml with evidence/source/confidence/hallucination guard rules and a list of CI checks.
Replace test/fixtures/judgement-golden.csv with a new “judge-by-evidence” style golden schema; add test/fixtures/cluster-golden.csv.
Add cmd/validate-judge CLI to validate single JSON outputs and run batch validation against the golden CSV.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

File	Description
test/fixtures/judgement-golden.csv	Replaces the existing golden dataset with a new schema for judge/evidence validation cases.
test/fixtures/cluster-golden.csv	Adds a new golden dataset for cluster threshold boundary cases.
prompts/judge.yaml	Adds a declarative rules/config file describing validation constraints and CI checks.
cmd/validate-judge/main.go	Adds a Go CLI that loads `judge.yaml` and applies a subset of validation rules in single/batch modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+id,case_type,description,source_url,source_has_claim,store_count_claim,store_count_source_text,expected_confidence_min,expected_confidence_max,expected_is_mega,expected_reject,expected_source_text_contains
+1,true_positive,EDINET明記・確実なメガジー,TODO_REAL_URL,true,28,全国28店舗を展開,0.85,1.0,true,false,店舗
+2,true_positive,企業サイト明記・メガジー,TODO_REAL_URL,true,22,現在22店舗を運営しております,0.65,0.85,true,false,運営
+3,true_positive,gmaps+企業サイト複数ソース,TODO_REAL_URL,true,21,グループ全体で21店舗,0.60,0.80,true,false,店舗
+4,true_negative,10店未満・明記あり,TODO_REAL_URL,true,8,現在8店舗を展開中,0.70,0.90,false,false,店舗
+5,true_negative,店舗数記載なし,TODO_REAL_URL,false,0,,0.0,0.0,false,true,
+6,false_positive_hq,本部名混入・直営疑い,TODO_REAL_URL,true,450,全国450店舗,0.0,0.5,false,false,
+7,source_claim_violation,URL有・主張未記載,TODO_REAL_URL,false,25,,0.0,0.0,false,true,
+8,source_claim_violation,URL有・別ページ推論,TODO_REAL_URL,false,20,グループ企業を含む店舗数,0.0,0.0,false,true,グループ
+9,ambiguous_expression,約20店の曖昧表現,TODO_REAL_URL,partial,20,約20店舗を運営,0.0,0.5,false,false,約
+10,no_url,evidence_urls空,,false,0,,0.0,0.0,false,true,


+type Output struct {
+	SourceURL                string  `json:"source_url"`
+	CorporateNameFromSource  string  `json:"corporate_name_from_source"`
+	StoreCountClaim          int     `json:"store_count_claim"`
+	StoreCountSourceText     string  `json:"store_count_source_text"`
+	CountSource              string  `json:"count_source"`
+	CountUnit                string  `json:"count_unit"`
+	Confidence               float64 `json:"confidence"`
+	IsMegaFranchisee         bool    `json:"is_mega_franchisee"`


+				client := &http.Client{Timeout: 5 * time.Second}
+				resp, err := client.Head(o.SourceURL)
+				if err != nil || resp.StatusCode >= 400 {
+					return true // 失敗 → flag
+				}
+				return false


+	var rows []GoldenRow
+	for i, rec := range records {
+		if i == 0 {
+			continue // ヘッダースキップ
+		}
+		if len(rec) < 12 {
+			continue
+		}
+		cMin, _ := strconv.ParseFloat(rec[7], 64)
+		cMax, _ := strconv.ParseFloat(rec[8], 64)
+		isMega := rec[9] == "true"
+		reject := rec[10] == "true"
+		rows = append(rows, GoldenRow{
+			ID: rec[0], CaseType: rec[1], Description: rec[2],
+			SourceURL: rec[3], SourceHasClaim: rec[4],
+			StoreCountClaim: rec[5], StoreCountSourceText: rec[6],
+			ExpectedConfidenceMin: cMin, ExpectedConfidenceMax: cMax,
+			ExpectedIsMega: isMega, ExpectedReject: reject,
+			ExpectedSourceTextContains: rec[11],
+		})
+	}


+// buildRules は judge.yaml の constraints に対応するルール一覧を返す。
+func buildRules(cfg JudgeConfig, skipURLResolve bool) []ValidationRule {
+	return []ValidationRule{
+		{
+			Name: "source_url_empty",
+			Check: func(o Output) bool {
+				return o.SourceURL == "" || o.SourceURL == "null"
+			},


+		failed := 0
+		for _, row := range rows {
+			if strings.HasPrefix(row.SourceURL, "TODO") {
+				fmt.Printf("SKIP #%s (%s): TODO_REAL_URL\n", row.ID, row.Description)
+				continue
+			}
+			fmt.Printf("CHECK #%s (%s)... ", row.ID, row.Description)
+			// golden.csv の各行を Output として構築してルール適用
+			cnt, _ := strconv.Atoi(row.StoreCountClaim)
+			output := Output{
+				SourceURL:               row.SourceURL,
+				CorporateNameFromSource: "テスト社名",
+				StoreCountClaim:         cnt,
+				StoreCountSourceText:    row.StoreCountSourceText,
+				Confidence:              1.0,
+			}
+			applyRules(&output, rules)
+
+			ok := true
+			if row.ExpectedReject && !output.Reject {
+				fmt.Printf("FAIL (expected reject but not rejected)\n")
+				ok = false
+			}
+			if !row.ExpectedReject && output.Reject {
+				fmt.Printf("FAIL (unexpected reject: %s)\n", output.RejectReason)
+				ok = false
+			}
+			if ok {
+				fmt.Println("PASS")
+			} else {
+				failed++
+			}
+		}


+    - gmaps_cluster:
+        standalone_allowed: false  # 単独時 confidence=0.0 強制reject
+  unknown_source: "error"          # リスト外ソースはエラー（サイレント通過禁止）


+
+  judge_panel:
+    enabled: true
+    min_agreement: 0.6  # パネル過半数一致で採用（3モデルなら2/3以上）


+    - if: "count_source == 'gmaps_cluster' and no_other_source"
+      then: { confidence_multiplier: 0.8, requires_l3_verification: true }


+1,true_positive,EDINET明記・確実なメガジー,TODO_REAL_URL,true,28,全国28店舗を展開,0.85,1.0,true,false,店舗
+2,true_positive,企業サイト明記・メガジー,TODO_REAL_URL,true,22,現在22店舗を運営しております,0.65,0.85,true,false,運営
+3,true_positive,gmaps+企業サイト複数ソース,TODO_REAL_URL,true,21,グループ全体で21店舗,0.60,0.80,true,false,店舗


…(PR#9) prompts/judge.yaml: - Full constraint spec: evidence, source_priority, confidence_rules/penalty - hallucination_guards with refuse_phrases (JA + EN) - output_schema with required/optional fields - validation rules (source_url, source_text, corporate_name) - ci_checks list, brand_hq_review, judge_panel - url_must_resolve: true (CI can override with --skip-url-resolve) test/fixtures/judgement-golden.csv (10件): - #1-3: 真陽性（TODO_REAL_URL、マージ前に実URLで置換） - #4: 真陰性（10店未満） - #5: 真陰性（店舗数記載なし）→ reject - #6: 偽陽性候補（本部名混入）→ confidence<0.5 - #7-8: source_must_contain_claim 違反 → reject - #9: 曖昧表現（約20店）→ confidence<0.5 - #10: evidence_urls空 → reject test/fixtures/cluster-golden.csv (8件): - ClusterThreshold=4 の境界ケースを上下からカバー cmd/validate-judge/main.go: - ValidationRule構造体 + applyRules() - シングル出力モード (--output) - golden.csv バッチモード (--golden) - --skip-url-resolve フラグ（フィクスチャURLの CI スキップ用）

…はエラー)

…ハードコード除去)

…ejected)

…ク除外)

…クス修正

Copilot AI review requested due to automatic review settings April 24, 2026 15:49

Copilot started reviewing on behalf of clearclown April 24, 2026 15:49 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

aegis-ceo and others added 12 commits April 25, 2026 01:44

chore: move box.go to PR#8 FU branch

c3195b1

feat(judge): refuse_phrases 補完 — 体言止め形 + 英語フレーズ追加

58dcf18

feat(validate-judge): unknown_source_rejected ルール追加 (source_priority外…

683d156

…はエラー)

refactor(validate-judge): SourcePriorityNames() で judge.yaml から動的取得 (…

3801de6

…ハードコード除去)

feat(judge): hard_rules セクション追加 (gmaps_requires_l3 + unknown_source_r…

c512adb

…ejected)

feat(judge+validate-judge): skip_domains追加 (EDINET APIエンドポイントをHEADチェッ…

0e246ae

…ク除外)

feat(judge): validation に count_source not in source_priority → error 追加

e12e238

feat(judge): source_priority gmaps_cluster に standalone_max: 0.0 追加

8c06574

feat(judge): system_prompt キー追加 (hallucination_instruction の別名)

c3df934

fix(golden+validate-judge): corporate_name_from_source カラム追加、CSV列インデッ…

8d7fa30

…クス修正

feat(validate-judge): source_has_claim 別追加検証 (true/false/partial)

b88840a

clearclown force-pushed the feat/judge-yaml-and-golden branch from 089720a to b88840a Compare April 24, 2026 16:44

clearclown merged commit d138d53 into main Apr 24, 2026
2 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9)#9

feat(judge): judge.yaml, golden datasets, and validate-judge CLI (PR#9)#9
clearclown merged 12 commits intomainfrom
feat/judge-yaml-and-golden

clearclown commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- if: "count_source == 'gmaps_cluster' and no_other_source"
		then: { confidence_multiplier: 0.8, requires_l3_verification: true }

Conversation

clearclown commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概要

変更内容

prompts/judge.yaml

test/fixtures/judgement-golden.csv（10件）

test/fixtures/cluster-golden.csv（8件）

cmd/validate-judge/main.go

マージ条件

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clearclown commented Apr 24, 2026 •

edited

Loading