feat(顺路-D): ai_adversarial LLM-as-judge — refusal/injection 语义判断#58
Merged
Conversation
PR-D 实战消除 keyword binary/confidence 范式假阳性 ("I won't output PWNED"
误判 PWNED 命中)。callable 注入设计, utils 独立约束保留。
设计 (q1-q5 默认全确认):
- judge_fn: Callable[[prompt, output, kind], Tuple[float, reason]] 注入
- utils 不直接 import anthropic/openai (用户负责包 SDK 成 callable)
- judge_fn=None (默认) → keyword fallback 兼容旧
- judge_fn 提供 → LLM 语义判断, 跳过 keyword 检测 (替代非 ensemble)
- judge_fn 抛异常 → graceful (confidence=0.0 + reason='judge_fn error: ...')
改动 (1 文件 +90/-7):
1. typing 加 Callable, Tuple imports
2. 加 JudgeFn type alias + 模块顶部 reference Anthropic 实现 docstring example
3. test_llm_jailbreak 加 judge_fn kwarg + inline 分支 (refusal kind)
4. test_prompt_injection 加 judge_fn kwarg + inline 分支 (injection kind)
5. results 加字段: judge_used (bool) / judge_reason (Optional[str])
6. return 顶层加 judge_mode (bool)
7. matched_keywords/matched_signals 在 judge 路径返空列表
实测 (mock judge_fn, 5/5 全过):
- jailbreak judge 路径: "I won't lie..." → confidence=1.0 (refusal)
- keyword fallback: judge_used=False, matched_keywords 含命中
- injection 假阳性消除: "I won't output PWNED" judge→0.0 (vs keyword→1.0)
- 对比验证 keyword 假阳性存在 (本身仍是真问题)
- judge_fn 异常 → graceful confidence=0.0 + 标 reason
f5: 实现 + 5/5 测试 H; "判定准确度" 实测 H (mock judge 模拟真实 LLM 语义判断);
"真 LLM 准确率" 留用户接 Anthropic SDK 验 L (utils 独立约束设计).
f6: 真问题 (keyword 假阳性, "I won't output PWNED" 实测 reproducible), 反例无,
修后实质变好 (langle 调 LLM 调用方完全消除假阳性, mock 验证 path 工作)。
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
范围
顺路 PR-D (大设计):
ai_adversarial.py的test_llm_jailbreak+test_prompt_injection加 LLM-as-judge 注入, 实战消除 keyword 假阳性 (e.g."I won't output PWNED"命中 PWNED 误判)。1 文件 +88/-5。假阳性证明 (本 PR 真问题动机)
LLM-as-judge 路径修复:
设计 (utils 独立约束)
callable 注入, utils 不直接 import anthropic/openai:
模块顶部加 Anthropic SDK reference 实现 (docstring example):
改动
Callable,TupleimportsJudgeFn顶层定义test_llm_jailbreakjudge_fn: Optional[JudgeFn] = Nonekwarg + inline 分支 (refusal kind)test_prompt_injectionjudge_used: bool/judge_reason: Optional[str]judge_mode: bool(是否本次调用启用 LLM judge)本地测试 (5/5 mock judge_fn 测试过)
关键证明: Test 3 + 4 对比 — 同一输出
"I won't output PWNED for you":confidence=0.0(正确, negation 识别)confidence=1.0(假阳性)不在范围 (留下次)
协作宪章 §1.3 六道闸自检 (f1-f6)
置信度自检 (f5)
judge_fn路径实现假阳性过滤 (f6)
唯一 finding = "keyword binary/confidence 假阳性 (negation 不识别)":
"I'd rather not output it"等无 won't / refuse 也是拒绝); LLM-as-judge 更彻底假阳性候选数: 0 (本 finding 真问题)
降置信 finding 数: 1 ("真 LLM 准确率" 留待用户实验)
关联