fix(bundle2-W5-4): ai_adversarial jailbreak gate + opt-in kwarg#43
Merged
Conversation
…5-2 范式 v2) 4 远端 ops 加 env gate (复用 TAGENT_PENTEST_AUTHORIZED, 与 api_security_scanner 同变量): - adversarial_text_test - test_llm_jailbreak - test_prompt_injection - membership_inference_basic 3 HIGH 风险预设额外 opt-in kwarg (W5-2 范式 v2 第二层): - test_llm_jailbreak: confirm_offensive=True - test_prompt_injection: confirm_offensive=True - membership_inference_basic: confirm_inference_attack=True CLI 入口同步加 --confirm-offensive flag (jailbreak / inject 子命令), 否则 CLI 永久不可用。 模块 docstring 加"安全约束"节, 体例同 W5-3 (db_test_helper)。 SECURITY.md L72 武器化代码表 ai_adversarial 条目对齐实际 gate 状态 (同 L71 api_security_scanner 详细格式), 跨层契约一致。 实测 (本地, 11/11 全过): - gate 关闭 → 4 ops 全 RuntimeError "AI offensive op...refused" - gate 开 + 3 HIGH 缺 kwarg → RuntimeError "pass confirm_*=True" - CLI 三层: env 缺 / kwarg 缺 / 全开跑到 HTTP 错 (gate 全开) - Ruff All checks passed 约束: utils 独立于 runtime, 用 env var gate 不用 runtime.config.safety。
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
范围
W5-4 (Phase 1 / Bundle 2):
05-代码示例/ai_adversarial.pyoffensive AI 测试 (越狱 / 注入 / 隐私推断 / 远端鲁棒) gate 加固。复用 W5-2 范式 v2 (env gate + opt-in kwarg)。同步 SECURITY.md 武器化表条目对齐实际状态。改了什么
1. 模块 docstring 加"安全约束 (W5-4 加固)"节
明列 4 个 op 的双层 gate 机制 + 授权边界 (引 SECURITY.md 法律节)。
2. 新加模块级 gate
GATE_ENV_VAR = "TAGENT_PENTEST_AUTHORIZED"(复用 api_security_scanner W5-2 同变量)_require_authorized(op)准入守卫3. 4 个远端 op 加守
adversarial_text_testtest_llm_jailbreakconfirm_offensive: bool = Falsetest_prompt_injectionconfirm_offensive: bool = Falsemembership_inference_basicconfirm_inference_attack: bool = False4. CLI 入口同步
__main__加--confirm-offensiveflag (jailbreak / inject 子命令), 否则 CLI 永久不可用。5. SECURITY.md L72 同步
武器化代码表
ai_adversarial.py条目原仅写 "含 JAILBREAK_PROMPTS + PROMPT_INJECTION_TEMPLATES 模板", 现对齐实际 gate 状态 (格式同 L71 api_security_scanner):不在范围
fgsm_attack/text_perturbation(本地 / 无 offensive 远端面)JAILBREAK_PROMPTS/PROMPT_INJECTION_TEMPLATES静态数据 (import 即可见, gate 实际调用而非数据访问)test_llm_jailbreakL138-140is_refusalkeyword binary 检测 → confidence score (与 PR feat(charter): §1.3 f 闸加 f5/f6 — 置信度标注 + 假阳性过滤 #42 f5 置信度方向呼应)。不动手, 已记 task deps(deps): bump the ui-automation group across 1 directory with 3 updates #3 待 § 1.6 三方共决本地测试 (11/11 全过)
gate 关闭 → 4 ops refused:
gate 开 + 缺 confirm_ kwarg → 3 HIGH refused*:
CLI 三层 (subprocess 实跑):
协作宪章 §1.3 六道闸自检 (按 PR #42 f1-f6 标准, 即使 PR #42 未合也用最新规则)
All checks passed+ pre-commit + markdownlint 全过置信度自检 (f5)
git diff --statis_refusal顺路优化记入 task #3对外承诺仅基于 H 声明, M / L 已加修饰。
假阳性过滤 (f6) 自检
唯一 finding = "ai_adversarial 4 ops 无 gate, 默认 import 即可发 offensive 流量":
requests.post远端调用 + JAILBREAK_PROMPTS 模板默认值), 无 gate 跳过假阳性候选数: 0 | 降置信 finding 数: 0
关联