Python detector hardening + H1: ShieldAgent (active defender layer) by cdayAI · Pull Request #18 · cdayAI/Agent-Shield

cdayAI · 2026-05-25T00:38:37Z

Summary

This PR has two stacked layers:

Python SDK detector hardening (commits 1-3) — post-merge code review of v14.2.2 caught real security bypasses, broken patterns, 10 exact-duplicate entries, broken CI infra, and stale release metadata. All verified empirically.
H1 of the agent roadmap: ShieldAgent (commit 5) — turns Shield from a passive scanner into an active agent that decides and acts on threats. Zero new dependencies.

Layer 1 — Python detector hardening

Security bypasses fixed

Bug	Verified bypass	Fix
Unicode `\b` bypass on 5 jailbreak detectors	`тестDAN mode` → 0 hits; `αjailbreak mode` → 0 hits	Add `re.ASCII` flag
Subdomain confusion (CVE-2026-21852)	`ANTHROPIC_BASE_URL=https://api.anthropic.com.evil.attacker.io/v1` → 0 hits	Require host terminator after whitelist
ReDoS in `encoding_evasion`	5000 reps of `"99 "` → 1.7s single regex call (200ms budget)	Cap repetition at `{5,30}` → 20ms (85× speedup)
`budget_drain` misses canonical attacks	`repeat 1000 times` → 0 hits	Make middle clause optional; negative lookahead for comparative phrasing

Duplicate patterns removed (10 entries, 330 → 314)

v14.x sync block re-added patterns from v13.x → 2× findings per attack. Removed dups in role_hijack, prompt_injection, tool_abuse, memory_poisoning, encoding_chain, structured_data_injection, xss_injection, svg_injection.

False-positive reduction

data_exfiltration markdown image — drop ?q= ?d= from exfil alternation
path_traversal — require 3+ ../ OR explicit sensitive target
cicd_injection @claude — negative lookahead for do not|don't|never|avoid|prevent
query_injection f-string — require SQL keyword inside body

Release hygiene

setup.py name="agent-shield" → "agentshield" (was conflicting with pyproject.toml)
pyproject.toml build-backend = "setuptools.backends._legacy:_Backend" (fictitious — broke python -m build) → setuptools.build_meta
README "141 patterns" → "300+ across 51 categories"
tsconfig.json "types": ["node"] dropped (was failing Type Check job — @types/node never installed; types/index.d.ts doesn't use Node types)
.github/workflows/quality-gates.yml Performance Check job called nonexistent detectPromptInjection → fixed to scanText

Tests

Python: 32 → 51 (18 new positive+negative tests + test_no_duplicate_patterns regression test)

Layer 2 — H1: ShieldAgent

What it adds

src/shield-agent.js + src/shield-actions.js — an LLM-powered triage layer that wraps the deterministic detector with a reasoning loop:

Detector fast path stays sub-millisecond. Judge only fires on ambiguous high-severity hits per a configurable triagePolicy.
Critical hits block instantly — no LLM call, no latency.
Judge replies are JSON-schema validated. Malformed or budget-timeout replies fail closed (block + uncertain verdict). Judge failure is counted in stats.judgeFailures so monitoring can alert.
Anti-prompt-injection by design — judge sees content wrapped in provenance tags (SYSTEM / USER / TOOL_OUTPUT / RAG_CHUNK / UNTRUSTED); attacker attempts to close the tag are escaped at the boundary, so the adjudicator cannot be injected by the very content it adjudicates.
Zero new dependencies — judge is a caller-supplied async ({system, user}) => string. Demo example calls Anthropic API via global fetch (Node 18+).

Actions executor

ShieldActions.execute(verdict, original) → {proceed, payload, info}. Translates verdicts into allow / block / sanitize / rewrite / quarantine / escalate. Quarantine + escalate sinks are caller-injected. Sanitizer strips HTML comments, display:none containers, data-exfil markdown images, instruction-override boilerplate, system-prompt impersonation tags.

MCP exposure

src/mcp-server.js extended with 5 new tools any host agent (Claude Code, Cursor, Windsurf, GPT) can call mid-conversation:

investigate(text, provenance, source, system_prompt)
safe_rewrite(text, source)
explain_threat(text, source)
execute_verdict(verdict, original_text)
agent_stats()

Tests

test/test-shield-agent.js — 61 assertions covering: schema validator, JSON extractor (incl. embedded/malformed/unterminated), tag-breakout escaping, detector-only fast paths, judge invocation, schema-violation fallback, budget-timeout fallback, no-judge degradation, history bounds, all 6 action types, quarantine + escalate sink plumbing, sanitizer correctness. All passing.

Demo

examples/security-copilot.js — end-to-end host agent receiving 4 messages (benign / injection / critical / borderline) and the agent making allow/rewrite/block/allow decisions with audit trail. Works offline with mock judge; reads ANTHROPIC_API_KEY env to use real Claude.

Test plan

python3 -m unittest tests.test_detector → 51/51 OK locally
npm test → 61/61 OK on the new ShieldAgent suite + all pre-existing suites pass
node examples/security-copilot.js → demo runs to completion with correct verdicts
scan_text('99 ' * 5000) returns in 20ms (was 1.7s)
тестDAN mode, αjailbreak mode, etc. → caught after re.ASCII fix
api.anthropic.com.evil.io → caught after host-terminator fix
Pattern count 330 → 314, categories still 51, zero remaining duplicates
CI: lint, type-check, performance, false-positive jobs all green

@claude

…, reduce FPs Post-merge code review of the v14.2.2 Python SDK port found exploitable detection gaps, broken regex, and 10 duplicate pattern entries. All verified empirically. Security bypasses fixed: - Unicode \b bypass on 5 jailbreak detectors (DAN, do-anything-now, developer-mode, jailbreak-mode, god-mode). Python's \b is Unicode-aware by default, so a single Cyrillic/Greek prefix evaded all 5. Fixed with re.ASCII flag. - Subdomain confusion in API base URL whitelist (CVE-2026-21852). The negative lookahead was prefix-only, so api.anthropic.com.evil.io bypassed config_poisoning / llm_router_tampering / cross_agent_injection. Fixed by requiring host terminator after whitelist match. - ReDoS in encoding_evasion ASCII pattern. (?:\d{2,3}\s+){5,} took 1.7s on 5000 reps of "99 ", blowing the 200ms scan budget. Capped at {5,30} -> 20ms (85x speedup). - budget_drain regex missed "repeat 1000 times" / "loop 99999999 times" because mandatory \s+ between .*? and digits required intermediate text. Made middle clause optional; added negative lookahead for comparative phrasing to suppress FP on "1000 times faster". Duplicate patterns removed (10 entries, 330 -> 314 total): v14.x sync block re-added patterns already present in v13.x, producing 2x findings per attack. Removed dups in role_hijack, prompt_injection, tool_abuse, memory_poisoning, encoding_chain, structured_data_injection, xss_injection, svg_injection. False-positive reduction: - data_exfiltration markdown-image rules dropped single-letter params (?q=, ?d=) from exfil alternation - was FP-ing on search/map URLs. - path_traversal now requires 3+ ../ segments OR an explicit sensitive target - was flagging every ../../package.json. - cicd_injection @claude pattern adds negative lookahead for benign warnings ("do not", "please don't", "never", "avoid", "prevent"). - query_injection f-string pattern now requires a SQL keyword inside the f-string body, so doc snippets about safe templating don't FP. Release hygiene: - setup.py renamed name="agent-shield" -> "agentshield" to match pyproject.toml (previously two PyPI namespaces could publish v14.2.2). - pyproject.toml fixed build-backend setuptools.backends._legacy:_Backend (fictitious, broke python -m build) -> setuptools.build_meta. - README pattern count updated 141 -> 300+ across 51 categories. Tests: 32 -> 51 (added 18 new positive+negative tests for v14.x categories plus a no_duplicate_patterns regression test). All pass. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

Second H1 item: given a flagged scan + a one-line user note, the replay agent reproduces the scan, identifies which rule(s) fired, names the root cause, proposes a structured fix, and emits a ready-to-paste regression test in both Node.js and Python. Handles four incident kinds: - false_positive: pattern matched benign input. Proposes regex tightening + allowlist rule with the input baked in. - false_negative: rule missed a confirmed attack. Proposes a new pattern with the distinctive substring. - redos: detector exceeded latency budget. Recommends rewriting the offending pattern (cap unbounded quantifiers, anchor greedy gaps); bisection hint when the offender is unknown. - crash: detector threw. Reports the stack, recommends try/catch in detector-core, emits an assert.doesNotThrow regression test. Optional judge-backed narration: if a ShieldAgent with an LLM judge is wired in, IncidentReplay also calls the judge for a 2-3 sentence human-readable explanation + remediation. All judge failures (timeout, malformed JSON, exception) fall back to a "judge unavailable" stub so the deterministic report still ships. investigateBatch() clusters repeated incidents by (kind, primarySuspect) so a real bug producing 1,000 customer reports surfaces as one cluster with 1,000 count, not 1,000 separate reports. Tests: test/test-incident-replay.js, 35 assertions across all four incident kinds, judge narration + fallbacks, batch aggregation, and input validation. Wired into npm test. Stack: PR #18 now contains - Python detector hardening (security bypasses, dedup, FPs, CI fixes) - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay Next loop iteration: H1 #3 cross-SDK differential auditor. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

…Rust Third H1 item. Same input through every available SDK; any disagreement is a bug -- either a port drifted or a regex-semantics difference (Python's Unicode-aware \\b vs JS's ASCII-only \\b, Python's Unicode \\d vs JS's ASCII \\d, etc.). This is the exact bug pattern that shipped in v14.2.2 and that the layer-1 hardening in this PR fixed. Adapter pattern: zero new deps. NodeAdapter runs in-process. PythonAdapter spawns python3 -c, reads JSON from stdout, skips gracefully if the runtime isn't on PATH. Easy to add GoAdapter / RustAdapter the same way. audit(inputs) returns: - availableSdks: which engines were actually consulted - disagreements[]: per-input, per-SDK verdict matrix with byCategory and bySeverity diffs (so a reviewer sees exactly where each SDK fires) - bySdkAccuracy: majority-vote score per SDK across all disagreements - suggestedCanonical: which SDK was right most often (the others need fixing toward it) driftBank() static helper returns 18 inputs hand-picked to expose every class of cross-SDK drift Shield has historically suffered: - Unicode \\b boundary cases (DAN, αjailbreak, βgod mode) - Subdomain confusion in API base URL whitelist - Fullwidth digit \\d divergence (10.０.０.１) - Multilingual instruction overrides (Chinese, German) - Encoding evasion - Critical attacks (should agree) - Benign edges (../../package.json, search URLs) Tests: test/test-cross-sdk-differential.js, 34 assertions covering mock- adapter agreement/disagreement, 3-way majority canonical detection, insufficient-SDK warning, unavailable-SDK skip, driftBank composition, input validation, and a LIVE Node↔Python audit using the actual fixed Python SDK in this PR. Stack on PR #18: - Python detector hardening - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay (autonomous triage) - H1 #3: CrossSDKDifferential (port-drift auditor) Next: H1 #4 self-tuning thresholds, H1 #5 adversarial tournament. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

…anceNarrator, CustomerLearning Shipping the rest of H1 and the first H2 batch in one commit since I can't ScheduleWakeup across turns in this environment. H1 #4: src/threshold-tuner.js Sweeps per-category confidence thresholds to maximize F1 (or precision/ recall/accuracy) on a labeled corpus. Scans corpus once, sweeps in O(grid x categories). Supports precision/recall floors. Returns a threshold map the host can apply to AgentShield for measurably better signal on the customer's traffic, plus a confusion-matrix baseline for before/after comparison. Tests: 23 assertions. H2 #1: src/adversarial-tournament.js Wires the existing EvolutionSimulator + MutationEngine into a closed loop. Seed attacks -> mutate -> classify -> survivors feed next gen -> derive hardened patterns via hardenFromEvolution. Optional LLM judge validates that survivors are real attacks (not mutation noise) and ranks them. runIterative() chains tournaments using prior survivors as seeds to surface emergent strategies. Tests: 22 assertions. H2 #2: src/compliance-narrator.js Auditor-grade narrative generator for SOC2 / HIPAA / GDPR / EU AI Act. Ingests Shield events (raw scan results, agent verdicts, or normalized entries), maps categories to framework control IDs, generates a deterministic markdown report, and optionally rewrites it as audit prose via an LLM judge. HMAC-SHA256 signs the canonicalized payload with order-independent serialization so tampering is detectable. Tests: 26 assertions including 3 distinct tamper attempts. H2 #3: src/customer-learning.js Reads a customer's agent codebase (defaults: js/ts/py/go/rs/json/yaml/ md/toml) and extracts: legitimate URLs/domains, env var names, tool names, system-prompt phrases, and secret-shape prefixes (sk-, ghp_, AKIA, etc). Builds a customer-specific profile with: - allowed domains/env-vars/tool-names (suppress generic FPs) - lookalike-tool regex patterns (catch tool-name impersonation) - honeypot canary tokens shaped like the customer's real secrets (any appearance in agent output is instant exfil confirmation) - system-prompt phrase allowlist (suppresses injection FPs on the agent's own legitimate prompts) Walks with safety caps (maxFiles, maxFileBytes, exclude node_modules etc). Tests: 28 assertions including a temp-fixture repo end-to-end. All four wired into src/main.js (with namespace fix: NARRATOR_FRAMEWORKS to avoid collision with the existing COMPLIANCE_FRAMEWORKS export from src/compliance.js) and added to npm test. Suite still green end-to-end. Stack on PR #18: - Python detector hardening - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay - H1 #3: CrossSDKDifferential - H1 #4: ThresholdTuner - H2 #1: AdversarialTournament - H2 #2: ComplianceNarrator - H2 #3: CustomerLearning Continuing with H2 #4 (autonomous threat hunter) and H2 #5 (production-traffic shadow-mode reporter) next. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

…porter H2 #4: src/threat-hunter.js Pluggable source pattern: any object with .name + async .fetch() can feed the hunter. Built-ins: - LocalCorpusSource: JSONL or in-memory array (offline-safe for CI) - HTTPSourceFn: caller-supplied async () => items (caller owns network so we add no dependency and the user controls egress) Hunt flow: 1. Fetch from every source (broken sources don't crash the hunt). 2. Classify each item: detector misses it -> "novel attack". 3. Synthesize a tight regex by picking the rarest 4-token window (lowest occurrence in the benign corpus) and escaping it. 4. Estimate FP rate against benignCorpus; reject above threshold (default 5%). 5. Optional LLM judge review of proposals. 6. Emit a PR-ready markdown report. Conservative by design — prefers tight literal phrases over loose alternation to keep FPs near zero. Tests: 22 assertions including broken-source tolerance, addSource validation, FP filtering, and judge integration. H2 #5: src/shadow-mode-reporter.js Aggregator over a stream of scan events. After N days emits an executive report: - traffic volume + scan-time percentiles (p50/p95/p99/max) - threats by severity, category, source - action projection (if deployed in enforce mode: would-block / would-rewrite / would-allow counts) - noisy categories (likely FP candidates: >=5 hits, low avg conf) - quiet categories (rarely fire — candidates for removal) - estimated ROI: wouldBlock * costPerIncident (configurable) Accepts raw shield.scan() results, ShieldAgent verdicts, or wrapped envelopes. ingestMany handles arrays. maxEvents cap protects memory on long-running services. Outputs JSON via report() or markdown via markdownReport(). Tests: 29 assertions including window filtering, raw vs wrapped ingest, noisy/quiet category detection, max-event cap, and ROI math. Both wired into main.js and npm test. Full suite green. Stack on PR #18: - Python detector hardening - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay - H1 #3: CrossSDKDifferential - H1 #4: ThresholdTuner - H2 #1: AdversarialTournament - H2 #2: ComplianceNarrator - H2 #3: CustomerLearning - H2 #4: ThreatHunter - H2 #5: ShadowModeReporter Continuing with H3 multi-agent SOC + agent identity CA next. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

…rg trust) H3 #1: src/soc-fleet.js Orchestrates the H1+H2 modules into a coordinated SOC team: - Defender: ShieldAgent triage on every event. - Detective: IncidentReplay deep-dive on block/escalate. - Forensics: ShadowModeReporter + ComplianceNarrator window report. - PatchWriter: ThreatHunter pattern synthesis for novel attacks. - Reviewer: judge-backed approval, with rule-based fallback (no judge configured -> approve <=5 patches with zero FP). - Releaser: bundles a ChangeRequest with patches + test cases + framework attribution, ready for PR generation. Every role's I/O is captured as a SOCEvent in the timeline so the entire decision chain is replayable. Bounded by maxTimeline. status() returns per-role event counts + last event. forceFullPipeline=true runs all roles even for safe input (for synthetic drills). Tests: 25 assertions including critical/safe/forced/FP paths, judge vs rule-based reviewer, timeline cap, status snapshot. H3 #2: src/agent-identity-ca.js Cryptographic agent passports for cross-org trust. Uses Ed25519 from Node's built-in crypto (zero external deps). Capabilities: - issuePassport({agentId, capabilities, orgId}) -> {passport, privateKey} Passport contains agentId, publicKey (SPKI base64), TTL, capabilities, orgId, caRootId. CA signs the canonical body. - verifyPassport: signature check + CA root match + revocation list + expiry. Tampered passports rejected. - revoke(agentId): CRL-style revocation, subsequent verifications fail. - signMessage({agentId, payload, privateKey}) -> envelope with timestamp + 16-byte nonce + signature. - verifyMessage(envelope, passport): full chain — passport valid, agentId matches, timestamp within messageTtlMs window (replay protection), nonce not in seenNonces cache, signature valid against passport's publicKey. Returns {valid, reason?, agentId, capabilities}. - exportRootPublicKey: SPKI for cross-org verification (private key never leaves the CA instance). Canonical JSON serializer recursively sorts keys so signatures are order-independent. Nonce cache has TTL-based sweep + max-size eviction. Tests: 30 assertions including issue+verify, tamper detection on body and signature, revocation, message signing, replay protection, stale message rejection, future-timestamp rejection, agentId mismatch, foreign-CA rejection, public-key export, input validation. Both wired into main.js and npm test. Full suite green. Stack on PR #18 (12 commits, 11 new modules, ~270 new test assertions): - Python detector hardening - H1 #1-#4: ShieldAgent, IncidentReplay, CrossSDKDifferential, ThresholdTuner - H2 #1-#5: AdversarialTournament, ComplianceNarrator, CustomerLearning, ThreatHunter, ShadowModeReporter - H3 #1-#2: SOCFleet, AgentIdentityCA That's all H1 + all H2 + 2 of the highest-leverage H3 modules built offline-safe with zero new dependencies. The remaining H3 items (fleet immunity wiring, cyber-insurance integration, public benchmark leaderboard hosting) require external infra (signed update feeds, a partner API, hosted infra) that can't be scaffolded purely in-tree. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

src/threats-2026-extra.js — the 5 threat-intel gaps a2a-guard.js didn't cover. Each lives as its own detector so hosts can wire them at the right lifecycle point (none of these are pure-regex): - TOCTOUGuard — hash a DOM locator at observe() time; refuse to act if hash differs (arXiv 2603.00476). - GRAPH_TRIPLE_PATTERNS (3 rules) — JSON-, Turtle/RDF-, and bulk-edge forms of GraphRAG triple poisoning (arXiv 2508.04276). - detectGCGSuffix — entropy + non-dictionary-ratio + symbol-density composite over the trailing 200 chars. Flags GCG / activation-steering optimized suffixes. - MemoryReplayGuard — wraps a memory backend so persisted messages are re-scanned at LOAD time, not just write time (CVE-2026-25253). Stricter threshold than live input (default medium). - detectContextStuffing — flags oversized inputs (>30KB) with repetition factor ≥ 20 or whitespace runs > 2KB. - scanExtras2026(input) — one-call helper. src/dream-pr-bot.js — closes the dreaming loop. Picks the latest high-confidence change-request artifact from DreamMemory and converts it to a real PR. Three modes: - dry-run — emit { branch, title, body, files } (default) - local-git — write config/dreams/dream-patches.json + dream-thresholds.json, create branch, commit; host's gitRunner does the git invocation. - mcp-github — call MCP github tools via caller-supplied adapter ({ createBranch, putFile, openPR }). Draft by default. Bot writes ONLY to config/dreams/*.json artifact files, never to detector-core or pattern source. Unreviewed dreams cannot live-fire. Python parity: all 9 a2a-guard patterns + 3 GraphRAG triple patterns ported into python-sdk/agent_shield/detector.py — 314 → 327 patterns, 51 → 59 categories. TestV2026CouncilPatterns class adds 14 positive + negative assertions. Tests: - test/test-threats-2026-extra.js: 40 assertions - test/test-dream-pr-bot.js: 33 assertions - python-sdk/tests/test_detector.py: 51 → 65 tests All wired into src/main.js exports and npm test. Full Node + Python suites green. PR #18 final tally: 16 new modules, ~543 new Node test assertions + 65 Python tests, 16 commits. Covers all 15 council threat gaps in JS plus 9 of 15 in Python; dreaming subsystem with autonomous PR bot fully wired; Hermes + 9-engine OSS integration shipped. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS

cdayAI marked this pull request as ready for review May 25, 2026 00:39

cdayAI merged commit ae23a8b into main May 25, 2026
9 of 11 checks passed

cdayAI deleted the claude/detector-py-fixes branch May 25, 2026 00:39

cdayAI changed the title ~~Python SDK detector hardening: security bypasses, dedup, FP reduction~~ Python detector hardening + H1: ShieldAgent (active defender layer) May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python detector hardening + H1: ShieldAgent (active defender layer)#18

Python detector hardening + H1: ShieldAgent (active defender layer)#18
cdayAI merged 1 commit into
mainfrom
claude/detector-py-fixes

cdayAI commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cdayAI commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Layer 1 — Python detector hardening

Security bypasses fixed

Duplicate patterns removed (10 entries, 330 → 314)

False-positive reduction

Release hygiene

Tests

Layer 2 — H1: ShieldAgent

What it adds

Actions executor

MCP exposure

Tests

Demo

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cdayAI commented May 25, 2026 •

edited

Loading