Python detector hardening + H1: ShieldAgent (active defender layer)#18
Merged
Conversation
…, reduce FPs Post-merge code review of the v14.2.2 Python SDK port found exploitable detection gaps, broken regex, and 10 duplicate pattern entries. All verified empirically. Security bypasses fixed: - Unicode \b bypass on 5 jailbreak detectors (DAN, do-anything-now, developer-mode, jailbreak-mode, god-mode). Python's \b is Unicode-aware by default, so a single Cyrillic/Greek prefix evaded all 5. Fixed with re.ASCII flag. - Subdomain confusion in API base URL whitelist (CVE-2026-21852). The negative lookahead was prefix-only, so api.anthropic.com.evil.io bypassed config_poisoning / llm_router_tampering / cross_agent_injection. Fixed by requiring host terminator after whitelist match. - ReDoS in encoding_evasion ASCII pattern. (?:\d{2,3}\s+){5,} took 1.7s on 5000 reps of "99 ", blowing the 200ms scan budget. Capped at {5,30} -> 20ms (85x speedup). - budget_drain regex missed "repeat 1000 times" / "loop 99999999 times" because mandatory \s+ between .*? and digits required intermediate text. Made middle clause optional; added negative lookahead for comparative phrasing to suppress FP on "1000 times faster". Duplicate patterns removed (10 entries, 330 -> 314 total): v14.x sync block re-added patterns already present in v13.x, producing 2x findings per attack. Removed dups in role_hijack, prompt_injection, tool_abuse, memory_poisoning, encoding_chain, structured_data_injection, xss_injection, svg_injection. False-positive reduction: - data_exfiltration markdown-image rules dropped single-letter params (?q=, ?d=) from exfil alternation - was FP-ing on search/map URLs. - path_traversal now requires 3+ ../ segments OR an explicit sensitive target - was flagging every ../../package.json. - cicd_injection @claude pattern adds negative lookahead for benign warnings ("do not", "please don't", "never", "avoid", "prevent"). - query_injection f-string pattern now requires a SQL keyword inside the f-string body, so doc snippets about safe templating don't FP. Release hygiene: - setup.py renamed name="agent-shield" -> "agentshield" to match pyproject.toml (previously two PyPI namespaces could publish v14.2.2). - pyproject.toml fixed build-backend setuptools.backends._legacy:_Backend (fictitious, broke python -m build) -> setuptools.build_meta. - README pattern count updated 141 -> 300+ across 51 categories. Tests: 32 -> 51 (added 18 new positive+negative tests for v14.x categories plus a no_duplicate_patterns regression test). All pass. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
cdayAI
pushed a commit
that referenced
this pull request
May 25, 2026
Second H1 item: given a flagged scan + a one-line user note, the replay agent reproduces the scan, identifies which rule(s) fired, names the root cause, proposes a structured fix, and emits a ready-to-paste regression test in both Node.js and Python. Handles four incident kinds: - false_positive: pattern matched benign input. Proposes regex tightening + allowlist rule with the input baked in. - false_negative: rule missed a confirmed attack. Proposes a new pattern with the distinctive substring. - redos: detector exceeded latency budget. Recommends rewriting the offending pattern (cap unbounded quantifiers, anchor greedy gaps); bisection hint when the offender is unknown. - crash: detector threw. Reports the stack, recommends try/catch in detector-core, emits an assert.doesNotThrow regression test. Optional judge-backed narration: if a ShieldAgent with an LLM judge is wired in, IncidentReplay also calls the judge for a 2-3 sentence human-readable explanation + remediation. All judge failures (timeout, malformed JSON, exception) fall back to a "judge unavailable" stub so the deterministic report still ships. investigateBatch() clusters repeated incidents by (kind, primarySuspect) so a real bug producing 1,000 customer reports surfaces as one cluster with 1,000 count, not 1,000 separate reports. Tests: test/test-incident-replay.js, 35 assertions across all four incident kinds, judge narration + fallbacks, batch aggregation, and input validation. Wired into npm test. Stack: PR #18 now contains - Python detector hardening (security bypasses, dedup, FPs, CI fixes) - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay Next loop iteration: H1 #3 cross-SDK differential auditor. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
cdayAI
pushed a commit
that referenced
this pull request
May 25, 2026
…Rust
Third H1 item. Same input through every available SDK; any disagreement is
a bug -- either a port drifted or a regex-semantics difference (Python's
Unicode-aware \\b vs JS's ASCII-only \\b, Python's Unicode \\d vs JS's
ASCII \\d, etc.). This is the exact bug pattern that shipped in v14.2.2
and that the layer-1 hardening in this PR fixed.
Adapter pattern: zero new deps. NodeAdapter runs in-process. PythonAdapter
spawns python3 -c, reads JSON from stdout, skips gracefully if the
runtime isn't on PATH. Easy to add GoAdapter / RustAdapter the same way.
audit(inputs) returns:
- availableSdks: which engines were actually consulted
- disagreements[]: per-input, per-SDK verdict matrix with byCategory and
bySeverity diffs (so a reviewer sees exactly where each SDK fires)
- bySdkAccuracy: majority-vote score per SDK across all disagreements
- suggestedCanonical: which SDK was right most often (the others need
fixing toward it)
driftBank() static helper returns 18 inputs hand-picked to expose every
class of cross-SDK drift Shield has historically suffered:
- Unicode \\b boundary cases (DAN, αjailbreak, βgod mode)
- Subdomain confusion in API base URL whitelist
- Fullwidth digit \\d divergence (10.0.0.1)
- Multilingual instruction overrides (Chinese, German)
- Encoding evasion
- Critical attacks (should agree)
- Benign edges (../../package.json, search URLs)
Tests: test/test-cross-sdk-differential.js, 34 assertions covering mock-
adapter agreement/disagreement, 3-way majority canonical detection,
insufficient-SDK warning, unavailable-SDK skip, driftBank composition,
input validation, and a LIVE Node↔Python audit using the actual fixed
Python SDK in this PR.
Stack on PR #18:
- Python detector hardening
- H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo
- H1 #2: IncidentReplay (autonomous triage)
- H1 #3: CrossSDKDifferential (port-drift auditor)
Next: H1 #4 self-tuning thresholds, H1 #5 adversarial tournament.
https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
cdayAI
pushed a commit
that referenced
this pull request
May 25, 2026
…anceNarrator, CustomerLearning Shipping the rest of H1 and the first H2 batch in one commit since I can't ScheduleWakeup across turns in this environment. H1 #4: src/threshold-tuner.js Sweeps per-category confidence thresholds to maximize F1 (or precision/ recall/accuracy) on a labeled corpus. Scans corpus once, sweeps in O(grid x categories). Supports precision/recall floors. Returns a threshold map the host can apply to AgentShield for measurably better signal on the customer's traffic, plus a confusion-matrix baseline for before/after comparison. Tests: 23 assertions. H2 #1: src/adversarial-tournament.js Wires the existing EvolutionSimulator + MutationEngine into a closed loop. Seed attacks -> mutate -> classify -> survivors feed next gen -> derive hardened patterns via hardenFromEvolution. Optional LLM judge validates that survivors are real attacks (not mutation noise) and ranks them. runIterative() chains tournaments using prior survivors as seeds to surface emergent strategies. Tests: 22 assertions. H2 #2: src/compliance-narrator.js Auditor-grade narrative generator for SOC2 / HIPAA / GDPR / EU AI Act. Ingests Shield events (raw scan results, agent verdicts, or normalized entries), maps categories to framework control IDs, generates a deterministic markdown report, and optionally rewrites it as audit prose via an LLM judge. HMAC-SHA256 signs the canonicalized payload with order-independent serialization so tampering is detectable. Tests: 26 assertions including 3 distinct tamper attempts. H2 #3: src/customer-learning.js Reads a customer's agent codebase (defaults: js/ts/py/go/rs/json/yaml/ md/toml) and extracts: legitimate URLs/domains, env var names, tool names, system-prompt phrases, and secret-shape prefixes (sk-, ghp_, AKIA, etc). Builds a customer-specific profile with: - allowed domains/env-vars/tool-names (suppress generic FPs) - lookalike-tool regex patterns (catch tool-name impersonation) - honeypot canary tokens shaped like the customer's real secrets (any appearance in agent output is instant exfil confirmation) - system-prompt phrase allowlist (suppresses injection FPs on the agent's own legitimate prompts) Walks with safety caps (maxFiles, maxFileBytes, exclude node_modules etc). Tests: 28 assertions including a temp-fixture repo end-to-end. All four wired into src/main.js (with namespace fix: NARRATOR_FRAMEWORKS to avoid collision with the existing COMPLIANCE_FRAMEWORKS export from src/compliance.js) and added to npm test. Suite still green end-to-end. Stack on PR #18: - Python detector hardening - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay - H1 #3: CrossSDKDifferential - H1 #4: ThresholdTuner - H2 #1: AdversarialTournament - H2 #2: ComplianceNarrator - H2 #3: CustomerLearning Continuing with H2 #4 (autonomous threat hunter) and H2 #5 (production-traffic shadow-mode reporter) next. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
cdayAI
pushed a commit
that referenced
this pull request
May 25, 2026
…porter H2 #4: src/threat-hunter.js Pluggable source pattern: any object with .name + async .fetch() can feed the hunter. Built-ins: - LocalCorpusSource: JSONL or in-memory array (offline-safe for CI) - HTTPSourceFn: caller-supplied async () => items (caller owns network so we add no dependency and the user controls egress) Hunt flow: 1. Fetch from every source (broken sources don't crash the hunt). 2. Classify each item: detector misses it -> "novel attack". 3. Synthesize a tight regex by picking the rarest 4-token window (lowest occurrence in the benign corpus) and escaping it. 4. Estimate FP rate against benignCorpus; reject above threshold (default 5%). 5. Optional LLM judge review of proposals. 6. Emit a PR-ready markdown report. Conservative by design — prefers tight literal phrases over loose alternation to keep FPs near zero. Tests: 22 assertions including broken-source tolerance, addSource validation, FP filtering, and judge integration. H2 #5: src/shadow-mode-reporter.js Aggregator over a stream of scan events. After N days emits an executive report: - traffic volume + scan-time percentiles (p50/p95/p99/max) - threats by severity, category, source - action projection (if deployed in enforce mode: would-block / would-rewrite / would-allow counts) - noisy categories (likely FP candidates: >=5 hits, low avg conf) - quiet categories (rarely fire — candidates for removal) - estimated ROI: wouldBlock * costPerIncident (configurable) Accepts raw shield.scan() results, ShieldAgent verdicts, or wrapped envelopes. ingestMany handles arrays. maxEvents cap protects memory on long-running services. Outputs JSON via report() or markdown via markdownReport(). Tests: 29 assertions including window filtering, raw vs wrapped ingest, noisy/quiet category detection, max-event cap, and ROI math. Both wired into main.js and npm test. Full suite green. Stack on PR #18: - Python detector hardening - H1 #1: ShieldAgent + ShieldActions + 5 MCP tools + demo - H1 #2: IncidentReplay - H1 #3: CrossSDKDifferential - H1 #4: ThresholdTuner - H2 #1: AdversarialTournament - H2 #2: ComplianceNarrator - H2 #3: CustomerLearning - H2 #4: ThreatHunter - H2 #5: ShadowModeReporter Continuing with H3 multi-agent SOC + agent identity CA next. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
cdayAI
pushed a commit
that referenced
this pull request
May 25, 2026
…rg trust) H3 #1: src/soc-fleet.js Orchestrates the H1+H2 modules into a coordinated SOC team: - Defender: ShieldAgent triage on every event. - Detective: IncidentReplay deep-dive on block/escalate. - Forensics: ShadowModeReporter + ComplianceNarrator window report. - PatchWriter: ThreatHunter pattern synthesis for novel attacks. - Reviewer: judge-backed approval, with rule-based fallback (no judge configured -> approve <=5 patches with zero FP). - Releaser: bundles a ChangeRequest with patches + test cases + framework attribution, ready for PR generation. Every role's I/O is captured as a SOCEvent in the timeline so the entire decision chain is replayable. Bounded by maxTimeline. status() returns per-role event counts + last event. forceFullPipeline=true runs all roles even for safe input (for synthetic drills). Tests: 25 assertions including critical/safe/forced/FP paths, judge vs rule-based reviewer, timeline cap, status snapshot. H3 #2: src/agent-identity-ca.js Cryptographic agent passports for cross-org trust. Uses Ed25519 from Node's built-in crypto (zero external deps). Capabilities: - issuePassport({agentId, capabilities, orgId}) -> {passport, privateKey} Passport contains agentId, publicKey (SPKI base64), TTL, capabilities, orgId, caRootId. CA signs the canonical body. - verifyPassport: signature check + CA root match + revocation list + expiry. Tampered passports rejected. - revoke(agentId): CRL-style revocation, subsequent verifications fail. - signMessage({agentId, payload, privateKey}) -> envelope with timestamp + 16-byte nonce + signature. - verifyMessage(envelope, passport): full chain — passport valid, agentId matches, timestamp within messageTtlMs window (replay protection), nonce not in seenNonces cache, signature valid against passport's publicKey. Returns {valid, reason?, agentId, capabilities}. - exportRootPublicKey: SPKI for cross-org verification (private key never leaves the CA instance). Canonical JSON serializer recursively sorts keys so signatures are order-independent. Nonce cache has TTL-based sweep + max-size eviction. Tests: 30 assertions including issue+verify, tamper detection on body and signature, revocation, message signing, replay protection, stale message rejection, future-timestamp rejection, agentId mismatch, foreign-CA rejection, public-key export, input validation. Both wired into main.js and npm test. Full suite green. Stack on PR #18 (12 commits, 11 new modules, ~270 new test assertions): - Python detector hardening - H1 #1-#4: ShieldAgent, IncidentReplay, CrossSDKDifferential, ThresholdTuner - H2 #1-#5: AdversarialTournament, ComplianceNarrator, CustomerLearning, ThreatHunter, ShadowModeReporter - H3 #1-#2: SOCFleet, AgentIdentityCA That's all H1 + all H2 + 2 of the highest-leverage H3 modules built offline-safe with zero new dependencies. The remaining H3 items (fleet immunity wiring, cyber-insurance integration, public benchmark leaderboard hosting) require external infra (signed update feeds, a partner API, hosted infra) that can't be scaffolded purely in-tree. https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
cdayAI
pushed a commit
that referenced
this pull request
May 25, 2026
src/threats-2026-extra.js — the 5 threat-intel gaps a2a-guard.js didn't
cover. Each lives as its own detector so hosts can wire them at the
right lifecycle point (none of these are pure-regex):
- TOCTOUGuard — hash a DOM locator at observe() time; refuse to
act if hash differs (arXiv 2603.00476).
- GRAPH_TRIPLE_PATTERNS (3 rules) — JSON-, Turtle/RDF-, and bulk-edge
forms of GraphRAG triple poisoning (arXiv 2508.04276).
- detectGCGSuffix — entropy + non-dictionary-ratio + symbol-density
composite over the trailing 200 chars. Flags GCG /
activation-steering optimized suffixes.
- MemoryReplayGuard — wraps a memory backend so persisted messages are
re-scanned at LOAD time, not just write time
(CVE-2026-25253). Stricter threshold than live
input (default medium).
- detectContextStuffing — flags oversized inputs (>30KB) with repetition
factor ≥ 20 or whitespace runs > 2KB.
- scanExtras2026(input) — one-call helper.
src/dream-pr-bot.js — closes the dreaming loop. Picks the latest
high-confidence change-request artifact from DreamMemory and converts
it to a real PR. Three modes:
- dry-run — emit { branch, title, body, files } (default)
- local-git — write config/dreams/dream-patches.json +
dream-thresholds.json, create branch, commit; host's
gitRunner does the git invocation.
- mcp-github — call MCP github tools via caller-supplied adapter
({ createBranch, putFile, openPR }). Draft by default.
Bot writes ONLY to config/dreams/*.json artifact files, never to
detector-core or pattern source. Unreviewed dreams cannot live-fire.
Python parity: all 9 a2a-guard patterns + 3 GraphRAG triple patterns
ported into python-sdk/agent_shield/detector.py — 314 → 327 patterns,
51 → 59 categories. TestV2026CouncilPatterns class adds 14 positive
+ negative assertions.
Tests:
- test/test-threats-2026-extra.js: 40 assertions
- test/test-dream-pr-bot.js: 33 assertions
- python-sdk/tests/test_detector.py: 51 → 65 tests
All wired into src/main.js exports and npm test. Full Node + Python
suites green.
PR #18 final tally: 16 new modules, ~543 new Node test assertions
+ 65 Python tests, 16 commits. Covers all 15 council threat gaps in
JS plus 9 of 15 in Python; dreaming subsystem with autonomous PR bot
fully wired; Hermes + 9-engine OSS integration shipped.
https://claude.ai/code/session_01AqtyP5YupS6MKt6qCTXghS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR has two stacked layers:
Layer 1 — Python detector hardening
Security bypasses fixed
\bbypass on 5 jailbreak detectorsтестDAN mode→ 0 hits;αjailbreak mode→ 0 hitsre.ASCIIflagANTHROPIC_BASE_URL=https://api.anthropic.com.evil.attacker.io/v1→ 0 hitsencoding_evasion"99 "→ 1.7s single regex call (200ms budget){5,30}→ 20ms (85× speedup)budget_drainmisses canonical attacksrepeat 1000 times→ 0 hitsDuplicate patterns removed (10 entries, 330 → 314)
v14.x sync block re-added patterns from v13.x → 2× findings per attack. Removed dups in
role_hijack,prompt_injection,tool_abuse,memory_poisoning,encoding_chain,structured_data_injection,xss_injection,svg_injection.False-positive reduction
data_exfiltrationmarkdown image — drop?q=?d=from exfil alternationpath_traversal— require 3+../OR explicit sensitive targetcicd_injection@claude— negative lookahead fordo not|don't|never|avoid|preventquery_injectionf-string — require SQL keyword inside bodyRelease hygiene
setup.pyname="agent-shield"→"agentshield"(was conflicting with pyproject.toml)pyproject.tomlbuild-backend = "setuptools.backends._legacy:_Backend"(fictitious — brokepython -m build) →setuptools.build_metatsconfig.json"types": ["node"]dropped (was failing Type Check job —@types/nodenever installed;types/index.d.tsdoesn't use Node types).github/workflows/quality-gates.ymlPerformance Check job called nonexistentdetectPromptInjection→ fixed toscanTextTests
test_no_duplicate_patternsregression test)Layer 2 — H1: ShieldAgent
What it adds
src/shield-agent.js+src/shield-actions.js— an LLM-powered triage layer that wraps the deterministic detector with a reasoning loop:triagePolicy.stats.judgeFailuresso monitoring can alert.SYSTEM/USER/TOOL_OUTPUT/RAG_CHUNK/UNTRUSTED); attacker attempts to close the tag are escaped at the boundary, so the adjudicator cannot be injected by the very content it adjudicates.async ({system, user}) => string. Demo example calls Anthropic API via globalfetch(Node 18+).Actions executor
ShieldActions.execute(verdict, original)→{proceed, payload, info}. Translates verdicts intoallow/block/sanitize/rewrite/quarantine/escalate. Quarantine + escalate sinks are caller-injected. Sanitizer strips HTML comments,display:nonecontainers, data-exfil markdown images, instruction-override boilerplate, system-prompt impersonation tags.MCP exposure
src/mcp-server.jsextended with 5 new tools any host agent (Claude Code, Cursor, Windsurf, GPT) can call mid-conversation:investigate(text, provenance, source, system_prompt)safe_rewrite(text, source)explain_threat(text, source)execute_verdict(verdict, original_text)agent_stats()Tests
test/test-shield-agent.js— 61 assertions covering: schema validator, JSON extractor (incl. embedded/malformed/unterminated), tag-breakout escaping, detector-only fast paths, judge invocation, schema-violation fallback, budget-timeout fallback, no-judge degradation, history bounds, all 6 action types, quarantine + escalate sink plumbing, sanitizer correctness. All passing.Demo
examples/security-copilot.js— end-to-end host agent receiving 4 messages (benign / injection / critical / borderline) and the agent making allow/rewrite/block/allow decisions with audit trail. Works offline with mock judge; readsANTHROPIC_API_KEYenv to use real Claude.Test plan
python3 -m unittest tests.test_detector→ 51/51 OK locallynpm test→ 61/61 OK on the new ShieldAgent suite + all pre-existing suites passnode examples/security-copilot.js→ demo runs to completion with correct verdictsscan_text('99 ' * 5000)returns in 20ms (was 1.7s)тестDAN mode,αjailbreak mode, etc. → caught after re.ASCII fixapi.anthropic.com.evil.io→ caught after host-terminator fix