Red-team Codex and Claude Code agents for prompt injection, MCP poisoning, memory poisoning, and concealed side effects.
HackYourAgent is a manual-use skill bundle for coding agents. It teaches an agent to map an authorized AI system, generate paired control and attack trials, inspect outputs one by one, and leave behind evidence, regressions, and hardening actions a builder can actually commit.
Most AI security tooling still looks like one of these:
- prompt scanners that never touch your actual agent workflow
- eval frameworks that are powerful but too heavy for everyday repo use
- research benchmarks that do not leave commit-ready regressions behind
HackYourAgent is the narrow wedge for builders using coding agents. It is designed to run inside Codex and Claude Code workflows, inspect repo-local trust boundaries, and tell you where prompt injection, tool poisoning, memory poisoning, approval confusion, or concealment still work.
- Native install for Codex and Claude Code
- A forensic red-team workflow with paired control and attack trials
- Evidence-first output under
redteam/ - Seeded vulnerable example targets you can test immediately
- Launch-ready docs and examples you can extend for your own targets
Install the skill:
python3 scripts/install_skill.py bothPick a seeded example:
examples/vulnerable-rag-agentexamples/vulnerable-mcp-agentexamples/vulnerable-concealment-agent
Invoke the skill:
Use $hack-your-agent on examples/vulnerable-rag-agent.
Write only to redteam/ artifacts. Build a paired control/attack trial matrix,
inspect outputs one by one, and leave minimal repros and regressions.
Expected outcome:
redteam/trials/trial-matrix.csv- one dossier per trial in
redteam/trials/ - raw evidence folders in
redteam/evidence/ - ranked findings in
redteam/findings/ - a hardening plan in
redteam/hardening-plan.md
- Shared research-backed references in
core/references/ - Reusable reporting templates in
core/templates/ - A helper bootstrap script in
scripts/init_redteam_run.py - A Codex wrapper in
platforms/codex/hack-your-agent/ - A Claude Code wrapper in
platforms/claude/hack-your-agent/ - Seeded vulnerable example targets in
examples/
- Manual invocation only. This skill has side effects and should not be auto-loaded.
- Authorized targets only. Default to local repos, dev stacks, and staging endpoints.
- Action over text. Treat compromise as untrusted content changing downstream behavior.
- Trust boundaries over prompt packs. Map architecture before probing.
- Regressions by default. Every high-confidence issue should leave a replayable artifact.
This repo is grounded in current primary sources as of March 28, 2026. The short version is:
- Official guidance from OpenAI and Anthropic treats prompt injection, tools, MCP, and approvals as first-class trust-boundary problems.
- Recent agent research shows static prompt lists are not enough; dynamic, surface-aware evaluation is required.
- Frontier work now emphasizes tool poisoning, memory poisoning, and concealed compromise rather than only direct prompt overrides.
See core/references/frontier-research.md for dated links and distilled takeaways.
Start with examples/README.md.
examples/vulnerable-rag-agent: raw retrieved documents are merged into a high-authority promptexamples/vulnerable-mcp-agent: tool metadata and tool results are trusted as instructionsexamples/vulnerable-concealment-agent: summaries are generated from planner state instead of trace-backed actions
These examples exist to help you demo the skill, test onboarding, and show prospective users a fast path to value.
The repo source is shared across platforms. Use the installer so the final installed skill is self-contained and native to the target agent.
Install to ${CODEX_HOME:-~/.codex}/skills/hack-your-agent:
python3 scripts/install_skill.py codexInstall as a personal skill to ~/.claude/skills/hack-your-agent:
python3 scripts/install_skill.py claudeInstall as a project skill to .claude/skills/hack-your-agent inside a repo:
python3 scripts/install_skill.py claude --scope project --project-dir /path/to/repoInstall both native variants at once:
python3 scripts/install_skill.py bothClaude Code skill behavior is aligned to the official skills docs: bundled files, disable-model-invocation: true, context: fork, and ${CLAUDE_SKILL_DIR} for script access. The Codex install target follows the local standard skill path used by Codex-compatible environments in this workspace.
If your current agent session does not see a newly installed skill, start a new session after installing it.
- Codex:
Use $hack-your-agent on this repo. Build a paired control/attack trial matrix and save raw evidence under redteam/. - Claude Code:
/hack-your-agent this repo and its staging endpoint
HackYourAgent is supposed to behave like a forensic operator, not a prompt list.
- Scope the target and forbidden actions.
- Map prompts, tools, MCP, retrieval, memory, approvals, and sinks.
- Select only the attack families that exist in the target.
- Generate a
redteam/trials/trial-matrix.csvwith paired control and attack runs. - Execute each trial one by one.
- Save raw inputs, raw outputs, traces, side effects, and per-trial verdicts under
redteam/evidence/. - Compare each attack row against its paired control before writing a finding.
- Produce findings, regressions, and a hardening plan only after the evidence is complete.
If the target lacks a runnable harness, traces, or staging surface, the skill can still map architecture and design probes, but the forensic result will be correspondingly weaker.
Use $hack-your-agent on this authorized repo and staging endpoint.
Write only to redteam/ artifacts. Focus on indirect prompt injection, MCP/tool poisoning,
memory poisoning, approval confusion, and concealment. Build a paired control/attack trial matrix,
run each row individually, save raw evidence, and leave minimal repros and regressions.
HackYourAgent is a defensive skill. It is meant for systems the user owns or is authorized to test. It is not a mass scanner, public exploit pack, credential brute-force tool, or live-offense workflow.