Skip to content

aparnaa19/AgentForensics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentForensics

Real-time prompt injection detection and forensics for LLM agents.

CI Python License: MIT ARPIbench

AgentForensics is an open-source security framework that monitors LLM agent sessions in real time, detects prompt injection attacks across multiple sources, fingerprints attack campaigns, and presents forensic evidence in a dashboard - all with zero changes to your existing agent code.

📖 Full User Guide - installation, all use cases, dashboard walkthrough, troubleshooting


Why AgentForensics?

As LLM agents increasingly read emails, browse websites, process documents, and execute code, they are exposed to a new attack surface: indirect prompt injection - malicious instructions embedded in external content that hijack the agent's behaviour.

Existing tools focus on input filtering at the user layer. AgentForensics monitors the entire agent session including tool outputs, retrieved documents, and API responses - where the real attacks happen.


Benchmark Results

Evaluated against ARPIbench (alexcbecker/ARPIbench), an academic benchmark of real-world indirect prompt injection payloads across web, email, and document scenarios:

Metric Result
Detection rate 100% (2000/2000 payloads)
False positive rate 0% (7 benign samples)
Attack types covered naive, completion, ignore, urgent_request, helpful_assistant, multi-turn
Scenarios web, email, local document
python benchtest.py --ml   # reproduce results

Features

  • Five-stage detection pipeline - heuristics, ML classifier, instruction boundary analysis, semantic drift, and sliding-window multi-turn detection
  • Live forensic dashboard - session timeline, per-turn evidence, real-time SSE alerts
  • Attack fingerprinting - clusters repeated injection patterns across sessions into named campaigns
  • False alarm management - mark safe content once, suppress forever
  • VS Code extension - inline squiggles, Problems panel, auto-scan on open/paste
  • Claude Desktop MCP - plug directly into Claude Desktop, zero config
  • Code wrapper - one import for OpenAI / Anthropic SDK apps
  • LangChain & AutoGen - native callback/tracer integrations

How It Works

flowchart TD
    A([External Content\nweb page · email · document · tool result]) --> B

    B[Stage 1 - Heuristics · less than 1ms\n8 regex rule groups · H01 to H08]
    B --> C[Stage 2 - ML Classifier · ~50ms\nFine-tuned DistilBERT on injection patterns]
    C --> D[Stage 3 - Instruction Boundary · less than 1ms\n10 boundary pattern groups · IB01 to IB10]
    D --> E[Stage 4 - Semantic Drift · ~30ms\nLLM response topic vs original query]
    E --> F[Stage 5 - Sliding Window · less than 1ms\nScore aggregation across last N turns]

    F --> G{Verdict}

    G -->|score below 0.25| H([CLEAN])
    G -->|score 0.25 to 0.75| I([SUSPICIOUS])
    G -->|score above 0.75| J([COMPROMISED])

    I --> K[Store · Fingerprint · Alert · Dashboard]
    J --> K

    style H fill:#2d6a2d,color:#ffffff,stroke:#2d6a2d
    style I fill:#a67c00,color:#ffffff,stroke:#a67c00
    style J fill:#8b1a1a,color:#ffffff,stroke:#8b1a1a
Loading

Detection Rules

Rule Pattern Weight
H01 ignore previous/all/prior instructions 0.35
H02 you are now / your new role/persona 0.30
H03 do not tell/inform the user 0.25
H04 Token injection markers [[SYSTEM]], [INST], <|im_start|> 0.15
H05 Sudden language/script switch (evasion) 0.20
H06 Imperative commands in tool output (send, forward, upload, exfiltrate) 0.20
H07 Reference to system prompt / context window 0.25
H08 Exfiltration setup: send/forward near secret/key/password 0.40
IB01–IB10 Instruction boundary patterns (context wipe, persona hijack, task redefinition, etc.) 0.20–0.35

Installation

One-command install (recommended)

git clone https://github.com/YOUR_USERNAME/agentforensics
cd agentforensics
python install.py

This installs Python dependencies, builds the VS Code extension, creates a desktop app (Windows), and configures Claude Desktop MCP automatically.

Manual

pip install agentforensics[dashboard]

# Launch dashboard
python -m agentforensics.cli dashboard
# → http://localhost:7890

Requirements: Python 3.10+, Node.js 18+ (for VS Code extension build)


Usage

1. Claude Desktop (MCP)

Run python install.py once. AgentForensics is auto-configured as an MCP server. Restart Claude Desktop - monitoring starts immediately, no code changes required.

2. Code Wrapper - OpenAI / Anthropic

Drop-in replacement. One import, no other changes:

from agentforensics import trace
from openai import OpenAI

client = trace(OpenAI(api_key="sk-..."))

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarise this webpage: ..."}],
)
# Injection signals captured automatically → visible in dashboard

3. VS Code Extension

Install the .vsix from agentforensics-vscode/:

Extensions → ⋯ → Install from VSIX

Every file you open, edit, or paste is automatically scanned. Injections appear as red squiggles and in the Problems panel (Ctrl+Shift+M).

4. LangChain

from agentforensics.integrations.langchain_handler import AgentForensicsHandler

agent_executor.invoke(
    {"input": user_input},
    config={"callbacks": [AgentForensicsHandler(session_id="my-session")]},
)

5. AutoGen

from agentforensics.integrations.autogen_tracer import patch_autogen
patch_autogen()
# All subsequent AutoGen conversations are monitored

Dashboard

Launch with the desktop app and click go to dashboard or python -m agentforensics.cli dashboard

View Description
Sessions All monitored agent sessions with verdict (Clean / Suspicious / Compromised)
Live Monitor Real-time injection alerts via Server-Sent Events
Campaigns Fingerprinted attack patterns grouped by similarity across sessions
False Alarms Approved safe content - permanently excluded from future alerts

Benchmarking

ARPIbench (recommended)

Tests against 7,560 real-world indirect injection payloads:

pip install datasets
python benchtest.py          # first 2000 rows
python benchtest.py --all    # full dataset
python benchtest.py --ml     # with ML stage enabled

Built-in benchmark

Tests per-rule recall and false positive rate with no internet required:

python benchmark.py
python benchmark.py --verbose

Environment Variables

Variable Default Description
AF_INJECT_THRESHOLD 0.25 Minimum combined score to flag a signal
AF_FP_THRESHOLD 0.70 Cosine similarity threshold for fingerprint grouping
AF_DISABLE_ML false Set true to skip ML stage (faster startup, no GPU needed)
AF_MODEL_PATH agentforensics/model/ Path to fine-tuned DistilBERT weights
AF_FP_DB ~/.agentforensics/fingerprints.db SQLite fingerprint database path
AF_WINDOW_SIZE 3 Number of turns for sliding-window multi-turn detection

Project Structure

agentforensics/
├── classifier.py       # Five-stage detection pipeline
├── fingerprinter.py    # Attack fingerprinting & campaign clustering
├── alerting.py         # Alert routing and SSE broadcast
├── reporter.py         # HTML forensic report generation
├── semantic.py         # Instruction boundary + semantic similarity
├── tracer.py           # OpenAI/Anthropic SDK wrapper (trace())
├── store.py            # SQLite session/signal storage
├── mcp_server.py       # Claude Desktop MCP integration
├── model/              # Fine-tuned DistilBERT weights (gitignored)
├── dashboard/
│   ├── server.py       # FastAPI backend + SSE
│   └── ui.html         # Single-file dashboard UI
└── integrations/
    ├── langchain_handler.py
    └── autogen_tracer.py

agentforensics-vscode/  # VS Code extension (TypeScript)
benchmark.py            # Built-in per-rule benchmark (no internet)
benchtest.py            # ARPIbench external benchmark
install.py              # One-command installer
launcher.py             # Desktop app launcher (tkinter + pywebview)

Contributing

Contributions are welcome. Areas of particular interest:

  • Additional heuristic rules for emerging injection techniques
  • New framework integrations (CrewAI, LlamaIndex, Haystack)
  • Improved ML model training data
  • Additional benchmark evaluations

Citation

If you use AgentForensics in your research, please cite:

@software{agentforensics2026,
  author    = Aparnaa,
  title     = {AgentForensics: Real-Time Prompt Injection Detection and Forensics for LLM Agents},
  year      = {2026},
  url       = {https://github.com/aparnaa19/agentforensics},
  license   = {MIT}
}

License

MIT - see LICENSE for details.

About

AgentForensics is an open-source security framework that monitors complete LLM agent sessions in real time, detecting prompt injection attacks across tool outputs, web pages, documents, and API responses. It uses heuristic rules, a DistilBERT ML classifier, instruction boundary detection, semantic drift, and sliding-window multi-turn detection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors