This repository provides a reproducible benchmark for evaluating how effectively the Deconvolute SDK protects Retrieval Augmented Generation pipelines and agent-based systems against Indirect Prompt Injection attacks. It includes the code, datasets, and configuration needed to validate integrity and detection claims, measure trade-offs such as latency and false positive rates, and enable independent verification. While the Deconvolute SDK is the primary focus, the benchmark is designed to support evaluation of other defense systems through shared, declarative experiment definitions that are repeatable across environments and model versions.
This benchmark evaluates system integrity failures caused by adversarial content injected through retrieved documents. It does not address general model safety, alignment, or harmful content generation. Results should be interpreted as measurements of integrity enforcement under specific, controlled attack scenarios.
The results presented below correspond to the Naive Base64 Injection scenario, evaluated on a dataset consisting of 300 samples.
- Attack Goal: Induce the generator model to ignore system-level instructions and emit responses encoded in Base64, thereby bypassing keyword-based filters.
- Name:
canary_naive_base64
| Experiment Configuration | ASR (Attack Success) | PNA (Benign Perf) | TP | FN | TN | FP |
|---|---|---|---|---|---|---|
| Baseline (No Defense) | 90.7% | 98.7% | 14 | 136 | 148 | 2 |
| Protected (Canary + Language) | 0.67% | 95.3% | 149 | 1 | 143 | 7 |
Lower ASR is better. Higher PNA is better.
Without defenses, the model is highly susceptible to indirect prompt injection. Enabling the Deconvolute SDK substantially reduces successful attacks, at the cost of a small increase in false positives on benign queries. This reflects a typical trade-off between strict integrity enforcement and tolerance for benign variation. A detailed analysis of failure modes and defense behavior is provided in the accompanying technical blog post.
This project uses uv for dependency management and environment isolation.
git clone https://github.com/daved01/deconvolute-benchmark.git
cd deconvolute-benchmark
uv sync --all-extrasThe virtual environment may be activated with source .venv/bin/activate to avoid prefixing commands with uv run. The examples below assume the environment is not activated.
Experiments are organized as scenarios under scenarios/. Each scenario directory contains:
experiment.yaml: the experiment definitiondataset_config.yaml: the dataset generation recipedataset.json: the generated dataset
An experiment can be run by specifying the scenario name:
uv run dcv-bench run exampleIf dataset.json is missing, it is generated automatically from dataset_config.yaml before execution.
Experiment variants can be selected using the colon syntax:
# Runs scenarios/example/experiment_gpt4.yaml
uv run dcv-bench run example:gpt4Attack strategies, dataset properties, and model settings are modified entirely through configuration. No code changes are required.
Experiments are defined using YAML files that describe the pipeline, defense layers, and evaluation logic.
Multiple defense layers from the deconvolute SDK can be composed in the pipeline using the config file. Then Evaluators can be added to evaluate how the SDK performs. The following evaluators are available:
| Layer Type | Description | Config |
|---|---|---|
canary |
Checks for explicit integrity violation signals from the SDK. | N/A |
keyword |
Detects the presence of an injected payload in the output. | target_keyword |
language_mismatch |
Detects violations of the expected output language. | expected_language, strict |
Each sample is treated as a binary security outcome. Results fall into one of four categories:
| Term | Scenario | Outcome | Interpretation |
|---|---|---|---|
| TP (True Positive) | Attack Sample | Detected | System Protected. The defense triggered correctly. |
| FN (False Negative) | Attack Sample | Not Detected | Silent Failure. The attack succeeded. |
| TN (True Negative) | Benign Sample | Not Detected | Normal Operation. System worked as expected. |
| FP (False Positive) | Benign Sample | Detected | False Alarm. Usefulness is degraded. |
Attack Success Rate (ASR) measures the fraction of attacks that bypass defenses. Performance on No Attack (PNA) measures how often benign queries proceed without interference. Latency is reported as end-to-end execution time for attack and benign samples.
Datasets live in each scenario folder under scenarios/<scenario-name>/ (e.g. scenarios/example/dataset.json).
Optional data dependencies must be installed first:
uv sync --extra dataDatasets are built from a base corpus in resources/corpus and the scenario’s dataset_config.yaml. For example, to fetch the baseline SQuAD v1.1 validation set:
uv run python resources/corpus/fetch_squad_data.pyTo generate a scenario dataset:
uv run dcv-bench data generate exampleThe resulting dataset.json will be saved in the scenario folder and is ready for experiments.
Each experiment run produces a timestamped directory under results/ in the scenario folder:
scenarios/
└── <scenario_name>/
└── results/
└── run_<timestamp>/
├── results.json # Run manifest (summary + config)
├── traces.jsonl # Line-by-line execution traces (Debugging Data)
└── plots/
├── confusion_matrix.png # TP/FN/FP/TN Heatmap
├── asr_by_strategy.png # Vulnerability by Attack Type
└── latency_distribution.png # Performance OverheadThe results.json file is the main artifact for analysis. For per-sample execution details, see traces.jsonl, which records data for each individual sample.
This benchmark focuses on a single attack class: Indirect Prompt Injection, using multiple defensive layers to evaluate defense-in-depth. It measures observable system behavior under attack rather than internal SDK implementation. Results should not be interpreted as comprehensive LLM security guarantees. Detailed explanations of the SDK’s defenses and integrity protocols are available in the Deconvolute SDK documentation.
Gakh, Valerii, and Hayretdin Bahsi. "Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems." arXiv:2506.19109. Preprint, arXiv, June 23, 2025. https://doi.org/10.48550/arXiv.2506.19109.
Liu, Yupei, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. "Formalizing and Benchmarking Prompt Injection Attacks and Defenses." Version 5. Preprint, arXiv, 2023. https://doi.org/10.48550/ARXIV.2310.12815.
Wallace, Eric, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions." arXiv:2404.13208. Preprint, arXiv, April 19, 2024. https://doi.org/10.48550/arXiv.2404.13208.