PrivacyPeek is a benchmark for auditing acquisition-stage privacy leakage of LLM-based agents. Existing privacy benchmarks audit what an agent's response or outgoing action discloses, but they overlook the stage at which sensitive data first enters the agent's context through tool calls. Once acquired, such over-collected information is then one careless action or one attack away from an outright leak. PrivacyPeek therefore inspects what the agent acquires during a task, not only what it eventually says.
- 1,182 cases spanning 7 acquisition behaviours × 16 application domains, generated through a human-in-the-loop template + GPT-4o pipeline.
- Two complementary evaluators that audit the same trajectory:
- Acquisition Inspection — deterministic checks on the tool-call trajectory, reporting Content-Exposure-Rate (CER), Task-Completion-Rate (TCR), and the utility-conditioned Helpful CER (HCER).
- Probe Elicitation — an LLM judge issues a follow-up probe after tools and network are disabled, reporting Probe-Leakage-Rate (PLR) and the utility-conditioned HPLR.
- Agent runners for both open-source models (via vLLM) and closed-source APIs (OpenAI-compatible).
- An end-to-end autogen pipeline to extend the benchmark with new domains or acquisition behaviours.
PrivacyPeek targets Python 3.10+.
git clone https://github.com/Xuan269/PrivacyPeek.git
cd PrivacyPeek
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor open-source agent runs you will additionally need a vLLM server:
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 --max-model-len 16384Copy the template and fill in your credentials. .env is git-ignored — only .env.example is tracked.
cp .env.example .env# .env
API_BASE=https://api.openai.com/v1 # closed-source agent backend
API_KEY=sk-...
JUDGE_API_BASE=https://api.openai.com/v1 # LLM judge for PLR
JUDGE_API_KEY=sk-...
JUDGE_MODEL=gpt-4o-2024-11-20
VLLM_ENDPOINT=http://127.0.0.1:8000/v1 # open-source agent backendRun a single case end-to-end:
# Closed-source API
python examples/run_single_case.py \
--case-dir data/cases/agriculture_A1_001 \
--backend closed \
--model gpt-4o-mini
# Open-source via local vLLM
python examples/run_single_case.py \
--case-dir data/cases/agriculture_A1_001 \
--backend open \
--model meta-llama/Llama-3.1-8B-Instruct \
--vllm-endpoint http://127.0.0.1:8000/v1# Open-source model (vLLM)
python evaluation/run_agent_opensource.py \
--model-name meta-llama/Llama-3.1-8B-Instruct \
--vllm-endpoint http://127.0.0.1:8000/v1 \
--output runs/llama3.1-8b.json
# Closed-source model (API)
python evaluation/run_agent_closed.py \
--model-name gpt-4o-mini \
--output runs/gpt-4o-mini.jsonScoring:
# Acquisition Inspection — deterministic CER + TCR + HCER (no judge API cost)
python evaluation/judges/cer_exact_match.py \
--runs runs/gpt-4o-mini.json \
--output runs/cer_gpt-4o-mini.json
# Probe Elicitation — LLM-judged PLR + HPLR (uses JUDGE_API_KEY)
python evaluation/judges/probe_judge.py \
--runs runs/gpt-4o-mini.json \
--output runs/plr_gpt-4o-mini.json
# Combined risk matrix
python evaluation/analysis/combined_analysis.py \
--cer runs/cer_gpt-4o-mini.json \
--plr runs/plr_gpt-4o-mini.json \
--output runs/combined_gpt-4o-mini.jsonThe closed-source runner supports --resume to skip cases that completed cleanly in a prior run. Both runners support --dims A1 A4 B3 and --limit N --start I for subsetting.
Each case in data/cases/ is a self-contained folder named {domain}_{behaviour}_{NNN}:
agriculture_A1_001/
├── config.json # Task, system prompt, probe, sensitive items, minimum scope
├── file_listing.json
├── file_metadata.json
├── files/ # Out-of-scope sensitive files
└── tools_data/ # In-scope tool return values
PrivacyPeek organises out-of-scope acquisition into seven acquisition behaviours (Tasks 1–7):
| Code | Acquisition Behaviour | Definition | Example |
|---|---|---|---|
| A1 | Normal-Filename Access | The agent visits a file having a name without any sensitive keywords and is irrelevant to the task. | Opening unrelated staff_roster.txt alongside the target file. |
| A2 | Sensitive-Filename Access | The agent visits a file having a name with sensitive keywords and is irrelevant to the task. | Opening passwords_and_keys.txt, a file the task never requests. |
| A3 | Cross-Format Access | The agent visits a file having a name with the same prefix as the target and a different suffix. | Opening report.xlsx when only report.docx was requested. |
| A4 | Out-Dated Data Access | The agent acquires records outside the task's intended time window through a wider-window query or a multi-period file. | Loading the full vitals history when asked for today. |
| B1 | Excess-Field Access | The tool returns more fields than the task asks for at a single subject and a single time. | Returning a national ID and home address with the requested vital sign. |
| B2 | Forbidden-Content Access | The agent reads or surfaces content carrying an explicit non-disclosure marker, despite its prohibition. | Reading content marked [CONFIDENTIAL: DO NOT READ]. |
| B3 | Out-of-Scope Inferential Access | The agent acquires fields outside the task's minimum scope and combines them into a sensitive derivation. | Inferring a pregnancy from a purchase log. |
Cases span 16 application domains covering both regulated and service sectors, including agriculture, finance, healthcare, legal, education, government, human resources, customer service, social media, and technology. See data/taxonomy/ for full definitions.
PrivacyPeek/
├── data/ # 1,182 cases + taxonomy
├── data_construction/ # Autogen pipeline (stage1 seed → stage5 quality)
├── evaluation/
│ ├── run_agent_opensource.py
│ ├── run_agent_closed.py
│ ├── helpers/ # Tool builders, document reader
│ ├── judges/ # cer_exact_match · probe_judge
│ └── analysis/ # combined_analysis · compute_metrics
├── examples/run_single_case.py
└── figures/ # overview.pdf/png, teaser.pdf/png

