PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

PrivacyPeek is a benchmark for auditing acquisition-stage privacy leakage of LLM-based agents. Existing privacy benchmarks audit what an agent's response or outgoing action discloses, but they overlook the stage at which sensitive data first enters the agent's context through tool calls. Once acquired, such over-collected information is then one careless action or one attack away from an outright leak. PrivacyPeek therefore inspects what the agent acquires during a task, not only what it eventually says.

What's Included

1,182 cases spanning 7 acquisition behaviours × 16 application domains, generated through a human-in-the-loop template + GPT-4o pipeline.
Two complementary evaluators that audit the same trajectory:
- Acquisition Inspection — deterministic checks on the tool-call trajectory, reporting Content-Exposure-Rate (CER), Task-Completion-Rate (TCR), and the utility-conditioned Helpful CER (HCER).
- Probe Elicitation — an LLM judge issues a follow-up probe after tools and network are disabled, reporting Probe-Leakage-Rate (PLR) and the utility-conditioned HPLR.
Agent runners for both open-source models (via vLLM) and closed-source APIs (OpenAI-compatible).
An end-to-end autogen pipeline to extend the benchmark with new domains or acquisition behaviours.

Installation

PrivacyPeek targets Python 3.10+.

git clone https://github.com/Xuan269/PrivacyPeek.git
cd PrivacyPeek

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For open-source agent runs you will additionally need a vLLM server:

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000 --max-model-len 16384

Configuration (`.env`)

Copy the template and fill in your credentials. .env is git-ignored — only .env.example is tracked.

cp .env.example .env

# .env
API_BASE=https://api.openai.com/v1            # closed-source agent backend
API_KEY=sk-...
JUDGE_API_BASE=https://api.openai.com/v1      # LLM judge for PLR
JUDGE_API_KEY=sk-...
JUDGE_MODEL=gpt-4o-2024-11-20
VLLM_ENDPOINT=http://127.0.0.1:8000/v1        # open-source agent backend

Quick Start

Run a single case end-to-end:

# Closed-source API
python examples/run_single_case.py \
  --case-dir data/cases/agriculture_A1_001 \
  --backend closed \
  --model gpt-4o-mini

# Open-source via local vLLM
python examples/run_single_case.py \
  --case-dir data/cases/agriculture_A1_001 \
  --backend open \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --vllm-endpoint http://127.0.0.1:8000/v1

Running the Benchmark

# Open-source model (vLLM)
python evaluation/run_agent_opensource.py \
  --model-name meta-llama/Llama-3.1-8B-Instruct \
  --vllm-endpoint http://127.0.0.1:8000/v1 \
  --output runs/llama3.1-8b.json

# Closed-source model (API)
python evaluation/run_agent_closed.py \
  --model-name gpt-4o-mini \
  --output runs/gpt-4o-mini.json

Scoring:

# Acquisition Inspection — deterministic CER + TCR + HCER (no judge API cost)
python evaluation/judges/cer_exact_match.py \
  --runs runs/gpt-4o-mini.json \
  --output runs/cer_gpt-4o-mini.json

# Probe Elicitation — LLM-judged PLR + HPLR (uses JUDGE_API_KEY)
python evaluation/judges/probe_judge.py \
  --runs runs/gpt-4o-mini.json \
  --output runs/plr_gpt-4o-mini.json

# Combined risk matrix
python evaluation/analysis/combined_analysis.py \
  --cer runs/cer_gpt-4o-mini.json \
  --plr runs/plr_gpt-4o-mini.json \
  --output runs/combined_gpt-4o-mini.json

The closed-source runner supports --resume to skip cases that completed cleanly in a prior run. Both runners support --dims A1 A4 B3 and --limit N --start I for subsetting.

Dataset

Each case in data/cases/ is a self-contained folder named {domain}_{behaviour}_{NNN}:

agriculture_A1_001/
├── config.json        # Task, system prompt, probe, sensitive items, minimum scope
├── file_listing.json
├── file_metadata.json
├── files/             # Out-of-scope sensitive files
└── tools_data/        # In-scope tool return values

PrivacyPeek organises out-of-scope acquisition into seven acquisition behaviours (Tasks 1–7):

Code	Acquisition Behaviour	Definition	Example
A1	Normal-Filename Access	The agent visits a file having a name without any sensitive keywords and is irrelevant to the task.	Opening unrelated `staff_roster.txt` alongside the target file.
A2	Sensitive-Filename Access	The agent visits a file having a name with sensitive keywords and is irrelevant to the task.	Opening `passwords_and_keys.txt`, a file the task never requests.
A3	Cross-Format Access	The agent visits a file having a name with the same prefix as the target and a different suffix.	Opening `report.xlsx` when only `report.docx` was requested.
A4	Out-Dated Data Access	The agent acquires records outside the task's intended time window through a wider-window query or a multi-period file.	Loading the full vitals history when asked for today.
B1	Excess-Field Access	The tool returns more fields than the task asks for at a single subject and a single time.	Returning a national ID and home address with the requested vital sign.
B2	Forbidden-Content Access	The agent reads or surfaces content carrying an explicit non-disclosure marker, despite its prohibition.	Reading content marked `[CONFIDENTIAL: DO NOT READ]`.
B3	Out-of-Scope Inferential Access	The agent acquires fields outside the task's minimum scope and combines them into a sensitive derivation.	Inferring a pregnancy from a purchase log.

Cases span 16 application domains covering both regulated and service sectors, including agriculture, finance, healthcare, legal, education, government, human resources, customer service, social media, and technology. See data/taxonomy/ for full definitions.

Repository Structure

PrivacyPeek/
├── data/                       # 1,182 cases + taxonomy
├── data_construction/          # Autogen pipeline (stage1 seed → stage5 quality)
├── evaluation/
│   ├── run_agent_opensource.py
│   ├── run_agent_closed.py
│   ├── helpers/                # Tool builders, document reader
│   ├── judges/                 # cer_exact_match · probe_judge
│   └── analysis/               # combined_analysis · compute_metrics
├── examples/run_single_case.py
└── figures/                    # overview.pdf/png, teaser.pdf/png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

What's Included

Installation

Configuration (`.env`)

Quick Start

Running the Benchmark

Dataset

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
data_construction		data_construction
evaluation		evaluation
examples		examples
figures		figures
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

What's Included

Installation

Configuration (.env)

Quick Start

Running the Benchmark

Dataset

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`.env`)

Packages