Status: public v0.7 evaluation artifact for ctxgov/agent-context-evals.
This repository is an evaluation artifact for AI-agent context health. It is not a public benchmark claim, security evaluation, provider compatibility matrix, hosted demo, or package release.
This artifact asks whether an evaluator can detect unhealthy AI-facing context before agent execution, using labeled cases, evidence spans, adapters, and review workflows.
Finding families:
stale_claimconflicting_policyunsupported_release_claimunsafe_action_guidancehidden_terminal_failuremissing_source_coveragemissing_rollbackunbounded_consequencemissing_model_state_surface- clean controls with no expected finding
agent-context-evals/
README.md
data/
cases.jsonl
labels.jsonl
v0.2/
trace_pattern_cases.jsonl
trace_pattern_labels.jsonl
hidden_holdout_cases.jsonl
hidden_holdout_manifest.json
benchmark_families.json
v0.3/
review_intake_cases.jsonl
review_intake_manifest.json
v0.4/
hard_negative_cases.jsonl
hard_negative_labels.jsonl
v0.5/
mutation_cases.jsonl
mutation_labels.jsonl
mutation_manifest.json
v0.6/
adversarial_hard_negative_cases.jsonl
adversarial_hard_negative_labels.jsonl
adversarial_hard_negative_manifest.json
v0.7/
trace_shaped_cases.jsonl
trace_shaped_labels.jsonl
trace_shaped_manifest.json
adapters/
offline_context_adapters.py
v07_trace_adapters.py
baselines/
regex_baseline.py
llm_judge_baseline.py
ctxgov_adapter/
run_ctxgov.py
scoring/
score_findings.py
metrics.py
score_multilabel.py
span_diagnostics.py
error_analysis.py
review/
independent-review-packet.md
blinded-label-sheet.csv
reviewer-rubric.md
label-adjudication-plan.md
demo/
60-second-demo.gif
60-second-demo-script.md
index.html
fixtures/bad_context_repo/
reports/demo-context-health-report.md
reports/v0.7-live-report-fixture.md
reports/
v0.1-results.md
technical-report.md
v0.2-results.md
v0.3-readiness.md
v0.4-results.md
v0.5-results.md
v0.6-results.md
v0.7-results.md
examples/
clean_repo/
stale_claim/
conflicting_policy/
unsupported_release_claim/
hidden_terminal_failure/
python3 baselines/regex_baseline.py --cases data/cases.jsonl --output reports/regex-baseline-results.jsonl
python3 ctxgov_adapter/run_ctxgov.py --cases data/cases.jsonl --output reports/ctxgov-adapter-results.jsonl
python3 scoring/score_findings.py --labels data/labels.jsonl --predictions reports/regex-baseline-results.jsonl
python3 scoring/score_findings.py --labels data/labels.jsonl --predictions reports/ctxgov-adapter-results.jsonlFor v0.2 trace-pattern data:
python3 scripts/generate_v02_data.py
python3 baselines/regex_baseline.py --cases data/v0.2/trace_pattern_cases.jsonl --output reports/v0.2-regex-baseline-results.jsonl
python3 ctxgov_adapter/run_ctxgov.py --cases data/v0.2/trace_pattern_cases.jsonl --output reports/v0.2-ctxgov-heuristic-results.jsonl --mode heuristic
python3 scoring/score_findings.py --labels data/v0.2/trace_pattern_labels.jsonl --predictions reports/v0.2-regex-baseline-results.jsonl
python3 scoring/score_findings.py --labels data/v0.2/trace_pattern_labels.jsonl --predictions reports/v0.2-ctxgov-heuristic-results.jsonlFor a real CtxGov doctor invocation, pass a local checkout of
https://github.com/ctxgov/ctxgov:
python3 ctxgov_adapter/run_ctxgov.py \
--cases data/v0.2/trace_pattern_cases.jsonl \
--output reports/v0.2-ctxgov-doctor-results.jsonl \
--mode doctor \
--ctxgov-root /path/to/ctxgovThe v0.2 heuristic mode does not read labels, but it is still a transparent
pattern adapter over public trace-pattern data. Treat it as scaffold validation,
not as a research benchmark result. The doctor mode shells out to CtxGov's
local ctxgov.cli doctor command with no provider/model call.
For v0.3 review and demo materials:
python3 baselines/llm_judge_baseline.py \
--cases data/v0.3/review_intake_cases.jsonl \
--output reports/v0.3-llm-judge-baseline-results.jsonl \
--manifest reports/v0.3-llm-judge-baseline-manifest.json \
--prompt-output reports/v0.3-llm-judge-prompts.jsonl
python3 scripts/build_demo_fixture.py \
--fixture demo/fixtures/bad_context_repo \
--output-dir demo/reports
bash scripts/render_demo_gif.sh demo/60-second-demo.gifThe LLM-judge harness is offline by default. It writes prompts and a manifest,
and can ingest offline reviewer/model decisions with --review-decisions, but
it does not call a provider or model.
For v0.4 hard negatives and native CtxGov doctor coverage:
python3 baselines/regex_baseline.py \
--cases data/v0.4/hard_negative_cases.jsonl \
--output reports/v0.4-hard-negative-regex-results.jsonl
python3 ctxgov_adapter/run_ctxgov.py \
--cases data/v0.2/trace_pattern_cases.jsonl \
--output reports/v0.4-ctxgov-doctor-results.jsonl \
--mode doctor \
--ctxgov-root /path/to/ctxgov
python3 scoring/score_findings.py \
--labels data/v0.2/trace_pattern_labels.jsonl \
--predictions reports/v0.4-ctxgov-doctor-results.jsonlFor v0.5 deterministic mutation and multi-label scoring:
python3 scripts/generate_v05_mutation_data.py
python3 baselines/regex_baseline.py \
--cases data/v0.5/mutation_cases.jsonl \
--output reports/v0.5-regex-baseline-results.jsonl \
--multi-label
python3 ctxgov_adapter/run_ctxgov.py \
--cases data/v0.5/mutation_cases.jsonl \
--output reports/v0.5-ctxgov-doctor-results.jsonl \
--mode doctor \
--projection none \
--ctxgov-root /path/to/ctxgov
python3 scoring/score_multilabel.py \
--labels data/v0.5/mutation_labels.jsonl \
--predictions reports/v0.5-regex-baseline-results.jsonl
python3 scoring/score_multilabel.py \
--labels data/v0.5/mutation_labels.jsonl \
--predictions reports/v0.5-ctxgov-doctor-results.jsonlFor v0.6 adversarial hard negatives and span diagnostics:
python3 scripts/generate_v06_adversarial_hard_negatives.py
python3 baselines/regex_baseline.py \
--cases data/v0.6/adversarial_hard_negative_cases.jsonl \
--output reports/v0.6-regex-hard-negative-results.jsonl \
--multi-label
python3 ctxgov_adapter/run_ctxgov.py \
--cases data/v0.6/adversarial_hard_negative_cases.jsonl \
--output reports/v0.6-ctxgov-doctor-hard-negative-results.jsonl \
--mode doctor \
--projection none \
--ctxgov-root /path/to/ctxgov
python3 scoring/span_diagnostics.py \
--labels data/v0.5/mutation_labels.jsonl \
--predictions reports/v0.5-ctxgov-doctor-results.jsonl \
--output reports/v0.5-ctxgov-doctor-span-diagnostics.jsonFor v0.7 trace-shaped local evaluation:
python3 scripts/generate_v07_trace_suite.py
python3 baselines/regex_baseline.py \
--cases data/v0.7/trace_shaped_cases.jsonl \
--output reports/v0.7-regex-baseline-results.jsonl \
--multi-label
python3 ctxgov_adapter/run_ctxgov.py \
--cases data/v0.7/trace_shaped_cases.jsonl \
--output reports/v0.7-ctxgov-doctor-results.jsonl \
--mode doctor \
--projection none \
--ctxgov-root /path/to/ctxgov
python3 scoring/error_analysis.py \
--labels data/v0.7/trace_shaped_labels.jsonl \
--predictions reports/v0.7-ctxgov-doctor-results.jsonl \
--hard-negative-labels data/v0.7/trace_shaped_labels.jsonl \
--output reports/v0.7-ctxgov-doctor-error-analysis.jsonThe v0.7 suite contains 96 trace-shaped local cases across terminal logs, handoff summaries, AGENTS/Cursor/CLAUDE-style rules, release notes, GitHub issue/PR snippets, package registry manifests, local transcripts, and memory traces. It is a local reproducibility artifact, not a public benchmark claim.
Each data/cases.jsonl row contains:
case_idsplitsourceai_contextexpected_finding_typeexpected_evidence_spanseveritynotes
Each data/labels.jsonl row contains:
case_idfinding_typeevidence_spanstart_charend_charmust_flagrationale
Clean controls use finding_type: "none" and must_flag: false.
The v0.1 and v0.2 data are synthetic or sanitized trace-pattern data. The v0.3 review intake cases are public trace-derived material prepared for independent review, but the review is still pending. The v0.4 hard negatives are synthetic controls that reduce obvious regex false positives but do not replace independent review. The v0.5 mutation split is deterministic local scaffold data with multi-label cases and clean controls; it is a strong regression gate, not a public benchmark claim. The v0.6 adversarial hard negatives add local false-positive pressure with hazardous vocabulary in repaired, scoped, or negated context. These artifacts are useful for schema, scorer, adapter, workflow, and demo validation. They do not prove security coverage, agent safety, model reliability, provider compatibility, or real-world prevalence.
Before a public benchmark claim, this needs real trace-derived cases with reviewer approval, hard negative controls, independent reviewer labels, and a documented data construction process.
Ready for public project surface:
- v0.2 scorer, regex baseline, CtxGov heuristic adapter, and real CtxGov doctor adapter results
- v0.3 offline LLM-judge interface with no provider/model call by default
- v0.3 independent review packet with labels withheld
- v0.3 reproducible demo fixture and 60-second GIF
- v0.4 hard-negative controls and tightened regex baseline
- v0.4 native CtxGov doctor adapter run for release integrity, Memory X-Ray L1, and Task Shard coverage
- v0.5 deterministic mutation data with 160 cases, 206 labels, 40 clean controls, multi-label scoring, and native CtxGov doctor adapter run
- v0.6 adversarial hard negatives with 60 clean controls and span diagnostics
Not ready for benchmark claims:
- independent review of trace-derived cases
- adjudicated reviewer labels
- hidden holdout administration outside this public repo
- public false positive and false negative analysis on reviewed labels
The key v0.6 result is that the artifact now has both positive deterministic mutation coverage and adversarial clean controls. The 1.0000 v0.5 doctor score and 0-FP v0.6 hard-negative result validate readiness of this artifact and adapter path, not general benchmark performance.
- CtxGov main repo:
https://github.com/ctxgov/ctxgov - CtxGov project page:
https://ctxgov.github.io/ctxgov/ - Latest companion release:
https://github.com/ctxgov/agent-context-evals/releases/tag/v0.6.0