Skip to content

ctxgov/agent-context-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Context Health Eval

Status: public v0.7 evaluation artifact for ctxgov/agent-context-evals.

This repository is an evaluation artifact for AI-agent context health. It is not a public benchmark claim, security evaluation, provider compatibility matrix, hosted demo, or package release.

Purpose

This artifact asks whether an evaluator can detect unhealthy AI-facing context before agent execution, using labeled cases, evidence spans, adapters, and review workflows.

Finding families:

  • stale_claim
  • conflicting_policy
  • unsupported_release_claim
  • unsafe_action_guidance
  • hidden_terminal_failure
  • missing_source_coverage
  • missing_rollback
  • unbounded_consequence
  • missing_model_state_surface
  • clean controls with no expected finding

Structure

agent-context-evals/
  README.md
  data/
    cases.jsonl
    labels.jsonl
    v0.2/
      trace_pattern_cases.jsonl
      trace_pattern_labels.jsonl
      hidden_holdout_cases.jsonl
      hidden_holdout_manifest.json
      benchmark_families.json
    v0.3/
      review_intake_cases.jsonl
      review_intake_manifest.json
    v0.4/
      hard_negative_cases.jsonl
      hard_negative_labels.jsonl
    v0.5/
      mutation_cases.jsonl
      mutation_labels.jsonl
      mutation_manifest.json
    v0.6/
      adversarial_hard_negative_cases.jsonl
      adversarial_hard_negative_labels.jsonl
      adversarial_hard_negative_manifest.json
    v0.7/
      trace_shaped_cases.jsonl
      trace_shaped_labels.jsonl
      trace_shaped_manifest.json
  adapters/
    offline_context_adapters.py
    v07_trace_adapters.py
  baselines/
    regex_baseline.py
    llm_judge_baseline.py
  ctxgov_adapter/
    run_ctxgov.py
  scoring/
    score_findings.py
    metrics.py
    score_multilabel.py
    span_diagnostics.py
    error_analysis.py
  review/
    independent-review-packet.md
    blinded-label-sheet.csv
    reviewer-rubric.md
    label-adjudication-plan.md
  demo/
    60-second-demo.gif
    60-second-demo-script.md
    index.html
    fixtures/bad_context_repo/
    reports/demo-context-health-report.md
    reports/v0.7-live-report-fixture.md
  reports/
    v0.1-results.md
    technical-report.md
    v0.2-results.md
    v0.3-readiness.md
    v0.4-results.md
    v0.5-results.md
    v0.6-results.md
    v0.7-results.md
  examples/
    clean_repo/
    stale_claim/
    conflicting_policy/
    unsupported_release_claim/
    hidden_terminal_failure/

Quick Run

python3 baselines/regex_baseline.py --cases data/cases.jsonl --output reports/regex-baseline-results.jsonl
python3 ctxgov_adapter/run_ctxgov.py --cases data/cases.jsonl --output reports/ctxgov-adapter-results.jsonl
python3 scoring/score_findings.py --labels data/labels.jsonl --predictions reports/regex-baseline-results.jsonl
python3 scoring/score_findings.py --labels data/labels.jsonl --predictions reports/ctxgov-adapter-results.jsonl

For v0.2 trace-pattern data:

python3 scripts/generate_v02_data.py
python3 baselines/regex_baseline.py --cases data/v0.2/trace_pattern_cases.jsonl --output reports/v0.2-regex-baseline-results.jsonl
python3 ctxgov_adapter/run_ctxgov.py --cases data/v0.2/trace_pattern_cases.jsonl --output reports/v0.2-ctxgov-heuristic-results.jsonl --mode heuristic
python3 scoring/score_findings.py --labels data/v0.2/trace_pattern_labels.jsonl --predictions reports/v0.2-regex-baseline-results.jsonl
python3 scoring/score_findings.py --labels data/v0.2/trace_pattern_labels.jsonl --predictions reports/v0.2-ctxgov-heuristic-results.jsonl

For a real CtxGov doctor invocation, pass a local checkout of https://github.com/ctxgov/ctxgov:

python3 ctxgov_adapter/run_ctxgov.py \
  --cases data/v0.2/trace_pattern_cases.jsonl \
  --output reports/v0.2-ctxgov-doctor-results.jsonl \
  --mode doctor \
  --ctxgov-root /path/to/ctxgov

The v0.2 heuristic mode does not read labels, but it is still a transparent pattern adapter over public trace-pattern data. Treat it as scaffold validation, not as a research benchmark result. The doctor mode shells out to CtxGov's local ctxgov.cli doctor command with no provider/model call.

For v0.3 review and demo materials:

python3 baselines/llm_judge_baseline.py \
  --cases data/v0.3/review_intake_cases.jsonl \
  --output reports/v0.3-llm-judge-baseline-results.jsonl \
  --manifest reports/v0.3-llm-judge-baseline-manifest.json \
  --prompt-output reports/v0.3-llm-judge-prompts.jsonl

python3 scripts/build_demo_fixture.py \
  --fixture demo/fixtures/bad_context_repo \
  --output-dir demo/reports

bash scripts/render_demo_gif.sh demo/60-second-demo.gif

The LLM-judge harness is offline by default. It writes prompts and a manifest, and can ingest offline reviewer/model decisions with --review-decisions, but it does not call a provider or model.

For v0.4 hard negatives and native CtxGov doctor coverage:

python3 baselines/regex_baseline.py \
  --cases data/v0.4/hard_negative_cases.jsonl \
  --output reports/v0.4-hard-negative-regex-results.jsonl

python3 ctxgov_adapter/run_ctxgov.py \
  --cases data/v0.2/trace_pattern_cases.jsonl \
  --output reports/v0.4-ctxgov-doctor-results.jsonl \
  --mode doctor \
  --ctxgov-root /path/to/ctxgov

python3 scoring/score_findings.py \
  --labels data/v0.2/trace_pattern_labels.jsonl \
  --predictions reports/v0.4-ctxgov-doctor-results.jsonl

For v0.5 deterministic mutation and multi-label scoring:

python3 scripts/generate_v05_mutation_data.py

python3 baselines/regex_baseline.py \
  --cases data/v0.5/mutation_cases.jsonl \
  --output reports/v0.5-regex-baseline-results.jsonl \
  --multi-label

python3 ctxgov_adapter/run_ctxgov.py \
  --cases data/v0.5/mutation_cases.jsonl \
  --output reports/v0.5-ctxgov-doctor-results.jsonl \
  --mode doctor \
  --projection none \
  --ctxgov-root /path/to/ctxgov

python3 scoring/score_multilabel.py \
  --labels data/v0.5/mutation_labels.jsonl \
  --predictions reports/v0.5-regex-baseline-results.jsonl

python3 scoring/score_multilabel.py \
  --labels data/v0.5/mutation_labels.jsonl \
  --predictions reports/v0.5-ctxgov-doctor-results.jsonl

For v0.6 adversarial hard negatives and span diagnostics:

python3 scripts/generate_v06_adversarial_hard_negatives.py

python3 baselines/regex_baseline.py \
  --cases data/v0.6/adversarial_hard_negative_cases.jsonl \
  --output reports/v0.6-regex-hard-negative-results.jsonl \
  --multi-label

python3 ctxgov_adapter/run_ctxgov.py \
  --cases data/v0.6/adversarial_hard_negative_cases.jsonl \
  --output reports/v0.6-ctxgov-doctor-hard-negative-results.jsonl \
  --mode doctor \
  --projection none \
  --ctxgov-root /path/to/ctxgov

python3 scoring/span_diagnostics.py \
  --labels data/v0.5/mutation_labels.jsonl \
  --predictions reports/v0.5-ctxgov-doctor-results.jsonl \
  --output reports/v0.5-ctxgov-doctor-span-diagnostics.json

For v0.7 trace-shaped local evaluation:

python3 scripts/generate_v07_trace_suite.py

python3 baselines/regex_baseline.py \
  --cases data/v0.7/trace_shaped_cases.jsonl \
  --output reports/v0.7-regex-baseline-results.jsonl \
  --multi-label

python3 ctxgov_adapter/run_ctxgov.py \
  --cases data/v0.7/trace_shaped_cases.jsonl \
  --output reports/v0.7-ctxgov-doctor-results.jsonl \
  --mode doctor \
  --projection none \
  --ctxgov-root /path/to/ctxgov

python3 scoring/error_analysis.py \
  --labels data/v0.7/trace_shaped_labels.jsonl \
  --predictions reports/v0.7-ctxgov-doctor-results.jsonl \
  --hard-negative-labels data/v0.7/trace_shaped_labels.jsonl \
  --output reports/v0.7-ctxgov-doctor-error-analysis.json

The v0.7 suite contains 96 trace-shaped local cases across terminal logs, handoff summaries, AGENTS/Cursor/CLAUDE-style rules, release notes, GitHub issue/PR snippets, package registry manifests, local transcripts, and memory traces. It is a local reproducibility artifact, not a public benchmark claim.

Case Schema

Each data/cases.jsonl row contains:

  • case_id
  • split
  • source
  • ai_context
  • expected_finding_type
  • expected_evidence_span
  • severity
  • notes

Each data/labels.jsonl row contains:

  • case_id
  • finding_type
  • evidence_span
  • start_char
  • end_char
  • must_flag
  • rationale

Clean controls use finding_type: "none" and must_flag: false.

Limitations

The v0.1 and v0.2 data are synthetic or sanitized trace-pattern data. The v0.3 review intake cases are public trace-derived material prepared for independent review, but the review is still pending. The v0.4 hard negatives are synthetic controls that reduce obvious regex false positives but do not replace independent review. The v0.5 mutation split is deterministic local scaffold data with multi-label cases and clean controls; it is a strong regression gate, not a public benchmark claim. The v0.6 adversarial hard negatives add local false-positive pressure with hazardous vocabulary in repaired, scoped, or negated context. These artifacts are useful for schema, scorer, adapter, workflow, and demo validation. They do not prove security coverage, agent safety, model reliability, provider compatibility, or real-world prevalence.

Before a public benchmark claim, this needs real trace-derived cases with reviewer approval, hard negative controls, independent reviewer labels, and a documented data construction process.

Readiness

Ready for public project surface:

  • v0.2 scorer, regex baseline, CtxGov heuristic adapter, and real CtxGov doctor adapter results
  • v0.3 offline LLM-judge interface with no provider/model call by default
  • v0.3 independent review packet with labels withheld
  • v0.3 reproducible demo fixture and 60-second GIF
  • v0.4 hard-negative controls and tightened regex baseline
  • v0.4 native CtxGov doctor adapter run for release integrity, Memory X-Ray L1, and Task Shard coverage
  • v0.5 deterministic mutation data with 160 cases, 206 labels, 40 clean controls, multi-label scoring, and native CtxGov doctor adapter run
  • v0.6 adversarial hard negatives with 60 clean controls and span diagnostics

Not ready for benchmark claims:

  • independent review of trace-derived cases
  • adjudicated reviewer labels
  • hidden holdout administration outside this public repo
  • public false positive and false negative analysis on reviewed labels

The key v0.6 result is that the artifact now has both positive deterministic mutation coverage and adversarial clean controls. The 1.0000 v0.5 doctor score and 0-FP v0.6 hard-negative result validate readiness of this artifact and adapter path, not general benchmark performance.

Related Project

  • CtxGov main repo: https://github.com/ctxgov/ctxgov
  • CtxGov project page: https://ctxgov.github.io/ctxgov/
  • Latest companion release: https://github.com/ctxgov/agent-context-evals/releases/tag/v0.6.0