
# EvasionBench Project Recap: Work, Results, and Decisions

This notebook rebuilds a full project recap from repository evidence.

It answers three questions:
1. What did we build and run?
2. What results did we get?
3. Which decisions did we make, and why?

All values come from files in `artifacts/`, `docs/`, and git history so the notebook stays reproducible.



## Outline

1. Repository + workflow snapshot
2. Phase 3-4 analysis outputs
3. Phase 5-6 modeling outputs
4. Explainability and label diagnostics
5. Phase 7 reporting status
6. Phase 8 optimization and serving decision
7. Decision log with evidence
8. Final recap


In [1]:

from __future__ import annotations

import csv
import json
import statistics
import subprocess
from datetime import datetime, timezone
from pathlib import Path
from typing import Any


def find_repo_root(start: Path) -> Path:
    for candidate in [start, *start.parents]:
        if (candidate / "AGENTS.md").exists() and (candidate / "artifacts").exists():
            return candidate
    return start


ROOT = find_repo_root(Path.cwd())
print(f"Repo root: {ROOT}")
print(f"Notebook run at (UTC): {datetime.now(timezone.utc).isoformat()}")


def read_json(rel_path: str, default: Any = None) -> Any:
    path = ROOT / rel_path
    if not path.exists():
        print(f"[missing] {rel_path}")
        return {} if default is None else default
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)


def read_csv(rel_path: str) -> list[dict[str, str]]:
    path = ROOT / rel_path
    if not path.exists():
        print(f"[missing] {rel_path}")
        return []
    with path.open("r", encoding="utf-8", newline="") as f:
        return list(csv.DictReader(f))


def fmt(x: Any, digits: int = 4) -> str:
    if isinstance(x, float):
        return f"{x:.{digits}f}"
    return str(x)


def print_table(headers: list[str], rows: list[list[Any]], max_rows: int | None = None) -> None:
    if max_rows is not None:
        rows = rows[:max_rows]
    widths = [len(h) for h in headers]
    for row in rows:
        for i, value in enumerate(row):
            widths[i] = max(widths[i], len(fmt(value)))

    def render_row(values: list[Any]) -> str:
        return " | ".join(fmt(v).ljust(widths[i]) for i, v in enumerate(values))

    print(render_row(headers))
    print("-+-".join("-" * w for w in widths))
    for row in rows:
        print(render_row(row))


Repo root: /Users/dustinober/Projects/EvasionBench
Notebook run at (UTC): 2026-02-13T17:07:17.526618+00:00



## 1) Repository and Workflow Snapshot


In [2]:

scripts_count = len(list((ROOT / "scripts").glob("*.py")))
src_modules = len(list((ROOT / "src").rglob("*.py")))
test_files = len(list((ROOT / "tests").glob("test_*.py")))

phase_dirs = [
    "artifacts/analysis/phase3",
    "artifacts/analysis/phase4",
    "artifacts/models/phase5",
    "artifacts/models/phase6",
    "artifacts/explainability/phase6",
    "artifacts/diagnostics/phase6",
    "artifacts/reports/phase7",
    "artifacts/models/phase8",
]

print("Script-first implementation footprint")
print(f"- script entrypoints (.py): {scripts_count}")
print(f"- src modules (.py): {src_modules}")
print(f"- test files: {test_files}")
print()

rows = []
for rel in phase_dirs:
    p = ROOT / rel
    file_count = sum(1 for item in p.rglob("*") if item.is_file()) if p.exists() else 0
    rows.append([rel, p.exists(), file_count])

print("Artifact footprint by phase")
print_table(["path", "exists", "file_count"], rows)

print()
print("Recent git timeline (most recent first)")
try:
    log_out = subprocess.check_output(
        ["git", "log", "--pretty=format:%h %ad %s", "--date=short", "-n", "12"],
        cwd=ROOT,
        text=True,
    )
    print(log_out)
except Exception as exc:
    print(f"Could not read git log: {exc}")


Script-first implementation footprint
- script entrypoints (.py): 30
- src modules (.py): 18
- test files: 27

Artifact footprint by phase
path                            | exists | file_count
--------------------------------+--------+-----------
artifacts/analysis/phase3       | True   | 38        
artifacts/analysis/phase4       | True   | 21        
artifacts/models/phase5         | True   | 27        
artifacts/models/phase6         | True   | 7         
artifacts/explainability/phase6 | True   | 14        
artifacts/diagnostics/phase6    | True   | 5         
artifacts/reports/phase7        | True   | 8         
artifacts/models/phase8         | True   | 9         

Recent git timeline (most recent first)
bb3b843 2026-02-13 feat: Generate new model optimization artifacts for phase8_3 and phase8_nonlogreg_tight.
e15e693 2026-02-13 feat: Generate phase8_2b SVM model optimization artifacts and update the optimization script.
fdedf9b 2026-02-13 feat: Add trained SVM model and associat


## 2) Phase 3-4 Analysis Results


In [3]:

class_dist = read_json("artifacts/analysis/phase3/core_stats/class_distribution.json", default=[])
length_tests = read_json("artifacts/analysis/phase3/core_stats/length_tests.json", default={})
semantic_h = read_json("artifacts/analysis/phase4/semantic_similarity/hypothesis_summary.json", default={})
question_behavior = read_json("artifacts/analysis/phase4/question_behavior/question_behavior_summary.json", default={})

print("Phase 3: class distribution")
total = sum(item.get("count", 0) for item in class_dist) or 1
rows = []
for item in class_dist:
    label = item.get("label", "unknown")
    count = item.get("count", 0)
    pct = 100.0 * count / total
    rows.append([label, count, f"{pct:.2f}%"])
print_table(["label", "count", "share"], rows)

print()
print("Phase 3: length hypothesis tests")
for metric_name in ["answer_length", "question_length"]:
    block = length_tests.get(metric_name, {})
    kruskal = block.get("kruskal", {})
    p_value = kruskal.get("p_value")
    interpretation = block.get("interpretation", "")
    print(f"- {metric_name}: p_value={fmt(p_value, 6)} | {interpretation}")

print()
print("Phase 4: semantic hypothesis summary")
for item in semantic_h.get("hypotheses", []):
    print(f"- {item.get('id')}: {item.get('name')} -> {item.get('finding')}")

print()
print("Phase 4: question behavior refusal spread (top 5)")
spreads = question_behavior.get("refusal_spread_by_question_type", [])
spreads_sorted = sorted(spreads, key=lambda x: x.get("refusal_rate_spread", 0), reverse=True)
rows = []
for item in spreads_sorted[:5]:
    rows.append([
        item.get("question_type"),
        item.get("min_refusal_rate"),
        item.get("max_refusal_rate"),
        item.get("refusal_rate_spread"),
    ])
print_table(["question_type", "min_refusal", "max_refusal", "spread"], rows)


Phase 3: class distribution
label         | count | share 
--------------+-------+-------
direct        | 8749  | 52.31%
intermediate  | 7359  | 44.00%
fully_evasive | 618   | 3.69% 

Phase 3: length hypothesis tests
- answer_length: p_value=0.000000 | Statistically significant differences detected across labels.
- question_length: p_value=0.000000 | Statistically significant differences detected across labels.

Phase 4: semantic hypothesis summary
- H1: relevance_deflection -> Highest alignment label: direct (0.0800); lowest alignment label: fully_evasive (0.0526); delta=0.0274.
- H2: vague_response_behavior -> Similarity standard deviations are reported per label for interpretation.

Phase 4: question behavior refusal spread (top 5)
question_type | min_refusal | max_refusal | spread
--------------+-------------+-------------+-------
yes_no        | 0.0615      | 0.2593      | 0.1978
factual       | 0.0643      | 0.2407      | 0.1764
opinion       | 0.0799      | 0.2432      | 0.1633



## 3) Phase 5-6 Modeling Results


In [4]:

phase5_summary = read_json("artifacts/models/phase5/run_summary.json", default={})
phase5_metrics_files = {
    "logreg": read_json("artifacts/models/phase5/logreg/metrics.json", default={}),
    "tree": read_json("artifacts/models/phase5/tree/metrics.json", default={}),
    "boosting": read_json("artifacts/models/phase5/boosting/metrics.json", default={}),
    "tree_boosting": read_json("artifacts/models/phase5/tree_boosting/metrics.json", default={}),
}
phase6_transformer = read_json("artifacts/models/phase6/transformer/metrics.json", default={})

print("Phase 5 canonical summary (run_summary.json)")
rows = []
for family, metrics in sorted(phase5_summary.items()):
    rows.append([
        family,
        metrics.get("accuracy"),
        metrics.get("f1_macro"),
        metrics.get("precision_macro"),
        metrics.get("recall_macro"),
    ])
print_table(["family", "accuracy", "f1_macro", "precision_macro", "recall_macro"], rows)

if rows:
    best = max(rows, key=lambda r: r[2] if isinstance(r[2], float) else -1)
    print()
    print(f"Best classical family by macro-F1 (from summary): {best[0]} ({fmt(best[2])})")

print()
print("Phase 5 direct metrics files (for sanity check)")
rows = []
for family, metrics in phase5_metrics_files.items():
    rows.append([
        family,
        metrics.get("accuracy"),
        metrics.get("f1_macro"),
        metrics.get("precision_macro"),
        metrics.get("recall_macro"),
    ])
print_table(["family", "accuracy", "f1_macro", "precision_macro", "recall_macro"], rows)

print()
print("Phase 6 transformer metrics")
print(phase6_transformer)


Phase 5 canonical summary (run_summary.json)
family   | accuracy | f1_macro | precision_macro | recall_macro
---------+----------+----------+-----------------+-------------
boosting | 0.6088   | 0.4496   | 0.6259          | 0.4389      
logreg   | 0.6432   | 0.5343   | 0.6566          | 0.5050      
tree     | 0.5870   | 0.4355   | 0.6929          | 0.4223      

Best classical family by macro-F1 (from summary): logreg (0.5343)

Phase 5 direct metrics files (for sanity check)
family        | accuracy | f1_macro | precision_macro | recall_macro
--------------+----------+----------+-----------------+-------------
logreg        | 0.5000   | 0.3333   | 0.2500          | 0.5000      
tree          | 0.5870   | 0.4355   | 0.6929          | 0.4223      
boosting      | 0.6088   | 0.4496   | 0.6259          | 0.4389      
tree_boosting | 0.5882   | 0.4356   | 0.6621          | 0.4229      

Phase 6 transformer metrics
{'accuracy': 0.5, 'f1_macro': 0.3333333333333333, 'precision_macro': 0.25, '


## 4) Explainability and Label Quality


In [5]:

xai_summary = read_json("artifacts/explainability/phase6/xai_summary.json", default={})
label_diag = read_json("artifacts/diagnostics/phase6/label_diagnostics_summary.json", default={})
near_dupes = read_csv("artifacts/diagnostics/phase6/near_duplicate_pairs.csv")
suspects = read_csv("artifacts/diagnostics/phase6/suspect_examples.csv")
outliers = read_csv("artifacts/diagnostics/phase6/outlier_examples.csv")

print("Explainability families and explainers")
rows = []
for family, payload in sorted(xai_summary.items()):
    rows.append([
        family,
        payload.get("explainer_type"),
        payload.get("n_features"),
        payload.get("n_samples"),
    ])
print_table(["family", "explainer", "n_features", "n_samples"], rows)

print()
print("Label diagnostics summary")
for k in [
    "training_size",
    "quality_score",
    "label_issues",
    "near_duplicate_issues",
    "outlier_issues",
    "random_state",
    "git_sha",
]:
    print(f"- {k}: {label_diag.get(k)}")

print()
print("Raw diagnostics file row counts")
print(f"- near_duplicate_pairs.csv rows: {len(near_dupes)}")
print(f"- suspect_examples.csv rows: {len(suspects)}")
print(f"- outlier_examples.csv rows: {len(outliers)}")


Explainability families and explainers
family   | explainer       | n_features | n_samples
---------+-----------------+------------+----------
boosting | TreeExplainer   | 80         | 10       
logreg   | LinearExplainer | 268        | 10       
tree     | TreeExplainer   | 268        | 10       

Label diagnostics summary
- training_size: 80
- quality_score: 100.0
- label_issues: 0
- near_duplicate_issues: 80
- outlier_issues: 0
- random_state: 42
- git_sha: 2be7541

Raw diagnostics file row counts
- near_duplicate_pairs.csv rows: 80
- suspect_examples.csv rows: 0
- outlier_examples.csv rows: 0



## 5) Phase 7 Reporting Pipeline Status


In [6]:

report_run = read_json("artifacts/reports/phase7/run_summary.json", default={})
manifest = read_json("artifacts/reports/phase7/provenance_manifest.json", default={})
traceability = read_json("artifacts/reports/phase7/report_traceability.json", default={})

print("Phase 7 run summary")
for k in ["pipeline", "status", "started_at", "completed_at", "git_sha"]:
    print(f"- {k}: {report_run.get(k)}")

failure = report_run.get("failure") or {}
if failure:
    print()
    print("Failure detail (if any)")
    for k in ["stage", "status", "exit_code", "hint", "log"]:
        print(f"- {k}: {failure.get(k)}")

sections = manifest.get("sections", {})
analysis_count = len(sections.get("analyses", []))
model_count = len(sections.get("models", []))
explainability_count = len(sections.get("explainability", []))
diagnostics_count = len(sections.get("diagnostics", []))

print()
print("Manifest coverage")
print(f"- analyses tracked: {analysis_count}")
print(f"- models tracked: {model_count}")
print(f"- explainability tracked: {explainability_count}")
print(f"- diagnostics tracked: {diagnostics_count}")

trace_items = traceability.get("items", {})
print(f"- traceability item count: {len(trace_items)}")


Phase 7 run summary
- pipeline: phase7_research_reporting
- status: failed
- started_at: 2026-02-11T00:32:24Z
- completed_at: 2026-02-11T00:32:49Z
- git_sha: 678faa5

Failure detail (if any)
- stage: phase3_analysis
- status: failed
- exit_code: 1
- hint: Run missing NLP prerequisites and retry phase3 analysis.
- log: artifacts/reports/phase7/logs/01_phase3_analysis.log

Manifest coverage
- analyses tracked: 43
- models tracked: 28
- explainability tracked: 18
- diagnostics tracked: 5
- traceability item count: 84



## 6) Phase 8 Optimization and Serving Decision


In [7]:

phase8_selected = sorted((ROOT / "artifacts" / "models").glob("phase8*/selected_model.json"))
rows = []
for path in phase8_selected:
    rel = path.relative_to(ROOT).as_posix()
    payload = read_json(rel, default={})
    metrics = payload.get("metrics", {})
    rows.append([
        path.parent.name,
        payload.get("best_model_family"),
        payload.get("winner_trial_id"),
        payload.get("winner_rule"),
        payload.get("accuracy_floor"),
        metrics.get("accuracy"),
        metrics.get("f1_macro"),
        payload.get("random_state"),
    ])

print("Optimization runs with selected_model.json")
print_table(
    [
        "run",
        "family",
        "winner_trial_id",
        "winner_rule",
        "acc_floor",
        "accuracy",
        "f1_macro",
        "seed",
    ],
    rows,
)

canonical = read_json("artifacts/models/phase8/selected_model.json", default={})
canon_metrics = canonical.get("metrics", {})
print()
print("Canonical serving selection (artifacts/models/phase8/selected_model.json)")
print(f"- best_model_family: {canonical.get('best_model_family')}")
print(f"- winner_trial_id: {canonical.get('winner_trial_id')}")
print(f"- winner_rule: {canonical.get('winner_rule')}")
print(f"- accuracy_floor: {canonical.get('accuracy_floor')}")
print(f"- accuracy: {canon_metrics.get('accuracy')}")
print(f"- f1_macro: {canon_metrics.get('f1_macro')}")

seed_rows = [r for r in rows if r[0].startswith("phase8_seed") and isinstance(r[6], float)]
if seed_rows:
    f1_values = [r[6] for r in seed_rows]
    print()
    print(f"Seed-run macro-F1 mean: {statistics.mean(f1_values):.4f}")
    print(f"Seed-run macro-F1 stdev: {statistics.pstdev(f1_values):.4f}")


Optimization runs with selected_model.json
run                    | family | winner_trial_id | winner_rule             | acc_floor | accuracy | f1_macro | seed
-----------------------+--------+-----------------+-------------------------+-----------+----------+----------+-----
phase8                 | logreg | logreg_0059     | accuracy_floor_enforced | 0.6300    | 0.6309   | 0.5522   | 84  
phase8_1               | logreg | logreg_0049     | accuracy_floor_enforced | 0.6400    | 0.6411   | 0.5291   | 42  
phase8_2               | svm    | svm_0001        | accuracy_floor_relaxed  | 0.6400    | 0.6360   | 0.4924   | 42  
phase8_2b              | svm    | svm_0007        | accuracy_floor_enforced | 0.6400    | 0.6458   | 0.5039   | 42  
phase8_3               | sgd    | sgd_0020        | accuracy_floor_relaxed  | 0.6400    | 0.6369   | 0.5535   | 42  
phase8_nonlogreg_tight | sgd    | sgd_0020        | accuracy_floor_relaxed  | 0.6400    | 0.6369   | 0.5535   | 42  
phase8_pruned2       


## 7) Decision Log (What We Chose and Why)


In [8]:

case_study_text = (ROOT / "docs" / "case_study.md").read_text(encoding="utf-8") if (ROOT / "docs" / "case_study.md").exists() else ""
brief_text = (ROOT / "docs" / "hiring_manager_brief.md").read_text(encoding="utf-8") if (ROOT / "docs" / "hiring_manager_brief.md").exists() else ""
opt_text = (ROOT / "docs" / "model_optimization_guide.md").read_text(encoding="utf-8") if (ROOT / "docs" / "model_optimization_guide.md").exists() else ""

canonical = read_json("artifacts/models/phase8/selected_model.json", default={})
canon_metrics = canonical.get("metrics", {})

decisions = [
    (
        "Script-first workflow",
        "All executable logic moved to scripts/src and phase outputs are contract-tested.",
        "Evidence: AGENTS.md + docs/analysis_workflow.md",
    ),
    (
        "Use question [SEP] answer representation",
        "Feature strategy ties question-answer context into one text stream for TF-IDF pipelines.",
        "Evidence: AGENTS.md and docs/architecture.md",
    ),
    (
        "Classical baseline emphasis first",
        "Interpretable, fast baselines established stable benchmark behavior before advanced modeling.",
        "Evidence: docs/case_study.md + docs/hiring_manager_brief.md",
    ),
    (
        "Optimize with macro-F1 + accuracy floor",
        "Selection policy protects minority-class quality while preventing unacceptable accuracy regression.",
        f"Evidence: docs/model_optimization_guide.md + phase8 selected_model.json (floor={canonical.get('accuracy_floor')}, f1={canon_metrics.get('f1_macro')})",
    ),
    (
        "Serve via canonical selected_model.json",
        "Runtime resolution uses a deterministic selected-model file for reproducible deployment.",
        "Evidence: docs/model_optimization_guide.md serving default section",
    ),
]

print("Decisions and rationale")
for i, (decision, rationale, evidence) in enumerate(decisions, start=1):
    print(f"{i}. {decision}")
    print(f"   - rationale: {rationale}")
    print(f"   - {evidence}")

print()
print("Cross-check: exact phrase presence in docs")
checks = [
    ("case_study_has_logreg_why", "Why Logistic Regression won" in case_study_text),
    ("brief_has_tradeoffs", "Engineering Decisions and Tradeoffs" in brief_text),
    ("opt_has_accuracy_floor", "Accuracy floor" in opt_text),
]
for key, passed in checks:
    print(f"- {key}: {passed}")


Decisions and rationale
1. Script-first workflow
   - rationale: All executable logic moved to scripts/src and phase outputs are contract-tested.
   - Evidence: AGENTS.md + docs/analysis_workflow.md
2. Use question [SEP] answer representation
   - rationale: Feature strategy ties question-answer context into one text stream for TF-IDF pipelines.
   - Evidence: AGENTS.md and docs/architecture.md
3. Classical baseline emphasis first
   - rationale: Interpretable, fast baselines established stable benchmark behavior before advanced modeling.
   - Evidence: docs/case_study.md + docs/hiring_manager_brief.md
4. Optimize with macro-F1 + accuracy floor
   - rationale: Selection policy protects minority-class quality while preventing unacceptable accuracy regression.
   - Evidence: docs/model_optimization_guide.md + phase8 selected_model.json (floor=0.63, f1=0.5521623974373844)
5. Serve via canonical selected_model.json
   - rationale: Runtime resolution uses a deterministic selected-model file


## 8) Final Recap


In [9]:

phase5_summary = read_json("artifacts/models/phase5/run_summary.json", default={})
canonical = read_json("artifacts/models/phase8/selected_model.json", default={})
report_run = read_json("artifacts/reports/phase7/run_summary.json", default={})
label_diag = read_json("artifacts/diagnostics/phase6/label_diagnostics_summary.json", default={})

phase5_best = None
if phase5_summary:
    phase5_best = max(
        phase5_summary.items(),
        key=lambda kv: kv[1].get("f1_macro", float("-inf")),
    )[0]

recap = {
    "project": "EvasionBench",
    "workflow": "script-first pipeline with artifact contracts",
    "phase5_best_classical_family": phase5_best,
    "phase8_canonical_family": canonical.get("best_model_family"),
    "phase8_canonical_trial": canonical.get("winner_trial_id"),
    "phase8_canonical_accuracy": canonical.get("metrics", {}).get("accuracy"),
    "phase8_canonical_f1_macro": canonical.get("metrics", {}).get("f1_macro"),
    "label_quality_score": label_diag.get("quality_score"),
    "phase7_report_status": report_run.get("status"),
}

print("Project recap summary")
for k, v in recap.items():
    print(f"- {k}: {v}")

print()
print("Notebook complete: this run reconstructed work, outcomes, and decisions from repository evidence.")


Project recap summary
- project: EvasionBench
- workflow: script-first pipeline with artifact contracts
- phase5_best_classical_family: logreg
- phase8_canonical_family: logreg
- phase8_canonical_trial: logreg_0059
- phase8_canonical_accuracy: 0.6309025702331141
- phase8_canonical_f1_macro: 0.5521623974373844
- label_quality_score: 100.0
- phase7_report_status: failed

Notebook complete: this run reconstructed work, outcomes, and decisions from repository evidence.
