v1.1.0 — Benchmark Suite & English Localization
v1.1.0 — Benchmark Suite & English Localization
This release ships the chapter-5 benchmark suite for the TFG thesis and finishes the full English localization of the strategy notebooks, scripts and evaluation artefacts. No model retraining, no breaking changes to runtime APIs — every benchmark consumes existing serialised artefacts under data/models/.
Highlights
Benchmark scripts (scripts/bench_*.py + scripts/bench/_common.py)
Four standalone .py benchmarks that load the production artefacts, run a single evaluation and emit both a Markdown table and a CSV ready for thesis tables / df.to_latex. All four use a shared Rich panel layout, a BenchResult dataclass and identical export_csv / export_markdown helpers.
bench_pace_baselines.py— persistence vs team x circuit median vs production XGBoost delta on the 2025 holdout. XGB MAE matches the 0.4104 s anchor reported inMEMORY.mdwithin +/-0.001 s.bench_whisper.py— Whisper turbo per-clip latency (P50 / P95 / mean). WER is intentionally out of scope and thenotescolumn documents why (no paired audio<->text ground truth yet).bench_subagent_latency.py— six sub-agents (pace,tire,race_situation,pit_strategy,radio,rag) timed in isolation against a Suzuka 2025 NOR lap 21 fixture (Bahrain 2025 NOR lap 18 fallback). LLM-calling agents pick up the configured provider through.envautoload.bench_nlp_pipeline_cpu.py— sentiment + intent + NER pipeline on CPU and GPU (8 messages x N runs). Replicates N24's loaders inline so the bench is self-contained.
Threshold sweeps + MC Dropout calibration (notebooks/agents/N33_thresholds_and_calibration.ipynb)
New sister notebook (no edits to N09 / N10 / N12 / N14 / N16) that produces every threshold and calibration figure referenced in chapter 5:
- Precision-recall sweeps for overtake (N12), safety car (N14) and undercut (N16) with the production threshold marked on the parametric curve.
- MC Dropout empirical coverage on the 20 284 tire-degradation sequences, reporting both the raw [P10, P90] dropout coverage and the residual-sigma calibrated coverage so the gap between epistemic and aleatoric uncertainty is visible.
- All four figures saved to
documents/images/05_results/at 300 DPI; tables exported as CSV (dot decimal) and Markdown (comma decimal).
Quantitative RAG benchmark (notebooks/agents/N30B_rag_benchmark.ipynb)
New quantitative companion to N30 (qualitative notebook untouched):
- 15-query ground-truth set in
data/rag_eval/queries_v1.jsoncovering tyre allocation, pit stops, safety car, flags / penalties and DRS, distributed across the 2023, 2024 and 2025 FIA Sporting Regulations PDFs. - Three retriever configurations evaluated head-to-head: BGE-M3 1024d chunk 512 (production), MiniLM-L6-v2 384d chunk 512, BGE-M3 1024d chunk 256.
data/rag_eval/results_v1.mdcarries the comparative table (Precision@1 / 3 / 5, Content P@5, MRR, P50, P95 latency) plus a transparency-first discussion that surfaces a known limitation of the production article-tagging regex.
Full English localization
Every benchmark surface that ships output is now English-only:
- N30B and N33 markdown narratives, code-cell strings, plot labels, axis titles, table headers and exported Markdown titles all in English.
queries_v1.jsonuser-facing query strings translated; ground-truth keywords and rationales stay as the verified PDF substrings.- Hardcoded Spanish strings purged from
scripts/bench_whisper.py(notes column),scripts/bench_subagent_latency.py(RAG question),scripts/bench/_common.py(docstrings) andscripts/download_fia_pdfs.py(CLI comments).
Layout changes
- Figures moved from
imagenes/05_resultados/(Spanish) todocuments/images/05_results/(English, under the thesis assets tree alongsidedocuments/banner/). - New
data/eval/directory with the 18 benchmark output files (CSV + MD). - New
data/rag_eval/directory with the query JSON and the comparative report.
Tooling
- Added
jiwer>=3.0.0topyproject.tomlas a forward-looking dependency for a future WER follow-up; not imported by any current script. - All bench scripts pass
ruff check .andruff format --check .(lint gate green on CI).
Install
pip install https://github.com/VforVitorio/F1-StratLab/releases/download/v1.1.0/f1_strat_manager-1.1.0-py3-none-any.whl
or with uv:
uv pip install https://github.com/VforVitorio/F1-StratLab/releases/download/v1.1.0/f1_strat_manager-1.1.0-py3-none-any.whl
Console entry points (f1-strat, f1-sim, f1-arcade, f1-streamlit) and the four new bench_* developer scripts are wired identically to v1.0.0.
Compatibility
- Python 3.10 — 3.12 (CPython, declared in
pyproject.toml). - PyTorch wheels routed through the
cu128index on Windows / Linux and the CPU index on macOS (unchanged from v1.0.0). - Existing serialised artefacts under
data/models/are consumed read-only; no migration step required.
What did not change
- No retraining of any production model.
- No edits to the strategy orchestrator (
src/agents/strategy_orchestrator.py) or toN31. - No edits to the development notebooks (N06, N09, N10, N12, N12B, N14, N15, N16, N18, N20-N29, N30, N31, N32, N34).
- Public agent APIs (
run_*_from_state) unchanged.
Full changelog
See v1.0.0...v1.1.0 for the per-commit log (50 commits).