Skip to content

v1.1.0 — Benchmark Suite & English Localization

Choose a tag to compare

@VforVitorio VforVitorio released this 11 May 08:15
· 387 commits to main since this release

v1.1.0 — Benchmark Suite & English Localization

This release ships the chapter-5 benchmark suite for the TFG thesis and finishes the full English localization of the strategy notebooks, scripts and evaluation artefacts. No model retraining, no breaking changes to runtime APIs — every benchmark consumes existing serialised artefacts under data/models/.

Highlights

Benchmark scripts (scripts/bench_*.py + scripts/bench/_common.py)

Four standalone .py benchmarks that load the production artefacts, run a single evaluation and emit both a Markdown table and a CSV ready for thesis tables / df.to_latex. All four use a shared Rich panel layout, a BenchResult dataclass and identical export_csv / export_markdown helpers.

  • bench_pace_baselines.py — persistence vs team x circuit median vs production XGBoost delta on the 2025 holdout. XGB MAE matches the 0.4104 s anchor reported in MEMORY.md within +/-0.001 s.
  • bench_whisper.py — Whisper turbo per-clip latency (P50 / P95 / mean). WER is intentionally out of scope and the notes column documents why (no paired audio<->text ground truth yet).
  • bench_subagent_latency.py — six sub-agents (pace, tire, race_situation, pit_strategy, radio, rag) timed in isolation against a Suzuka 2025 NOR lap 21 fixture (Bahrain 2025 NOR lap 18 fallback). LLM-calling agents pick up the configured provider through .env autoload.
  • bench_nlp_pipeline_cpu.py — sentiment + intent + NER pipeline on CPU and GPU (8 messages x N runs). Replicates N24's loaders inline so the bench is self-contained.

Threshold sweeps + MC Dropout calibration (notebooks/agents/N33_thresholds_and_calibration.ipynb)

New sister notebook (no edits to N09 / N10 / N12 / N14 / N16) that produces every threshold and calibration figure referenced in chapter 5:

  • Precision-recall sweeps for overtake (N12), safety car (N14) and undercut (N16) with the production threshold marked on the parametric curve.
  • MC Dropout empirical coverage on the 20 284 tire-degradation sequences, reporting both the raw [P10, P90] dropout coverage and the residual-sigma calibrated coverage so the gap between epistemic and aleatoric uncertainty is visible.
  • All four figures saved to documents/images/05_results/ at 300 DPI; tables exported as CSV (dot decimal) and Markdown (comma decimal).

Quantitative RAG benchmark (notebooks/agents/N30B_rag_benchmark.ipynb)

New quantitative companion to N30 (qualitative notebook untouched):

  • 15-query ground-truth set in data/rag_eval/queries_v1.json covering tyre allocation, pit stops, safety car, flags / penalties and DRS, distributed across the 2023, 2024 and 2025 FIA Sporting Regulations PDFs.
  • Three retriever configurations evaluated head-to-head: BGE-M3 1024d chunk 512 (production), MiniLM-L6-v2 384d chunk 512, BGE-M3 1024d chunk 256.
  • data/rag_eval/results_v1.md carries the comparative table (Precision@1 / 3 / 5, Content P@5, MRR, P50, P95 latency) plus a transparency-first discussion that surfaces a known limitation of the production article-tagging regex.

Full English localization

Every benchmark surface that ships output is now English-only:

  • N30B and N33 markdown narratives, code-cell strings, plot labels, axis titles, table headers and exported Markdown titles all in English.
  • queries_v1.json user-facing query strings translated; ground-truth keywords and rationales stay as the verified PDF substrings.
  • Hardcoded Spanish strings purged from scripts/bench_whisper.py (notes column), scripts/bench_subagent_latency.py (RAG question), scripts/bench/_common.py (docstrings) and scripts/download_fia_pdfs.py (CLI comments).

Layout changes

  • Figures moved from imagenes/05_resultados/ (Spanish) to documents/images/05_results/ (English, under the thesis assets tree alongside documents/banner/).
  • New data/eval/ directory with the 18 benchmark output files (CSV + MD).
  • New data/rag_eval/ directory with the query JSON and the comparative report.

Tooling

  • Added jiwer>=3.0.0 to pyproject.toml as a forward-looking dependency for a future WER follow-up; not imported by any current script.
  • All bench scripts pass ruff check . and ruff format --check . (lint gate green on CI).

Install

pip install https://github.com/VforVitorio/F1-StratLab/releases/download/v1.1.0/f1_strat_manager-1.1.0-py3-none-any.whl

or with uv:

uv pip install https://github.com/VforVitorio/F1-StratLab/releases/download/v1.1.0/f1_strat_manager-1.1.0-py3-none-any.whl

Console entry points (f1-strat, f1-sim, f1-arcade, f1-streamlit) and the four new bench_* developer scripts are wired identically to v1.0.0.

Compatibility

  • Python 3.10 — 3.12 (CPython, declared in pyproject.toml).
  • PyTorch wheels routed through the cu128 index on Windows / Linux and the CPU index on macOS (unchanged from v1.0.0).
  • Existing serialised artefacts under data/models/ are consumed read-only; no migration step required.

What did not change

  • No retraining of any production model.
  • No edits to the strategy orchestrator (src/agents/strategy_orchestrator.py) or to N31.
  • No edits to the development notebooks (N06, N09, N10, N12, N12B, N14, N15, N16, N18, N20-N29, N30, N31, N32, N34).
  • Public agent APIs (run_*_from_state) unchanged.

Full changelog

See v1.0.0...v1.1.0 for the per-commit log (50 commits).