v1.1.0 — Benchmark Suite & English Localization

This release ships the chapter-5 benchmark suite for the TFG thesis and finishes the full English localization of the strategy notebooks, scripts and evaluation artefacts. No model retraining, no breaking changes to runtime APIs — every benchmark consumes existing serialised artefacts under data/models/.

Highlights

Benchmark scripts (`scripts/bench_*.py` + `scripts/bench/_common.py`)

Four standalone .py benchmarks that load the production artefacts, run a single evaluation and emit both a Markdown table and a CSV ready for thesis tables / df.to_latex. All four use a shared Rich panel layout, a BenchResult dataclass and identical export_csv / export_markdown helpers.

bench_pace_baselines.py — persistence vs team x circuit median vs production XGBoost delta on the 2025 holdout. XGB MAE matches the 0.4104 s anchor reported in MEMORY.md within +/-0.001 s.
bench_whisper.py — Whisper turbo per-clip latency (P50 / P95 / mean). WER is intentionally out of scope and the notes column documents why (no paired audio<->text ground truth yet).
bench_subagent_latency.py — six sub-agents (pace, tire, race_situation, pit_strategy, radio, rag) timed in isolation against a Suzuka 2025 NOR lap 21 fixture (Bahrain 2025 NOR lap 18 fallback). LLM-calling agents pick up the configured provider through .env autoload.
bench_nlp_pipeline_cpu.py — sentiment + intent + NER pipeline on CPU and GPU (8 messages x N runs). Replicates N24's loaders inline so the bench is self-contained.

Threshold sweeps + MC Dropout calibration (`notebooks/agents/N33_thresholds_and_calibration.ipynb`)

New sister notebook (no edits to N09 / N10 / N12 / N14 / N16) that produces every threshold and calibration figure referenced in chapter 5:

Precision-recall sweeps for overtake (N12), safety car (N14) and undercut (N16) with the production threshold marked on the parametric curve.
MC Dropout empirical coverage on the 20 284 tire-degradation sequences, reporting both the raw [P10, P90] dropout coverage and the residual-sigma calibrated coverage so the gap between epistemic and aleatoric uncertainty is visible.
All four figures saved to documents/images/05_results/ at 300 DPI; tables exported as CSV (dot decimal) and Markdown (comma decimal).

Quantitative RAG benchmark (`notebooks/agents/N30B_rag_benchmark.ipynb`)

New quantitative companion to N30 (qualitative notebook untouched):

15-query ground-truth set in data/rag_eval/queries_v1.json covering tyre allocation, pit stops, safety car, flags / penalties and DRS, distributed across the 2023, 2024 and 2025 FIA Sporting Regulations PDFs.
Three retriever configurations evaluated head-to-head: BGE-M3 1024d chunk 512 (production), MiniLM-L6-v2 384d chunk 512, BGE-M3 1024d chunk 256.
data/rag_eval/results_v1.md carries the comparative table (Precision@1 / 3 / 5, Content P@5, MRR, P50, P95 latency) plus a transparency-first discussion that surfaces a known limitation of the production article-tagging regex.

Full English localization

Every benchmark surface that ships output is now English-only:

N30B and N33 markdown narratives, code-cell strings, plot labels, axis titles, table headers and exported Markdown titles all in English.
queries_v1.json user-facing query strings translated; ground-truth keywords and rationales stay as the verified PDF substrings.
Hardcoded Spanish strings purged from scripts/bench_whisper.py (notes column), scripts/bench_subagent_latency.py (RAG question), scripts/bench/_common.py (docstrings) and scripts/download_fia_pdfs.py (CLI comments).

Layout changes

Figures moved from imagenes/05_resultados/ (Spanish) to documents/images/05_results/ (English, under the thesis assets tree alongside documents/banner/).
New data/eval/ directory with the 18 benchmark output files (CSV + MD).
New data/rag_eval/ directory with the query JSON and the comparative report.

Tooling

Added jiwer>=3.0.0 to pyproject.toml as a forward-looking dependency for a future WER follow-up; not imported by any current script.
All bench scripts pass ruff check . and ruff format --check . (lint gate green on CI).

Install

pip install https://github.com/VforVitorio/F1-StratLab/releases/download/v1.1.0/f1_strat_manager-1.1.0-py3-none-any.whl

or with uv:

uv pip install https://github.com/VforVitorio/F1-StratLab/releases/download/v1.1.0/f1_strat_manager-1.1.0-py3-none-any.whl

Console entry points (f1-strat, f1-sim, f1-arcade, f1-streamlit) and the four new bench_* developer scripts are wired identically to v1.0.0.

Compatibility

Python 3.10 — 3.12 (CPython, declared in pyproject.toml).
PyTorch wheels routed through the cu128 index on Windows / Linux and the CPU index on macOS (unchanged from v1.0.0).
Existing serialised artefacts under data/models/ are consumed read-only; no migration step required.

What did not change

No retraining of any production model.
No edits to the strategy orchestrator (src/agents/strategy_orchestrator.py) or to N31.
No edits to the development notebooks (N06, N09, N10, N12, N12B, N14, N15, N16, N18, N20-N29, N30, N31, N32, N34).
Public agent APIs (run_*_from_state) unchanged.

Full changelog

See v1.0.0...v1.1.0 for the per-commit log (50 commits).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0 — Benchmark Suite & English Localization

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v1.1.0 — Benchmark Suite & English Localization

Highlights

Benchmark scripts (`scripts/bench_*.py` + `scripts/bench/_common.py`)

Threshold sweeps + MC Dropout calibration (`notebooks/agents/N33_thresholds_and_calibration.ipynb`)

Quantitative RAG benchmark (`notebooks/agents/N30B_rag_benchmark.ipynb`)

Full English localization

Layout changes

Tooling

Install

Compatibility

What did not change

Full changelog

Uh oh!

v1.1.0 — Benchmark Suite & English Localization

v1.1.0 — Benchmark Suite & English Localization

Highlights

Benchmark scripts (scripts/bench_*.py + scripts/bench/_common.py)

Threshold sweeps + MC Dropout calibration (notebooks/agents/N33_thresholds_and_calibration.ipynb)

Quantitative RAG benchmark (notebooks/agents/N30B_rag_benchmark.ipynb)

Full English localization

Layout changes

Tooling

Install

Compatibility

What did not change

Full changelog

Uh oh!

Benchmark scripts (`scripts/bench_*.py` + `scripts/bench/_common.py`)

Threshold sweeps + MC Dropout calibration (`notebooks/agents/N33_thresholds_and_calibration.ipynb`)

Quantitative RAG benchmark (`notebooks/agents/N30B_rag_benchmark.ipynb`)