Summary

Closes out the full data engineering phase. From raw FastF1 telemetry to a
clean, feature-rich dataset ready to feed into the ML models.

Repo restructure

Previous notebooks and code moved to legacy/ to preserve the original work.
New structure built from scratch around the TFG architecture:
notebooks/data_engineering/, notebooks/strategy/, src/strategy/,
src/agents/, src/telemetry/, etc.

N01 — Download pipeline

Extended to support 2025 season alongside 2023-2024. Fixed a FastF1 naming
inconsistency where Miami appears as Miami_Gardens in 2025 — aliased to keep
the canonical name consistent across all seasons. Same fix applied to Barcelona,
which appeared as Spain in some race weekends.

N03 — Circuit clustering

K-Means with k=4 fitted on 2023-2024 data and serialized with joblib. Added a
2025 inference step that runs kmeans.predict() on the saved model without
refitting. Las Vegas had missing speed trap data in 2025 — imputed with training
means from the scaler rather than zeroing out, which would've distorted the
cluster assignment. These clusters are surprisingly accurate with the data I have.

N04 — Feature engineering

Main pipeline. Takes the raw lap data and outputs a 48-column dataset across
~45k clean racing laps:

Fuel-corrected degradation (0.055 s/lap constant, from Pirelli literature)
Sequential lap features: previous lap time, deltas, trend — NaN on first lap
of each stint intentionally left as-is for XGBoost's missing value branch
Rolling 3-lap degradation rate via polyfit, clipped to ±2 s/lap
Race context: phase, laps remaining, track status
Circuit cluster merge from N03

2025 runs through the same pipeline separately and is saved as a held-out test
set. It never touches the training data at any point.

Data storage

Dataset published to Hugging Face Hub (VforVitorio/f1-strategy-dataset).
Clone the repo and run scripts/download_data.py to pull everything locally.

Next:: I can start to develop the ML models, starting with the XGBoost lap time predictor. Once it's done, a new release will be made

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6 — Data Engineering Pipeline

Choose a tag to compare

Sorry, something went wrong.