v0.6 — Data Engineering Pipeline
Summary
Closes out the full data engineering phase. From raw FastF1 telemetry to a
clean, feature-rich dataset ready to feed into the ML models.
Repo restructure
Previous notebooks and code moved to legacy/ to preserve the original work.
New structure built from scratch around the TFG architecture:
notebooks/data_engineering/, notebooks/strategy/, src/strategy/,
src/agents/, src/telemetry/, etc.
N01 — Download pipeline
Extended to support 2025 season alongside 2023-2024. Fixed a FastF1 naming
inconsistency where Miami appears as Miami_Gardens in 2025 — aliased to keep
the canonical name consistent across all seasons. Same fix applied to Barcelona,
which appeared as Spain in some race weekends.
N03 — Circuit clustering
K-Means with k=4 fitted on 2023-2024 data and serialized with joblib. Added a
2025 inference step that runs kmeans.predict() on the saved model without
refitting. Las Vegas had missing speed trap data in 2025 — imputed with training
means from the scaler rather than zeroing out, which would've distorted the
cluster assignment. These clusters are surprisingly accurate with the data I have.
N04 — Feature engineering
Main pipeline. Takes the raw lap data and outputs a 48-column dataset across
~45k clean racing laps:
- Fuel-corrected degradation (0.055 s/lap constant, from Pirelli literature)
- Sequential lap features: previous lap time, deltas, trend — NaN on first lap
of each stint intentionally left as-is for XGBoost's missing value branch - Rolling 3-lap degradation rate via polyfit, clipped to ±2 s/lap
- Race context: phase, laps remaining, track status
- Circuit cluster merge from N03
2025 runs through the same pipeline separately and is saved as a held-out test
set. It never touches the training data at any point.
Data storage
Dataset published to Hugging Face Hub (VforVitorio/f1-strategy-dataset).
Clone the repo and run scripts/download_data.py to pull everything locally.
Next:: I can start to develop the ML models, starting with the XGBoost lap time predictor. Once it's done, a new release will be made