Skip to content

v0.6 — Data Engineering Pipeline

Choose a tag to compare

@VforVitorio VforVitorio released this 12 Feb 17:08
· 998 commits to main since this release

Summary

Closes out the full data engineering phase. From raw FastF1 telemetry to a
clean, feature-rich dataset ready to feed into the ML models.

Repo restructure

Previous notebooks and code moved to legacy/ to preserve the original work.
New structure built from scratch around the TFG architecture:
notebooks/data_engineering/, notebooks/strategy/, src/strategy/,
src/agents/, src/telemetry/, etc.


N01 — Download pipeline

Extended to support 2025 season alongside 2023-2024. Fixed a FastF1 naming
inconsistency where Miami appears as Miami_Gardens in 2025 — aliased to keep
the canonical name consistent across all seasons. Same fix applied to Barcelona,
which appeared as Spain in some race weekends.

N03 — Circuit clustering

K-Means with k=4 fitted on 2023-2024 data and serialized with joblib. Added a
2025 inference step that runs kmeans.predict() on the saved model without
refitting. Las Vegas had missing speed trap data in 2025 — imputed with training
means from the scaler rather than zeroing out, which would've distorted the
cluster assignment. These clusters are surprisingly accurate with the data I have.

N04 — Feature engineering

Main pipeline. Takes the raw lap data and outputs a 48-column dataset across
~45k clean racing laps:

  • Fuel-corrected degradation (0.055 s/lap constant, from Pirelli literature)
  • Sequential lap features: previous lap time, deltas, trend — NaN on first lap
    of each stint intentionally left as-is for XGBoost's missing value branch
  • Rolling 3-lap degradation rate via polyfit, clipped to ±2 s/lap
  • Race context: phase, laps remaining, track status
  • Circuit cluster merge from N03

2025 runs through the same pipeline separately and is saved as a held-out test
set. It never touches the training data at any point.

Data storage

Dataset published to Hugging Face Hub (VforVitorio/f1-strategy-dataset).
Clone the repo and run scripts/download_data.py to pull everything locally.


Next:: I can start to develop the ML models, starting with the XGBoost lap time predictor. Once it's done, a new release will be made