***
***
# F1 WINNER PREDICTIONS USING THE FASTF1 API

***
***
<br>

- ### This Notebook is intended to be a more verbose version of the .py files included in this repository, serving explanatory and educational purposes.

    > ##### _Please take a look at the README file before running it locally, to make sure you have all the libraries installed_.
***

- #### __Disclaimer:__ The vast majority of Python functionality in this Notebook is abtracted to and loaded from individual .py files that contain functions for downloading data from fastf1, manipulating dataframes, and making predictions.
***

- #### Make sure that all fastf1 loaded data stays cached in you local environment

In [1]:
import fastf1
my_path = r'C:\Users\apost\miniconda3\envs\fastF1_cache'
fastf1.Cache.enable_cache(my_path)

***
## STEP 1: GETTING THE DATA
***
<br>

- #### Download the data from the past three seasons (2022-2024), plus the current one (2025).
- #### __Disclaimer:__ _The .py files only support making predictions for the 2025 season for now. Functionality for adding future seasons will be added later._

In [2]:
# Import the local function that is tasked with downloading data for whole seasons:
from f1_downloader import get_season
# Download seasons 2022-2025
stats_2022 = get_season(2022)
stats_2023 = get_season(2023)
stats_2024 = get_season(2024)
stats_2025 = get_season(2025)

core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.6.1]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
core           INFO 	Finished loading data for 20 drivers: ['16', '55', '44', '63', '20', '77', '31', '22', '14', '24', '47', '18', '23', '3', '4', '6', '27', '11', '1', '10']
core           INFO 	Loading data for Saudi Arabian Grand Prix - Race [v3.6.1]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data 

***
## STEP 2: DATA PREPARATION
***
<br>

- #### Next, we'll use Pandas to concatenate our data into one big DataFrame that we'll work with

In [3]:
import pandas as pd
all_stats = pd.concat(
    [stats_2022, stats_2023, stats_2024, stats_2025],
    ignore_index = True
)
all_stats.head(3)

Unnamed: 0,Driver,lapsCompleted,Team,CircuitName,avgLapTime_s,stdLapTime_s,GridPosition,Position,isDNF,raceID,Year
0,ALB,57.0,Williams,Bahrain Grand Prix,103.640632,8.808555,14.0,13.0,0,1,2022
1,ALO,57.0,Alpine,Bahrain Grand Prix,103.087263,8.36173,8.0,9.0,0,1,2022
2,BOT,57.0,Alfa Romeo,Bahrain Grand Prix,102.977246,9.828384,6.0,6.0,0,1,2022


In [4]:
all_stats[-3:]

Unnamed: 0,Driver,lapsCompleted,Team,CircuitName,avgLapTime_s,stdLapTime_s,GridPosition,Position,isDNF,raceID,Year
1710,STR,61.0,Aston Martin,Singapore Grand Prix,99.128803,3.131568,15.0,13.0,0,18,2025
1711,TSU,61.0,Red Bull Racing,Singapore Grand Prix,99.043213,3.473563,13.0,12.0,0,18,2025
1712,VER,62.0,Red Bull Racing,Singapore Grand Prix,97.222532,2.553562,2.0,2.0,0,18,2025


- ##### Our get_season function does not download everything from fastf1. You can see everything that the get_session native fastf1 API call returns below:
  ```
  https://docs.fastf1.dev/fastf1.html#fastf1.get_session
  https://docs.fastf1.dev/core.html#fastf1.core.Laps
  https://docs.fastf1.dev/core.html#fastf1.core.SessionResults
  ```

- ##### For the calculation of avgLapTime_s and stdLapTime_s, the columns are converted to seconds, as the original fastf1 column is using Pandas Timedelta objects:

  ```
  LapTime (pandas.Timedelta: Recorded lap time.
  ```

- #### We sort our data chronologically and reset the old indexes of the DataFrame

In [5]:
all_stats = all_stats.sort_values(by = ['Year', 'raceID']).reset_index(drop = True)

- #### Our Model's goal will be to predict the winner of the next Gran Prix. We'll make use of the 'Position' column to create a binary variable which will tell us: is this row/driver a winner in this race, or not?

In [6]:
all_stats['Winner'] = (all_stats['Position'] == 1).astype(int)

- #### Next, we'll need to create some historical features, the "pace" of which a Driver or a Team improves (or not). We'll make use of Pandas' expanding and rolling averages, to calculate the overall (expanding) or recent (rolling) improvement on average, when it comes to features like the lap times, grid position, final position, the tendency to DNF, and group them by driver and/or team.

In [7]:
from f1_train_data import collect_historical_data
collect_historical_data(all_stats) # Operations are in-place, assigning the result to a value would lead to None

***
## STEP 3: CREATE PREDICTION/FUTURE DATA
***
<br>

In [8]:
from f1_future_data import get_next_race
next_race = get_next_race(all_stats)

- ##### The get_next_race function extracts the last (chronologically) race from our DataFrame, and repopulates it with Null or zero values in the appropriate fields. We will need to re-pass the full stats (all_stats + next_race) through our collect_historical_data so that the next_race part will get populated with data. It's a redundant step, but it helps me automate the creation of the next_race object.

In [9]:
full_df = pd.concat([all_stats, next_race], ignore_index = True)
collect_historical_data(full_df)

# Remove any Null values
from f1_train_data import drop_na
drop_na(full_df)

***
## STEP 4: SPLIT THE DATA
***
<br>

In [10]:
# Define the columns that are gonna be used for training and predictions
from f1_future_data import pred_cols
X_cols = pred_cols() # Custom function that collects the appropriate columns for efficient training
X_future = full_df[full_df['isPredictionData'] == 1][X_cols]
X_train = full_df[full_df['isPredictionData'] != 1][X_cols]
y_train = full_df[full_df['isPredictionData'] != 1]['Winner']

- ##### Since columns like the driver name, team name, raceID don't have any use for training, they're dropped from our data. But we'll keep them seperate for identification purposes (We don't just want to know the row of the winning driver, but also their name!)

In [11]:
ID_cols = ['Driver', 'Year', 'raceID']
ids = full_df[full_df['isPredictionData'] == 1][ID_cols] # future data/race ids
train_ids = full_df[full_df['isPredictionData'] != 1][ID_cols] # historical data ids

***
## STEP 5: LOAD THE MODEL. MAKE PREDICTIONS. EVALUATE.
***
<br>

In [12]:
from f1_predictor import predict_winner
results = predict_winner(X_train, y_train, X_future, ids)
print(results)

   Driver Probability to win
0     ANT              8.49%
1     VER              7.51%
2     RUS              6.76%
3     PIA              2.78%
4     LEC              0.82%
5     SAI              0.54%
6     HAM              0.43%
7     TSU              0.35%
8     HAD              0.32%
9     NOR               0.3%
10    ALB              0.18%
11    LAW              0.17%
12    GAS              0.15%
13    ALO              0.15%
14    BEA              0.14%
15    COL              0.14%
16    HUL              0.14%
17    BOR              0.14%
18    OCO              0.14%
19    STR              0.14%


- ##### Our model loads an XGB Classifier with a few predifined parameters, which are the result of testing, trial, and error. These may need to change later.
- __These parameters are in plain text:__
  ```
  HARD_CODED_PARAMS = {'gamma': np.float64(0.05),
                         'learning_rate': np.float64(0.027882270922132885),
                         'max_depth': 8,
                         'min_child_weight': 5,
                         'n_estimators': 294,
                         'scale_pos_weight': np.float64(16.78048780487805)
                         }
    ```

- ##### These parameters may require additional tuning using Sklearn's RandomizedSearchCV, Scipy's stats, and Sklearns classification_report

   > _(See 'STEP 6')_

In [13]:
# custom parameters must be added at the 'params' property, eg params=custom_params
import inspect
inspect.signature(predict_winner).parameters

mappingproxy({'history_data': <Parameter "history_data">,
              'history_results': <Parameter "history_results">,
              'next_race_data': <Parameter "next_race_data">,
              'ids': <Parameter "ids">,
              'params': <Parameter "params={'gamma': np.float64(0.05), 'learning_rate': np.float64(0.027882270922132885), 'max_depth': 8, 'min_child_weight': 5, 'n_estimators': 294, 'scale_pos_weight': np.float64(16.78048780487805)}">})

- #### Evaluate our model. We will use all 2005 (current season) results as test data

In [14]:
from f1_predictor import class_report
# Do NOT use Sklearn's classification_report. This class_report function is customized to drop the next race from the data
# and evaluate using seasons 2022-2024 as the training set and season 2025 (so far) as the test set
report = class_report(X_train, y_train, train_ids)
print(report)

              precision    recall  f1-score   support

           0       0.95      0.97      0.96       241
           1       0.42      0.28      0.33        18

    accuracy                           0.92       259
   macro avg       0.68      0.62      0.65       259
weighted avg       0.91      0.92      0.92       259



***
## STEP 6: DANGER ZONE!!!
- #### _Tweak our XGB Classifier. Make new classification reports. Do NOT use vanilla classification_report, use the custom function above._
***
<br>

In [15]:
from f1_predictor import get_eval_sets
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats
import xgboost as xgb

(X_train_eval, y_train_eval), (X_test_eval, y_test_eval) = get_eval_sets(X_train, y_train, train_ids)

neg_count = sum(y_train_eval == 0)
pos_count = sum(y_train_eval == 1)
scale_weight = neg_count / pos_count

param_dist = {
    'learning_rate': stats.uniform(loc = 0.01, scale = 0.1),
    'n_estimators': stats.randint(100, 1000),
    'scale_pos_weight': [scale_weight, scale_weight * 1.1, scale_weight * 0.9],
    'max_depth': stats.randint(1, 10),
    'min_child_weight': stats.randint(3, 9),
    'gamma': stats.uniform(loc = 0, scale = 0.1)
}

random_search = RandomizedSearchCV(
    estimator = xgb.XGBClassifier(objective='binary:logistic', eval_metric = 'logloss', seed = 42),
    n_iter = 100,
    param_distributions = param_dist,
    cv = 5,
    scoring = 'roc_auc',
    verbose = 1,
    random_state = 42
)

random_search.fit(X_train_eval, y_train_eval)
custom_params = random_search.best_params_
custom_params

Fitting 5 folds for each of 100 candidates, totalling 500 fits


{'gamma': np.float64(0.017336465350777208),
 'learning_rate': np.float64(0.049106060757324085),
 'max_depth': 2,
 'min_child_weight': 6,
 'n_estimators': 101,
 'scale_pos_weight': 18.316279069767443}

- ##### Playing a little bit with the values, to optimize precision, recall and f1-score. This works only for the current version of the dataset and does not guarantee being optimal for future predictions.

In [16]:
custom_params['gamma'] = 0.05
custom_params['learning_rate'] = 0.02
custom_params['max_depth'] = 9
custom_params['min_child_weight'] = 8
custom_params['n_estimators'] = 500
custom_params['scale_pos_weight'] = 18.0

report_eval = class_report(X_train, y_train, train_ids, params = custom_params)
print(report_eval)

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       241
           1       0.47      0.44      0.46        18

    accuracy                           0.93       259
   macro avg       0.71      0.70      0.71       259
weighted avg       0.92      0.93      0.93       259



- #### Make a new prediction, using those evaluated parameters

In [17]:
results = predict_winner(X_train, y_train, X_future, ids, params = custom_params)
results.index = results.index + 1
print(results)

   Driver Probability to win
1     VER             10.68%
2     RUS              7.76%
3     PIA              7.73%
4     ANT              2.02%
5     LEC              0.85%
6     HAM              0.64%
7     NOR              0.47%
8     SAI              0.36%
9     HAD              0.27%
10    TSU              0.23%
11    ALB              0.17%
12    ALO              0.17%
13    GAS              0.16%
14    LAW              0.15%
15    COL              0.14%
16    HUL              0.14%
17    BEA              0.14%
18    OCO              0.14%
19    STR              0.14%
20    BOR              0.14%


***
## EXTRA: USE QUALIFICATION RESULTS IN OUR PREDICTIONS
***
<br>

- #### Requires the results being available when the model runs. Must re-train on historical data that includes a raw 'GridPosition' column. Quali results need to be hardcoded in the .py file (f1_future_data) for now.

> __Current Gran Prix:__ _United States Grand Prix 2025_

In [18]:
X_cols_grid = pred_cols(grid = True)
X_future_grid = full_df[full_df['isPredictionData'] == 1][X_cols_grid]
X_train_grid = full_df[full_df['isPredictionData'] != 1][X_cols_grid]
y_train_grid = full_df[full_df['isPredictionData'] != 1]['Winner']

In [19]:
from tabulate import tabulate

results = predict_winner(X_train_grid, y_train_grid, X_future_grid, ids, params = custom_params)
results.index = results.index + 1
print(tabulate(results, headers = ["Driver", "Winning %"], tablefmt = "double_outline"))

╔════╦══════════╦═════════════╗
║    ║ Driver   ║ Winning %   ║
╠════╬══════════╬═════════════╣
║  1 ║ NOR      ║ 1.66%       ║
║  2 ║ VER      ║ 0.98%       ║
║  3 ║ ANT      ║ 0.56%       ║
║  4 ║ RUS      ║ 0.39%       ║
║  5 ║ LEC      ║ 0.33%       ║
║  6 ║ TSU      ║ 0.25%       ║
║  7 ║ PIA      ║ 0.24%       ║
║  8 ║ HAD      ║ 0.2%        ║
║  9 ║ ALB      ║ 0.18%       ║
║ 10 ║ ALO      ║ 0.18%       ║
║ 11 ║ HUL      ║ 0.17%       ║
║ 12 ║ COL      ║ 0.17%       ║
║ 13 ║ GAS      ║ 0.14%       ║
║ 14 ║ BOR      ║ 0.13%       ║
║ 15 ║ OCO      ║ 0.13%       ║
║ 16 ║ LAW      ║ 0.13%       ║
║ 17 ║ HAM      ║ 0.13%       ║
║ 18 ║ STR      ║ 0.1%        ║
║ 19 ║ BEA      ║ 0.1%        ║
║ 20 ║ SAI      ║ 0.08%       ║
╚════╩══════════╩═════════════╝
