<h1 style="text-align:center;">F1 Safety Car Predictor</h1>
<img src="Images/safety_car.jpg" align="left" alt="Alt text" />

## Authors: Petar Stamenković, Aleksa Mitrovčan

## Intro

Hello everyone, we are passionate Formula 1 fans, and we’ve always been fascinated by the strategic impact of Safety Car deployments during a race. To explore this further, we built an F1 Safety Car Predictor, a machine learning model that predicts the likelihood of a Safety Car being deployed in an F1 race.

Using Python and Jupyter Notebook, we developed a Random Forest model trained on a custom-built dataset (CSV file) filled with historical race data, track conditions, weather, and past Safety Car occurrences. Our goal was to analyze patterns and create a tool that could provide real-time predictions for teams, analysts, or F1 enthusiasts.

## Import new table with weather track data.

Here we are importing two datasets using function read_csv.  
Safety_car_predictor.csv is our fact table, while circuit_data is our dimension table.  
In the table below you can see the following columns : name, circuit_id, circuit_type and phy_diff.

- name : circuit name
- circuit_id : unique number coresponding to a track
- circuit_type : traditional or street 
- phy_diff : level of difficulty for a track (1 - 4)

This file contains basic information about the project.    
For more details consult with the documentation provided in the gythub repository *("F1_Safety_Car_Predictor.pdf")*

In [1]:
import numpy as np
import pandas as pd

safety_car_predictor = pd.read_csv("Dataset/safety_car_predictor.csv")
circuit_data = pd.read_csv("Dataset/circuit_data.csv")
circuit_data

Unnamed: 0,name,circuit_id,circuit_type,phy_diff
0,Australian Grand Prix,1,street,2
1,Malaysian Grand Prix,2,traditional,4
2,Bahrain Grand Prix,3,traditional,3
3,Spanish Grand Prix,4,traditional,2
4,Turkish Grand Prix,5,traditional,2
5,Monaco Grand Prix,6,street,3
6,Canadian Grand Prix,7,traditional,2
7,French Grand Prix,8,traditional,2
8,British Grand Prix,9,traditional,1
9,German Grand Prix,10,traditional,2


## Merging two tables

In this step we merge our fact and dimension tables on their common column 'name'.
This step enhances support for future tracks so that we don't need to manually enter all parameters each time.
The **Virtual Safety Car** (num_of_vsc) was introduced in 2015, so all the data before that timeline is set to **NaN** (Not a Number). In the following table we have additional columns:
- weather : contains information about the weather on each race.
- num_dnfs : total number of dnf's including mechanical failures.
- num_of_sc : number of safety car deployments on each race.
- num_of_vsc : number of virtual safety car deployments on each race.
- total : sum of previously mentioned columns.
- safety_car : information whether a safety car was deployed or not.

In [2]:
safety_car_predictor = pd.merge(safety_car_predictor,circuit_data, on='name')
safety_car_predictor

Unnamed: 0,year,name,date,weather,num_dnfs,safety_car,num_of_sc,num_of_vsc,total,circuit_id,circuit_type,phy_diff
0,1993,Brazilian Grand Prix,03/28/93,light_rain,9,True,1,,1,18,traditional,2
1,1994,Brazilian Grand Prix,3/27/1994,dry,14,False,0,,0,18,traditional,2
2,1995,Brazilian Grand Prix,3/26/1995,dry,15,False,0,,0,18,traditional,2
3,1996,Brazilian Grand Prix,3/31/1996,heavy_rain,11,False,0,,0,18,traditional,2
4,1997,Brazilian Grand Prix,3/30/1997,dry,4,False,0,,0,18,traditional,2
...,...,...,...,...,...,...,...,...,...,...,...,...
582,2022,Miami Grand Prix,5/8/2022,dry,4,True,1,1.0,2,34,street,3
583,2023,Miami Grand Prix,5/7/2023,dry,0,False,0,0.0,0,34,street,3
584,2024,Miami Grand Prix,5/5/2024,dry,1,True,1,1.0,2,34,street,3
585,2023,Las Vegas Grand Prix,11/19/2023,dry,3,True,2,0.0,2,35,street,3


## Unique list of all circuits

In [3]:
safety_car_predictor.name.unique() # List out all circuits in order to create a new DataFrame for circuit ID

array(['Brazilian Grand Prix', 'Emilia Romagna Grand Prix',
       'Spanish Grand Prix', 'Monaco Grand Prix', 'Canadian Grand Prix',
       'French Grand Prix', 'British Grand Prix', 'German Grand Prix',
       'Hungarian Grand Prix', 'Belgian Grand Prix', 'Italian Grand Prix',
       'Portuguese Grand Prix', 'Japanese Grand Prix',
       'Australian Grand Prix', 'Pacific Grand Prix',
       'European Grand Prix', 'Argentina Grand Prix',
       'Austrian Grand Prix', 'Malaysian Grand Prix',
       'Indianapolis Grand Prix', 'Bahrain Grand Prix',
       'Chinese Grand Prix', 'Turkish Grand Prix', 'Singapore Grand Prix',
       'Abu Dhabi Grand Prix', 'Korean Grand Prix',
       'United States Grand Prix', 'Russian Grand Prix',
       'Mexican Grand Prix', 'Azerbaijan Grand Prix', 'Tuscan Grand Prix',
       'Eifel Grand Prix', 'Qatar Grand Prix', 'Dutch Grand Prix',
       'Saudi Arabian Grand Prix', 'Miami Grand Prix',
       'Las Vegas Grand Prix'], dtype=object)

## Calculating feature values

Since we don't know the number of dnfs, safety car and virtual safety car deployments for a future race, we calculate average for all those columns and create new columns that will work as a feature in our model.  
We rounded all averages to 2 decimal, and rearanged columns for additional clarity.

In [4]:
# Average dnsf per circuit
safety_car_predictor['avg_dnfs_per_circuit'] = np.round(safety_car_predictor.groupby('name')['num_dnfs'].transform('mean'), 2)

# Ignore the NaN values
safety_car_predictor['num_of_vsc'] = np.round(pd.to_numeric(safety_car_predictor['num_of_vsc'], errors='coerce'), 2)
safety_car_predictor['avg_sc_per_circuit'] = np.round(safety_car_predictor.groupby('name')['num_of_sc'].transform('mean'),2)
safety_car_predictor['avg_vsc_per_circuit'] = np.round(safety_car_predictor.groupby('name')['num_of_vsc'].transform(lambda x:x.mean(skipna=True)), 2)

#rearange columns
columns = ['year', 'name', 'circuit_id', 'circuit_type', 'phy_diff', 'date', 'weather', 'num_dnfs', 'avg_dnfs_per_circuit', 'num_of_sc', 'num_of_vsc', 'total', 'avg_sc_per_circuit', 'avg_vsc_per_circuit', 
           'safety_car']
safety_car_predictor = safety_car_predictor[columns]
safety_car_predictor


Unnamed: 0,year,name,circuit_id,circuit_type,phy_diff,date,weather,num_dnfs,avg_dnfs_per_circuit,num_of_sc,num_of_vsc,total,avg_sc_per_circuit,avg_vsc_per_circuit,safety_car
0,1993,Brazilian Grand Prix,18,traditional,2,03/28/93,light_rain,9,6.32,1,,1,0.84,0.44,True
1,1994,Brazilian Grand Prix,18,traditional,2,3/27/1994,dry,14,6.32,0,,0,0.84,0.44,False
2,1995,Brazilian Grand Prix,18,traditional,2,3/26/1995,dry,15,6.32,0,,0,0.84,0.44,False
3,1996,Brazilian Grand Prix,18,traditional,2,3/31/1996,heavy_rain,11,6.32,0,,0,0.84,0.44,False
4,1997,Brazilian Grand Prix,18,traditional,2,3/30/1997,dry,4,6.32,0,,0,0.84,0.44,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
582,2022,Miami Grand Prix,34,street,3,5/8/2022,dry,4,1.67,1,1.0,2,0.67,0.67,True
583,2023,Miami Grand Prix,34,street,3,5/7/2023,dry,0,1.67,0,0.0,0,0.67,0.67,False
584,2024,Miami Grand Prix,34,street,3,5/5/2024,dry,1,1.67,1,1.0,2,0.67,0.67,True
585,2023,Las Vegas Grand Prix,35,street,3,11/19/2023,dry,3,2.50,2,0.0,2,1.00,0.00,True


## Data modification for model training 

Since machine learning models can't work with strings, we need to map the coresponding values to an integer type. For this step we used dictionaries to map the values.

In [5]:
# Replace string values in the 'weather' column
safety_car_predictor['weather'] = safety_car_predictor['weather'].replace({
    "dry": 0,
    "light_rain": 1,
    "heavy_rain": 2
})

# Replace string values in the 'circuit type' column
safety_car_predictor['circuit_type'] = safety_car_predictor['circuit_type'].replace({
    "traditional" : 0,
    "street" : 1
})

safety_car_predictor


Unnamed: 0,year,name,circuit_id,circuit_type,phy_diff,date,weather,num_dnfs,avg_dnfs_per_circuit,num_of_sc,num_of_vsc,total,avg_sc_per_circuit,avg_vsc_per_circuit,safety_car
0,1993,Brazilian Grand Prix,18,0,2,03/28/93,1,9,6.32,1,,1,0.84,0.44,True
1,1994,Brazilian Grand Prix,18,0,2,3/27/1994,0,14,6.32,0,,0,0.84,0.44,False
2,1995,Brazilian Grand Prix,18,0,2,3/26/1995,0,15,6.32,0,,0,0.84,0.44,False
3,1996,Brazilian Grand Prix,18,0,2,3/31/1996,2,11,6.32,0,,0,0.84,0.44,False
4,1997,Brazilian Grand Prix,18,0,2,3/30/1997,0,4,6.32,0,,0,0.84,0.44,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
582,2022,Miami Grand Prix,34,1,3,5/8/2022,0,4,1.67,1,1.0,2,0.67,0.67,True
583,2023,Miami Grand Prix,34,1,3,5/7/2023,0,0,1.67,0,0.0,0,0.67,0.67,False
584,2024,Miami Grand Prix,34,1,3,5/5/2024,0,1,1.67,1,1.0,2,0.67,0.67,True
585,2023,Las Vegas Grand Prix,35,1,3,11/19/2023,0,3,2.50,2,0.0,2,1.00,0.00,True


## Random forest

For this project we are using Random Forest Machine Learning Model, because it's more accurate than the Decision Tree model. In the code below we select features for the model, and assign our target variable. We split data into training and validation data, for both features and target, for more accurate model prediction.

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

features = ['circuit_id', 'weather', 'circuit_type', 
            'phy_diff','avg_dnfs_per_circuit', 'avg_sc_per_circuit','avg_vsc_per_circuit']

X = safety_car_predictor[features]
y = safety_car_predictor.safety_car

train_X, val_x, train_y, val_y = train_test_split(X,y,random_state = 0)

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)

## Accepting User Input

This step requries the user to manually select the current race track for prediction and weather forecast for Sunday. List circuit_names contains all possible inputs for track names, while forecast is chosen with numbers (0, 1, 2).

In [7]:
circuit_names = ['Australian Grand Prix' , 'Argentina Grand Prix', 'Malaysian Grand Prix',
       'Brazilian Grand Prix', 'Emilia Romagna Grand Prix',
       'Spanish Grand Prix', 'Austrian Grand Prix', 'Monaco Grand Prix',
       'Canadian Grand Prix', 'European Grand Prix', 'British Grand Prix',
       'French Grand Prix', 'German Grand Prix', 'Hungarian Grand Prix',
       'Belgian Grand Prix', 'Italian Grand Prix', 'Portuguese Grand Prix',
       'Indianapolis Grand Prix', 'Japanese Grand Prix',
       'Bahrain Grand Prix', 'Chinese Grand Prix', 'Turkish Grand Prix',
       'Singapore Grand Prix', 'Abu Dhabi Grand Prix',
       'Korean Grand Prix', 'United States Grand Prix',
       'Russian Grand Prix', 'Mexican Grand Prix',
       'Azerbaijan Grand Prix', 'Tuscan Grand Prix', 'Eifel Grand Prix',
       'Qatar Grand Prix', 'Dutch Grand Prix', 'Saudi Arabian Grand Prix',
       'Miami Grand Prix', 'Las Vegas Grand Prix', 'South African Grand Prix']

# Enter circuit name (ex. Australian Grand Prix) for the prediction
# Manually edit weather forecast (0 - dry, 1 - light rain, 2 - heavy rain)

while True:
    circuit_name = input("Enter circuit name (ex. Australian Grand Prix) for the prediction : ").strip()
    if circuit_name in circuit_names:
        print(f"Valid circuit: {circuit_name}")
        break
    else:
        print("Invalid circuit name! Please enter a valid Grand Prix name.")
        
while True:
    valid_weather=[0,1,2]
    weather = int(input("Enter weather forecast (0 - dry, 1 - light rain, 2 - heavy rain)"))
    if weather in valid_weather:
        print(f"Valid weather input: {weather}")
        break
    else:
        print("Invalid weather! Please enter a valid value.")



Enter circuit name (ex. Australian Grand Prix) for the prediction : Chinese Grand Prix
Valid circuit: Chinese Grand Prix
Enter weather forecast (0 - dry, 1 - light rain, 2 - heavy rain)0
Valid weather input: 0


## Sample creation and prediction

Finally we are predicting the chance of a safety car occurrence in a given race (ignore the warning, it's all okay)

In [8]:
# Prediction sample has to be a 2-dimensional array!

prediction_sample = safety_car_predictor[safety_car_predictor['name'] == 
circuit_name][['circuit_id', 'circuit_type', 'phy_diff','avg_dnfs_per_circuit',
               'avg_sc_per_circuit', 'avg_vsc_per_circuit']]

prediction_sample = prediction_sample.iloc[0].tolist()
prediction_sample.insert(1, weather)

#prediction_sample = [[1,1,1,2,6.761905, 1.095238, 0.625000]] # For instance this predicts a safety car chance for Australian Grand Prix with light rain!
safety_car_deployment_prediction = forest_model.predict([prediction_sample]) * 100
safety_car_deployment_prediction_rounded = np.round(safety_car_deployment_prediction,2)

print(f"Average DNF's on {circuit_name} :", prediction_sample[4])
print(f"Average safety cars on {circuit_name} :", prediction_sample[5])
print(f"Average virtual safety cars on {circuit_name} :", prediction_sample[6])
print('\033[92m' + '\033[1m' + f"Prediction : {safety_car_deployment_prediction_rounded}%")


Average DNF's on Chinese Grand Prix : 3.41
Average safety cars on Chinese Grand Prix : 0.71
Average virtual safety cars on Chinese Grand Prix : 0.5
[92m[1mPrediction : [60.97]%




## Additional information about our model

This shows the chance of a safety car based on weather forecast (0 - dry, 1 - light rain, 2 - heavy rain) based on our dataset which was manually created.  

In [9]:
print(safety_car_predictor.groupby('weather')['safety_car'].mean()) # This shows what is the effect of a weather on a target variable


weather
0    0.438878
1    0.540000
2    0.684211
Name: safety_car, dtype: float64


This code segment shows the impact of each feature from our dataset and it's importance in the final prediction.  
Here we can see the largest impact on the prediction comes from the avg_sc_per_circuit feature, which was expected. Weather comes in as second place, as rain can drasticaly improve the chances of an accident which causes safety car occurrences.  
The least important factor is circuit_type, because drivers can get used to street or traditional circuits as they have been racing for most of their lives on both categories of circuits.

In [10]:
for column in range(len(features)):
    print(features[column], (np.round(forest_model.feature_importances_[column], 2)*100),'%') # This command shows how important each column is for our model


circuit_id 10.0 %
weather 16.0 %
circuit_type 1.0 %
phy_diff 4.0 %
avg_dnfs_per_circuit 10.0 %
avg_sc_per_circuit 48.0 %
avg_vsc_per_circuit 11.0 %


## Final words

Overall we are happy with the results, especially because it is our first ML project, hopefully not last.
Thank you for taking your time to review this project, and for any other information/advice/critics you can contact us via LinkedIn.  
We don't support using this model for any type of gambling/betting. Save your money and go to an actual race. Cheers!