# F1 Winner **Prediction**

**Author**: Esma Yildirim

**Date**: 05.04.2025

**Description**: Predicts F1 Grand Prix winners using historical data (1950–2024)

**Links**:  
- [GitHub Repo](https://github.com/frauvate/f1-2025-winner-prediction/) | [Kaggle Dataset](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020)  

Importing Libraries

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib
from IPython.display import display, Markdown

# Data Loading and Preprocessing

Load Datasets

In [2]:
results = pd.read_csv("results.csv")
qualifying = pd.read_csv("qualifying.csv")
driver_standings = pd.read_csv("driver_standings.csv")
constructor_standings = pd.read_csv("constructor_standings.csv")
races = pd.read_csv("races.csv")
drivers = pd.read_csv("drivers.csv")

Merge Datasets

In [3]:
merged_df = results.merge(qualifying, on=["raceId", "driverId", "constructorId"], how="left")
merged_df = merged_df.merge(driver_standings, on=["raceId", "driverId"], how="left")
merged_df = merged_df.merge(constructor_standings, on=["raceId", "constructorId"], how="left", suffixes=('_results', '_constructor'))
merged_df = merged_df.merge(races, on="raceId", how="left")

# print(merged_df.columns) / Used to select from column names

Select Relevant Features

In [4]:
merged_df = merged_df[[
    "year", "driverId", "constructorId", "grid", "positionOrder", "points_x", "q1", "q2", "q3", "wins_results"
]]

Renaming Columns

In [5]:
merged_df.rename(columns={
    "points_results": "race_points",
    "wins_results": "driver_wins"
}, inplace=True)

Handle Missing Values

In [6]:
merged_df.fillna(-1, inplace=True)

In [7]:
# Convert qualifying times (e.g., "1:23.456") to numeric seconds
for col in ["q1", "q2", "q3"]:
    merged_df[col] = pd.to_numeric(merged_df[col].str.replace(':', '', regex=True), errors='coerce')

Define Features and Target Variable

In [8]:
target = (merged_df["positionOrder"] == 1).astype(int)  # 1 if the driver won, else 0
features = merged_df.drop(columns=["positionOrder", "year"])

Scale Numerical Features

In [9]:
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Model Training

Split into Train and Test Sets

In [10]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

In [11]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Train the Random Forest Classifier

In [12]:
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

In [13]:
# Save model and scaler
joblib.dump(model, "f1_winner_model.pkl")
joblib.dump(scaler, "scaler.pkl")

['scaler.pkl']

Make Predictions

In [14]:
y_pred = model.predict(X_test)

Evaluate Model

In [15]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 0.9981
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5124
           1       1.00      0.96      0.98       228

    accuracy                           1.00      5352
   macro avg       1.00      0.98      0.99      5352
weighted avg       1.00      1.00      1.00      5352



# Simulate and Predict

Simulate 2025 Race Data

In [16]:
prediction_data = merged_df[merged_df["year"] == 2024].copy()
prediction_data["year"] = 2025

In [17]:
# Assign new raceId values for 2025
prediction_data["raceId"] = prediction_data.groupby(["year", "constructorId", "driverId"]).ngroup() + 1

In [18]:
# Ensure prediction features match training features
prediction_features = prediction_data[features.columns]
prediction_features_scaled = scaler.transform(prediction_features)



In [19]:
# Predict probabilities instead of direct classifications
prediction_probs = model.predict_proba(prediction_features_scaled)

Extract Probability of Winning

In [20]:
winning_probs = prediction_probs[:, 1]
prediction_data["win_probability"] = winning_probs

Select the Driver with the Highest Probability

In [21]:
predicted_winners = prediction_data.loc[prediction_data.groupby("raceId")["win_probability"].idxmax()]

Calculate Average Win Probability Per Driver

In [22]:
average_win_probs = prediction_data.groupby("driverId")["win_probability"].mean()
champion_prediction = average_win_probs.idxmax()
champion_probability = average_win_probs.max()

Get Top 5 Drivers

In [27]:
top_5 = average_win_probs.nlargest(5).reset_index()
top_5.columns = ["Driver ID", "Win Probability"]
top_5 = top_5.merge(
    drivers[["driverId", "surname"]],
    left_on="Driver ID",
    right_on="driverId"
).drop(columns="driverId")

# Display
display(Markdown("### 🏆 Top 5 Predicted 2025 Winners"))
display(Markdown(top_5[["surname", "Win Probability"]].rename(
    columns={"surname": "Driver"}).to_markdown(index=False, floatfmt=".1%")))

### 🏆 Top 5 Predicted 2025 Winners (Historical Model)

| Driver     |   Win Probability |
|:-----------|------------------:|
| Verstappen |             65.5% |
| Sainz      |             56.4% |
| Norris     |             51.9% |
| Leclerc    |             46.8% |
| Russell    |             41.9% |

Final Winner Prediction

In [26]:
driver_name = drivers.loc[drivers["driverId"] == champion_prediction, "surname"].values[0]
print(f"\n🏆 Predicted 2025 Winner (With the average probability of winning): {driver_name}")
print(f"🔮 Average Probability of Winning: {champion_probability:.2%}")


🏆 Predicted 2025 Winner (With the average probability of winning): Verstappen
🔮 Average Probability of Winning: 65.46%
