# SWE3050-41 Term Project: F1 Champion Prediction

ÏÑ±Í∑†Í¥ÄÎåÄÌïôÍµê SKKU

Group 11: Nguyen Andy, UZMA NABEEHA BINTI SUFFIAN, ÏïôÍ∞ÄÎùΩ, Ïù¥ÎÇòÌòÑ



## Phase 1: Data Collection
### 1Îã®Í≥Ñ: Îç∞Ïù¥ÌÑ∞ ÏàòÏßë
Load raw CSV files from Kaggle (results, drivers, races, and constructors), then print the full dataset of F1 results from 1950 to 2024. After that, sort the results chronologically by year and races, then filter relevant (modern) data from 2014 to 2023.

Work by Ïï§Îîî Andy

In [38]:
import shutil

# delete old f1_cache folder
shutil.rmtree('/content/f1_cache', ignore_errors=True)
print("‚úÖ Deleted f1_cache folder completely.")

‚úÖ Deleted f1_cache folder completely.


In [39]:
# install fastf1 if not yet installed
!pip install fastf1

import os, fastf1

# re-initialise new fastf1 cache
os.makedirs('/content/f1_cache', exist_ok=True)
fastf1.Cache.enable_cache('/content/f1_cache')
print("üÜï New FastF1 cache initialized.")

üÜï New FastF1 cache initialized.


In [40]:
import pandas as pd

# üèéÔ∏è STEP 1: LOAD RAW CSV FILES
print("üìÇ Loading Kaggle F1 dataset files...")

# upload these following files to Google Colab (in 'kaggle' folder)
kaggle_results = pd.read_csv('/kaggle/results.csv')
races = pd.read_csv('/kaggle/races.csv')[['raceId', 'year', 'name']]
drivers = pd.read_csv('/kaggle/drivers.csv')[['driverId', 'driverRef', 'surname']]
constructors = pd.read_csv('/kaggle/constructors.csv')[['constructorId', 'name']]

print("‚úÖ All source files loaded successfully.")

# üß© STEP 2: MERGE INTO ONE COMPLETE DATAFRAME
merged = (
    kaggle_results
    .merge(races, on='raceId')
    .merge(drivers, on='driverId')
    .merge(constructors, on='constructorId')
)

# rename columns for clarity
merged = merged.rename(columns={
    'year': 'Year',
    'name_x': 'Race',
    'surname': 'Driver',
    'name_y': 'Constructor',
    'positionOrder': 'Position',
    'points': 'Points'
})

# save merged dataset
merged.to_csv('/content/formula1_results_1950_2024.csv', index=False)
print("üíæ Merged dataset saved as formula1_results_1950_2024.csv")

# üßπ STEP 3: CLEAN AND SORT DATA
print("üßπ Cleaning and sorting dataset...")

df = pd.read_csv('/content/formula1_results_1950_2024.csv')

# select and reorder relevant columns
df = df[['Year', 'Race', 'Driver', 'Constructor', 'Position', 'Points']]

# drop missing / invalid rows
df = df.dropna(subset=['Year', 'Race', 'Driver', 'Constructor', 'Points'])
df = df[df['Position'] != '\n']

# ensure numeric columns have proper data types
df['Year'] = df['Year'].astype(int)
df['Position'] = df['Position'].astype(int)
df['Points'] = df['Points'].astype(float)

# sort chronologically
df = df.sort_values(by=['Year', 'Race']).reset_index(drop=True)

print(f"‚úÖ Cleaned dataset shape: {df.shape}")

# save the clean version
df.to_csv('/content/formula1_results_cleaned.csv', index=False)
print("üíæ Saved clean dataset as formula1_results_cleaned.csv")

# üéØ STEP 4: FILTER MODERN ERA (2014‚Äì2023)
# skip 2024 due to partial/incomplete results
df_filtered = df[df['Year'].between(2014, 2023)]
df_filtered.to_csv('/content/f1_results_2014_2023.csv', index=False)
print(f"üèÅ Saved modern-era filtered dataset (2014‚Äì2023) ‚Üí {len(df_filtered)} rows")

üìÇ Loading Kaggle F1 dataset files...
‚úÖ All source files loaded successfully.
üíæ Merged dataset saved as formula1_results_1950_2024.csv
üßπ Cleaning and sorting dataset...
‚úÖ Cleaned dataset shape: (26759, 6)
üíæ Saved clean dataset as formula1_results_cleaned.csv
üèÅ Saved modern-era filtered dataset (2014‚Äì2023) ‚Üí 4147 rows


## Phase 2: Feature Engineering
### 2Îã®Í≥Ñ: Í∏∞Îä• ÏóîÏßÄÎãàÏñ¥ÎßÅ

Transforming clean race-level data into season-level driver statistics, ready for model training (for predicting champions).

Work by Ïï§Îîî Andy

In [41]:
# üßÆ FEATURE ENGINEERING: CREATE DRIVER-SEASON LEVEL DATA

import pandas as pd

# load cleaned race-level dataset
# NOTE: make sure to upload "f1_results_2014_2023.csv" to Google Colab so this script can read the file
df = pd.read_csv('/content/f1_results_2014_2023.csv')

# ensure correct datatypes
df['Points'] = df['Points'].astype(float)
df['Position'] = df['Position'].astype(int)

# aggregate key performance features per driver per season
features = (
    df.groupby(['Year', 'Driver'])
      .agg(
          Total_Points=('Points', 'sum'),
          Wins=('Position', lambda x: (x == 1).sum()),
          Podiums=('Position', lambda x: (x <= 3).sum()),
          Top10s=('Position', lambda x: (x <= 10).sum()),
          Races_Entered=('Race', 'count'),
          Avg_Position=('Position', 'mean')
      )
      .reset_index()
)

# add a feature for consistency (lower = better)
features['Position_STD'] = (
    df.groupby(['Year', 'Driver'])['Position'].std().reset_index(drop=True)
)

# find the champion for each season (label = 1 if champion, else 0)
champions = (
    features.loc[features.groupby('Year')['Total_Points'].idxmax(), ['Year', 'Driver']]
    .assign(Champion=1)
)

# merge back to create binary label column
features = features.merge(champions, on=['Year', 'Driver'], how='left')
features['Champion'] = features['Champion'].fillna(0).astype(int)

# sort and save
features = features.sort_values(['Year', 'Total_Points'], ascending=[True, False])
features.to_csv('/content/f1_driver_features_2014_2023.csv', index=False)

print("‚úÖ Feature engineering complete!")
print("üìÑ Saved: f1_driver_features_2014_2023.csv")
print(f"Rows: {len(features)}, Columns: {features.shape[1]}")
print("\nSample preview:")
print(features.head(10))

‚úÖ Feature engineering complete!
üìÑ Saved: f1_driver_features_2014_2023.csv
Rows: 223, Columns: 10

Sample preview:
    Year      Driver  Total_Points  Wins  Podiums  Top10s  Races_Entered  \
8   2014    Hamilton         384.0    11       16      16             19   
18  2014     Rosberg         317.0     5       15      16             19   
17  2014   Ricciardo         238.0     3        8      16             19   
2   2014      Bottas         186.0     0        6      17             19   
23  2014      Vettel         167.0     0        4      16             19   
0   2014      Alonso         161.0     0        2      17             19   
15  2014       Massa         134.0     0        3      11             19   
3   2014      Button         126.0     0        1      13             19   
9   2014  H√ºlkenberg          96.0     0        0      15             19   
16  2014       P√©rez          59.0     0        1      12             19   

    Avg_Position  Position_STD  Champion  

## Phase 3: Model Training & F1 Champion Prediction Results
### 3Îã®Í≥Ñ: Î™®Îç∏ ÌïôÏäµ Î∞è F1 Ï±îÌîºÏñ∏ ÏòàÏ∏° Í≤∞Í≥º

Work by Uzma, Ïù¥ÎÇòÌòÑ
#### 1: Data Preparation & Time-Based Split

In [42]:
from google.colab import files
import pandas as pd

# load the driver-season feature dataset
df = pd.read_csv('/content/f1_driver_features_2014_2023.csv')

# fill missing std with mean
df['Position_STD'] = df['Position_STD'].fillna(df['Position_STD'].mean())

# define feature columns
feature_cols = [
    'Total_Points', 'Wins', 'Podiums', 'Top10s',
    'Races_Entered', 'Avg_Position', 'Position_STD'
]

# time-based split: no shuffling, no leakage
train_years = list(range(2014, 2020))  # 2014‚Äì2019
val_years   = [2020, 2021]
test_years  = [2022, 2023]

train_df = df[df['Year'].isin(train_years)].copy()
val_df   = df[df['Year'].isin(val_years)].copy()
test_df  = df[df['Year'].isin(test_years)].copy()

X_train = train_df[feature_cols]
y_train = train_df['Champion']

X_val   = val_df[feature_cols]
y_val   = val_df['Champion']

X_test  = test_df[feature_cols]
y_test  = test_df['Champion']

print("train years:", sorted(train_df['Year'].unique()), "shape:", X_train.shape)
print("val years:  ", sorted(val_df['Year'].unique()), "shape:", X_val.shape)
print("test years: ", sorted(test_df['Year'].unique()), "shape:", X_test.shape)

train years: [np.int64(2014), np.int64(2015), np.int64(2016), np.int64(2017), np.int64(2018), np.int64(2019)] shape: (135, 7)
val years:   [np.int64(2020), np.int64(2021)] shape: (44, 7)
test years:  [np.int64(2022), np.int64(2023)] shape: (44, 7)


#### 2. Define Models (Logistic Regression, Decision Tree, Random Forests, SVM) with Scaling & Class Weights

In [43]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# add scaling and class_weight
models = {
    "Logistic Regression": Pipeline([
        ('scale', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
    ]),
    "Decision Tree": DecisionTreeClassifier(class_weight='balanced', random_state=42),
    "Random Forest": RandomForestClassifier(class_weight='balanced', random_state=42),
    "Support Vector Machine (SVM)": Pipeline([
        ('scale', StandardScaler()),
        ('clf', SVC(kernel='rbf', probability=True, class_weight='balanced', random_state=42))
    ])
}


#### 3. Train & Choose Best Model

In [44]:
from sklearn.metrics import classification_report

best_models = {}
val_scores = {}

for name, model in models.items():
    print(f"\n=== Training {name} ===")
    model.fit(X_train, y_train)

    y_val_pred = model.predict(X_val)
    report = classification_report(y_val, y_val_pred, output_dict=True)
    f1_champion = report['1']['f1-score']
    val_scores[name] = f1_champion
    best_models[name] = model

    print(classification_report(y_val, y_val_pred, digits=4))
    print(f"F1 score for champion class (1): {f1_champion:.4f}")

# select best model based on F1 for champion class
best_name = max(val_scores, key=val_scores.get)
best_model = best_models[best_name]

print(f"Best model selected: {best_name}")



=== Training Logistic Regression ===
              precision    recall  f1-score   support

           0     1.0000    0.9762    0.9880        42
           1     0.6667    1.0000    0.8000         2

    accuracy                         0.9773        44
   macro avg     0.8333    0.9881    0.8940        44
weighted avg     0.9848    0.9773    0.9794        44

F1 score for champion class (1): 0.8000

=== Training Decision Tree ===
              precision    recall  f1-score   support

           0     1.0000    0.9762    0.9880        42
           1     0.6667    1.0000    0.8000         2

    accuracy                         0.9773        44
   macro avg     0.8333    0.9881    0.8940        44
weighted avg     0.9848    0.9773    0.9794        44

F1 score for champion class (1): 0.8000

=== Training Random Forest ===
              precision    recall  f1-score   support

           0     1.0000    0.9762    0.9880        42
           1     0.6667    1.0000    0.8000         2



#### 4. Final Classification Evaluation on Test Set

In [45]:
from sklearn.metrics import accuracy_score, classification_report

y_test_pred = best_model.predict(X_test)

print("=== Test Set Classification ===")
print("Accuracy:", accuracy_score(y_test, y_test_pred))
print(classification_report(y_test, y_test_pred, digits=4))

=== Test Set Classification ===
Accuracy: 1.0
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000        42
           1     1.0000    1.0000    1.0000         2

    accuracy                         1.0000        44
   macro avg     1.0000    1.0000    1.0000        44
weighted avg     1.0000    1.0000    1.0000        44



#### 5. Ranking Evaluation & Champion Prediction

In [46]:
# add predicted probabilities for test set
test_df = test_df.copy()
test_df['pred_proba'] = best_model.predict_proba(X_test)[:, 1]

def ranking_metrics(df_season, top_k=3):
    ranked = df_season.sort_values('pred_proba', ascending=False)
    champion_idx = ranked['Champion'].values.argmax()  # index of first '1'
    hit1 = int(champion_idx == 0)
    hitk = int(champion_idx < top_k)
    rank = champion_idx + 1
    return rank, hit1, hitk, ranked

results = []
for year, grp in test_df.groupby('Year'):
    rank, hit1, hit3, ranked = ranking_metrics(grp, top_k=3)
    results.append({
        'Year': year,
        'Champion_Rank': rank,
        'Hit@1': hit1,
        'Hit@3': hit3
    })

ranking_df = pd.DataFrame(results).sort_values('Year')
print("=== Champion Rank per Test Season ===")
print(ranking_df)
print("\nOverall Hit@1:", ranking_df['Hit@1'].mean())
print("Overall Hit@3:", ranking_df['Hit@3'].mean())

=== Champion Rank per Test Season ===
   Year  Champion_Rank  Hit@1  Hit@3
0  2022              1      1      1
1  2023              1      1      1

Overall Hit@1: 1.0
Overall Hit@3: 1.0


#### 6. Predicted vs. True Champion

In [47]:
predicted_champions = []

for year, grp in test_df.groupby('Year'):
    ranked = grp.sort_values('pred_proba', ascending=False)
    predicted_driver = ranked.iloc[0]['Driver']
    true_driver = ranked.loc[ranked['Champion'] == 1, 'Driver'].iloc[0]
    predicted_champions.append({
        'Year': year,
        'Predicted Champion': predicted_driver,
        'True Champion': true_driver,
        'Correct?': predicted_driver == true_driver
    })

predicted_df = pd.DataFrame(predicted_champions).sort_values('Year')
print("=== Predicted vs True Champions (Test Years) ===")
print(predicted_df)

=== Predicted vs True Champions (Test Years) ===
   Year Predicted Champion True Champion  Correct?
0  2022         Verstappen    Verstappen      True
1  2023         Verstappen    Verstappen      True


#### 7. ü•â Top 3 Ranked Drivers for Each Season

In [48]:
top3_list = []

for year, grp in test_df.groupby('Year'):
    ranked = grp.sort_values('pred_proba', ascending=False).head(3)
    top3_list.append({
        'Year': year,
        '1st': ranked.iloc[0]['Driver'],
        '2nd': ranked.iloc[1]['Driver'],
        '3rd': ranked.iloc[2]['Driver']
    })

top3_df = pd.DataFrame(top3_list).sort_values('Year')
top3_df

Unnamed: 0,Year,1st,2nd,3rd
0,2022,Verstappen,Leclerc,P√©rez
1,2023,Verstappen,P√©rez,Hamilton
