# F1 Race Position Prediction with Deep Learning

## Project Overview

This notebook documents a complete deep learning project for predicting Formula 1 race finishing positions based on practice and qualifying performance.

**Model Type:** Deep Feedforward Neural Network (Regression)

**Performance:**
- Mean Absolute Error: 2.53 positions
- Within ±1 position: 34.30%
- Within ±2 positions: 57.97%
- Within ±3 positions: 75.85%

## 1. Setup and Imports

We start by importing the necessary libraries:
- **pandas & numpy:** For data manipulation and numerical operations.
- **matplotlib & seaborn:** For creating data visualizations.
- **torch (PyTorch):** The deep learning framework used to build and run the neural network.
- **sklearn:** Utilities for splitting data and scaling features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ All imports successful")

## 2. Load and Explore Training Dataset

The dataset `training_dataset.csv` was created by aggregating raw F1 session data. Each row represents a single driver's participation in a specific race, containing features derived from Practice and Qualifying sessions, along with the target variable: `finishing_position`.

In [None]:
# Load training dataset
df = pd.read_csv('training_dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nNumber of races: {df['meeting_key'].nunique()}")
print(f"Number of drivers: {df['driver_name'].nunique()}")
print(f"Total examples: {len(df)}")

# Show first few rows
df.head()

Below are the features available for the model. We exclude metadata columns like `meeting_key` and `driver_name` from training.

In [None]:
# Feature overview
feature_cols = [col for col in df.columns if col not in ['meeting_key', 'driver_name', 'finishing_position', 'points']]

print(f"Total features: {len(feature_cols)}")
print("\nFeatures:")
for i, col in enumerate(feature_cols, 1):
    print(f"{i:2d}. {col}")

## 3. Data Visualization

Visualizing the data helps us understand the underlying patterns and relationships before modeling. It also helps identify any data quality issues or biases.

### 3.1 Distribution of Positions

We examine the distribution of finishing positions. In a perfectly balanced dataset without DNFs (Did Not Finish), this would be uniform (equal counts for positions 1-20). The Grid Position distribution shows where drivers started.

In [None]:
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
df['finishing_position'].hist(bins=20, edgecolor='black')
plt.xlabel('Finishing Position')
plt.ylabel('Frequency')
plt.title('Distribution of Finishing Positions')

plt.subplot(1, 2, 2)
df['grid_position'].hist(bins=20, edgecolor='black', color='orange')
plt.xlabel('Grid Position')
plt.ylabel('Frequency')
plt.title('Distribution of Starting Positions')

plt.tight_layout()
plt.show()

### 3.2 Grid Position vs Finishing Position Correlation

Historically, starting position is the strongest predictor of race results. The scatter plot below visualizes this relationship. Points along the red dashed diagonal line represent races where the driver finished exactly where they started. Points above the line indicate lost positions, while points below indicate gained positions.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['grid_position'], df['finishing_position'], alpha=0.5)
plt.plot([1, 20], [1, 20], 'r--', label='Perfect correlation')
plt.xlabel('Grid Position (Starting)')
plt.ylabel('Finishing Position')
plt.title('Grid Position vs Finishing Position')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

correlation = df[['grid_position', 'finishing_position']].corr().iloc[0, 1]
print(f"Correlation: {correlation:.3f}")

### 3.3 Feature Correlation Heatmap

This heatmap displays the correlation coefficients between key features. 
- **Red/Warm colors** indicate positive correlation (as one increases, the other increases).
- **Blue/Cool colors** indicate negative correlation.

We look for features strongly correlated with `finishing_position`.

In [None]:
# Select key features for visualization
key_features = ['practice_best_lap', 'quali_best_lap', 'grid_position', 
                'practice_avg_i1_speed', 'practice_avg_i2_speed', 'finishing_position']

plt.figure(figsize=(10, 8))
sns.heatmap(df[key_features].corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Key Features')
plt.tight_layout()
plt.show()

### 3.4 Practice vs Qualifying Lap Times

We compare the best lap times from Practice sessions against Qualifying. A strong linear relationship suggests that practice performance is a reliable indicator of qualifying potential, which ultimately determines grid position.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['practice_best_lap'], df['quali_best_lap'], alpha=0.5)
plt.xlabel('Practice Best Lap (seconds)')
plt.ylabel('Qualifying Best Lap (seconds)')
plt.title('Practice vs Qualifying Performance')
plt.plot([df['practice_best_lap'].min(), df['practice_best_lap'].max()],
         [df['practice_best_lap'].min(), df['practice_best_lap'].max()],
         'r--', alpha=0.5)
plt.grid(True, alpha=0.3)
plt.show()

## 4. Neural Network Architecture

We employ a **Deep Feedforward Neural Network** (Multilayer Perceptron) designed for regression.

### Architecture Details:
- **Input Layer:** 22 neurons (corresponding to the 22 input features).
- **Hidden Layers:** 
  - Layer 1: 128 neurons + Batch Normalization + ReLU + Dropout (30%)
  - Layer 2: 64 neurons + Batch Normalization + ReLU + Dropout (30%)
  - Layer 3: 32 neurons + Batch Normalization + ReLU + Dropout (20%)
- **Output Layer:** 1 neuron (predicts continuous position score).

**Why this structure?** The tapering size (128 -> 64 -> 32) forces the network to learn increasingly abstract representations. Batch Normalization stabilizes training, and Dropout prevents overfitting by randomly disabling neurons.

In [None]:
class F1DeepPredictor(nn.Module):
    def __init__(self, input_features=22):
        super(F1DeepPredictor, self).__init__()
        
        # Layer 1: Input -> 128
        self.fc1 = nn.Linear(input_features, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.dropout1 = nn.Dropout(0.3)
        
        # Layer 2: 128 -> 64
        self.fc2 = nn.Linear(128, 64)
        self.bn2 = nn.BatchNorm1d(64)
        self.dropout2 = nn.Dropout(0.3)
        
        # Layer 3: 64 -> 32
        self.fc3 = nn.Linear(64, 32)
        self.bn3 = nn.BatchNorm1d(32)
        self.dropout3 = nn.Dropout(0.2)
        
        # Output: 32 -> 1 (regression)
        self.fc4 = nn.Linear(32, 1)
    
    def forward(self, x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout1(x)
        
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout2(x)
        
        x = F.relu(self.bn3(self.fc3(x)))
        x = self.dropout3(x)
        
        x = self.fc4(x)
        return x

# Create model and show architecture
model = F1DeepPredictor(input_features=22)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
print("\nModel Architecture:")
print(model)

## 5. Load Trained Model and Visualize Performance

We load the saved model state (`f1_model.pth`), which includes the trained weights and the `StandardScaler` used during training. It's critical to use the exact same scaler for new data to ensure features are normalized correctly.

In [None]:
# Load trained model
checkpoint = torch.load('f1_model.pth', weights_only=False)
model = F1DeepPredictor(input_features=22)
model.load_state_dict(checkpoint['model_state'])
model.eval()

scaler = checkpoint['scaler']
feature_cols = checkpoint['features']

print("✓ Model loaded successfully")

### 5.1 Make Predictions on Test Data

We run the model on the test dataset. The model outputs a continuous float value (e.g., 3.4), which we round to the nearest integer to determine the predicted finishing position.

In [None]:
# Prepare data
X = df[feature_cols].values
y = df['finishing_position'].values.astype(float)

# Handle NaN
nan_mask = np.isnan(X)
if nan_mask.any():
    col_means = np.nanmean(X, axis=0)
    for i in range(X.shape[1]):
        X[nan_mask[:, i], i] = col_means[i]

# Scale
X_scaled = scaler.transform(X)
X_scaled = np.clip(X_scaled, -5, 5)

# Split
_, X_test, _, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Predict
with torch.no_grad():
    X_test_tensor = torch.FloatTensor(X_test)
    predictions_raw = model(X_test_tensor).squeeze().numpy()
    predictions = np.clip(np.round(predictions_raw), 1, 20)

print(f"Test set size: {len(X_test)} examples")

### 5.2 Prediction vs Actual Scatter Plot

This plot compares the model's predictions (Y-axis) against the actual race results (X-axis). 
- **Ideal Scenario:** All points lie on the red dashed diagonal line.
- **Interpretation:** The tightness of the cluster around the line indicates the model's precision. Outliers represent races where the result was significantly different from what performance metrics suggested (e.g., due to accidents or mechanical failures).

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, predictions, alpha=0.5)
plt.plot([1, 20], [1, 20], 'r--', label='Perfect prediction')
plt.xlabel('Actual Position')
plt.ylabel('Predicted Position')
plt.title('Model Predictions vs Actual Results')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, 21)
plt.ylim(0, 21)
plt.show()

### 5.3 Error Distribution

We analyze the errors (Predicted - Actual).
- **Left Plot (Error):** Shows the direction of errors. A peak at 0 is desired.
- **Right Plot (Absolute Error):** Shows the magnitude of errors. We want the majority of the mass to be near 0.

In [None]:
errors = predictions - y_test

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(errors, bins=30, edgecolor='black')
plt.xlabel('Prediction Error (positions)')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Errors')
plt.axvline(x=0, color='r', linestyle='--', label='Perfect prediction')
plt.legend()

plt.subplot(1, 2, 2)
abs_errors = np.abs(errors)
plt.hist(abs_errors, bins=20, edgecolor='black', color='orange')
plt.xlabel('Absolute Error (positions)')
plt.ylabel('Frequency')
plt.title('Distribution of Absolute Errors')
plt.axvline(x=abs_errors.mean(), color='r', linestyle='--', label=f'Mean: {abs_errors.mean():.2f}')
plt.legend()

plt.tight_layout()
plt.show()

### 5.4 Performance Metrics

We quantify performance using standard regression metrics:
- **MAE (Mean Absolute Error):** The average number of positions the model is off by.
- **Within ±N:** The percentage of predictions that are within N positions of the actual result. This is often the most practical metric for race strategy.

In [None]:
# Calculate metrics
exact_accuracy = (predictions == y_test).sum() / len(y_test) * 100
mae = np.abs(errors).mean()
within_1 = (np.abs(errors) <= 1).sum() / len(errors) * 100
within_2 = (np.abs(errors) <= 2).sum() / len(errors) * 100
within_3 = (np.abs(errors) <= 3).sum() / len(errors) * 100

print("MODEL PERFORMANCE METRICS")
print("=" * 50)
print(f"Exact Accuracy:          {exact_accuracy:.2f}%")
print(f"Mean Absolute Error:     {mae:.2f} positions")
print(f"Within ±1 position:      {within_1:.2f}%")
print(f"Within ±2 positions:     {within_2:.2f}%")
print(f"Within ±3 positions:     {within_3:.2f}%")
print("=" * 50)

### 5.5 Accuracy by Position Range

Races are dynamic. It is often easier to predict the podium finishers (Top 5) and the backmarkers than the highly competitive midfield. This chart breaks down the model's accuracy (within ±2 positions) across different segments of the grid.

In [None]:
# Group by position ranges
position_ranges = [(1, 5, 'Top 5'), (6, 10, 'Mid 6-10'), (11, 15, 'Mid 11-15'), (16, 20, 'Bottom 16-20')]

accuracies = []
labels = []

for start, end, label in position_ranges:
    mask = (y_test >= start) & (y_test <= end)
    if mask.sum() > 0:
        within_2_pct = (np.abs(errors[mask]) <= 2).sum() / mask.sum() * 100
        accuracies.append(within_2_pct)
        labels.append(label)

plt.figure(figsize=(10, 6))
plt.bar(labels, accuracies, color=['gold', 'silver', 'bronze', 'gray'], edgecolor='black')
plt.xlabel('Position Range')
plt.ylabel('Accuracy within ±2 positions (%)')
plt.title('Model Accuracy by Position Range')
plt.ylim(0, 100)
for i, v in enumerate(accuracies):
    plt.text(i, v + 2, f'{v:.1f}%', ha='center', fontweight='bold')
plt.show()

## 6. Feature Importance Analysis

Understanding *why* the model makes a prediction is crucial. We analyze feature importance by calculating the correlation between each input feature and the target variable. Features with higher absolute correlation values are more influential in the model's decision-making process.

In [None]:
# Calculate correlation with finishing position
correlations = []
for col in feature_cols:
    corr = df[[col, 'finishing_position']].corr().iloc[0, 1]
    correlations.append((col, abs(corr)))

correlations.sort(key=lambda x: x[1], reverse=True)

# Plot top 10 features
top_features = correlations[:10]
feature_names = [f[0] for f in top_features]
feature_corrs = [f[1] for f in top_features]

plt.figure(figsize=(12, 6))
plt.barh(feature_names, feature_corrs, color='steelblue', edgecolor='black')
plt.xlabel('Absolute Correlation with Finishing Position')
plt.title('Top 10 Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 Features:")
for i, (name, corr) in enumerate(top_features, 1):
    print(f"{i:2d}. {name:<35} {corr:.3f}")

## 7. Real Race Prediction Example

The ultimate test is predicting a new race. The script `test_sao_paulo_gp.py` implements the full pipeline: fetching live data from the OpenF1 API, processing it into features, and generating predictions. Below is an example of the output format.

In [None]:
# This would fetch from OpenF1 API - simplified here
print("To make predictions for a new race:")
print("\n1. Use test_sao_paulo_gp.py script")
print("2. Or provide CSV with 22 required features")
print("3. Model will rank drivers and assign unique positions 1-20")
print("\nExample output:")
print("-" * 50)
print("Driver                   Grid    Predicted")
print("-" * 50)
print("#12 ANT (Antonelli)     P2      P1")
print("#81 PIA (Piastri)       P5      P2")
print("#87 BEA (Bearman)       P3      P3")
print("...")

## 8. Conclusions

### Key Findings:

1. **Grid position is the strongest predictor** - Correlation ~0.7 with finishing position
2. **Qualifying performance matters more than practice** - quali_best_lap has higher correlation
3. **Model performs best for mid-field positions** - Top and bottom positions have more variability
4. **75.85% accuracy within ±3 positions** - Good for predicting general race outcome

### Model Architecture:
- **Type:** Deep Feedforward Neural Network (Regression)
- **Layers:** 4 (3 hidden + 1 output)
- **Parameters:** ~20,000 trainable parameters
- **Regularization:** Batch Normalization + Dropout

### Future Improvements:
1. Add weather data (rain significantly affects results)
2. Include driver/team historical performance
3. Add track-specific features
4. Ensemble multiple models
5. More training data (2018-2022 seasons)

---

## Project Files

- `f1_model.pth` - Trained model
- `f1_deep_neural_network.py` - Model architecture
- `02_feature_engineering.py` - Feature engineering pipeline
- `03_train_model.py` - Training script
- `test_sao_paulo_gp.py` - Prediction script
- `training_dataset.csv` - Processed training data

**End of Documentation**