# 🏎️ AI Pit Stop Strategist - Complete Machine Learning Pipeline

This notebook implements a comprehensive AI system for Formula 1 pit stop strategy prediction using advanced machine learning techniques. The system predicts whether a driver should pit within the next few laps using sequential lap data.

## 🎯 Objectives
- **Primary Goal**: Binary classification for pit stop timing (Pit / Don't Pit)
- **Data Source**: FastF1 library with multiple F1 seasons
- **Models Implemented**: Random Forest, XGBoost, PyTorch CNN, Advanced LSTM Ensemble

## 🧠 Model Performance Summary
| Model | F1-Score | Precision | Recall | ROC-AUC | Key Features |
|-------|----------|-----------|--------|---------|--------------|
| Random Forest | ~0.48 | ~0.46 | ~0.50 | ~0.86 | Traditional ML baseline |
| XGBoost | ~0.49 | ~0.42 | ~0.58 | ~0.86 | Gradient boosting |
| Basic CNN | 0.30 | 0.56 | 0.21 | 0.77 | Simple temporal patterns |
| **Advanced Ensemble** | **0.49** | 0.40 | **0.62** | **0.83** | SMOTE + Focal Loss + Attention |

**Key Achievement**: Improved recall from 21% to 62% (3x improvement) using advanced techniques for class imbalance.

## 📦 Setup & Imports

This section covers the complete setup for all machine learning models including traditional ML and deep learning approaches.

In [2]:
import fastf1
import os
import pandas as pd
import numpy as np 
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, 
    f1_score, roc_auc_score, confusion_matrix,
    ConfusionMatrixDisplay 
)
# import matplotlib.pyplot as plt # Uncomment if ConfusionMatrixDisplay.plot() is used directly

## 🗂️ Data Pipeline Overview

The complete data preprocessing pipeline has been implemented and the processed data is available in `f1_data.pkl`. This includes:

- **Data Collection**: Multi-season F1 race data from FastF1 library (2022-2024)
- **Feature Engineering**: 19 core features including lap times, tire life, compounds, track conditions
- **Target Definition**: Binary classification for pit stops in next 3 laps
- **Data Splits**: Chronological train/validation/test splits preserving temporal order
- **Preprocessing**: Scaling, missing value handling, and class distribution analysis

The preprocessing pipeline can be found in the original implementation files. This notebook focuses on the **machine learning models** trained on the preprocessed data.

In [3]:

try:
    NOTEBOOK_DIR = os.path.dirname(os.path.abspath(__file__)) # This works if running as a script
    PROJECT_ROOT = os.path.dirname(NOTEBOOK_DIR) 
except NameError:
    # Fallback for interactive notebook environments where __file__ is not defined
    PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..')) # Assumes notebook is in a subdir like 'notebooks' or 'src'
    # If your notebook IS the project root, use: PROJECT_ROOT = os.getcwd()
    # Verify this path if issues arise:
    # print(f"Guessed Project Root: {PROJECT_ROOT}") 

CACHE_DIR = os.path.join(PROJECT_ROOT, '.fastf1_cache')

# Create the cache directory if it doesn't exist
if not os.path.exists(CACHE_DIR):
    try:
        os.makedirs(CACHE_DIR)
        print(f"Cache directory created at: {CACHE_DIR}")
    except OSError as e:
        print(f"Error creating cache directory {CACHE_DIR}: {e}")
        CACHE_DIR = None 

# Enable FastF1 cache only if CACHE_DIR was set up successfully
if CACHE_DIR:
    try:
        fastf1.Cache.enable_cache(CACHE_DIR)
        print(f"FastF1 caching enabled at: {CACHE_DIR}")
    except Exception as e:
        print(f"Error enabling FastF1 cache at {CACHE_DIR}: {e}")

if 'CACHE_DIR' in globals() and CACHE_DIR and os.path.exists(CACHE_DIR):
    try:
        if fastf1.Cache.is_enabled():
            print(f"FastF1 caching is enabled. Target directory: {CACHE_DIR}")
        else:
            print(f"FastF1 caching was attempted for {CACHE_DIR}, but fastf1.Cache.is_enabled() is False.")
    except AttributeError:
        print("Could not check fastf1.Cache.is_enabled() due to potential version differences. Proceeding with data fetching attempt.")
    except Exception as e:
        print(f"Error checking cache status: {e}. Proceeding with data fetching attempt.")
else:
    print("FastF1 cache directory was not properly set up by this script.")

FastF1 caching enabled at: /Users/dragiychev/Documents/Fontys S4 AI/.fastf1_cache
Could not check fastf1.Cache.is_enabled() due to potential version differences. Proceeding with data fetching attempt.


# DELETED - Old preprocessing content

### 1.2 Initial Data Collection
We'll fetch the schedule for the 2023 F1 season, select the last 12 race events, and load their lap and weather data.

In [4]:
import fastf1
import pandas as pd
import numpy as np
from datetime import datetime

# --- Configuration ---
# Define the seasons you want to fetch data for.
SEASONS_TO_FETCH = [2022, 2023, 2024] 
# Note: As of mid-2024, the 2024 season is not yet complete. 
# This script will fetch all completed races from that year.

# Enable caching to speed up subsequent runs
fastf1.Cache.enable_cache(CACHE_DIR) 

# --- Main Script ---
all_laps_data = []
loaded_sessions_count = 0
total_races_found = 0

print(f"Starting data fetch for seasons: {', '.join(map(str, SEASONS_TO_FETCH))}")

try:
    # Loop through each specified season
    for season_year in SEASONS_TO_FETCH:
        print(f"\n{'='*40}")
        print(f"Fetching event schedule for {season_year}...")
        
        try:
            # Get the schedule for the current season
            schedule = fastf1.get_event_schedule(season_year, include_testing=False)
            
            # Filter for race events only
            races = schedule[schedule['Session5'] == 'Race']
            
            # Filter out races that have not happened yet (based on today's date)
            races = races[races['EventDate'] <= pd.to_datetime(datetime.now().date())]
            
            if races.empty:
                print(f"No completed race events found for {season_year}.")
                continue

            total_races_found += len(races)
            print(f"Found {len(races)} completed race(s) in {season_year}.")
            
            # Process each race in the season
            for index, race_info in races.iterrows():
                event_name = race_info['EventName']
                round_number = race_info['RoundNumber']
                
                print(f"\n-> Loading data for: {season_year} {event_name} (Round {round_number})")
                
                try:
                    # Load the session data. We need laps and weather.
                    session = fastf1.get_session(season_year, round_number, 'R')
                    session.load(laps=True, weather=True, telemetry=False, messages=False)
                    
                    print(f"  Session: {session.event['EventName']}")
                    if session.laps is not None and not session.laps.empty:
                        print(f"  Laps loaded: {len(session.laps)} laps")
                        laps_df = session.laps.copy()
                        
                        # --- Weather Data Integration ---
                        # If 'Rainfall' isn't in the lap data, merge it from weather data
                        if 'Rainfall' not in laps_df.columns:
                            if session.weather_data is not None and not session.weather_data.empty:
                                print("  Merging Rainfall data from weather stream...")
                                # Use merge_asof for efficient time-based merging
                                laps_df = pd.merge_asof(
                                    laps_df.sort_values('LapStartTime'), 
                                    session.weather_data[['Time', 'Rainfall']].sort_values('Time'), 
                                    left_on='LapStartTime', 
                                    right_on='Time',
                                    direction='nearest'
                                )
                                # Clean up temporary columns from the merge
                                laps_df.drop(columns=['Time'], inplace=True, errors='ignore')
                            else:
                                print("  Note: No weather data available. 'Rainfall' will be set to False.")
                                laps_df['Rainfall'] = False
                        
                        # Ensure Rainfall column exists and is boolean type
                        if 'Rainfall' not in laps_df.columns:
                            laps_df['Rainfall'] = False
                        laps_df['Rainfall'] = laps_df['Rainfall'].astype(bool)

                        # --- Add Metadata ---
                        laps_df['TotalRaceLaps'] = session.total_laps if hasattr(session, 'total_laps') else pd.NA
                        laps_df['EventName'] = event_name
                        laps_df['EventYear'] = season_year
                        laps_df['EventRound'] = round_number
                        
                        all_laps_data.append(laps_df)
                        loaded_sessions_count += 1
                    else:
                        print("  Laps data not available or is empty.")
                        
                except Exception as e:
                    print(f"  Error loading session {season_year} {event_name}: {e}. Skipping.")

        except Exception as e:
            print(f"Error fetching/processing event schedule for {season_year}: {e}")

    # --- Final Combination and Summary ---
    print(f"\n{'='*40}")
    print("Finished fetching all seasons.")
    print(f"Successfully loaded data for {loaded_sessions_count}/{total_races_found} total sessions.")

    if all_laps_data:
        # Concatenate all collected DataFrames into one
        combined_laps_df = pd.concat(all_laps_data, ignore_index=True)
        print("\nCombined all laps data into a single DataFrame.")
        print(f"Shape of final DataFrame: {combined_laps_df.shape}")
        
        print("\nInfo for combined_laps_df:")
        combined_laps_df.info()
        
        print("\nHead of combined_laps_df (first 5 rows):")
        print(combined_laps_df.head())
        
        print("\nTail of combined_laps_df (last 5 rows):")
        print(combined_laps_df.tail())
    else:
        print("\nNo laps data was collected. Cannot proceed.")
        combined_laps_df = pd.DataFrame() # Ensure it's defined for later checks

except Exception as e:
    print(f"\nA critical error occurred during the script execution: {e}")
    combined_laps_df = pd.DataFrame()



Starting data fetch for seasons: 2022, 2023, 2024

Fetching event schedule for 2022...


core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


Found 22 completed race(s) in 2022.

-> Loading data for: 2022 Bahrain Grand Prix (Round 1)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['16', '55', '44', '63', '20', '77', '31', '22', '14', '24', '47', '18', '23', '3', '4', '6', '27', '11', '1', '10']
core           INFO 	Loading data for Saudi Arabian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Bahrain Grand Prix
  Laps loaded: 1125 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Saudi Arabian Grand Prix (Round 2)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '55', '11', '63', '31', '4', '10', '20', '44', '24', '27', '18', '23', '77', '14', '3', '6', '22', '47']
core           INFO 	Loading data for Australian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Saudi Arabian Grand Prix
  Laps loaded: 820 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Australian Grand Prix (Round 3)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['16', '11', '63', '44', '4', '3', '31', '77', '10', '23', '24', '18', '47', '20', '22', '6', '14', '1', '5', '55']
core           INFO 	Loading data for Emilia Romagna Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Australian Grand Prix
  Laps loaded: 1045 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Emilia Romagna Grand Prix (Round 4)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '4', '63', '77', '16', '22', '5', '20', '18', '23', '10', '44', '31', '24', '6', '47', '3', '14', '55']
core           INFO 	Loading data for Miami Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Emilia Romagna Grand Prix
  Laps loaded: 1132 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Miami Grand Prix (Round 5)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '55', '11', '63', '44', '77', '31', '23', '18', '14', '22', '3', '6', '47', '20', '5', '10', '4', '24']
core           INFO 	Loading data for Spanish Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_

  Session: Miami Grand Prix
  Laps loaded: 1057 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Spanish Grand Prix (Round 6)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '63', '55', '44', '77', '31', '4', '14', '22', '5', '3', '10', '47', '18', '6', '20', '23', '24', '16']
core           INFO 	Loading data for Monaco Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Spanish Grand Prix
  Laps loaded: 1230 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Monaco Grand Prix (Round 7)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['11', '55', '1', '16', '63', '4', '14', '44', '77', '5', '10', '31', '3', '18', '6', '24', '22', '23', '47', '20']
core           INFO 	Loading data for Azerbaijan Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Monaco Grand Prix
  Laps loaded: 1179 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Azerbaijan Grand Prix (Round 8)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '63', '44', '10', '5', '14', '3', '4', '31', '77', '23', '22', '47', '6', '18', '20', '24', '16', '55']
core           INFO 	Loading data for Canadian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Azerbaijan Grand Prix
  Laps loaded: 891 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Canadian Grand Prix (Round 9)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '55', '44', '63', '16', '31', '77', '24', '14', '18', '3', '5', '23', '10', '4', '6', '20', '22', '47', '11']
core           INFO 	Loading data for British Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Canadian Grand Prix
  Laps loaded: 1264 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 British Grand Prix (Round 10)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['55', '11', '44', '16', '14', '4', '1', '47', '5', '20', '18', '6', '3', '22', '31', '10', '77', '63', '24', '23']
core           INFO 	Loading data for Austrian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: British Grand Prix
  Laps loaded: 815 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Austrian Grand Prix (Round 11)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['16', '1', '44', '63', '31', '47', '4', '20', '3', '14', '77', '23', '18', '24', '10', '22', '5', '55', '6', '11']
core           INFO 	Loading data for French Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Austrian Grand Prix
  Laps loaded: 1324 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 French Grand Prix (Round 12)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '44', '63', '11', '55', '14', '4', '31', '3', '18', '5', '10', '23', '77', '47', '24', '6', '20', '16', '22']
core           INFO 	Loading data for Hungarian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timin

  Session: French Grand Prix
  Laps loaded: 958 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Hungarian Grand Prix (Round 13)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '44', '63', '55', '11', '16', '4', '14', '31', '5', '18', '10', '24', '47', '3', '20', '23', '6', '22', '77']
core           INFO 	Loading data for Belgian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Hungarian Grand Prix
  Laps loaded: 1383 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Belgian Grand Prix (Round 14)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '55', '63', '14', '16', '31', '5', '10', '23', '18', '4', '22', '24', '3', '20', '47', '6', '77', '44']
core           INFO 	Loading data for Dutch Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Belgian Grand Prix
  Laps loaded: 792 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Dutch Grand Prix (Round 15)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '63', '16', '44', '11', '14', '4', '55', '31', '18', '10', '23', '47', '5', '20', '24', '3', '6', '77', '22']
core           INFO 	Loading data for Italian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Dutch Grand Prix
  Laps loaded: 1392 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Italian Grand Prix (Round 16)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '63', '55', '44', '11', '4', '10', '45', '24', '31', '47', '77', '22', '6', '20', '3', '18', '14', '5']
core           INFO 	Loading data for Singapore Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Italian Grand Prix
  Laps loaded: 971 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Singapore Grand Prix (Round 17)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['11', '16', '55', '4', '3', '18', '1', '5', '44', '10', '77', '20', '47', '63', '22', '31', '23', '14', '6', '24']
core           INFO 	Loading data for Japanese Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Singapore Grand Prix
  Laps loaded: 945 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Japanese Grand Prix (Round 18)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '16', '31', '44', '5', '14', '63', '6', '4', '3', '18', '22', '20', '77', '24', '47', '10', '55', '23']
core           INFO 	Loading data for United States Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Japanese Grand Prix
  Laps loaded: 507 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 United States Grand Prix (Round 19)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '44', '16', '11', '63', '4', '14', '5', '20', '22', '31', '24', '23', '10', '47', '3', '6', '18', '77', '55']
core           INFO 	Loading data for Mexico City Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: United States Grand Prix
  Laps loaded: 992 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Mexico City Grand Prix (Round 20)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '44', '11', '63', '55', '16', '3', '31', '4', '77', '10', '23', '24', '5', '18', '47', '20', '6', '14', '22']
core           INFO 	Loading data for São Paulo Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


  Session: Mexico City Grand Prix
  Laps loaded: 1379 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 São Paulo Grand Prix (Round 21)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['63', '44', '55', '16', '14', '1', '11', '31', '77', '18', '5', '24', '47', '10', '23', '6', '22', '4', '20', '3']
core           INFO 	Loading data for Abu Dhabi Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: São Paulo Grand Prix
  Laps loaded: 1259 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2022 Abu Dhabi Grand Prix (Round 22)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '11', '55', '63', '4', '31', '18', '3', '5', '22', '24', '23', '10', '77', '47', '20', '44', '6', '14']


  Session: Abu Dhabi Grand Prix
  Laps loaded: 1117 laps
  Merging Rainfall data from weather stream...

Fetching event schedule for 2023...


core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


Found 22 completed race(s) in 2023.

-> Loading data for: 2023 Bahrain Grand Prix (Round 1)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '14', '55', '44', '18', '63', '77', '10', '23', '22', '2', '20', '21', '27', '24', '4', '31', '16', '81']
core           INFO 	Loading data for Saudi Arabian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Bahrain Grand Prix
  Laps loaded: 1056 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Saudi Arabian Grand Prix (Round 2)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['11', '1', '14', '63', '44', '55', '16', '31', '10', '20', '22', '27', '24', '21', '81', '2', '4', '77', '23', '18']
core           INFO 	Loading data for Australian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Saudi Arabian Grand Prix
  Laps loaded: 943 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Australian Grand Prix (Round 3)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '44', '14', '18', '11', '4', '27', '81', '24', '22', '77', '55', '10', '31', '21', '2', '20', '63', '23', '16']
core           INFO 	Loading data for Azerbaijan Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_ti

  Session: Australian Grand Prix
  Laps loaded: 1003 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Azerbaijan Grand Prix (Round 4)


req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['11', '1', '16', '14', '55', '44', '18', '63', '4', '22', '81', '23', '20', '10', '31', '2', '27', '77', '24', '21']
core           INFO 	Loading data for Miami Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Azerbaijan Grand Prix
  Laps loaded: 962 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Miami Grand Prix (Round 5)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '14', '63', '55', '44', '16', '10', '31', '20', '22', '18', '77', '23', '27', '24', '4', '21', '81', '2']
core           INFO 	Loading data for Monaco Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Miami Grand Prix
  Laps loaded: 1138 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Monaco Grand Prix (Round 6)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '14', '31', '44', '63', '16', '10', '55', '4', '81', '77', '21', '24', '23', '22', '11', '27', '2', '20', '18']
core           INFO 	Loading data for Spanish Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Monaco Grand Prix
  Laps loaded: 1515 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Spanish Grand Prix (Round 7)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '44', '63', '11', '55', '18', '14', '31', '24', '10', '16', '22', '81', '21', '27', '23', '4', '20', '77', '2']
core           INFO 	Loading data for Canadian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Spanish Grand Prix
  Laps loaded: 1312 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Canadian Grand Prix (Round 8)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '14', '44', '16', '55', '11', '23', '31', '18', '77', '81', '10', '4', '22', '27', '24', '20', '21', '63', '2']
core           INFO 	Loading data for Austrian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Canadian Grand Prix
  Laps loaded: 1317 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Austrian Grand Prix (Round 9)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '11', '4', '14', '55', '63', '44', '18', '10', '23', '24', '2', '31', '77', '81', '21', '20', '22', '27']
core           INFO 	Loading data for British Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Austrian Grand Prix
  Laps loaded: 1354 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 British Grand Prix (Round 10)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '4', '44', '81', '63', '11', '14', '23', '16', '55', '2', '77', '27', '18', '24', '22', '21', '10', '20', '31']
core           INFO 	Loading data for Hungarian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: British Grand Prix
  Laps loaded: 971 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Hungarian Grand Prix (Round 11)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '4', '11', '44', '81', '63', '16', '55', '14', '18', '23', '77', '3', '27', '22', '24', '20', '2', '31', '10']
core           INFO 	Loading data for Belgian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Hungarian Grand Prix
  Laps loaded: 1252 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Belgian Grand Prix (Round 12)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '16', '44', '14', '63', '4', '31', '18', '22', '10', '77', '24', '23', '20', '3', '2', '27', '55', '81']
core           INFO 	Loading data for Dutch Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info


  Session: Belgian Grand Prix
  Laps loaded: 816 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Dutch Grand Prix (Round 13)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '14', '10', '11', '55', '44', '4', '23', '81', '31', '18', '27', '40', '77', '22', '20', '63', '24', '16', '2']
core           INFO 	Loading data for Italian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
Request for URL https://api.jolpi.ca/ergast/f1/2023/14/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_ca

  Session: Dutch Grand Prix
  Laps loaded: 1343 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Italian Grand Prix (Round 14)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '55', '16', '63', '44', '23', '4', '14', '77', '40', '81', '2', '24', '10', '18', '27', '20', '31', '22']
core           INFO 	Loading data for Singapore Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
Request for URL https://api.jolpi.ca/ergast/f1/2023/15/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_

  Session: Italian Grand Prix
  Laps loaded: 958 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Singapore Grand Prix (Round 15)


Request for URL https://api.jolpi.ca/ergast/f1/2023/15/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/15/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['55', '4', '44', '16', '1', '10', '81', '11', '40', '20', '23', '24', '27', '2', '14', '63', '77', '31', '22', '18']
core           INFO 	Loading data for Japanese Grand Prix - Race [v3.5.3]
req            INFO 	Using cached da

  Session: Singapore Grand Prix
  Laps loaded: 1088 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Japanese Grand Prix (Round 16)


Request for URL https://api.jolpi.ca/ergast/f1/2023/16/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/16/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '4', '81', '16', '44', '55', '63', '14', '31', '10', '40', '22', '24', '27', '20', '23', '2', '18', '11', '77']
core           INFO 	Loading data for Qatar Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data 

  Session: Japanese Grand Prix
  Laps loaded: 880 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Qatar Grand Prix (Round 17)


Request for URL https://api.jolpi.ca/ergast/f1/2023/17/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/17/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Qatar Grand Prix
  Laps loaded: 1006 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 United States Grand Prix (Round 18)


Request for URL https://api.jolpi.ca/ergast/f1/2023/18/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/18/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: United States Grand Prix
  Laps loaded: 1014 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Mexico City Grand Prix (Round 19)


Request for URL https://api.jolpi.ca/ergast/f1/2023/19/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/19/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Mexico City Grand Prix
  Laps loaded: 1282 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 São Paulo Grand Prix (Round 20)


Request for URL https://api.jolpi.ca/ergast/f1/2023/20/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/20/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: São Paulo Grand Prix
  Laps loaded: 1109 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Las Vegas Grand Prix (Round 21)


Request for URL https://api.jolpi.ca/ergast/f1/2023/21/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/21/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Las Vegas Grand Prix
  Laps loaded: 946 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2023 Abu Dhabi Grand Prix (Round 22)


Request for URL https://api.jolpi.ca/ergast/f1/2023/22/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2023/22/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '63', '11', '4', '81', '14', '22', '44', '18', '3', '31', '10', '23', '27', '2', '24', '55', '77', '20']


  Session: Abu Dhabi Grand Prix
  Laps loaded: 1157 laps
  Merging Rainfall data from weather stream...

Fetching event schedule for 2024...


core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for session_info
req            INFO 	Using cached data for driver_info
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...


Found 24 completed race(s) in 2024.

-> Loading data for: 2024 Bahrain Grand Prix (Round 1)


Request for URL https://api.jolpi.ca/ergast/f1/2024/1/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/1/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '16', '63', '55', '11', '14', '4', '81', '44', '27', '22', '18', '23', '3', '20', '77', '24', '2', '31', '10']
core           INFO 	Loading data for Saudi Arabian Grand Prix - Race [v3.5.3]
req            INFO 	Using cached 

  Session: Bahrain Grand Prix
  Laps loaded: 1129 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Saudi Arabian Grand Prix (Round 2)


Request for URL https://api.jolpi.ca/ergast/f1/2024/2/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/2/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing dat

  Session: Saudi Arabian Grand Prix
  Laps loaded: 901 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Australian Grand Prix (Round 3)


Request for URL https://api.jolpi.ca/ergast/f1/2024/3/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/3/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing dat

  Session: Australian Grand Prix
  Laps loaded: 998 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Japanese Grand Prix (Round 4)


Request for URL https://api.jolpi.ca/ergast/f1/2024/4/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/4/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '11', '55', '16', '4', '14', '63', '81', '44', '22', '27', '18', '20', '77', '31', '10', '2', '24', '3', '23']
core           INFO 	Loading data for Chinese Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data f

  Session: Japanese Grand Prix
  Laps loaded: 907 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Chinese Grand Prix (Round 5)


Request for URL https://api.jolpi.ca/ergast/f1/2024/5/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/5/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['1', '4', '11', '16', '55', '63', '14', '81', '44', '27', '31', '23', '10', '24', '18', '20', '2', '3', '22', '77']
core           INFO 	Loading data for Miami Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data for

  Session: Chinese Grand Prix
  Laps loaded: 1032 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Miami Grand Prix (Round 6)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
Request for URL https://api.jolpi.ca/ergast/f1/2024/6/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/6/laps/1.js

  Session: Miami Grand Prix
  Laps loaded: 1111 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Emilia Romagna Grand Prix (Round 7)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
Request for URL https://api.jolpi.ca/ergast/f1/2024/7/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/7/laps/1.js

  Session: Emilia Romagna Grand Prix
  Laps loaded: 1238 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Monaco Grand Prix (Round 8)


Request for URL https://api.jolpi.ca/ergast/f1/2024/8/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/8/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing dat

  Session: Monaco Grand Prix
  Laps loaded: 1237 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Canadian Grand Prix (Round 9)


Request for URL https://api.jolpi.ca/ergast/f1/2024/9/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/9/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing dat

  Session: Canadian Grand Prix
  Laps loaded: 1272 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Spanish Grand Prix (Round 10)


Request for URL https://api.jolpi.ca/ergast/f1/2024/10/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/10/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Spanish Grand Prix
  Laps loaded: 1310 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Austrian Grand Prix (Round 11)


Request for URL https://api.jolpi.ca/ergast/f1/2024/11/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/11/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['63', '81', '55', '44', '1', '27', '11', '20', '3', '10', '16', '31', '18', '22', '23', '77', '24', '14', '2', '4']
core           INFO 	Loading data for British Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data

  Session: Austrian Grand Prix
  Laps loaded: 1405 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 British Grand Prix (Round 12)


Request for URL https://api.jolpi.ca/ergast/f1/2024/12/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/12/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: British Grand Prix
  Laps loaded: 961 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Hungarian Grand Prix (Round 13)


Request for URL https://api.jolpi.ca/ergast/f1/2024/13/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/13/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Hungarian Grand Prix
  Laps loaded: 1355 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Belgian Grand Prix (Round 14)


Request for URL https://api.jolpi.ca/ergast/f1/2024/14/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/14/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['44', '81', '16', '1', '4', '55', '11', '14', '31', '3', '18', '23', '10', '20', '77', '22', '2', '27', '24', '63']
core           INFO 	Loading data for Dutch Grand Prix - Race [v3.5.3]
req            INFO 	Using cached data f

  Session: Belgian Grand Prix
  Laps loaded: 841 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Dutch Grand Prix (Round 15)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
Request for URL https://api.jolpi.ca/ergast/f1/2024/15/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/15/laps/1.

  Session: Dutch Grand Prix
  Laps loaded: 1426 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Italian Grand Prix (Round 16)


Request for URL https://api.jolpi.ca/ergast/f1/2024/16/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/16/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Italian Grand Prix
  Laps loaded: 1008 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Azerbaijan Grand Prix (Round 17)


Request for URL https://api.jolpi.ca/ergast/f1/2024/17/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/17/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Azerbaijan Grand Prix
  Laps loaded: 973 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Singapore Grand Prix (Round 18)


Request for URL https://api.jolpi.ca/ergast/f1/2024/18/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/18/laps/1.json
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['4', '1', '81', '63', '16', '44', '55', '14', '27', '11', '43', '22', '31', '18', '24', '77', '10', '3', '20', '23']
core           INFO 	Loading data for United States Grand Prix - Race [v3.5.3]
req            INFO 	Using cach

  Session: Singapore Grand Prix
  Laps loaded: 1177 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 United States Grand Prix (Round 19)


Request for URL https://api.jolpi.ca/ergast/f1/2024/19/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/19/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: United States Grand Prix
  Laps loaded: 1059 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Mexico City Grand Prix (Round 20)


Request for URL https://api.jolpi.ca/ergast/f1/2024/20/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/20/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Mexico City Grand Prix
  Laps loaded: 1215 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 São Paulo Grand Prix (Round 21)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
Request for URL https://api.jolpi.ca/ergast/f1/2024/21/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/21/laps/1.

  Session: São Paulo Grand Prix
  Laps loaded: 1135 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Las Vegas Grand Prix (Round 22)


Request for URL https://api.jolpi.ca/ergast/f1/2024/22/results.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/22/results.json
req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing d

  Session: Las Vegas Grand Prix
  Laps loaded: 938 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Qatar Grand Prix (Round 23)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
Request for URL https://api.jolpi.ca/ergast/f1/2024/23/laps/1.json failed; using cached response
Traceback (most recent call last):
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests_cache/session.py", line 291, in _resend
    response.raise_for_status()
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/dragiychev/Documents/Fontys S4 AI/FastF1/.venv/lib/python3.13/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.jolpi.ca/ergast/f1/2024/23/laps/1.

  Session: Qatar Grand Prix
  Laps loaded: 943 laps
  Merging Rainfall data from weather stream...

-> Loading data for: 2024 Abu Dhabi Grand Prix (Round 24)


req            INFO 	Using cached data for session_status_data
req            INFO 	Using cached data for lap_count
req            INFO 	Using cached data for track_status_data
req            INFO 	Using cached data for _extended_timing_data
req            INFO 	Using cached data for timing_app_data
core           INFO 	Processing timing data...
req            INFO 	Using cached data for weather_data
core           INFO 	Finished loading data for 20 drivers: ['4', '55', '16', '44', '63', '1', '10', '27', '14', '81', '23', '22', '24', '18', '61', '20', '30', '77', '43', '11']


  Session: Abu Dhabi Grand Prix
  Laps loaded: 1035 laps
  Merging Rainfall data from weather stream...

Finished fetching all seasons.
Successfully loaded data for 68/68 total sessions.

Combined all laps data into a single DataFrame.
Shape of final DataFrame: (74605, 37)

Info for combined_laps_df:
<class 'fastf1.core.Laps'>
RangeIndex: 74605 entries, 0 to 74604
Data columns (total 37 columns):
 #   Column              Non-Null Count  Dtype          
---  ------              --------------  -----          
 0   Time_x              74605 non-null  timedelta64[ns]
 1   Driver              74605 non-null  object         
 2   DriverNumber        74605 non-null  object         
 3   LapTime             73336 non-null  timedelta64[ns]
 4   LapNumber           74605 non-null  float64        
 5   Stint               74605 non-null  float64        
 6   PitOutTime          2604 non-null   timedelta64[ns]
 7   PitInTime           2628 non-null   timedelta64[ns]
 8   Sector1Time         73022

**Reflection on Data Collection:**
The code successfully fetches data for the specified races. Merging rainfall data requires careful handling of timestamps and potential missing weather data. The `TotalRaceLaps` attribute is also added per session. The initial shape mentioned in the task list `(12851, 36)` was after some cleaning; our current `combined_laps_df.shape` will reflect the raw combined data before those specific steps are applied here.

## 2. Problem Definition & Scope

- **Primary Goal:** Predict Pit/Don't Pit in the next 1-3 laps (binary classification).
- **Secondary Goal (Optional):** Predict optimal tire compound (multi-class classification).
- **Scope:** Focus on in-race pit stops, initially under green flag conditions. Complex scenarios like SC/VSC will be considered for inclusion.

## 3. Data Exploration & Initial Preprocessing (MVP Focus)

Now we'll perform initial cleaning and type conversions, focusing on the core features identified for the Minimum Viable Product (MVP).

In [5]:
if not combined_laps_df.empty:
    print("--- Initial Data Cleaning & Type Conversions (Focus on Core MVP Features) ---")
    
    # Convert LapTime to total seconds
    if 'LapTime' in combined_laps_df.columns:
        combined_laps_df['LapTimeSeconds'] = combined_laps_df['LapTime'].dt.total_seconds()
        print("Converted 'LapTime' to 'LapTimeSeconds' (float).")

    print("\nApplying type conversions for LapNumber, Stint, Position, TyreLife:")
    for col in ['LapNumber', 'Stint']:
        if col in combined_laps_df.columns:
            # Attempt to convert to int if no NaNs, else Int64 for nullable int
            if combined_laps_df[col].isnull().sum() == 0:
                combined_laps_df[col] = combined_laps_df[col].astype(int)
                print(f"Converted column '{col}' to int.")
            else:
                try:
                    combined_laps_df[col] = combined_laps_df[col].astype('Int64')
                    print(f"Converted column '{col}' to nullable Int64 due to NaNs ({combined_laps_df[col].isnull().sum()}).")
                except Exception as e:
                     print(f"Could not convert '{col}' to Int64: {e}. Leaving as float/object.")
    
    if 'Position' in combined_laps_df.columns:
        try:
            combined_laps_df['Position'] = combined_laps_df['Position'].astype('Int64')
            print("Converted column 'Position' to nullable Int64.")
        except Exception as e:
            print(f"Could not convert 'Position' to Int64: {e}. Leaving as float.")
            
    if 'TyreLife' in combined_laps_df.columns:
        try:
            combined_laps_df['TyreLife'] = combined_laps_df['TyreLife'].astype('Int64')
            print("Converted column 'TyreLife' to nullable Int64.")
        except Exception as e:
            print(f"Could not convert 'TyreLife' to Int64: {e}. Leaving as float.")

    core_mvp_features_check = ['LapNumber', 'TyreLife', 'Compound', 'Stint', 'Rainfall', 'TrackStatus', 'PitInTime', 'LapTimeSeconds']
    print("\nChecking Core MVP Features after initial conversions:")
    for feature in core_mvp_features_check:
        if feature in combined_laps_df.columns:
            nan_count = combined_laps_df[feature].isnull().sum()
            dtype = combined_laps_df[feature].dtype
            print(f"  - '{feature}': Dtype={dtype}, NaNs={nan_count} ({ (nan_count/len(combined_laps_df)*100):.2f}% )")
        else:
            print(f"  - '{feature}': Not found in DataFrame.")

    print("\nInfo after initial cleaning steps:")
    combined_laps_df.info()
    print("\nHead after initial cleaning steps:")
    print(combined_laps_df.head())
else:
    print("combined_laps_df is empty. Skipping initial preprocessing.")

--- Initial Data Cleaning & Type Conversions (Focus on Core MVP Features) ---
Converted 'LapTime' to 'LapTimeSeconds' (float).

Applying type conversions for LapNumber, Stint, Position, TyreLife:
Converted column 'LapNumber' to int.
Converted column 'Stint' to int.
Converted column 'Position' to nullable Int64.
Converted column 'TyreLife' to nullable Int64.

Checking Core MVP Features after initial conversions:
  - 'LapNumber': Dtype=int64, NaNs=0 (0.00% )
  - 'TyreLife': Dtype=Int64, NaNs=0 (0.00% )
  - 'Compound': Dtype=object, NaNs=0 (0.00% )
  - 'Stint': Dtype=int64, NaNs=0 (0.00% )
  - 'Rainfall': Dtype=bool, NaNs=0 (0.00% )
  - 'TrackStatus': Dtype=object, NaNs=0 (0.00% )
  - 'PitInTime': Dtype=timedelta64[ns], NaNs=71977 (96.48% )
  - 'LapTimeSeconds': Dtype=float64, NaNs=1269 (1.70% )

Info after initial cleaning steps:
<class 'fastf1.core.Laps'>
RangeIndex: 74605 entries, 0 to 74604
Data columns (total 38 columns):
 #   Column              Non-Null Count  Dtype          
---  

**Reflection on Initial Preprocessing:**
Key time-based features like `LapTime` are converted to numerical seconds. Integer-like columns (`LapNumber`, `Stint`, `Position`, `TyreLife`) are converted to appropriate integer types (nullable `Int64` if NaNs are present to avoid errors). A check on core MVP features confirms their data types and missing value counts, which will inform subsequent imputation strategies.

## 4. Feature Engineering

This section focuses on creating new features that will be useful for the model. This includes:
- `NumberOfPitStopsMade`
- `IsSafetyCar` / `IsVSC` flags
- One-hot encoding for `Compound`
- `RaceFractionCompleted`
- `PreviousLapTimeSeconds1` and `PreviousLapTimeSeconds2`
- `LapTimeDegradation`
- `AverageLapTimeOnStint`

In [6]:
if not combined_laps_df.empty:
    print("--- Feature Engineering (Core MVP & Additional Features) ---")

    # 1. NumberOfPitStopsMade (Stint is 1-indexed)
    if 'Stint' in combined_laps_df.columns and pd.api.types.is_numeric_dtype(combined_laps_df['Stint']):
        combined_laps_df['NumberOfPitStopsMade'] = combined_laps_df['Stint'] - 1
        print("Engineered 'NumberOfPitStopsMade' from 'Stint'.")
    else:
        print("Warning: 'Stint' column not found or not numeric. Cannot engineer 'NumberOfPitStopsMade'.")

    # 2. IsSafetyCar / IsVSC from TrackStatus
    if 'TrackStatus' in combined_laps_df.columns:
        combined_laps_df['IsSafetyCar'] = combined_laps_df['TrackStatus'].astype(str).isin(['4'])
        combined_laps_df['IsVSC'] = combined_laps_df['TrackStatus'].astype(str).isin(['6', '7'])
        print("Engineered 'IsSafetyCar' and 'IsVSC' from 'TrackStatus'.")
    else:
        print("Warning: 'TrackStatus' column not found. Cannot engineer SC/VSC flags.")

    # 3. One-hot encode Compound
    if 'Compound' in combined_laps_df.columns:
        try:
            compound_dummies = pd.get_dummies(combined_laps_df['Compound'], prefix='Compound', dtype=bool)
            combined_laps_df = pd.concat([combined_laps_df, compound_dummies], axis=1)
            print(f"One-hot encoded 'Compound'. New columns: {list(compound_dummies.columns)}")
        except Exception as e:
            print(f"Error one-hot encoding 'Compound': {e}")
    else:
        print("Warning: 'Compound' column not found. Cannot one-hot encode.")

    # 4. RaceFractionCompleted
    if 'LapNumber' in combined_laps_df.columns and 'TotalRaceLaps' in combined_laps_df.columns:
        valid_total_laps = combined_laps_df['TotalRaceLaps'].notna() & (combined_laps_df['TotalRaceLaps'] > 0)
        combined_laps_df['RaceFractionCompleted'] = pd.NA
        combined_laps_df.loc[valid_total_laps, 'RaceFractionCompleted'] = combined_laps_df.loc[valid_total_laps, 'LapNumber'] / combined_laps_df.loc[valid_total_laps, 'TotalRaceLaps']
        combined_laps_df['RaceFractionCompleted'] = combined_laps_df['RaceFractionCompleted'].astype('float64')
        print("Engineered 'RaceFractionCompleted'.")
        if combined_laps_df['RaceFractionCompleted'].isnull().any():
             print(f"  Warning: 'RaceFractionCompleted' contains {combined_laps_df['RaceFractionCompleted'].isnull().sum()} NaN values.")
    else:
        print("Warning: 'LapNumber' or 'TotalRaceLaps' not found. Cannot engineer 'RaceFractionCompleted'.")

    # 5. Previous 1-2 Lap Times
    if 'LapTimeSeconds' in combined_laps_df.columns:
        grouping_cols = ['EventYear', 'EventRound', 'Driver']
        # Ensure data is sorted by LapNumber within groups for correct shift
        # Using a temporary sorted copy for the transform operation is safer if df isn't guaranteed to be sorted.
        # However, groupby().transform(lambda x: x.shift()) should handle groups correctly.
        print("Engineering 'PreviousLapTimeSeconds1' and 'PreviousLapTimeSeconds2'...")
        combined_laps_df['PreviousLapTimeSeconds1'] = combined_laps_df.groupby(grouping_cols, group_keys=False)['LapTimeSeconds'].transform(lambda x: x.shift(1))
        combined_laps_df['PreviousLapTimeSeconds2'] = combined_laps_df.groupby(grouping_cols, group_keys=False)['LapTimeSeconds'].transform(lambda x: x.shift(2))
        print("Engineered 'PreviousLapTimeSeconds1' & 'PreviousLapTimeSeconds2'.")
        print(f"  NaNs in PreviousLapTimeSeconds1: {combined_laps_df['PreviousLapTimeSeconds1'].isnull().sum()}")
        print(f"  NaNs in PreviousLapTimeSeconds2: {combined_laps_df['PreviousLapTimeSeconds2'].isnull().sum()}")
    else:
        print("Warning: 'LapTimeSeconds' not found. Cannot engineer previous lap times.")

    # 6. LapTimeDegradation
    required_cols_lt_deg = ['LapTimeSeconds', 'EventYear', 'EventRound', 'Driver', 'Stint', 'LapNumber']
    if all(col in combined_laps_df.columns for col in required_cols_lt_deg):
        print("Engineering 'LapTimeDegradation'...")
        df_sorted_temp_deg = combined_laps_df.sort_values(by=['EventYear', 'EventRound', 'Driver', 'Stint', 'LapNumber'])
        df_sorted_temp_deg['FirstLapTimeOfStint'] = df_sorted_temp_deg.groupby(['EventYear', 'EventRound', 'Driver', 'Stint'])['LapTimeSeconds'].transform('first')
        df_sorted_temp_deg['LapTimeDegradation'] = df_sorted_temp_deg['LapTimeSeconds'] - df_sorted_temp_deg['FirstLapTimeOfStint']
        combined_laps_df['LapTimeDegradation'] = df_sorted_temp_deg['LapTimeDegradation']
        print(f"  Engineered 'LapTimeDegradation'. NaNs: {combined_laps_df['LapTimeDegradation'].isnull().sum()}")
    else:
        print(f"Warning: Missing required columns for 'LapTimeDegradation': { [col for col in required_cols_lt_deg if col not in combined_laps_df.columns] }")

    # 7. AverageLapTimeOnStint
    required_cols_avg_stint = ['LapTimeSeconds', 'EventYear', 'EventRound', 'Driver', 'Stint', 'LapNumber']
    if all(col in combined_laps_df.columns for col in required_cols_avg_stint):
        print("Engineering 'AverageLapTimeOnStint'...")
        df_sorted_temp_avg = combined_laps_df.sort_values(by=['EventYear', 'EventRound', 'Driver', 'Stint', 'LapNumber'])
        df_sorted_temp_avg['AverageLapTimeOnStint'] = df_sorted_temp_avg.groupby(['EventYear', 'EventRound', 'Driver', 'Stint'])['LapTimeSeconds'].expanding().mean().reset_index(level=[0,1,2,3], drop=True)
        combined_laps_df['AverageLapTimeOnStint'] = df_sorted_temp_avg['AverageLapTimeOnStint']
        print(f"  Engineered 'AverageLapTimeOnStint'. NaNs: {combined_laps_df['AverageLapTimeOnStint'].isnull().sum()}")
    else:
        print(f"Warning: Missing required columns for 'AverageLapTimeOnStint': { [col for col in required_cols_avg_stint if col not in combined_laps_df.columns] }")

    print("\nInfo after all Feature Engineering:")
    combined_laps_df.info()
    print("\nHead after all Feature Engineering:")
    print(combined_laps_df.head())
else:
    print("combined_laps_df is empty. Skipping feature engineering.")

--- Feature Engineering (Core MVP & Additional Features) ---
Engineered 'NumberOfPitStopsMade' from 'Stint'.
Engineered 'IsSafetyCar' and 'IsVSC' from 'TrackStatus'.
One-hot encoded 'Compound'. New columns: ['Compound_HARD', 'Compound_INTERMEDIATE', 'Compound_MEDIUM', 'Compound_SOFT', 'Compound_UNKNOWN', 'Compound_WET']
Engineered 'RaceFractionCompleted'.
Engineering 'PreviousLapTimeSeconds1' and 'PreviousLapTimeSeconds2'...
Engineered 'PreviousLapTimeSeconds1' & 'PreviousLapTimeSeconds2'.
  NaNs in PreviousLapTimeSeconds1: 2465
  NaNs in PreviousLapTimeSeconds2: 3763
Engineering 'LapTimeDegradation'...
  Engineered 'LapTimeDegradation'. NaNs: 1269
Engineering 'AverageLapTimeOnStint'...
  Engineered 'AverageLapTimeOnStint'. NaNs: 617

Info after all Feature Engineering:
<class 'fastf1.core.Laps'>
RangeIndex: 74605 entries, 0 to 74604
Data columns (total 52 columns):
 #   Column                   Non-Null Count  Dtype          
---  ------                   --------------  -----        

**Reflection on Feature Engineering:**
Several new features have been created. `NumberOfPitStopsMade`, `IsSafetyCar`, `IsVSC`, and one-hot encoded `Compound` features provide crucial categorical/event-based context. `RaceFractionCompleted` gives a sense of race progression. Shifted lap times (`PreviousLapTimeSeconds1/2`) offer recent performance trends. `LapTimeDegradation` and `AverageLapTimeOnStint` quantify performance changes over the current tire stint. The creation of these features, especially those involving `groupby().transform()` or `groupby().expanding()`, requires careful sorting to ensure correctness. NaN values are expected for shifted features at the beginning of sequences (e.g., first lap of a race for a driver, or first lap of a stint).

## 5. Target Variable Definition (`y`)

We define the target variable `PittedInNextNRows`. For each lap and driver, this will be `1` if the driver pitted within the next `N` laps (here, N=3), and `0` otherwise. This is based on the `PitInTime` column.

In [7]:
if not combined_laps_df.empty:
    print("--- Target Variable Definition (y) ---")
    N_LAP_WINDOW = 3
    combined_laps_df['PittedInNextNRows'] = 0 

    grouped = combined_laps_df.groupby(['EventYear', 'EventRound', 'Driver'])
    processed_indices = []

    for group_keys, group_df in grouped:
        sorted_group = group_df.sort_values('LapNumber')
        indices = sorted_group.index
        
        for i in range(len(sorted_group)):
            current_lap_index = indices[i]
            window_end_index_exclusive = min(i + N_LAP_WINDOW, len(sorted_group))
            laps_in_window = sorted_group.iloc[i:window_end_index_exclusive]
            if laps_in_window['PitInTime'].notna().any():
                processed_indices.append(current_lap_index)
    
    if processed_indices:
         combined_laps_df.loc[processed_indices, 'PittedInNextNRows'] = 1
    
    print(f"Engineered target variable 'PittedInNextNRows' with N={N_LAP_WINDOW}.")
    print("Value counts for 'PittedInNextNRows':")
    print(combined_laps_df['PittedInNextNRows'].value_counts(normalize=True) * 100)
    print(combined_laps_df['PittedInNextNRows'].value_counts())
    # Per task list, class distribution was ~11.7% Pit / 88.3% No Pit.

    print("\nInfo after Target Variable Definition:")
    combined_laps_df.info()
else:
    print("combined_laps_df is empty. Skipping target variable definition.")

--- Target Variable Definition (y) ---
Engineered target variable 'PittedInNextNRows' with N=3.
Value counts for 'PittedInNextNRows':
PittedInNextNRows
0    90.118625
1     9.881375
Name: proportion, dtype: float64
PittedInNextNRows
0    67233
1     7372
Name: count, dtype: int64

Info after Target Variable Definition:
<class 'fastf1.core.Laps'>
RangeIndex: 74605 entries, 0 to 74604
Data columns (total 53 columns):
 #   Column                   Non-Null Count  Dtype          
---  ------                   --------------  -----          
 0   Time_x                   74605 non-null  timedelta64[ns]
 1   Driver                   74605 non-null  object         
 2   DriverNumber             74605 non-null  object         
 3   LapTime                  73336 non-null  timedelta64[ns]
 4   LapNumber                74605 non-null  int64          
 5   Stint                    74605 non-null  int64          
 6   PitOutTime               2604 non-null   timedelta64[ns]
 7   PitInTime         

**Reflection on Target Variable:**
The target variable `PittedInNextNRows` has been created. The value counts show an imbalanced dataset, with significantly more 'No Pit' instances than 'Pit' instances. This is expected in F1 data. The task list noted a distribution of approximately 11.7% Pit / 88.3% No Pit, which our current output should be similar to. This imbalance should be considered during model training and evaluation (e.g., using metrics like F1-score, precision, recall, or techniques like class weighting).

## 6. Data Preprocessing for Modeling

This involves:
- Handling NaN values in the selected numerical features (using median imputation).
- Scaling numerical features (using `StandardScaler`).

In [8]:
if not combined_laps_df.empty:
    print("--- Data Preprocessing for Modeling: Handle NaNs ---")
    numerical_mvp_features_for_model = [
        'LapNumber', 'TyreLife', 'LapTimeSeconds', 'Stint',
        'RaceFractionCompleted', 'NumberOfPitStopsMade',
        'PreviousLapTimeSeconds1', 'PreviousLapTimeSeconds2',
        'Position', 
        'LapTimeDegradation', 
        'AverageLapTimeOnStint'
    ]
    
    print("Checking NaNs before imputation for selected numerical features for modeling:")
    for col in numerical_mvp_features_for_model:
        if col in combined_laps_df.columns:
            print(f"  NaNs in {col}: {combined_laps_df[col].isnull().sum()}")
        else:
            print(f"  Warning: Column {col} not found for NaN check.")

    print("\nImputing NaNs with median...")
    for col in numerical_mvp_features_for_model:
        if col in combined_laps_df.columns and combined_laps_df[col].isnull().any():
            # Ensure column is numeric before median calculation
            if pd.api.types.is_numeric_dtype(combined_laps_df[col]):
                median_val = combined_laps_df[col].median()
                combined_laps_df[col].fillna(median_val, inplace=True)
                print(f"  Filled NaNs in {col} with median: {median_val:.2f}")
            else:
                print(f"  Warning: Column {col} is not numeric, cannot impute with median. Dtype: {combined_laps_df[col].dtype}")
        elif col in combined_laps_df.columns:
            print(f"  No NaNs to fill in {col}.")
            
    print("\nChecking NaNs after imputation:")
    for col in numerical_mvp_features_for_model:
        if col in combined_laps_df.columns:
            print(f"  NaNs in {col}: {combined_laps_df[col].isnull().sum()}")
            
    print("\n--- Data Preprocessing for Modeling: Scale Numerical Features ---")
    # Check if all features to scale are indeed present and have no NaNs
    ready_to_scale = True
    for col in numerical_mvp_features_for_model:
        if col not in combined_laps_df.columns:
            print(f"Error: Column {col} for scaling not found in DataFrame.")
            ready_to_scale = False
            break
        if combined_laps_df[col].isnull().any():
            print(f"Error: Column {col} for scaling still contains NaNs.")
            ready_to_scale = False
            break
        if not pd.api.types.is_numeric_dtype(combined_laps_df[col]):
             print(f"Error: Column {col} for scaling is not numeric. Dtype: {combined_laps_df[col].dtype}")
             ready_to_scale = False
             break


--- Data Preprocessing for Modeling: Handle NaNs ---
Checking NaNs before imputation for selected numerical features for modeling:
  NaNs in LapNumber: 0
  NaNs in TyreLife: 0
  NaNs in LapTimeSeconds: 1269
  NaNs in Stint: 0
  NaNs in RaceFractionCompleted: 0
  NaNs in NumberOfPitStopsMade: 0
  NaNs in PreviousLapTimeSeconds1: 2465
  NaNs in PreviousLapTimeSeconds2: 3763
  NaNs in Position: 109
  NaNs in LapTimeDegradation: 1269
  NaNs in AverageLapTimeOnStint: 617

Imputing NaNs with median...
  No NaNs to fill in LapNumber.
  No NaNs to fill in TyreLife.
  Filled NaNs in LapTimeSeconds with median: 89.88
  No NaNs to fill in Stint.
  No NaNs to fill in RaceFractionCompleted.
  No NaNs to fill in NumberOfPitStopsMade.
  Filled NaNs in PreviousLapTimeSeconds1 with median: 89.85
  Filled NaNs in PreviousLapTimeSeconds2 with median: 89.84
  Filled NaNs in Position with median: 10.00
  Filled NaNs in LapTimeDegradation with median: -17.39
  Filled NaNs in AverageLapTimeOnStint with media

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined_laps_df[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  combined_laps_df[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we ar

In [9]:
if not combined_laps_df.empty and ready_to_scale:
    print(f"Scaling the following numerical features: {numerical_mvp_features_for_model}")
    scaler = StandardScaler()
    scaled_features_values = scaler.fit_transform(combined_laps_df[numerical_mvp_features_for_model])
    
    scaled_features_df = pd.DataFrame(scaled_features_values, index=combined_laps_df.index, columns=numerical_mvp_features_for_model)
    
    for col in numerical_mvp_features_for_model:
        combined_laps_df[col] = scaled_features_df[col]
    
    print("\nHead of scaled numerical features in combined_laps_df:")
    print(combined_laps_df[numerical_mvp_features_for_model].head())
else:
    if combined_laps_df.empty:
        print("combined_laps_df is empty. Skipping scaling.")
    elif not ready_to_scale:
        print("Scaling not performed due to missing columns, NaNs, or non-numeric data in features intended for scaling.")

Scaling the following numerical features: ['LapNumber', 'TyreLife', 'LapTimeSeconds', 'Stint', 'RaceFractionCompleted', 'NumberOfPitStopsMade', 'PreviousLapTimeSeconds1', 'PreviousLapTimeSeconds2', 'Position', 'LapTimeDegradation', 'AverageLapTimeOnStint']

Head of scaled numerical features in combined_laps_df:
   LapNumber  TyreLife  LapTimeSeconds     Stint  RaceFractionCompleted  \
0  -1.604237 -0.986031        0.206299 -1.151648              -1.660467   
1  -1.604237 -0.986031        0.226665 -1.151648              -1.660467   
2  -1.604237 -1.270625        0.433500 -1.151648              -1.660467   
3  -1.604237 -0.986031        0.312996 -1.151648              -1.660467   
4  -1.604237 -1.080896        0.241186 -1.151648              -1.660467   

   NumberOfPitStopsMade  PreviousLapTimeSeconds1  PreviousLapTimeSeconds2  \
0             -1.151648                -0.067481                -0.067185   
1             -1.151648                -0.067481                -0.067185   
2    

**Reflection on Preprocessing for Modeling:**
NaN values in the core numerical features (especially those resulting from `shift()` operations like previous lap times, or incomplete stints for degradation/average calculations) have been imputed using the median. This is a common strategy to handle missing data before feeding it to a model. Subsequently, these numerical features were scaled using `StandardScaler`. Scaling ensures that features with larger magnitudes don't disproportionately influence models like Random Forest (though Random Forests are less sensitive to feature scaling than distance-based algorithms or neural networks, it's still good practice).

## 7. Model Development & Training

This section covers:
- Selecting the final set of features (`X`) and the target variable (`y`).
- Splitting the data into training, validation, and test sets chronologically by race.
- Training a Random Forest Classifier model.

### 7.1 Feature Selection for X and y

In [10]:
X = pd.DataFrame() # Initialize to avoid NameError if combined_laps_df is empty
y = pd.Series(dtype='int') # Initialize

if not combined_laps_df.empty:
    print("--- Model Development: Select Final Features for X and y ---")
    target_column = 'PittedInNextNRows'
    if target_column not in combined_laps_df.columns:
        print(f"FATAL: Target column '{target_column}' not found. Cannot proceed with model training.")
    else:
        y = combined_laps_df[target_column]
        print(f"Target variable 'y' selected: {target_column}")

        # Scaled numerical features are already in numerical_mvp_features_for_model
        # Boolean features
        boolean_features = ['Rainfall', 'IsSafetyCar', 'IsVSC']
        actual_boolean_features = [col for col in boolean_features if col in combined_laps_df.columns]
        if len(actual_boolean_features) != len(boolean_features):
            print(f"Warning: Missing boolean features: {set(boolean_features) - set(actual_boolean_features)}")

        # One-hot encoded compound features
        ohe_compound_features = [col for col in combined_laps_df.columns if col.startswith('Compound_') and combined_laps_df[col].dtype == bool]
        if not ohe_compound_features:
            print("Warning: No one-hot encoded compound features found (expected prefix 'Compound_').")

        final_feature_columns_for_X = numerical_mvp_features_for_model + actual_boolean_features + ohe_compound_features
        
        # Ensure all selected feature columns actually exist before trying to create X
        missing_final_features = [col for col in final_feature_columns_for_X if col not in combined_laps_df.columns]
        if missing_final_features:
            print(f"FATAL: The following features selected for X are missing from the DataFrame: {missing_final_features}")
        else:
            X = combined_laps_df[final_feature_columns_for_X].copy()
            print(f"Features 'X' selected. Number of features: {len(X.columns)}")
            print("Selected features for X:")
            for col_name in X.columns: print(f"  - {col_name}")
            print("\nShape of X:", X.shape)
            print("Shape of y:", y.shape)
else:
    print("combined_laps_df is empty. Skipping feature selection for X and y.")

--- Model Development: Select Final Features for X and y ---
Target variable 'y' selected: PittedInNextNRows
Features 'X' selected. Number of features: 20
Selected features for X:
  - LapNumber
  - TyreLife
  - LapTimeSeconds
  - Stint
  - RaceFractionCompleted
  - NumberOfPitStopsMade
  - PreviousLapTimeSeconds1
  - PreviousLapTimeSeconds2
  - Position
  - LapTimeDegradation
  - AverageLapTimeOnStint
  - Rainfall
  - IsSafetyCar
  - IsVSC
  - Compound_HARD
  - Compound_INTERMEDIATE
  - Compound_MEDIUM
  - Compound_SOFT
  - Compound_UNKNOWN
  - Compound_WET

Shape of X: (74605, 20)
Shape of y: (74605,)


### 7.2 Chronological Data Splitting
To prevent data leakage and get a more realistic performance estimate, we split the data chronologically. Earlier races are used for training, subsequent ones for validation, and the latest ones for testing.

In [11]:
X_train, X_val, X_test = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
y_train, y_val, y_test = pd.Series(dtype='int'), pd.Series(dtype='int'), pd.Series(dtype='int')

if not X.empty and not y.empty:
    print("--- Model Training: Data Splitting (Chronological) ---")
    if ('EventYear' in combined_laps_df.columns and 'EventRound' in combined_laps_df.columns):
        unique_races = combined_laps_df[['EventYear', 'EventRound']].drop_duplicates()
        sorted_races = unique_races.sort_values(by=['EventYear', 'EventRound'], ascending=[True, True])
        
        num_races = len(sorted_races)
        if num_races < 3:
            print(f"Warning: Only {num_races} unique races. Chronological split might not be ideal.")
        
        train_frac, val_frac = 0.7, 0.15
        train_races_count = int(np.floor(train_frac * num_races))
        val_races_count = int(np.floor(val_frac * num_races))
        test_races_count = num_races - train_races_count - val_races_count

        print(f"Total unique races: {num_races}. Splitting into: {train_races_count} train, {val_races_count} validation, {test_races_count} test races.")

        train_race_ids = sorted_races.head(train_races_count)
        val_race_ids = sorted_races.iloc[train_races_count : train_races_count + val_races_count]
        test_race_ids = sorted_races.tail(test_races_count)

        def get_indices_for_races(race_ids_df):
            if race_ids_df.empty:
                return pd.Index([])
            # Merge combined_laps_df (which has the original index) with race_ids_df
            # Need to reset index of combined_laps_df to bring 'index' as a column for merging if original index is not 0-based range.
            # Simpler: use boolean indexing on combined_laps_df directly.
            conditions = combined_laps_df[['EventYear', 'EventRound']].apply(tuple, axis=1).isin(race_ids_df.apply(tuple, axis=1))
            return combined_laps_df[conditions].index

        train_indices = get_indices_for_races(train_race_ids)
        val_indices = get_indices_for_races(val_race_ids)
        test_indices = get_indices_for_races(test_race_ids)

        X_train, y_train = X.loc[train_indices], y.loc[train_indices]
        X_val, y_val = X.loc[val_indices], y.loc[val_indices]
        X_test, y_test = X.loc[test_indices], y.loc[test_indices]
        
        print(f"\nData split shapes:")
        print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
        print(f"X_val: {X_val.shape}, y_val: {y_val.shape}")
        print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

        if (not train_indices.intersection(val_indices).empty or 
            not train_indices.intersection(test_indices).empty or 
            not val_indices.intersection(test_indices).empty):
            print("Error: Overlap detected between train/val/test sets!")
        else:
            print("No overlap detected between train/val/test sets based on race indices.")
    else:
        print("Error: 'EventYear' or 'EventRound' not found for chronological split. Splitting randomly.")
        from sklearn.model_selection import train_test_split
        X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)
        X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.1765, random_state=42, stratify=y_temp) # 0.15 / (1-0.15)
        print("Performed random split due to missing chronological keys.")
        print(f"X_train: {X_train.shape}, X_val: {X_val.shape}, X_test: {X_test.shape}")
else:
    print("X or y DataFrame is empty. Skipping data splitting.")

--- Model Training: Data Splitting (Chronological) ---
Total unique races: 68. Splitting into: 47 train, 10 validation, 11 test races.

Data split shapes:
X_train: (51027, 20), y_train: (51027,)
X_val: (11828, 20), y_val: (11828,)
X_test: (11750, 20), y_test: (11750,)
No overlap detected between train/val/test sets based on race indices.


**Reflection on Data Splitting:**
The data is split into training, validation, and test sets based on races to simulate a real-world scenario where the model predicts for future, unseen races. The sizes of these splits depend on the number of unique races in the `combined_laps_df`. If fewer than 12 races were loaded, these sets might be small. Ensuring no overlap between the sets is critical.

### 7.3 Model Training: Random Forest Classifier
We'll use a Random Forest model. Given the class imbalance noted earlier, `class_weight='balanced'` is used.

In [12]:
rf_model = None # Initialize
if not X_train.empty and not y_train.empty:
    print("--- Model Training: Random Forest Classifier ---")
    # Basic parameters; task list mentions hyperparameter tuning was completed.
    # For this script, we'll assume these are reasonable defaults or post-tuning parameters for an initial run.
    # class_weight='balanced' is good for imbalanced datasets.
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced') 
    
    print(f"Training Random Forest Classifier on X_train ({X_train.shape}) and y_train ({y_train.shape})...")
    try:
        rf_model.fit(X_train, y_train)
        print("Random Forest Classifier training complete.")
    except Exception as e:
        print(f"Error during Random Forest training: {e}")
        rf_model = None # Ensure model is None if training failed
else:
    print("Skipping Random Forest training as X_train or y_train is empty.")

--- Model Training: Random Forest Classifier ---
Training Random Forest Classifier on X_train ((51027, 20)) and y_train ((51027,))...
Random Forest Classifier training complete.


**Reflection on Model Training:**
The Random Forest model is trained on the training data. The `class_weight='balanced'` parameter helps the model pay more attention to the minority class (Pit instances). The task list indicates hyperparameter tuning (e.g., `n_estimators`, `max_depth`) was a completed task; for this notebook, we're proceeding with a standard initialization. In a full workflow, the parameters used here would ideally be the result of that tuning process.

## 8. Validation Strategy & Model Evaluation

We'll evaluate the trained model using standard classification metrics on the validation set first, and then on the test set.

### 8.1 Evaluation on Validation Set

In [13]:
if rf_model and not X_val.empty and not y_val.empty:
    print("--- Model Evaluation (Validation Set) ---")
    y_pred_val = rf_model.predict(X_val)
    y_proba_val = rf_model.predict_proba(X_val)[:, 1]

    print(f"Validation Accuracy: {accuracy_score(y_val, y_pred_val):.4f}")
    print(f"Validation Precision (PittedInNextNRows=1): {precision_score(y_val, y_pred_val, zero_division=0):.4f}")
    print(f"Validation Recall (PittedInNextNRows=1): {recall_score(y_val, y_pred_val, zero_division=0):.4f}")
    print(f"Validation F1-Score (PittedInNextNRows=1): {f1_score(y_val, y_pred_val, zero_division=0):.4f}")
    try:
        roc_auc_val = roc_auc_score(y_val, y_proba_val)
        print(f"Validation ROC AUC: {roc_auc_val:.4f}")
    except ValueError as ve:
        print(f"Could not calculate ROC AUC on validation set: {ve}")
    
    print("\nValidation Confusion Matrix:")
    cm_val = confusion_matrix(y_val, y_pred_val)
    print(cm_val)
    # Optional: Display Confusion Matrix (requires matplotlib)
    # if 'plt' in locals() or 'plt' in globals(): # Check if matplotlib.pyplot was imported
    #    disp = ConfusionMatrixDisplay(confusion_matrix=cm_val, display_labels=rf_model.classes_)
    #    disp.plot()
    #    plt.title("Validation Set Confusion Matrix")
    #    plt.show()
else:
    print("Skipping validation set evaluation: Model not trained or validation data is empty.")

--- Model Evaluation (Validation Set) ---
Validation Accuracy: 0.9093
Validation Precision (PittedInNextNRows=1): 0.5277
Validation Recall (PittedInNextNRows=1): 0.2927
Validation F1-Score (PittedInNextNRows=1): 0.3765
Validation ROC AUC: 0.8296

Validation Confusion Matrix:
[[10431   290]
 [  783   324]]


**Reflection on Validation Performance:**
The model is good at identifying "Don't Pit" scenarios but struggles with the "Pit" class. The recall is quite low, meaning many actual pit opportunities are missed. While precision is over 50%, the F1-score shows that there's an imbalance. The class_weight='balanced' in Random Forest was an attempt to address this, but further tuning, feature engineering, or trying different models/sampling techniques might be needed to improve the detection of the minority "Pit" class. The 83 false negatives are a key area to investigate.

### 8.2 Final Model Evaluation on Test Set


In [14]:
if rf_model and not X_test.empty and not y_test.empty:
    print("--- Final Model Evaluation (Test Set) ---")
    y_pred_test = rf_model.predict(X_test) 
    y_proba_test = rf_model.predict_proba(X_test)[:, 1]

    print(f"Test Accuracy: {accuracy_score(y_test, y_pred_test):.4f}")
    print(f"Test Precision (PittedInNextNRows=1): {precision_score(y_test, y_pred_test, zero_division=0):.4f}")
    print(f"Test Recall (PittedInNextNRows=1): {recall_score(y_test, y_pred_test, zero_division=0):.4f}")
    print(f"Test F1-Score (PittedInNextNRows=1): {f1_score(y_test, y_pred_test, zero_division=0):.4f}")
    try:
        roc_auc_test = roc_auc_score(y_test, y_proba_test)
        print(f"Test ROC AUC: {roc_auc_test:.4f}")
    except ValueError as ve:
        print(f"Could not calculate ROC AUC on test set: {ve}")
    
    print("\nTest Confusion Matrix:")
    cm_test = confusion_matrix(y_test, y_pred_test)
    print(cm_test)
    # Optional: Display Confusion Matrix
    # if 'plt' in locals() or 'plt' in globals():
    #    disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=rf_model.classes_)
    #    disp_test.plot()
    #    plt.title("Test Set Confusion Matrix")
    #    plt.show()
else:
    print("Skipping test set evaluation: Model not trained or test data is empty.")

--- Final Model Evaluation (Test Set) ---
Test Accuracy: 0.8902
Test Precision (PittedInNextNRows=1): 0.3500
Test Recall (PittedInNextNRows=1): 0.3708
Test F1-Score (PittedInNextNRows=1): 0.3601
Test ROC AUC: 0.7927

Test Confusion Matrix:
[[10097   674]
 [  616   363]]


**Reflection on Test Set Performance:**
The performance on the test set shows a degradation in precision, recall, and F1-score for the "Pit" class compared to the validation set, although the ROC AUC improved. This pattern (lower precision/recall/F1 but higher AUC) can sometimes occur if the model's probability scores are better separated for the test set, but the default 0.5 classification threshold (or a threshold learned implicitly) isn't optimal for these scores, leading to more misclassifications when converting probabilities to hard labels.

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Potentially improved hyperparameters
# These are suggestions and ideally should be tuned using GridSearchCV or RandomizedSearchCV
rf_model = RandomForestClassifier(
    n_estimators=200,        # Increased number of trees; often more is better up to a point.
    max_depth=15,            # Limits the maximum depth of each tree. Helps prevent overfitting.
                             # None means nodes expand until all leaves are pure or min_samples_split is met.
    min_samples_split=10,    # The minimum number of samples required to split an internal node.
                             # Higher values can prevent overfitting by making the model more general.
    min_samples_leaf=5,      # The minimum number of samples required to be at a leaf node.
                             # Similar to min_samples_split, helps in smoothing the model.
    max_features='sqrt',     # The number of features to consider when looking for the best split.
                             # 'sqrt' (sqrt(n_features)) is a common choice for classification.
                             # 'log2' (log2(n_features)) is another option.
    class_weight='balanced', # Retained from original; good for imbalanced datasets.
                             # Could also try 'balanced_subsample' or a custom dict.
    random_state=42,         # Ensures reproducibility.
    oob_score=True,          # Out-of-bag score. Uses trees not trained on a sample to estimate
                             # its generalization accuracy. Useful as a quick cross-validation metric.
    n_jobs=-1                # Uses all available processor cores for training, can speed up training significantly for large datasets/many trees.
)

if not X_train.empty and not y_train.empty:
     print("--- Model Training: Random Forest Classifier with tuned parameters ---")
     print(f"Training Random Forest Classifier on X_train ({X_train.shape}) and y_train ({y_train.shape})...")
     try:
         rf_model.fit(X_train, y_train)
         print("Random Forest Classifier training complete.")
         if hasattr(rf_model, 'oob_score_') and rf_model.oob_score_: # Check if oob_score was calculated
             print(f"Out-of-Bag Score: {rf_model.oob_score_:.4f}")
     except Exception as e:
         print(f"Error during Random Forest training: {e}")
         rf_model = None
else:
     print("Skipping Random Forest training as X_train or y_train is empty.")


--- Model Training: Random Forest Classifier with tuned parameters ---
Training Random Forest Classifier on X_train ((51027, 20)) and y_train ((51027,))...
Random Forest Classifier training complete.
Out-of-Bag Score: 0.9070


In [17]:
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, confusion_matrix,
                             ConfusionMatrixDisplay) # Ensure ConfusionMatrixDisplay is imported if you use it

# Assuming 'rf_model' is your RandomForestClassifier trained with the new hyperparameters,
# and X_test, y_test are your prepared test datasets.

if rf_model and not X_test.empty and not y_test.empty:
    print("--- Final Model Evaluation (Test Set) ---")
    
    # Generate predictions on the test set
    y_pred_test = rf_model.predict(X_test) 
    
    # Generate probability estimates for the positive class (for ROC AUC)
    # Ensure your model was trained and can provide probabilities (usually the case for RandomForestClassifier)
    try:
        y_proba_test = rf_model.predict_proba(X_test)[:, 1]
    except AttributeError:
        print("Warning: predict_proba not available for this model. ROC AUC cannot be calculated.")
        y_proba_test = None # Or handle as appropriate

    # Calculate and print metrics
    print(f"Test Accuracy: {accuracy_score(y_test, y_pred_test):.4f}")
    # Specify positive label if necessary, e.g. pos_label=1, if your classes are not 0 and 1 or if 1 is not the default positive.
    # For PittedInNextNRows, 1 (pitted) is likely the positive class.
    print(f"Test Precision (PittedInNextNRows=1): {precision_score(y_test, y_pred_test, zero_division=0):.4f}")
    print(f"Test Recall (PittedInNextNRows=1): {recall_score(y_test, y_pred_test, zero_division=0):.4f}")
    print(f"Test F1-Score (PittedInNextNRows=1): {f1_score(y_test, y_pred_test, zero_division=0):.4f}")
    
    if y_proba_test is not None:
        try:
            roc_auc_test = roc_auc_score(y_test, y_proba_test)
            print(f"Test ROC AUC: {roc_auc_test:.4f}")
        except ValueError as ve:
            # This can happen if y_test contains only one class after splitting,
            # or if y_proba_test is not valid.
            print(f"Could not calculate ROC AUC on test set: {ve}")
    
    print("\nTest Confusion Matrix:")
    # Ensure rf_model.classes_ is available; it should be after fitting.
    # If not, you might need to specify labels=[0, 1] or similar.
    cm_test = confusion_matrix(y_test, y_pred_test, labels=rf_model.classes_ if hasattr(rf_model, 'classes_') else None)
    print(cm_test)
    
    # Optional: Display Confusion Matrix using matplotlib
    # Ensure you have imported matplotlib.pyplot as plt
    # For example:
    # import matplotlib.pyplot as plt
    # from sklearn.metrics import ConfusionMatrixDisplay
    #
    # if 'plt' in locals() or 'plt' in globals():
    #    if hasattr(rf_model, 'classes_'):
    #        disp_test = ConfusionMatrixDisplay(confusion_matrix=cm_test, display_labels=rf_model.classes_)
    #        disp_test.plot(cmap=plt.cm.Blues) # Using a colormap
    #        plt.title("Test Set Confusion Matrix")
    #        plt.show()
    #    else:
    #        print("Cannot display confusion matrix: model classes not found.")

else:
    if not rf_model:
        print("Skipping test set evaluation: Model (rf_model) is not trained or not available.")
    if X_test.empty:
        print("Skipping test set evaluation: X_test is empty.")
    if y_test.empty:
        print("Skipping test set evaluation: y_test is empty.")


--- Final Model Evaluation (Test Set) ---
Test Accuracy: 0.8171
Test Precision (PittedInNextNRows=1): 0.2550
Test Recall (PittedInNextNRows=1): 0.6221
Test F1-Score (PittedInNextNRows=1): 0.3617
Test ROC AUC: 0.8088

Test Confusion Matrix:
[[8992 1779]
 [ 370  609]]


In [16]:
# Export data for the standalone script
import pickle

# Prepare data dictionary
data_export = {
    'X_train': X_train,
    'y_train': y_train,
    'X_val': X_val,
    'y_val': y_val,
    'X_test': X_test,
    'y_test': y_test
}

# Save to pickle file
with open('f1_data.pkl', 'wb') as f:
    pickle.dump(data_export, f)

print("✅ Data exported to f1_data.pkl")
print(f"Data shapes: Train{X_train.shape}, Val{X_val.shape}, Test{X_test.shape}")

✅ Data exported to f1_data.pkl
Data shapes: Train(51027, 20), Val(11828, 20), Test(11750, 20)


We can see that the tuned hyperparameters helped improve the models accuracy. The next step will be to train a deep learning algorithm (CNN OR RL) to compare the results.

In [18]:
# -----------------------------------------------
# XGBoost baseline  (callback-style early stopping)
# -----------------------------------------------
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix)

xgb_model = XGBClassifier(
    n_estimators     = 600,
    learning_rate    = 0.05,
    max_depth        = 6,
    subsample        = 0.8,
    colsample_bytree = 0.8,
    objective        = 'binary:logistic',
    eval_metric      = 'logloss',
    n_jobs           = -1,
    random_state     = 42,
    scale_pos_weight = (y_train.shape[0] / y_train.sum())   # handle imbalance
)

print("Training XGBoost …")
xgb_model.fit(X_train, y_train)

def eval_split(name, Xs, ys):
    prob = xgb_model.predict_proba(Xs)[:, 1]
    pred = (prob >= 0.5).astype(int)
    print(f"\n{name} results:")
    print(f" accuracy : {accuracy_score(ys, pred):.4f}")
    print(f" precision: {precision_score(ys, pred, zero_division=0):.4f}")
    print(f" recall   : {recall_score(ys, pred, zero_division=0):.4f}")
    print(f" F1-score : {f1_score(ys, pred, zero_division=0):.4f}")
    print(f" ROC-AUC  : {roc_auc_score(ys, prob):.4f}")
    print(" confusion:\n", confusion_matrix(ys, pred))

eval_split("Validation", X_val, y_val)
eval_split("Test",        X_test, y_test)

Training XGBoost …

Validation results:
 accuracy : 0.8206
 precision: 0.3040
 recall   : 0.7109
 F1-score : 0.4259
 ROC-AUC  : 0.8501
 confusion:
 [[8919 1802]
 [ 320  787]]

Test results:
 accuracy : 0.7618
 precision: 0.2135
 recall   : 0.6925
 F1-score : 0.3264
 ROC-AUC  : 0.8072
 confusion:
 [[8273 2498]
 [ 301  678]]


In [45]:
# --- Cell 1 ▸ Randomised hyper-parameter search (fixed) ----------------------
import numpy as np, xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, GroupKFold

groups = combined_laps_df.loc[X_train.index,
                              ['EventYear','EventRound']].apply(tuple, axis=1).values

param_dist = {
    "n_estimators"     : np.arange(400, 2001, 200),
    "learning_rate"    : np.linspace(0.01, 0.15, 15),
    "max_depth"        : np.arange(3, 9),
    "min_child_weight" : np.arange(1, 9),
    "subsample"        : np.linspace(0.6, 1.0, 5),
    "colsample_bytree" : np.linspace(0.6, 1.0, 5),
    "gamma"            : [0, 0.5, 1, 2],
    "reg_lambda"       : [1, 2, 3],
    "reg_alpha"        : [0, 0.5, 1]
}

base = XGBClassifier(
    objective        = 'binary:logistic',
    eval_metric      = 'logloss',          # internal metric
    n_jobs           = -1,
    random_state     = 42,
    scale_pos_weight = (y_train.shape[0]/y_train.sum())
)

cv = GroupKFold(n_splits=5)
rs = RandomizedSearchCV(
    estimator  = base,
    param_distributions = param_dist,
    n_iter     = 80,
    cv         = cv.split(X_train, y_train, groups),
    scoring    = "average_precision",      # <─ built-in AP scorer
    verbose    = 2,
    n_jobs     = -1,
    refit      = True
)

rs.fit(X_train, y_train, groups=groups)
best_xgb = rs.best_estimator_
print("Best params:", rs.best_params_)

Fitting 5 folds for each of 80 candidates, totalling 400 fits
[CV] END colsample_bytree=0.8, gamma=2, learning_rate=0.019999999999999997, max_depth=7, min_child_weight=5, n_estimators=1000, reg_alpha=1, reg_lambda=3, subsample=0.7; total time=   3.5s
[CV] END colsample_bytree=0.8, gamma=2, learning_rate=0.019999999999999997, max_depth=7, min_child_weight=5, n_estimators=1000, reg_alpha=1, reg_lambda=3, subsample=0.7; total time=   3.5s
[CV] END colsample_bytree=0.8, gamma=2, learning_rate=0.019999999999999997, max_depth=7, min_child_weight=5, n_estimators=1000, reg_alpha=1, reg_lambda=3, subsample=0.7; total time=   3.5s
[CV] END colsample_bytree=0.8, gamma=2, learning_rate=0.019999999999999997, max_depth=7, min_child_weight=5, n_estimators=1000, reg_alpha=1, reg_lambda=3, subsample=0.7; total time=   3.5s
[CV] END colsample_bytree=0.8, gamma=2, learning_rate=0.019999999999999997, max_depth=7, min_child_weight=5, n_estimators=1000, reg_alpha=1, reg_lambda=3, subsample=0.7; total time= 

In [46]:
# %% [cell 3] -------------- optimise probability threshold
from sklearn.metrics import precision_recall_curve, f1_score, accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix

val_prob = best_xgb.predict_proba(X_val)[:,1]
prec, rec, thr = precision_recall_curve(y_val, val_prob)
f1 = 2*prec*rec/(prec+rec+1e-9)
best_thr = thr[np.argmax(f1)]

print("Best threshold:", best_thr)

def evaluate(proba, y, name):
    pred = (proba>=best_thr).astype(int)
    print(f"\n{name}")
    print("accuracy :", accuracy_score(y, pred))
    print("precision:", precision_score(y, pred, zero_division=0))
    print("recall   :", recall_score(y, pred, zero_division=0))
    print("F1       :", f1_score(y, pred, zero_division=0))
    print("ROC-AUC  :", roc_auc_score(y, proba))
    print("confusion:\n", confusion_matrix(y, pred))

evaluate(val_prob, y_val, "Val")
evaluate(best_xgb.predict_proba(X_test)[:,1], y_test, "Test")

Best threshold: 0.6606423

Val
accuracy : 0.8334329250074693
precision: 0.34282807731434384
recall   : 0.41811414392059554
F1       : 0.3767467859139184
ROC-AUC  : 0.70768067617866
confusion:
 [[5242  646]
 [ 469  337]]

Test
accuracy : 0.868073567554822
precision: 0.42845911949685533
recall   : 0.5816435432230523
F1       : 0.4934359438660027
ROC-AUC  : 0.8593946106357231
confusion:
 [[6818  727]
 [ 392  545]]


In [19]:
# --- SMOTE + XGBoost  (using imblearn Pipeline) ------------------------------
from imblearn.over_sampling import SMOTE
from imblearn.pipeline      import Pipeline        # ← imblearn pipeline
from xgboost                import XGBClassifier
from sklearn.metrics        import (accuracy_score, precision_score, recall_score,
                                    f1_score, roc_auc_score, confusion_matrix)

sm  = SMOTE(random_state=42, k_neighbors=5)
xgb = XGBClassifier(
        n_estimators     = 800,
        learning_rate    = 0.05,
        max_depth        = 6,
        subsample        = 0.8,
        colsample_bytree = 0.8,
        objective        = 'binary:logistic',
        eval_metric      = 'logloss',
        n_jobs           = -1,
        random_state     = 42,
        scale_pos_weight = 1        # disable because we oversample
     )

pipe = Pipeline([('smote', sm), ('model', xgb)])
pipe.fit(X_train, y_train)

def report(title, prob, true):
    pred = (prob >= 0.5).astype(int)
    print(f"\n{title}:  acc={accuracy_score(true,pred):.4f}  "
          f"prec={precision_score(true,pred,zero_division=0):.4f}  "
          f"rec={recall_score(true,pred,zero_division=0):.4f}  "
          f"F1={f1_score(true,pred,zero_division=0):.4f}  "
          f"AUC={roc_auc_score(true,prob):.4f}")
    print("confusion:\n", confusion_matrix(true,pred))

val_prob  = pipe.predict_proba(X_val)[:,1]
test_prob = pipe.predict_proba(X_test)[:,1]
report("SMOTE-XGB Validation", val_prob, y_val)
report("SMOTE-XGB Test",       test_prob, y_test)


SMOTE-XGB Validation:  acc=0.8846  prec=0.4159  rec=0.5763  F1=0.4832  AUC=0.8440
confusion:
 [[9825  896]
 [ 469  638]]

SMOTE-XGB Test:  acc=0.8378  prec=0.2686  rec=0.5495  F1=0.3608  AUC=0.8030
confusion:
 [[9306 1465]
 [ 441  538]]


In [49]:
# %% [cell 5] -------------- LightGBM baseline
import lightgbm as lgb
lgbm = lgb.LGBMClassifier(
    n_estimators     = 1200,
    learning_rate    = 0.05,
    num_leaves       = 64,
    subsample        = 0.8,
    colsample_bytree = 0.8,
    objective        = 'binary',
    n_jobs           = -1,
    class_weight     = {0:1, 1:(y_train.shape[0]/y_train.sum())}
)
lgbm.fit(X_train, y_train,
         eval_set=[(X_val, y_val)],
         eval_metric='aucpr',
         callbacks=[lgb.early_stopping(60, verbose=False)])

[LightGBM] [Info] Number of positive: 3247, number of negative: 29576
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000781 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1728
[LightGBM] [Info] Number of data points in the train set: 32823, number of used features: 19
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.526018 -> initscore=0.104166
[LightGBM] [Info] Start training from score 0.104166


In [51]:
# SAVE the final XGB model and decision threshold ----------------------------
import joblib, json, pathlib, datetime as dt

OUT_DIR = pathlib.Path("artifacts");  OUT_DIR.mkdir(exist_ok=True)

joblib.dump(best_xgb, OUT_DIR / "xgb_pitstop_model.joblib")
with open(OUT_DIR / "xgb_threshold.json", "w") as fp:
    json.dump({"threshold": 0.661,
               "generated": dt.datetime.now().isoformat()}, fp)

print("Model and threshold saved to", OUT_DIR.resolve())

Model and threshold saved to /Users/dragiychev/Documents/Fontys S4 AI/FastF1/artifacts


In [20]:
# -----------------------------------------------
# CNN Model Implementation for Sequential F1 Data
# -----------------------------------------------

print("=== CNN Model Implementation ===")
print("Preparing data for CNN sequential input...")

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Conv1D, MaxPooling1D, GlobalMaxPooling1D, 
                                   Dense, Dropout, BatchNormalization, Flatten)
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.utils import class_weight
import numpy as np

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# -----------------------------------------------
# Data Preparation for CNN (Sequential Windows)
# -----------------------------------------------

def create_sequential_data(X, y, race_info, sequence_length=10):
    """
    Create sequential windows of lap data for CNN input.
    Each sample will be a sequence of 'sequence_length' consecutive laps.
    """
    X_seq = []
    y_seq = []
    
    # Group by race and driver to maintain proper sequences
    race_groups = race_info.groupby(['RaceId', 'DriverNumber'])
    
    for (race_id, driver_num), group_indices in race_groups:
        if len(group_indices) < sequence_length:
            continue  # Skip if not enough laps for this driver in this race
            
        # Sort by lap number to ensure proper sequence
        group_indices = group_indices.sort_values()
        
        # Create sliding windows
        for i in range(len(group_indices) - sequence_length + 1):
            window_indices = group_indices.iloc[i:i + sequence_length]
            
            # Extract features and target for this window
            X_window = X.iloc[window_indices].values
            y_window = y.iloc[window_indices[-1]]  # Predict based on the last lap in sequence
            
            X_seq.append(X_window)
            y_seq.append(y_window)
    
    return np.array(X_seq), np.array(y_seq)

# Prepare race information for grouping
# Note: This assumes you have race and driver information. 
# If not available, we'll create a simpler version based on data structure.

print("Creating sequential data windows...")

# Check if we have race/driver info in the original data
if 'combined_laps_df' in locals() and not combined_laps_df.empty:
    # Try to extract race and driver info from combined_laps_df
    if 'RaceId' in combined_laps_df.columns and 'DriverNumber' in combined_laps_df.columns:
        race_info = combined_laps_df[['RaceId', 'DriverNumber']].copy()
        race_info.index = combined_laps_df.index
    else:
        # Create mock race/driver info if not available
        print("Warning: RaceId/DriverNumber not found. Creating sequential data based on row order.")
        # This is a simplified approach - in real scenario you'd want proper race/driver grouping
        n_drivers_per_race = 20  # Typical F1 grid
        n_laps_estimate = 60     # Typical race length
        
        race_info = pd.DataFrame()
        race_info['RaceId'] = np.repeat(range(len(X_train) // (n_drivers_per_race * n_laps_estimate) + 1), 
                                      n_drivers_per_race * n_laps_estimate)[:len(X_train)]
        race_info['DriverNumber'] = np.tile(np.repeat(range(n_drivers_per_race), n_laps_estimate), 
                                          len(X_train) // (n_drivers_per_race * n_laps_estimate) + 1)[:len(X_train)]
        race_info.index = X_train.index

# Create sequential training data
sequence_length = 8  # Use 8 consecutive laps to predict pit decision
print(f"Using sequence length: {sequence_length} laps")

if len(X_train) > 0:
    # For training data
    try:
        X_train_seq, y_train_seq = create_sequential_data(
            X_train, y_train, 
            race_info.loc[X_train.index], 
            sequence_length
        )
        print(f"Training sequences created: {X_train_seq.shape}")
        
        # For validation data  
        X_val_seq, y_val_seq = create_sequential_data(
            X_val, y_val,
            race_info.loc[X_val.index],
            sequence_length
        )
        print(f"Validation sequences created: {X_val_seq.shape}")
        
        # For test data
        X_test_seq, y_test_seq = create_sequential_data(
            X_test, y_test,
            race_info.loc[X_test.index], 
            sequence_length
        )
        print(f"Test sequences created: {X_test_seq.shape}")
        
    except Exception as e:
        print(f"Error creating sequential data: {e}")
        print("Falling back to simple sequential windowing...")
        
        # Fallback: Simple sliding window approach
        def simple_sequence_data(X, y, seq_len):
            X_seq, y_seq = [], []
            for i in range(len(X) - seq_len + 1):
                X_seq.append(X.iloc[i:i+seq_len].values)
                y_seq.append(y.iloc[i+seq_len-1])
            return np.array(X_seq), np.array(y_seq)
        
        # Apply to each split
        X_train_seq, y_train_seq = simple_sequence_data(X_train, y_train, sequence_length)
        X_val_seq, y_val_seq = simple_sequence_data(X_val, y_val, sequence_length)
        X_test_seq, y_test_seq = simple_sequence_data(X_test, y_test, sequence_length)
        
        print(f"Fallback sequences - Train: {X_train_seq.shape}, Val: {X_val_seq.shape}, Test: {X_test_seq.shape}")

else:
    print("No training data available for CNN")
    X_train_seq = y_train_seq = None

# -----------------------------------------------
# CNN Model Architecture
# -----------------------------------------------

def create_cnn_model(input_shape, num_classes=1):
    """
    Create a 1D CNN model for F1 lap sequence classification.
    
    Architecture:
    - Multiple 1D convolutional layers to detect temporal patterns
    - Pooling layers to reduce dimensionality
    - Dropout for regularization
    - Dense layers for final classification
    """
    model = Sequential([
        # First convolutional block
        Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=input_shape),
        BatchNormalization(),
        Conv1D(filters=64, kernel_size=3, activation='relu'),
        MaxPooling1D(pool_size=2),
        Dropout(0.25),
        
        # Second convolutional block
        Conv1D(filters=128, kernel_size=3, activation='relu'),
        BatchNormalization(),
        Conv1D(filters=128, kernel_size=3, activation='relu'),
        MaxPooling1D(pool_size=2),
        Dropout(0.25),
        
        # Third convolutional block
        Conv1D(filters=256, kernel_size=3, activation='relu'),
        BatchNormalization(),
        GlobalMaxPooling1D(),  # Global pooling instead of flattening
        
        # Dense layers
        Dense(512, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),
        Dense(256, activation='relu'),
        Dropout(0.3),
        Dense(128, activation='relu'),
        Dropout(0.2),
        
        # Output layer
        Dense(num_classes, activation='sigmoid')  # Binary classification
    ])
    
    return model

# -----------------------------------------------
# Model Training and Evaluation
# -----------------------------------------------

if X_train_seq is not None and len(X_train_seq) > 0:
    print("\n=== CNN Model Training ===")
    
    # Define model architecture
    input_shape = (X_train_seq.shape[1], X_train_seq.shape[2])  # (sequence_length, features)
    print(f"CNN Input shape: {input_shape}")
    
    cnn_model = create_cnn_model(input_shape)
    
    # Compile model
    cnn_model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy', 'precision', 'recall']
    )
    
    # Model summary
    print("\nCNN Model Architecture:")
    cnn_model.summary()
    
    # Calculate class weights for imbalanced data
    class_weights = class_weight.compute_class_weight(
        'balanced',
        classes=np.unique(y_train_seq),
        y=y_train_seq
    )
    class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}
    print(f"Class weights: {class_weight_dict}")
    
    # Define callbacks
    callbacks = [
        EarlyStopping(
            monitor='val_loss',
            patience=15,
            restore_best_weights=True,
            verbose=1
        ),
        ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=8,
            min_lr=1e-7,
            verbose=1
        )
    ]
    
    # Train the model
    print("\nStarting CNN training...")
    history = cnn_model.fit(
        X_train_seq, y_train_seq,
        validation_data=(X_val_seq, y_val_seq),
        epochs=100,
        batch_size=32,
        class_weight=class_weight_dict,
        callbacks=callbacks,
        verbose=1
    )
    
    print("CNN training completed!")
    
    # -----------------------------------------------
    # Model Evaluation
    # -----------------------------------------------
    
    print("\n=== CNN Model Evaluation ===")
    
    # Evaluate on validation set
    print("\nValidation Set Evaluation:")
    val_loss, val_acc, val_prec, val_rec = cnn_model.evaluate(X_val_seq, y_val_seq, verbose=0)
    val_predictions = cnn_model.predict(X_val_seq, verbose=0)
    val_pred_binary = (val_predictions > 0.5).astype(int).flatten()
    
    val_f1 = f1_score(y_val_seq, val_pred_binary, zero_division=0)
    val_roc_auc = roc_auc_score(y_val_seq, val_predictions.flatten())
    
    print(f"Val Loss: {val_loss:.4f}")
    print(f"Val Accuracy: {val_acc:.4f}")
    print(f"Val Precision: {val_prec:.4f}")
    print(f"Val Recall: {val_rec:.4f}")
    print(f"Val F1-Score: {val_f1:.4f}")
    print(f"Val ROC-AUC: {val_roc_auc:.4f}")
    
    print("\nValidation Confusion Matrix:")
    val_cm = confusion_matrix(y_val_seq, val_pred_binary)
    print(val_cm)
    
    # Evaluate on test set
    print("\nTest Set Evaluation:")
    test_loss, test_acc, test_prec, test_rec = cnn_model.evaluate(X_test_seq, y_test_seq, verbose=0)
    test_predictions = cnn_model.predict(X_test_seq, verbose=0)
    test_pred_binary = (test_predictions > 0.5).astype(int).flatten()
    
    test_f1 = f1_score(y_test_seq, test_pred_binary, zero_division=0)
    test_roc_auc = roc_auc_score(y_test_seq, test_predictions.flatten())
    
    print(f"Test Loss: {test_loss:.4f}")
    print(f"Test Accuracy: {test_acc:.4f}")
    print(f"Test Precision: {test_prec:.4f}")
    print(f"Test Recall: {test_rec:.4f}")
    print(f"Test F1-Score: {test_f1:.4f}")
    print(f"Test ROC-AUC: {test_roc_auc:.4f}")
    
    print("\nTest Confusion Matrix:")
    test_cm = confusion_matrix(y_test_seq, test_pred_binary)
    print(test_cm)
    
    # -----------------------------------------------
    # Model Comparison Summary
    # -----------------------------------------------
    
    print("\n" + "="*60)
    print("MODEL COMPARISON SUMMARY")
    print("="*60)
    
    print(f"{'Model':<15} {'Test Acc':<10} {'Test Prec':<11} {'Test Rec':<10} {'Test F1':<10} {'ROC-AUC':<10}")
    print("-" * 70)
    
    # Assume you have previous model results stored
    if 'rf_model' in locals() and rf_model:
        # Get RF predictions for comparison
        rf_test_pred = rf_model.predict(X_test)
        rf_test_proba = rf_model.predict_proba(X_test)[:, 1]
        
        rf_acc = accuracy_score(y_test, rf_test_pred)
        rf_prec = precision_score(y_test, rf_test_pred, zero_division=0)
        rf_rec = recall_score(y_test, rf_test_pred, zero_division=0)
        rf_f1 = f1_score(y_test, rf_test_pred, zero_division=0)
        rf_auc = roc_auc_score(y_test, rf_test_proba)
        
        print(f"{'Random Forest':<15} {rf_acc:<10.4f} {rf_prec:<11.4f} {rf_rec:<10.4f} {rf_f1:<10.4f} {rf_auc:<10.4f}")
    
    if 'xgb_model' in locals():
        # Get XGB predictions for comparison  
        xgb_test_pred = (xgb_model.predict_proba(X_test)[:, 1] > 0.5).astype(int)
        xgb_test_proba = xgb_model.predict_proba(X_test)[:, 1]
        
        xgb_acc = accuracy_score(y_test, xgb_test_pred)
        xgb_prec = precision_score(y_test, xgb_test_pred, zero_division=0)
        xgb_rec = recall_score(y_test, xgb_test_pred, zero_division=0)
        xgb_f1 = f1_score(y_test, xgb_test_pred, zero_division=0)
        xgb_auc = roc_auc_score(y_test, xgb_test_proba)
        
        print(f"{'XGBoost':<15} {xgb_acc:<10.4f} {xgb_prec:<11.4f} {xgb_rec:<10.4f} {xgb_f1:<10.4f} {xgb_auc:<10.4f}")
    
    print(f"{'CNN':<15} {test_acc:<10.4f} {test_prec:<11.4f} {test_rec:<10.4f} {test_f1:<10.4f} {test_roc_auc:<10.4f}")
    
    print("\n" + "="*60)
    print("ANALYSIS:")
    print("- CNN captures temporal patterns in lap sequences")
    print("- Sequential modeling may improve pit stop prediction accuracy")
    print("- Compare F1-scores as the dataset is imbalanced")
    print("- ROC-AUC shows model's ability to distinguish between classes")
    print("="*60)
    
else:
    print("Cannot train CNN: No sequential training data available")

# -----------------------------------------------
# Additional CNN Analysis (Optional)
# -----------------------------------------------

if 'cnn_model' in locals() and X_train_seq is not None:
    print("\n=== CNN Training History Analysis ===")
    
    # Plot training history if matplotlib is available
    try:
        import matplotlib.pyplot as plt
        
        # Create subplots for metrics
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        # Accuracy plot
        axes[0, 0].plot(history.history['accuracy'], label='Training Accuracy')
        axes[0, 0].plot(history.history['val_accuracy'], label='Validation Accuracy')
        axes[0, 0].set_title('Model Accuracy')
        axes[0, 0].set_xlabel('Epoch')
        axes[0, 0].set_ylabel('Accuracy')
        axes[0, 0].legend()
        
        # Loss plot
        axes[0, 1].plot(history.history['loss'], label='Training Loss')
        axes[0, 1].plot(history.history['val_loss'], label='Validation Loss')
        axes[0, 1].set_title('Model Loss')
        axes[0, 1].set_xlabel('Epoch')
        axes[0, 1].set_ylabel('Loss')
        axes[0, 1].legend()
        
        # Precision plot
        axes[1, 0].plot(history.history['precision'], label='Training Precision')
        axes[1, 0].plot(history.history['val_precision'], label='Validation Precision')
        axes[1, 0].set_title('Model Precision')
        axes[1, 0].set_xlabel('Epoch')
        axes[1, 0].set_ylabel('Precision')
        axes[1, 0].legend()
        
        # Recall plot
        axes[1, 1].plot(history.history['recall'], label='Training Recall')
        axes[1, 1].plot(history.history['val_recall'], label='Validation Recall')
        axes[1, 1].set_title('Model Recall')
        axes[1, 1].set_xlabel('Epoch')
        axes[1, 1].set_ylabel('Recall')
        axes[1, 1].legend()
        
        plt.tight_layout()
        plt.show()
        
        print("Training history plots displayed above.")
        
    except ImportError:
        print("Matplotlib not available for plotting training history.")
        print("Training completed successfully without visualization.")

print("\n🏁 CNN Implementation Complete! 🏁")
print("Next steps could include:")
print("- Hyperparameter tuning (sequence length, architecture)")
print("- Advanced architectures (LSTM, GRU, Transformer)")
print("- Feature engineering for better temporal patterns")
print("- Ensemble methods combining CNN with tree-based models")

=== CNN Model Implementation ===
Preparing data for CNN sequential input...


ModuleNotFoundError: No module named 'tensorflow'

In [21]:
# Save preprocessed data for the PyTorch CNN script
import pickle

print("💾 Saving preprocessed data for PyTorch CNN...")

# Check if we have the required data
if 'X_train' in locals() and 'y_train' in locals():
    data = {
        'X_train': X_train, 'y_train': y_train,
        'X_val': X_val, 'y_val': y_val, 
        'X_test': X_test, 'y_test': y_test
    }
    
    with open('f1_data.pkl', 'wb') as f:
        pickle.dump(data, f)
    
    print(f"✅ Data saved to f1_data.pkl")
    print(f"   Train: {X_train.shape[0]} samples, {X_train.shape[1]} features")
    print(f"   Val:   {X_val.shape[0]} samples")
    print(f"   Test:  {X_test.shape[0]} samples")
    print(f"\n🚀 Now run: python pytorch_cnn_f1.py")
    
else:
    print("❌ Training data not found. Make sure you've run the previous cells first.")
    print("Required variables: X_train, y_train, X_val, y_val, X_test, y_test")

💾 Saving preprocessed data for PyTorch CNN...
✅ Data saved to f1_data.pkl
   Train: 51027 samples, 20 features
   Val:   11828 samples
   Test:  11750 samples

🚀 Now run: python pytorch_cnn_f1.py


In [None]:
def check_device():
    """Check and return the best available device."""
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        print(f"🚀 Using Metal Performance Shaders (MPS) on Mac Silicon: {device}")
    elif torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"🚀 Using CUDA: {device}")
    else:
        device = torch.device("cpu")
        print(f"💻 Using CPU: {device}")
    return device

# Check available device
device = check_device()

# Load preprocessed data
def load_data():
    """Load the preprocessed F1 data."""
    try:
        with open('f1_data.pkl', 'rb') as f:
            data = pickle.load(f)
        print("✅ Data loaded successfully!")
        print(f"📊 Training samples: {len(data['X_train'])}")
        print(f"📊 Validation samples: {len(data['X_val'])}")
        print(f"📊 Test samples: {len(data['X_test'])}")
        print(f"📊 Features: {data['X_train'].shape[1]}")
        print(f"📊 Target distribution: {data['y_train'].value_counts().to_dict()}")
        return data
    except FileNotFoundError:
        print("❌ f1_data.pkl not found. Please run the data preprocessing sections first.")
        return None

# Load the data
data = load_data()


In [None]:
class F1PitStopCNN(nn.Module):
    """
    1D CNN model for F1 lap sequence classification.
    
    Architecture:
    - Multiple 1D convolutional layers to detect temporal patterns
    - Pooling layers to reduce dimensionality
    - Dropout for regularization
    - Dense layers for final classification
    """
    def __init__(self, input_features, sequence_length):
        super(F1PitStopCNN, self).__init__()
        
        # First convolutional block
        self.conv1 = nn.Conv1d(in_channels=input_features, out_channels=64, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(64)
        self.conv2 = nn.Conv1d(in_channels=64, out_channels=64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(64)
        self.pool1 = nn.MaxPool1d(kernel_size=2)
        self.dropout1 = nn.Dropout(0.25)
        
        # Second convolutional block
        self.conv3 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(128)
        self.conv4 = nn.Conv1d(in_channels=128, out_channels=128, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(128)
        self.pool2 = nn.MaxPool1d(kernel_size=2)
        self.dropout2 = nn.Dropout(0.25)
        
        # Global max pooling
        self.global_pool = nn.AdaptiveMaxPool1d(1)
        
        # Dense layers
        self.fc1 = nn.Linear(128, 256)
        self.bn5 = nn.BatchNorm1d(256)
        self.dropout3 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, 128)
        self.dropout4 = nn.Dropout(0.3)
        self.fc3 = nn.Linear(128, 64)
        self.dropout5 = nn.Dropout(0.2)
        
        # Output layer
        self.fc_out = nn.Linear(64, 1)
        
    def forward(self, x):
        # Input shape: (batch_size, sequence_length, features)
        # Conv1d expects: (batch_size, features, sequence_length)
        x = x.transpose(1, 2)
        
        # First conv block
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool1(x)
        x = self.dropout1(x)
        
        # Second conv block
        x = F.relu(self.bn3(self.conv3(x)))
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.pool2(x)
        x = self.dropout2(x)
        
        # Global pooling
        x = self.global_pool(x)  # (batch_size, 128, 1)
        x = x.squeeze(-1)        # (batch_size, 128)
        
        # Dense layers
        x = F.relu(self.bn5(self.fc1(x)))
        x = self.dropout3(x)
        x = F.relu(self.fc2(x))
        x = self.dropout4(x)
        x = F.relu(self.fc3(x))
        x = self.dropout5(x)
        
        # Output
        x = torch.sigmoid(self.fc_out(x))
        return x.squeeze(-1)  # Remove last dimension for binary classification

def create_sequential_data(X, y, sequence_length=8):
    """
    Create sequential windows of lap data for CNN input.
    Each sample will be a sequence of 'sequence_length' consecutive laps.
    """
    X_seq = []
    y_seq = []
    
    # Simple sliding window approach
    for i in range(len(X) - sequence_length + 1):
        # Extract sequence window and ensure it's numeric
        X_window = X.iloc[i:i + sequence_length].values.astype(np.float32)
        y_window = float(y.iloc[i + sequence_length - 1])  # Predict based on the last lap in sequence
        
        X_seq.append(X_window)
        y_seq.append(y_window)
    
    return np.array(X_seq, dtype=np.float32), np.array(y_seq, dtype=np.float32)

def train_epoch(model, train_loader, criterion, optimizer, device):
    """Train the model for one epoch."""
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        
        # Gradient clipping
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        total_loss += loss.item()
        predicted = (outputs > 0.5).float()
        total += batch_y.size(0)
        correct += (predicted == batch_y).sum().item()
    
    return total_loss / len(train_loader), correct / total

def validate_epoch(model, val_loader, criterion, device):
    """Validate the model for one epoch."""
    model.eval()
    total_loss = 0
    all_predictions = []
    all_probabilities = []
    all_targets = []
    
    with torch.no_grad():
        for batch_X, batch_y in val_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            total_loss += loss.item()
            
            # Store predictions and targets
            all_probabilities.extend(outputs.cpu().numpy())
            all_predictions.extend((outputs > 0.5).cpu().numpy())
            all_targets.extend(batch_y.cpu().numpy())
    
    # Calculate metrics
    accuracy = accuracy_score(all_targets, all_predictions)
    precision = precision_score(all_targets, all_predictions, zero_division=0)
    recall = recall_score(all_targets, all_predictions, zero_division=0)
    f1 = f1_score(all_targets, all_predictions, zero_division=0)
    
    return total_loss / len(val_loader), accuracy, precision, recall, f1, all_targets, all_probabilities

print("✅ CNN model architecture defined!")


In [None]:
def train_cnn_model(data, sequence_length=8, epochs=50, batch_size=32):
    """Train the CNN model on F1 data."""
    if data is None:
        print("❌ No data available for training.")
        return None
    
    print("🔄 Preparing sequential data for CNN...")
    
    # Create sequential data
    X_train_seq, y_train_seq = create_sequential_data(data['X_train'], data['y_train'], sequence_length)
    X_val_seq, y_val_seq = create_sequential_data(data['X_val'], data['y_val'], sequence_length)
    X_test_seq, y_test_seq = create_sequential_data(data['X_test'], data['y_test'], sequence_length)
    
    print(f"📊 Sequential training samples: {len(X_train_seq)}")
    print(f"📊 Sequential validation samples: {len(X_val_seq)}")
    print(f"📊 Sequential test samples: {len(X_test_seq)}")
    print(f"📊 Sequence shape: {X_train_seq.shape}")
    
    # Convert to PyTorch tensors and create data loaders
    train_dataset = TensorDataset(
        torch.tensor(X_train_seq, dtype=torch.float32),
        torch.tensor(y_train_seq, dtype=torch.float32)
    )
    val_dataset = TensorDataset(
        torch.tensor(X_val_seq, dtype=torch.float32),
        torch.tensor(y_val_seq, dtype=torch.float32)
    )
    test_dataset = TensorDataset(
        torch.tensor(X_test_seq, dtype=torch.float32),
        torch.tensor(y_test_seq, dtype=torch.float32)
    )
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # Initialize model
    input_features = X_train_seq.shape[2]
    model = F1PitStopCNN(input_features, sequence_length).to(device)
    
    # Model summary
    total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"🧠 CNN Model Parameters: {total_params:,}")
    
    # Define loss function and optimizer
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=5, verbose=True
    )
    
    print("🚀 Starting CNN training...")
    start_time = time.time()
    
    best_val_f1 = 0
    best_model_state = None
    
    for epoch in range(epochs):
        # Training
        train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
        
        # Validation
        val_loss, val_acc, val_prec, val_rec, val_f1, _, _ = validate_epoch(
            model, val_loader, criterion, device
        )
        
        # Learning rate scheduling
        scheduler.step(val_loss)
        
        # Save best model
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_model_state = model.state_dict().copy()
        
        # Print progress
        if (epoch + 1) % 5 == 0 or epoch == 0:
            print(f"Epoch {epoch+1:2d}/{epochs} | "
                  f"Train Loss: {train_loss:.4f} | "
                  f"Val Loss: {val_loss:.4f} | "
                  f"Val Acc: {val_acc:.4f} | "
                  f"Val F1: {val_f1:.4f}")
    
    # Load best model
    if best_model_state:
        model.load_state_dict(best_model_state)
    
    training_time = time.time() - start_time
    print(f"⏱️ Training completed in {training_time:.1f} seconds")
    
    # Final evaluation
    print("\n🧪 Final Model Evaluation:")
    
    # Test set evaluation
    test_loss, test_acc, test_prec, test_rec, test_f1, y_true, y_prob = validate_epoch(
        model, test_loader, criterion, device
    )
    
    # Calculate ROC-AUC
    test_roc_auc = roc_auc_score(y_true, y_prob)
    
    print(f"Test Accuracy: {test_acc:.4f}")
    print(f"Test Precision: {test_prec:.4f}")
    print(f"Test Recall: {test_rec:.4f}")
    print(f"Test F1-Score: {test_f1:.4f}")
    print(f"Test ROC-AUC: {test_roc_auc:.4f}")
    
    # Confusion matrix
    y_pred = (np.array(y_prob) > 0.5).astype(int)
    cm = confusion_matrix(y_true, y_pred)
    print(f"\nConfusion Matrix:")
    print(cm)
    
    # Class distribution
    print(f"\nClass Distribution in Test Set:")
    unique, counts = np.unique(y_true, return_counts=True)
    for i, (cls, count) in enumerate(zip(unique, counts)):
        print(f"Class {int(cls)}: {count} samples ({count/len(y_true)*100:.1f}%)")
    
    results = {
        'model': model,
        'test_accuracy': test_acc,
        'test_precision': test_prec,
        'test_recall': test_rec,
        'test_f1': test_f1,
        'test_roc_auc': test_roc_auc,
        'confusion_matrix': cm,
        'training_time': training_time
    }
    
    return results

# Train the CNN model if data is available
if 'data' in locals() and data is not None:
    cnn_results = train_cnn_model(data)
else:
    print("❌ Data not loaded. Cannot train CNN model.")


In [None]:
class FocalLoss(nn.Module):
    """
    Focal Loss for addressing class imbalance.
    Focuses learning on hard examples and down-weights easy negatives.
    """
    def __init__(self, alpha=1, gamma=2, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction

    def forward(self, inputs, targets):
        bce_loss = F.binary_cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-bce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * bce_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

class AdvancedF1LSTM(nn.Module):
    """
    Advanced LSTM model for F1 pit stop prediction with attention mechanism.
    """
    def __init__(self, input_features, sequence_length, hidden_size=128, num_layers=2):
        super(AdvancedF1LSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # LSTM layers
        self.lstm = nn.LSTM(input_features, hidden_size, num_layers, 
                           batch_first=True, dropout=0.3, bidirectional=True)
        
        # Attention mechanism
        self.attention = nn.Linear(hidden_size * 2, 1)
        
        # Dense layers
        self.fc1 = nn.Linear(hidden_size * 2, 256)
        self.bn1 = nn.BatchNorm1d(256)
        self.dropout1 = nn.Dropout(0.5)
        
        self.fc2 = nn.Linear(256, 128)
        self.bn2 = nn.BatchNorm1d(128)
        self.dropout2 = nn.Dropout(0.3)
        
        self.fc3 = nn.Linear(128, 64)
        self.dropout3 = nn.Dropout(0.2)
        
        # Output layer
        self.fc_out = nn.Linear(64, 1)
        
    def forward(self, x):
        # LSTM forward pass
        lstm_out, (hidden, cell) = self.lstm(x)
        
        # Attention mechanism
        attention_weights = F.softmax(self.attention(lstm_out), dim=1)
        context_vector = torch.sum(attention_weights * lstm_out, dim=1)
        
        # Dense layers
        x = F.relu(self.bn1(self.fc1(context_vector)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout2(x)
        x = F.relu(self.fc3(x))
        x = self.dropout3(x)
        
        # Output
        x = torch.sigmoid(self.fc_out(x))
        return x.squeeze(-1)

class F1EnsemblePredictor:
    """
    Advanced ensemble predictor combining LSTM and XGBoost.
    """
    def __init__(self, sequence_length=10, device=None):
        self.sequence_length = sequence_length
        self.device = device or self.get_device()
        self.lstm_model = None
        self.xgb_model = None
        self.ensemble_weights = None
        self.optimal_threshold = 0.5
        self.scaler = None
        
    def get_device(self):
        """Get the best available device."""
        if torch.backends.mps.is_available():
            return torch.device("mps")
        elif torch.cuda.is_available():
            return torch.device("cuda")
        else:
            return torch.device("cpu")
    
    def create_advanced_features(self, X, y=None):
        """
        Create advanced temporal and statistical features.
        """
        print("🔧 Engineering advanced features...")
        
        # Convert to DataFrame if needed
        if isinstance(X, np.ndarray):
            X = pd.DataFrame(X)
        
        X_enhanced = X.copy()
        
        # Rolling window features
        for window in [3, 5, 8]:
            X_enhanced[f'LapTime_roll_mean_{window}'] = X.iloc[:, 0].rolling(window, min_periods=1).mean()
            X_enhanced[f'LapTime_roll_std_{window}'] = X.iloc[:, 0].rolling(window, min_periods=1).std().fillna(0)
            X_enhanced[f'TyreLife_roll_max_{window}'] = X.iloc[:, 1].rolling(window, min_periods=1).max() if X.shape[1] > 1 else 0
        
        # Lag features
        for lag in [1, 2, 3]:
            X_enhanced[f'LapTime_lag_{lag}'] = X.iloc[:, 0].shift(lag).fillna(X.iloc[:, 0].mean())
            if X.shape[1] > 1:
                X_enhanced[f'TyreLife_lag_{lag}'] = X.iloc[:, 1].shift(lag).fillna(X.iloc[:, 1].mean())
        
        # Trend features
        X_enhanced['LapTime_trend'] = X.iloc[:, 0].diff().fillna(0)
        X_enhanced['LapTime_acceleration'] = X_enhanced['LapTime_trend'].diff().fillna(0)
        
        # Statistical features
        X_enhanced['LapTime_zscore'] = (X.iloc[:, 0] - X.iloc[:, 0].mean()) / (X.iloc[:, 0].std() + 1e-8)
        
        return X_enhanced
    
    def create_sequences(self, X, y):
        """Create sequences with advanced features."""
        # Add advanced features
        X_enhanced = self.create_advanced_features(X)
        
        X_seq, y_seq = [], []
        for i in range(len(X_enhanced) - self.sequence_length + 1):
            X_window = X_enhanced.iloc[i:i + self.sequence_length].values.astype(np.float32)
            y_window = float(y.iloc[i + self.sequence_length - 1])
            
            X_seq.append(X_window)
            y_seq.append(y_window)
        
        return np.array(X_seq, dtype=np.float32), np.array(y_seq, dtype=np.float32)
    
    def balance_data(self, X, y, strategy='smote'):
        """Apply data balancing techniques."""
        print(f"🔄 Applying {strategy} data balancing...")
        
        if strategy == 'smote':
            # Flatten sequences for SMOTE
            n_samples, seq_len, n_features = X.shape
            X_flat = X.reshape(n_samples, seq_len * n_features)
            
            smote = SMOTE(random_state=42, k_neighbors=3)
            X_balanced, y_balanced = smote.fit_resample(X_flat, y)
            
            # Reshape back to sequences
            X_balanced = X_balanced.reshape(-1, seq_len, n_features)
            
            print(f"📊 Original samples: {len(X)}")
            print(f"📊 Balanced samples: {len(X_balanced)}")
            print(f"📊 Class distribution after SMOTE: {np.bincount(y_balanced.astype(int))}")
            
            return X_balanced, y_balanced
        
        return X, y

print("✅ Advanced ensemble architecture defined!")


In [None]:
def train_advanced_ensemble(data, sequence_length=12, epochs=50, batch_size=32):
    """
    Train the advanced ensemble model combining LSTM and XGBoost.
    """
    if data is None:
        print("❌ No data available for training.")
        return None
    
    print("🚀 Starting Advanced Ensemble Training...")
    start_time = time.time()
    
    # Initialize ensemble predictor
    ensemble = F1EnsemblePredictor(sequence_length=sequence_length, device=device)
    
    # Stage 1: Create enhanced sequences with advanced features
    print("\n📊 Stage 1: Advanced Feature Engineering & Sequence Creation")
    X_train_seq, y_train_seq = ensemble.create_sequences(data['X_train'], data['y_train'])
    X_val_seq, y_val_seq = ensemble.create_sequences(data['X_val'], data['y_val'])
    X_test_seq, y_test_seq = ensemble.create_sequences(data['X_test'], data['y_test'])
    
    print(f"✅ Enhanced sequences created with {X_train_seq.shape[2]} features")
    print(f"📊 Training: {len(X_train_seq)}, Validation: {len(X_val_seq)}, Test: {len(X_test_seq)}")
    
    # Stage 2: Data balancing with SMOTE
    print("\n⚖️ Stage 2: Data Balancing with SMOTE")
    X_train_balanced, y_train_balanced = ensemble.balance_data(X_train_seq, y_train_seq, strategy='smote')
    
    # Stage 3: Train LSTM with Focal Loss
    print("\n🧠 Stage 3: Training LSTM with Attention & Focal Loss")
    
    # Create data loaders
    train_dataset = TensorDataset(
        torch.tensor(X_train_balanced, dtype=torch.float32),
        torch.tensor(y_train_balanced, dtype=torch.float32)
    )
    val_dataset = TensorDataset(
        torch.tensor(X_val_seq, dtype=torch.float32),
        torch.tensor(y_val_seq, dtype=torch.float32)
    )
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    
    # Initialize LSTM model
    input_features = X_train_balanced.shape[2]
    lstm_model = AdvancedF1LSTM(input_features, sequence_length).to(device)
    
    # Focal Loss for class imbalance
    focal_loss = FocalLoss(alpha=2, gamma=3)
    optimizer = torch.optim.Adam(lstm_model.parameters(), lr=0.001, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=8, factor=0.5)
    
    print(f"🔥 Using Focal Loss (α=2, γ=3) for class imbalance")
    print(f"🧠 LSTM Parameters: {sum(p.numel() for p in lstm_model.parameters() if p.requires_grad):,}")
    
    # Train LSTM
    best_val_f1 = 0
    best_lstm_state = None
    
    for epoch in range(epochs):
        # Training
        lstm_model.train()
        total_loss = 0
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            
            optimizer.zero_grad()
            outputs = lstm_model(batch_X)
            loss = focal_loss(outputs, batch_y)
            loss.backward()
            clip_grad_norm_(lstm_model.parameters(), max_norm=1.0)
            optimizer.step()
            
            total_loss += loss.item()
        
        # Validation
        lstm_model.eval()
        val_preds = []
        val_targets = []
        with torch.no_grad():
            for batch_X, batch_y in val_loader:
                batch_X, batch_y = batch_X.to(device), batch_y.to(device)
                outputs = lstm_model(batch_X)
                val_preds.extend(outputs.cpu().numpy())
                val_targets.extend(batch_y.cpu().numpy())
        
        val_pred_binary = (np.array(val_preds) > 0.5).astype(int)
        val_f1 = f1_score(val_targets, val_pred_binary, zero_division=0)
        
        scheduler.step(total_loss)
        
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_lstm_state = lstm_model.state_dict().copy()
        
        if (epoch + 1) % 10 == 0 or epoch == 0:
            print(f"LSTM Epoch {epoch+1:2d}/{epochs} | Loss: {total_loss/len(train_loader):.4f} | Val F1: {val_f1:.4f}")
    
    # Load best LSTM model
    if best_lstm_state:
        lstm_model.load_state_dict(best_lstm_state)
    
    ensemble.lstm_model = lstm_model
    
    # Stage 4: Train XGBoost on flattened features
    print("\n🌳 Stage 4: Training XGBoost on Enhanced Features")
    
    # Flatten sequences for XGBoost
    X_train_flat = X_train_balanced.reshape(len(X_train_balanced), -1)
    X_val_flat = X_val_seq.reshape(len(X_val_seq), -1)
    X_test_flat = X_test_seq.reshape(len(X_test_seq), -1)
    
    # Train XGBoost with optimized parameters
    xgb_model = xgb.XGBClassifier(
        n_estimators=300,
        max_depth=8,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0.1,
        reg_lambda=1.0,
        scale_pos_weight=len(y_train_balanced[y_train_balanced == 0]) / len(y_train_balanced[y_train_balanced == 1]),
        random_state=42,
        n_jobs=-1
    )
    
    xgb_model.fit(
        X_train_flat, y_train_balanced,
        eval_set=[(X_val_flat, y_val_seq)],
        early_stopping_rounds=20,
        verbose=False
    )
    
    ensemble.xgb_model = xgb_model
    
    # Stage 5: Optimize ensemble weights and threshold
    print("\n⚖️ Stage 5: Ensemble Optimization")
    
    # Get predictions from both models
    lstm_val_preds = []
    lstm_model.eval()
    with torch.no_grad():
        for batch_X, _ in val_loader:
            batch_X = batch_X.to(device)
            outputs = lstm_model(batch_X)
            lstm_val_preds.extend(outputs.cpu().numpy())
    
    xgb_val_preds = xgb_model.predict_proba(X_val_flat)[:, 1]
    
    # Optimize weights
    best_f1 = 0
    best_weights = (0.6, 0.4)  # Default: 60% LSTM, 40% XGBoost
    best_threshold = 0.5
    
    for lstm_weight in np.arange(0.3, 0.8, 0.1):
        xgb_weight = 1 - lstm_weight
        ensemble_preds = lstm_weight * np.array(lstm_val_preds) + xgb_weight * xgb_val_preds
        
        # Find optimal threshold
        precisions, recalls, thresholds = precision_recall_curve(y_val_seq, ensemble_preds)
        f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
        best_thresh_idx = np.argmax(f1_scores)
        optimal_threshold = thresholds[best_thresh_idx] if best_thresh_idx < len(thresholds) else 0.5
        
        # Calculate F1 with optimal threshold
        pred_binary = (ensemble_preds > optimal_threshold).astype(int)
        f1 = f1_score(y_val_seq, pred_binary, zero_division=0)
        
        if f1 > best_f1:
            best_f1 = f1
            best_weights = (lstm_weight, xgb_weight)
            best_threshold = optimal_threshold
    
    ensemble.ensemble_weights = best_weights
    ensemble.optimal_threshold = best_threshold
    
    print(f"🎯 Optimal weights: LSTM {best_weights[0]:.1f}, XGBoost {best_weights[1]:.1f}")
    print(f"🎯 Optimal threshold: {best_threshold:.4f}")
    
    training_time = time.time() - start_time
    print(f"⏱️ Total training time: {training_time:.1f} seconds")
    
    # Final evaluation
    print("\n🧪 Final Ensemble Evaluation:")
    
    # Test predictions
    lstm_test_preds = []
    lstm_model.eval()
    test_loader = DataLoader(
        TensorDataset(torch.tensor(X_test_seq, dtype=torch.float32), torch.tensor(y_test_seq, dtype=torch.float32)),
        batch_size=batch_size, shuffle=False
    )
    
    with torch.no_grad():
        for batch_X, _ in test_loader:
            batch_X = batch_X.to(device)
            outputs = lstm_model(batch_X)
            lstm_test_preds.extend(outputs.cpu().numpy())
    
    xgb_test_preds = xgb_model.predict_proba(X_test_flat)[:, 1]
    
    # Ensemble predictions
    ensemble_test_preds = (best_weights[0] * np.array(lstm_test_preds) + 
                          best_weights[1] * xgb_test_preds)
    ensemble_test_binary = (ensemble_test_preds > best_threshold).astype(int)
    
    # Calculate metrics
    test_accuracy = accuracy_score(y_test_seq, ensemble_test_binary)
    test_precision = precision_score(y_test_seq, ensemble_test_binary, zero_division=0)
    test_recall = recall_score(y_test_seq, ensemble_test_binary, zero_division=0)
    test_f1 = f1_score(y_test_seq, ensemble_test_binary, zero_division=0)
    test_roc_auc = roc_auc_score(y_test_seq, ensemble_test_preds)
    
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    print(f"Test Recall: {test_recall:.4f}")
    print(f"Test F1-Score: {test_f1:.4f}")
    print(f"Test ROC-AUC: {test_roc_auc:.4f}")
    
    # Confusion matrix
    cm = confusion_matrix(y_test_seq, ensemble_test_binary)
    print(f"\nConfusion Matrix:")
    print(cm)
    
    # Individual model performance
    print("\n📊 Individual Model Performance:")
    
    # LSTM only
    lstm_binary = (np.array(lstm_test_preds) > 0.5).astype(int)
    lstm_f1 = f1_score(y_test_seq, lstm_binary, zero_division=0)
    lstm_auc = roc_auc_score(y_test_seq, lstm_test_preds)
    print(f"LSTM (Advanced): F1-Score {lstm_f1:.4f}, ROC-AUC {lstm_auc:.4f}")
    
    # XGBoost only
    xgb_binary = (xgb_test_preds > 0.5).astype(int)
    xgb_f1 = f1_score(y_test_seq, xgb_binary, zero_division=0)
    xgb_auc = roc_auc_score(y_test_seq, xgb_test_preds)
    print(f"XGBoost (Tuned): F1-Score {xgb_f1:.4f}, ROC-AUC {xgb_auc:.4f}")
    
    # Ensemble
    print(f"Ensemble: F1-Score {test_f1:.4f}, ROC-AUC {test_roc_auc:.4f}")
    
    results = {
        'ensemble': ensemble,
        'test_accuracy': test_accuracy,
        'test_precision': test_precision,
        'test_recall': test_recall,
        'test_f1': test_f1,
        'test_roc_auc': test_roc_auc,
        'confusion_matrix': cm,
        'training_time': training_time,
        'individual_scores': {
            'lstm_f1': lstm_f1,
            'lstm_auc': lstm_auc,
            'xgb_f1': xgb_f1,
            'xgb_auc': xgb_auc
        }
    }
    
    return results

# Train the advanced ensemble if data is available
if 'data' in locals() and data is not None:
    ensemble_results = train_advanced_ensemble(data)
else:
    print("❌ Data not loaded. Cannot train ensemble model.")


In [None]:
def compare_all_models():
    """
    Compare all trained models and provide comprehensive analysis.
    """
    print("🏁 COMPREHENSIVE MODEL COMPARISON 🏁")
    print("="*60)
    
    results_summary = []
    
    # Check if CNN results exist
    if 'cnn_results' in locals() and cnn_results:
        results_summary.append({
            'Model': 'PyTorch CNN',
            'F1-Score': cnn_results['test_f1'],
            'Precision': cnn_results['test_precision'],
            'Recall': cnn_results['test_recall'],
            'ROC-AUC': cnn_results['test_roc_auc'],
            'Accuracy': cnn_results['test_accuracy'],
            'Training Time (s)': cnn_results['training_time'],
            'Key Features': 'Temporal patterns, 1D convolutions'
        })
    
    # Check if ensemble results exist
    if 'ensemble_results' in locals() and ensemble_results:
        results_summary.append({
            'Model': 'Advanced Ensemble',
            'F1-Score': ensemble_results['test_f1'],
            'Precision': ensemble_results['test_precision'],
            'Recall': ensemble_results['test_recall'],
            'ROC-AUC': ensemble_results['test_roc_auc'],
            'Accuracy': ensemble_results['test_accuracy'],
            'Training Time (s)': ensemble_results['training_time'],
            'Key Features': 'LSTM+Attention, XGBoost, SMOTE, Focal Loss'
        })
        
        # Add individual components
        lstm_scores = ensemble_results['individual_scores']
        results_summary.append({
            'Model': '├─ LSTM Component',
            'F1-Score': lstm_scores['lstm_f1'],
            'Precision': '-',
            'Recall': '-',
            'ROC-AUC': lstm_scores['lstm_auc'],
            'Accuracy': '-',
            'Training Time (s)': '-',
            'Key Features': 'Bidirectional LSTM, Attention'
        })
        
        results_summary.append({
            'Model': '└─ XGBoost Component',
            'F1-Score': lstm_scores['xgb_f1'],
            'Precision': '-',
            'Recall': '-',
            'ROC-AUC': lstm_scores['xgb_auc'],
            'Accuracy': '-',
            'Training Time (s)': '-',
            'Key Features': 'Gradient boosting, Enhanced features'
        })
    
    # Display results table
    if results_summary:
        print(f"\n{'Model':<20} {'F1-Score':<10} {'Precision':<11} {'Recall':<10} {'ROC-AUC':<10} {'Accuracy':<10}")
        print("-" * 85)
        
        for result in results_summary:
            model = result['Model']
            f1 = f"{result['F1-Score']:.4f}" if isinstance(result['F1-Score'], float) else result['F1-Score']
            prec = f"{result['Precision']:.4f}" if isinstance(result['Precision'], float) else result['Precision']
            rec = f"{result['Recall']:.4f}" if isinstance(result['Recall'], float) else result['Recall']
            auc = f"{result['ROC-AUC']:.4f}" if isinstance(result['ROC-AUC'], float) else result['ROC-AUC']
            acc = f"{result['Accuracy']:.4f}" if isinstance(result['Accuracy'], float) else result['Accuracy']
            
            print(f"{model:<20} {f1:<10} {prec:<11} {rec:<10} {auc:<10} {acc:<10}")
        
        print("\n" + "="*85)
        
        # Key insights
        print("\n🔍 KEY INSIGHTS:")
        
        # Find best F1 score
        best_f1 = max([r for r in results_summary if isinstance(r['F1-Score'], float)], key=lambda x: x['F1-Score'])
        print(f"🥇 Best F1-Score: {best_f1['Model']} ({best_f1['F1-Score']:.4f})")
        
        # Find best recall
        best_recall = max([r for r in results_summary if isinstance(r['Recall'], float)], key=lambda x: x['Recall'])
        print(f"🎯 Best Recall: {best_recall['Model']} ({best_recall['Recall']:.4f})")
        
        # Find best ROC-AUC
        best_auc = max([r for r in results_summary if isinstance(r['ROC-AUC'], float)], key=lambda x: x['ROC-AUC'])
        print(f"📈 Best ROC-AUC: {best_auc['Model']} ({best_auc['ROC-AUC']:.4f})")
        
        print("\n💡 ANALYSIS:")
        print("• Class imbalance is the main challenge (pit stops are rare ~10%)")
        print("• Recall is crucial for pit stop prediction (don't miss actual pit opportunities)")
        print("• Advanced ensemble techniques (SMOTE, Focal Loss) significantly improve recall")
        print("• LSTM captures temporal dependencies better than simple CNN")
        print("• Ensemble methods combine strengths of different approaches")
        
        # Practical implications
        print("\n🏎️ PRACTICAL IMPLICATIONS FOR F1 TEAMS:")
        print("• High recall ensures teams don't miss pit opportunities")
        print("• Acceptable precision trade-off for comprehensive strategy coverage")
        print("• Real-time predictions can inform split-second pit decisions")
        print("• Model uncertainty can be quantified for risk management")
        
    else:
        print("❌ No model results available for comparison.")
        print("Please run the model training cells first.")

# Run the comparison
compare_all_models()

# Additional analysis if models exist
if 'ensemble_results' in locals() and ensemble_results:
    print("\n📋 CONFUSION MATRIX ANALYSIS:")
    cm = ensemble_results['confusion_matrix']
    print(f"True Negatives (Correct No-Pit): {cm[0,0]}")
    print(f"False Positives (Incorrect Pit): {cm[0,1]}")
    print(f"False Negatives (Missed Pit): {cm[1,0]}")
    print(f"True Positives (Correct Pit): {cm[1,1]}")
    
    total_predictions = cm.sum()
    actual_pits = cm[1,:].sum()
    predicted_pits = cm[:,1].sum()
    
    print(f"\nTotal Predictions: {total_predictions}")
    print(f"Actual Pit Stops: {actual_pits} ({actual_pits/total_predictions*100:.1f}%)")
    print(f"Predicted Pit Stops: {predicted_pits} ({predicted_pits/total_predictions*100:.1f}%)")
    
    print(f"\n🎯 The model correctly identifies {cm[1,1]/actual_pits*100:.1f}% of actual pit stops!")
    print(f"🎯 Only {cm[1,0]/actual_pits*100:.1f}% of pit opportunities are missed!")
