# Tire Degradation Prediction Model - Week 5

## Overview
This notebook implements advanced models to predict tire degradation in Formula 1 races. Since tire degradation is not strictly linear and depends on multiple factors (compound type, track temperature, driving style, etc.), we'll use sequence models like LSTM to capture these complex patterns.

## Approach
1. **Data Exploration**
   - Analyze the relationship between lap times and tire age
   - Visualize performance degradation patterns by compound
   - Determine how to quantify "degradation" (lap time delta or derived metric)

2. **Feature Engineering**
   - Create a derived tire degradation metric
   - Organize data into sequential format for LSTM
   - Normalize features appropriately
   - Create sliding windows of previous laps to predict future performance

3. **Model Development**
   - **Primary Model**: LSTM network to predict degradation trajectory
   - **Alternative Model**: XGBoost with quantile regression for uncertainty estimation

4. **Evaluation & Visualization**
   - Compare predicted vs. actual degradation curves
   - Analyze performance across different compounds and race conditions
   - Create interactive Plotly visualizations of degradation patterns

5. **Implementation Details**
   - Sequence length: 5 laps (input) → predict next 3-5 laps
   - Features: Tire age, compound, lap time trends, fuel load
   - Target: Derived degradation metric or direct lap time prediction

## Expected Outcomes
- Trained model to predict tire performance over extended stints
- Uncertainty bounds for degradation predictions (10th, 50th, 90th percentiles)
- Interactive visualization of degradation curves by compound

---

## 1. Importing Necessary Libraries and Creatind New Directories

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import os



In [None]:
# Import utility functions from our module
from ML_utils.lap_prediction import compound_colors, compound_names

# Set plot style
plt.style.use('ggplot')
sns.set(style="whitegrid")

# Create output directories if they don't exist
os.makedirs('../../outputs/week5', exist_ok=True)
os.makedirs('../../models/week5', exist_ok=True)

---

## 2. Loading Dataframes

In [None]:
data = pd.read_csv("../../outputs/week3/lap_prediction_data.csv")
print("\nRegular data sample:")
display(data.head())

In [None]:
seq_data = pd.read_csv("../../outputs/week3/sequential_lap_prediction_data.csv")
print("\nSequential data sample:")
display(seq_data.head())

In [None]:
# Display basic information
print("Basic dataset information:")
print(f"Regular data shape: {data.shape}")
print(f"Sequential data shape: {seq_data.shape}")

---

## 3. Locating Tire Related Columns


In [None]:
# Check for tire-related columns
tire_columns = ['CompoundID', 'TyreAge']
print(f"\nTire-related columns: {tire_columns}")

In [None]:
# Summary statistics for tire-related columns
print("\nTire-related statistics:")
display(data[tire_columns].describe())

---

## 4. Compound Mappings

In [None]:
# Print compound mappings for reference
print("\nCompound mappings:")
print(f"Compound names: {compound_names}")
print(f"Compound colors: {compound_colors}")

---

## 5. Analyzing Relationship between Tire Age and Lap Time by compound

In [None]:
# Visualize the relationship between tire age and lap time by compound
plt.figure(figsize=(12, 6))

# Group by compound and tire age
for compound_id in data['CompoundID'].unique():
    subset = data[data['CompoundID'] == compound_id]
    
    # Aggregate by tire age
    agg_data = subset.groupby('TyreAge')['LapTime'].agg(['mean', 'std', 'count']).reset_index()
    
    # Only plot if we have enough data points
    if len(agg_data) > 1:
        color = compound_colors.get(compound_id, 'black')
        compound_name = compound_names.get(compound_id, f'Unknown ({compound_id})')
        
        plt.plot(agg_data['TyreAge'], agg_data['mean'], 'o-', 
                 color=color, label=f'{compound_name} Tire')
        
        # Add error bands (standard deviation)
        if 'std' in agg_data.columns:
            plt.fill_between(agg_data['TyreAge'], 
                            agg_data['mean'] - agg_data['std'], 
                            agg_data['mean'] + agg_data['std'],
                            color=color, alpha=0.2)

plt.xlabel('Tire Age (laps)')
plt.ylabel('Lap Time (s)')
plt.title('Tire Degradation: Effect on Lap Time')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('../../outputs/week5/tire_deg_curve.png')
plt.show()

---

## 6. Exploring Lap Times Deltas and Tire Ages

In [None]:
# Explore lap time deltas as tire ages
if 'LapTime_Delta' in seq_data.columns:
    plt.figure(figsize=(12, 6))
    
    for compound_id in seq_data['CompoundID'].unique():
        subset = seq_data[seq_data['CompoundID'] == compound_id]
        
        # Aggregate by tire age
        agg_data = subset.groupby('TyreAge')['LapTime_Delta'].mean().reset_index()
        
        # Only plot if we have enough data points
        if len(agg_data) > 1:
            color = compound_colors.get(compound_id, 'black')
            compound_name = compound_names.get(compound_id, f'Unknown ({compound_id})')
            
            plt.plot(agg_data['TyreAge'], agg_data['LapTime_Delta'], 'o-', 
                     color=color, label=f'{compound_name} Tire')
    
    plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
    plt.xlabel('Tire Age (laps)')
    plt.ylabel('Lap Time Delta (s) - Positive means getting slower')
    plt.title('Lap Time Degradation Rate by Tire Age')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.savefig('../../outputs/week5/tire_deg_rate.png')
    plt.show()

---

## 7. Exploring if Tire Age affects Speed in different Sectors

In [None]:
# Look at how tire age affects speed in different sectors
speed_columns = ['SpeedI1', 'SpeedI2', 'SpeedFL']

plt.figure(figsize=(14, 8))
for speed_col in speed_columns:
    # Focus on compound ID 2 (Medium) since that's what we have in the data
    subset = data[data['CompoundID'] == 2]
    
    # Aggregate by tire age
    agg_data = subset.groupby('TyreAge')[speed_col].mean().reset_index()
    
    if len(agg_data) > 1:
        plt.plot(agg_data['TyreAge'], agg_data[speed_col], 'o-', 
                 label=f'{speed_col}')

plt.xlabel('Tire Age (laps)')
plt.ylabel('Speed (kph)')
plt.title(f'Effect of Tire Age on Speed - {compound_names.get(2)} Tires')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('../../outputs/week5/tire_age_speed_effect.png')
plt.show()

---

## 8. Creating Tire Degradation Metrics

For being able to predict tyre degradation more effectively, a good option can be generate more variables with the current data that we have. Therefore, I´ll add this new variables:

## Disclaimer: importance of fuel load.

After making the cells and looking at the data, some tires shows positive degradation. That means that lap times are descending instead of going up. This is caused due to the less amount of fuel during the race. Therefore, I need to **create an adjusted lap time** that takes into account this fuel factor before creating our prediction models. 

Then, I will create this variable and then adjust the plots and variable calculation for fitting this feature.

*NOTE*: fuel burn calculation and impact will be calculated according to this articles: 

- [BBC Sport Weight Reduction](https://www.bbc.com/sport/articles/cv2g715dkk1o#:~:text=A%201.5kg%20reduction%20in,so%20over%20a%20race%20distance.)

- [Fuel Correction Analysis, Medium](https://medium.com/@umakschually/fuel-correction-29ccd98ae62b#:~:text=Rule%20of%20thumb%20is%20that,tyre%20age%20or%20anything%20else.)

#### 1. Absolute Tire Degradation (TireDegAbsolute)

Its objective is to **measure how much seconds is the actual lap time slower, compared with the baseline** (new tires or with the less degradation possible, for instance, only 2 laps).

**Positive values** implie degradation (car is getting slower). As I said, it would be measured in **seconds**.

- *Utility*:
    - Allows knowing the direct impact on lap time.
    - Helps to determine the *cross point* when a pit stop becomes an advantage.
    - Fundamental for strategic calculus, as teams work with absolute times.
    - Helps us answering the following: **How many seconds are we losing per lap with degradation?**

#### 2. Tire Degradation Percentage 

It expresses degradation as an **augmenting percentage** to base time. For instance, 2% means the car is 2% slower than with new tires.

- *Utility*:
    - Allows more intuitive comparisons between different conditions.
    - Normalizes the data for more clear comparisons between tires.
    - Helps us aswering the following:**Which compound maintains better its relative performance?**

#### 3. Degradation Rate

Means how much time increases per lap with each aditional lap. Represents the first derivative of degradation curve. 

- *Utility:*
    - Allows knowing if degradation is lineal, progressive or if it stabilizes.
    - Crucial for estimating optimum pit stop window during races.
    - Allows anticipating future compound´s behaviour.
    - Helps us answering the following: **Degradation is getting worse or it is stabilizing?**.



---

### Calculating Lap Time Improvement Per Lap

In [None]:
# Crear métricas de degradación ajustadas usando directamente la mejora de tiempo por vuelta
# Tiempo que mejora cada vuelta debido a la reducción de combustible
LAP_TIME_IMPROVEMENT_PER_LAP = 0.055  # segundos por vuelta (punto medio de 0.05-0.06s)

# Create a DataFrame to store all results with fuel adjustment
tire_deg_data = pd.DataFrame()

# Process each compound separately
for compound_id in data['CompoundID'].unique():
    compound_name = compound_names.get(compound_id, f"Unknown ({compound_id})")
    print(f"Processing {compound_name} tires (ID: {compound_id})...")
    
    # Filter for this compound
    compound_data = data[data['CompoundID'] == compound_id].copy()
    
    # Sort by TyreAge to see the degradation trend
    compound_data = compound_data.sort_values('TyreAge')
    
    # Check if we have enough data
    if len(compound_data) < 5:
        print(f"  Not enough data for {compound_name} tires, skipping")
        continue
    
    # Find baseline information
    if 1 in compound_data['TyreAge'].values:
        # Get baseline data (TyreAge=1)
        baseline_data = compound_data[compound_data['TyreAge'] == 1]
        baseline_lap_time = baseline_data['LapTime'].mean()
        baseline_tire_age = 1
    else:
        # If no 'new tire' laps, use the minimum TyreAge available
        min_age = compound_data['TyreAge'].min()
        baseline_data = compound_data[compound_data['TyreAge'] == min_age]
        baseline_lap_time = baseline_data['LapTime'].mean()
        baseline_tire_age = min_age
        print(f"  No laps with new tires for {compound_name}, using TyreAge={min_age} as baseline")
    
    # Calculate fuel adjustment directly based on laps from baseline
    compound_data['LapsFromBaseline'] = compound_data['TyreAge'] - baseline_tire_age
    compound_data['FuelEffect'] = compound_data['LapsFromBaseline'] * LAP_TIME_IMPROVEMENT_PER_LAP
    
    # Calculate fuel-adjusted lap time
    compound_data['FuelAdjustedLapTime'] = compound_data['LapTime'] + compound_data['FuelEffect']
    
    # Calculate traditional degradation metrics
    compound_data['TireDegAbsolute'] = compound_data['LapTime'] - baseline_lap_time
    compound_data['TireDegPercent'] = (compound_data['LapTime'] / baseline_lap_time - 1) * 100
    
    # Calculate fuel-adjusted degradation metrics
    baseline_adjusted_lap_time = baseline_lap_time  # For new tires, no adjustment needed
    compound_data['FuelAdjustedDegAbsolute'] = compound_data['FuelAdjustedLapTime'] - baseline_adjusted_lap_time
    compound_data['FuelAdjustedDegPercent'] = (compound_data['FuelAdjustedLapTime'] / baseline_adjusted_lap_time - 1) * 100
    
    # Add compound info for later aggregation
    compound_data['CompoundName'] = compound_name
    
    # Add to the combined DataFrame
    tire_deg_data = pd.concat([tire_deg_data, compound_data])
    
    # Calculate maximum laps and total fuel effect
    max_laps = compound_data['TyreAge'].max() - baseline_tire_age
    total_fuel_effect = max_laps * LAP_TIME_IMPROVEMENT_PER_LAP
    
    print(f"  Baseline lap time for {compound_name}: {baseline_lap_time:.3f}s")
    print(f"  Maximum laps from baseline: {max_laps:.0f}")
    print(f"  Estimated total fuel benefit: ~{total_fuel_effect:.2f}s")
    print(f"  Processed {len(compound_data)} laps with {compound_name} tires")

# Display comparison between regular and fuel-adjusted metrics
print("\nComparison of regular vs. fuel-adjusted metrics (sample):")
sample_comparison = tire_deg_data.groupby(['CompoundName', 'TyreAge'])[
    ['TireDegAbsolute', 'FuelAdjustedDegAbsolute', 'FuelEffect']
].mean().reset_index()
display(sample_comparison.head(10))


---

### Plotting the Diferrence between regular and Fuel Adjusted Degradation

In [None]:
# Create a comparison of regular vs fuel-adjusted degradation
plt.figure(figsize=(16, 12))
compound_ids = tire_deg_data['CompoundID'].unique()
# Loop through the compounds to create comparison plots
for i, compound_id in enumerate(compound_ids):
    compound_subset = tire_deg_data[tire_deg_data['CompoundID'] == compound_id]
    color = compound_colors.get(compound_id, 'black')
    compound_name = compound_names.get(compound_id, f'Unknown ({compound_id})')
    
    # Calculate means for regular and adjusted degradation
    reg_agg = compound_subset.groupby('TyreAge')['TireDegAbsolute'].mean()
    adj_agg = compound_subset.groupby('TyreAge')['FuelAdjustedDegAbsolute'].mean()
    
    # Create subplot
    plt.subplot(len(compound_ids), 1, i+1)
    
    # Plot regular degradation
    plt.plot(reg_agg.index, reg_agg.values, 'o--', 
             color=color, alpha=0.5, label=f'{compound_name} (Regular)')
    
    # Plot fuel-adjusted degradation
    plt.plot(adj_agg.index, adj_agg.values, 'o-', 
             color=color, linewidth=2, label=f'{compound_name} (Fuel Adjusted)')
    
    plt.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
    plt.ylabel('Degradation (s)')
    plt.title(f'{compound_name} Tire Degradation: Regular vs. Fuel-Adjusted')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Anotar la diferencia total estimada del efecto combustible
    min_lap = reg_agg.index.min()
    max_lap = reg_agg.index.max()
    total_laps = max_lap - min_lap
    total_fuel_effect = total_laps * LAP_TIME_IMPROVEMENT_PER_LAP
    plt.annotate(f"Est. total fuel effect: ~{total_fuel_effect:.2f}s", 
                 xy=(0.02, 0.05), xycoords='axes fraction',
                 bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.8))
    
    if i == len(compound_ids)-1:  # Solo añadir etiqueta x para el subgráfico inferior
        plt.xlabel('Tire Age (laps)')

plt.tight_layout()
plt.savefig('../../outputs/week5/regular_vs_adjusted_comparison.png')
plt.show()


---

### 8.1 Absolute Tire Degradation

In [None]:
# Visualize the fuel-adjusted absolute degradation
plt.figure(figsize=(14, 7))
compound_ids = tire_deg_data['CompoundID'].unique()

for compound_id in compound_ids:
    compound_subset = tire_deg_data[tire_deg_data['CompoundID'] == compound_id]
    color = compound_colors.get(compound_id, 'black')
    compound_name = compound_names.get(compound_id, f'Unknown ({compound_id})')
    
    # Aggregate data for line plot
    agg_data = compound_subset.groupby('TyreAge')['FuelAdjustedDegAbsolute'].agg(['mean', 'std']).reset_index()
    
    # Plot mean line
    plt.plot(agg_data['TyreAge'], agg_data['mean'], 'o-', 
             color=color, linewidth=2, label=f'{compound_name}')
    
    # Add error bands if we have standard deviation
    if 'std' in agg_data.columns and not agg_data['std'].isnull().all():
        plt.fill_between(agg_data['TyreAge'], 
                        agg_data['mean'] - agg_data['std'], 
                        agg_data['mean'] + agg_data['std'],
                        color=color, alpha=0.2)

plt.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
plt.xlabel('Tire Age (laps)')
plt.ylabel('Fuel-Adjusted Absolute Degradation (s)')
plt.title('Tire Degradation by Compound and Age (Fuel Effect Removed)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('../../outputs/week5/fuel_adjusted_deg_by_compound.png')
plt.show()

---

### 8.2 Tire Degradation Percentage

In [None]:
# Visualize the fuel-adjusted percentage degradation
plt.figure(figsize=(14, 7))

for compound_id in compound_ids:
    compound_subset = tire_deg_data[tire_deg_data['CompoundID'] == compound_id]
    color = compound_colors.get(compound_id, 'black')
    compound_name = compound_names.get(compound_id, f'Unknown ({compound_id})')
    
    # Aggregate data for line plot
    agg_data = compound_subset.groupby('TyreAge')['FuelAdjustedDegPercent'].agg(['mean', 'std']).reset_index()
    
    # Plot mean line
    plt.plot(agg_data['TyreAge'], agg_data['mean'], 'o-', 
             color=color, linewidth=2, label=f'{compound_name}')
    
    # Add error bands if we have standard deviation
    if 'std' in agg_data.columns and not agg_data['std'].isnull().all():
        plt.fill_between(agg_data['TyreAge'], 
                        agg_data['mean'] - agg_data['std'], 
                        agg_data['mean'] + agg_data['std'],
                        color=color, alpha=0.2)

plt.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
plt.xlabel('Tire Age (laps)')
plt.ylabel('Fuel-Adjusted Percentage Degradation (%)')
plt.title('Percentage Tire Degradation by Compound and Age (Fuel Effect Removed)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('../../outputs/week5/fuel_adjusted_deg_percent_by_compound.png')
plt.show()

---

### 8.3 Tire Degradation Rate

In [None]:
# After plotting, add the variable to the dataframe
for compound_id in compound_ids:
    # Recalculate using the same method as in the visualization
    compound_subset = tire_deg_data[tire_deg_data['CompoundID'] == compound_id]
    avg_laptimes = compound_subset.groupby('TyreAge')['FuelAdjustedLapTime'].mean()
    deg_rates = avg_laptimes.diff()
    
    # Assign values to the dataframe
    for age, rate in zip(deg_rates.index, deg_rates.values):
        mask = (tire_deg_data['CompoundID'] == compound_id) & (tire_deg_data['TyreAge'] == age)
        tire_deg_data.loc[mask, 'DegradationRate'] = rate

# Verify that it has been added correctly
print("\nFirst rows with DegradationRate:")
display(tire_deg_data[['CompoundID', 'TyreAge', 'FuelAdjustedLapTime', 'DegradationRate']].head(10))


In [None]:
# Plot a line chart showing Tire Degradation Rate by compound with error bands
plt.figure(figsize=(14, 7))

for compound_id in compound_ids:
    # Filter the data for the current compound
    compound_subset = tire_deg_data[tire_deg_data['CompoundID'] == compound_id]
    
    # Calculate the average and standard deviation of degradation rate per tire age
    deg_stats = compound_subset.groupby('TyreAge')['DegradationRate'].agg(['mean', 'std']).reset_index()
    
    # Get color and compound name for the plot
    color = compound_colors.get(compound_id, 'black')
    compound_name = compound_names.get(compound_id, f'Unknown ({compound_id})')
    
    # Plot the line for this compound
    plt.plot(deg_stats['TyreAge'], deg_stats['mean'], marker='o', linestyle='-',
             color=color, linewidth=2, label=compound_name)
    
    # Add error bands (standard deviation)
    # Check if we have valid standard deviation values
    if 'std' in deg_stats.columns and not deg_stats['std'].isnull().all():
        plt.fill_between(deg_stats['TyreAge'], 
                        deg_stats['mean'] - deg_stats['std'], 
                        deg_stats['mean'] + deg_stats['std'],
                        color=color, alpha=0.)

plt.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
plt.xlabel('Tire Age (laps)')
plt.ylabel('Fuel-Adjusted Degradation Rate (s/lap)')
plt.title('Tire Degradation Rate by Compound (Fuel Effect Removed)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Fixing NaNs. This are due to diff pandas method. Assume 0 as new tires do not have degradation still

num_nans = tire_deg_data['DegradationRate'].isna().sum()
print(num_nans)

tire_deg_data['DegradationRate'] = tire_deg_data['DegradationRate'].fillna(0)
print(f"Number of NaN after sustitution: {tire_deg_data['DegradationRate'].isna().sum()}")

## Key Findings

#### Medium Tire (Yellow).

- The fuel effect was masking significantly degradation. With the adjustment, mroe degradation can be seen.

- They offer the best initial advantage (-4 seconds), that stabilizes in -3 seconds until aproximately lap 30.

- Total fuel impact is about 1.73 seconds faster at the end of the stint.

- They represent the best balance between performance and durability.

#### Hard Tire (Gray)

- Fuel effect made an stabilization ilusion and even improvement. The adjust reveals a constant and progressive degradation that goes up to +2seconds.

- Fuel effect on this tires are the biggest, with 2.64 fastet.

- Degradation rate is more stable and predictable, making them ideal for long stints.

#### Soft Tire (Red)

- There is a bigger volatile effect that it seemed without fuel effect.
- Erratic behaviour and big fluctuations after lap 20.
- Highly unpredictable and dramatically fluctating, specially after lap 20, with extreme degradation peaks of +2 seconds slower per lap.

### Detected Turning Points

- *SOFT*: show a cliff degradation over 20 laps.
- *MEDIUM*: change in pattern near lap 30.
- *HARD*: show a degradation increase after lap 40.


### Good Conclussions for predictive model. 

1. Fuel adjustment was essential for identifying true degradation.
2. Each compound shows unique patterns that can be useful for the model:
    - Soft tire as high volatile with critic points of sudden degradation.
    - Medium tire has a fast fall followed by stabilization.
    - Hard tire with slow but continous degradation.

3. Identified turning points are going to be crucial parameters for the LSTM or XgBoost, as they show critic moments for pit stop strategies.

---

## 9. Correlation Analysis: Tire-Related Factors with Lap Time

In [None]:
# Invert dictionary to apply conversion to compound names to numbers 
compound_names_inv = {value: key for key, value in compound_names.items()}
# Replace the names with its according numbers
tire_deg_data["CompoundName"] = tire_deg_data["CompoundName"].replace(compound_names_inv)

# We can eliminate this column as it does not provide any information
tire_deg_data = tire_deg_data.drop('Unnamed: 0', axis=1)


In [None]:
# Making correlation matrix
correlation_matrix = tire_deg_data.corr()

import seaborn as sns
import matplotlib.pyplot as plt

# Crear un heatmap
plt.figure(figsize=(24,12))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

---

## 10. Conclussions and Variable Cleaning

**Variables to Keep**  
- **FuelAdjustedLapTime** (remove `LapTime`, correlation 0.95)  
  - Main lap time metric adjusted without the misleading fuel effect  
- **FuelAdjustedDegPercent** (remove other degradation metrics)  
  - Best metric for comparing compounds  
  - The four degradation metrics have very high correlations with each other (0.98–1.00). The pther 3 variables were only created to add more explicability to the data analysis.
- **DegradationRate**  
  - Captures the changing dynamics of degradation  
  - Its low correlation with other variables confirms it adds unique information  
  - Crucial for detecting inflection points and sudden changes  
- **TyreAge**  
  - A fundamental variable for the model  
  - Tire age is the main predictor of its condition  
- **CompoundID** (remove `CompoundName`)  
  - Needed to distinguish between different compounds  

- **Rest of variables**

**Optional Variables (if they improve the model)**  
- **SpeedI1**, **SpeedI2**, **SpeedFL**: to capture sector effects  
- **FuelLoad**: as a control variable  

**Variables to Remove**  
- **LapTime** (use only `FuelAdjustedLapTime`)  
- **TireDegAbsolute**, **TireDegPercent**, **FuelAdjustedDegAbsolute**  
- **CompoundName** (redundant with `CompoundID`)  
- **LapsFromBaseline** and **FuelEffect** (perfect correlation)  

In [None]:
# Remove only the specified redundant variables
columns_to_remove = [
    'LapTime',                 # Use only FuelAdjustedLapTime
    'TireDegAbsolute',         # Redundant with FuelAdjustedDegPercent
    'TireDegPercent',          # Redundant with FuelAdjustedDegPercent
    'FuelAdjustedDegAbsolute', # Redundant with FuelAdjustedDegPercent
    'CompoundName',            # Redundant with CompoundID
    'LapsFromBaseline',        # Perfect correlation
    'FuelEffect'               # Perfect correlation
]

# Drop columns from the dataframe
tire_deg_data = tire_deg_data.drop(columns=columns_to_remove)

# Show how many variables were removed and display a new correlation matrix
print(f"Removed {len(columns_to_remove)} redundant variables.")
print(f"The dataframe now has {tire_deg_data.shape[1]} columns.")




---

## 11. New Correlation Matrix to see interesting Variables

In [None]:
# Display the new correlation matrix
updated_correlation_matrix = tire_deg_data.corr()
plt.figure(figsize=(16,8))
sns.heatmap(updated_correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Updated Correlation Matrix after Removing Redundant Variables")
plt.show()

After processing the data and applying fuel adjustment, the correlation matrix reveals critical relationships that will inform our LSTM model. Here are the main points:

- **Key Correlations:**
  - **FuelAdjustedLapTime & FuelAdjustedDegPercent:** Strong correlation (0.59) indicates similar degradation dynamics from different perspectives.
  - **TyreAge & LapsSincePitStop:** Almost perfect correlation (0.90), as both track similar information.
  - **CompoundID & FuelAdjustedDegPercent:** Moderate correlation (0.40) shows that tire compound significantly influences degradation.
  - **FuelLoad & Performance Metrics:** Strong negative correlation with Stint (-0.86) and moderate correlation with adjusted lap time (0.42), emphasizing the importance of fuel weight.
  - **DegradationRate:** Low correlation with most variables, including FuelAdjustedDegPercent (0.19), suggesting it captures unique degradation dynamics.

- **LSTM Model Input Selection:**
  - **Primary Variables:**
    - **FuelAdjustedLapTime:** Core performance metric without fuel effects.
    - **FuelAdjustedDegPercent:** Best metric for comparing compound performance.
    - **DegradationRate:** Captures lap-to-lap changes and critical inflection points.
    - **TyreAge:** Fundamental predictor of tire condition.
    - **CompoundID:** Essential for distinguishing between different tire compounds.
  - **Supporting Variables:**
    - **Speed Metrics (SpeedI1, SpeedI2, SpeedFL):** Capture sector-specific effects.
    - **FuelLoad:** Used as a control variable.
    - **Position:** Provides contextual race information.

---

## 12. Saving my final Dataframe as a CSV

In [None]:

# Guardar el DataFrame procesado con las métricas de degradación ajustadas por combustible
output_path = "../../outputs/week5/tire_degradation_fuel_adjusted.csv"

# Guardar el DataFrame
tire_deg_data.to_csv(output_path, index=False)

---


## 13. Creating Sequential Data

In [None]:
df = pd.read_csv("../../outputs/week5/tire_degradation_fuel_adjusted.csv")

## Sequencing Process for LSTM

### Creating Temporal Sequences:
- We need to transform our tabular data into chronologically ordered sequences.
- Each sequence should contain a sliding window of N consecutive laps (typically 5 laps).

### Data Format for LSTM:
**Input:** [lap_t-5, lap_t-4, lap_t-3, lap_t-2, lap_t-1]  

**Output:** [lap_t, lap_t+1, lap_t+2]

### Variables to Include in Each Sequence Element:
- FuelAdjustedLapTime
- FuelAdjustedDegPercent
- DegradationRate
- TyreAge
- CompoundID
- Contextual variables (FuelLoad, position, sector speeds)

It is crucial to ensure that the sequences maintain temporal integrity and do not mix data from different stints or pit stops.

In [None]:
def create_sequences(df, input_length=5, prediction_horizon=3, target_column='FuelAdjustedDegPercent'):
    """
    Create sequences for LSTM model from the tire degradation data.
    Groups by driver, stint and compound to ensure proper sequencing.
    
    Args:
        df: DataFrame with tire degradation data
        input_length: Number of consecutive laps to include in input sequence
        prediction_horizon: Number of future laps to predict
        target_column: Column to predict
        
    Returns:
        sequences: List of DataFrame sequences
        targets: List of target arrays
    """
    sequences = []
    targets = []
    
    # Group by DriverNumber, Stint, and CompoundID
    groupby_columns = ['DriverNumber', 'Stint', 'CompoundID']
    
    # Process each driver-stint-compound group separately
    for name, group in df.groupby(groupby_columns):
        # Sort by TyreAge to ensure chronological order
        sorted_group = group.sort_values('TyreAge').reset_index(drop=True)
        
        # Skip if we don't have enough laps for a sequence
        if len(sorted_group) < input_length + prediction_horizon:
            continue
        
        # Create sliding window sequences
        for i in range(len(sorted_group) - input_length - prediction_horizon + 1):
            # Get input sequence (all features)
            seq = sorted_group.iloc[i:i+input_length]
            
            # Get target values (future values to predict)
            target = sorted_group.iloc[i+input_length:i+input_length+prediction_horizon][target_column].values
            
            sequences.append(seq)
            targets.append(target)
    
    print(f"Created {len(sequences)} sequences of {input_length} laps each")
    return sequences, targets



In [None]:
sequences, targets = create_sequences(df)

---

### Explanation of Sequences and Targets
#### What does the `create_sequences` function do?
The function creates data in a sequential format, necessary for training LSTM models. Specifically:

**Sequences:**
- They are "sliding windows" of consecutive data from the same set of tires.
- Each sequence contains data from 5 consecutive laps (all DataFrame columns).
- They represent the "recent history" that the model will use to make predictions.

**Targets:**
- These are the values we want to predict in the future.
- Each target contains the degradation values for the next 3 laps after the sequence.
- It only includes the column we want to predict (`FuelAdjustedDegPercent`).

#### Concrete Example
Imagine we have data from 10 laps with the same tire:

- **Sequence 1:** Laps 1-5
  - **Target 1:** Degradation in laps 6-8
- **Sequence 2:** Laps 2-6
  - **Target 2:** Degradation in laps 7-9
- **Sequence 3:** Laps 3-7
  - **Target 3:** Degradation in laps 8-10




---

## 14. Verifying the Sequential Data

In [None]:
def verify_sequences_with_targets(sequences, targets, num_to_check=3):
    print("COMPLETE SEQUENCE VERIFICATION (WITH TARGETS):")
    print("=============================================")
    
    # Check a few consecutive sequences from the same group
    driver_stint_compounds = []
    
    for i, seq in enumerate(sequences):
        # Get identifier for this sequence
        identifier = (seq['DriverNumber'].iloc[0], seq['Stint'].iloc[0], seq['CompoundID'].iloc[0])
        driver_stint_compounds.append(identifier)
    
    # Find groups with consecutive sequences
    for i in range(len(sequences)-1):
        # Check if consecutive sequences are from same driver-stint-compound
        if driver_stint_compounds[i] == driver_stint_compounds[i+1]:
            seq1 = sequences[i]
            seq2 = sequences[i+1]
            
            # Get tire ages and targets
            ages1 = seq1['TyreAge'].values
            ages2 = seq2['TyreAge'].values
            target1 = targets[i]
            target2 = targets[i+1]
            
            # Calculate what the next tire ages should be (for targets)
            expected_target_ages1 = np.array([ages1[-1] + j + 1 for j in range(len(target1))])
            expected_target_ages2 = np.array([ages2[-1] + j + 1 for j in range(len(target2))])
            
            # Check if sliding window pattern is correct
            sliding_window_correct = np.array_equal(ages1[1:], ages2[:-1])
            
            # Print results
            print(f"\nSequences {i} and {i+1}:")
            print(f"Driver: {seq1['DriverNumber'].iloc[0]}, Stint: {seq1['Stint'].iloc[0]}, Compound: {seq1['CompoundID'].iloc[0]}")
            print(f"Tire ages seq {i}: {ages1}")
            print(f"TARGET values seq {i}: {target1}")
            print(f"Expected target ages seq {i}: {expected_target_ages1}")
            print(f"Tire ages seq {i+1}: {ages2}")
            print(f"TARGET values seq {i+1}: {target2}")
            print(f"Expected target ages seq {i+1}: {expected_target_ages2}")
            print(f"Sliding window pattern: {sliding_window_correct}")
            
            # Verify that target1 corresponds to the next values after seq1
            # We'd need the original dataframe to check this precisely
            
            # Only check a limited number
            num_to_check -= 1
            if num_to_check <= 0:
                break
    
    print("\nVERIFICATION SUMMARY:")
    print("1. Each sequence should advance by one lap (sliding window pattern)")
    print("2. Targets should contain the FuelAdjustedDegPercent values for the next 3 laps")
    print("3. Each target should start exactly where its sequence ends")

In [None]:
# Verirfy sequences with their targets
verify_sequences_with_targets(sequences, targets)

### Validation of Data Sequencing
#### Sliding Window Pattern:
- Each sequence advances exactly one lap relative to the previous one.
- Example: `[1,2,3,4,5] → [2,3,4,5,6] → [3,4,5,6,7] → [4,5,6,7,8]`

#### Consistency in Targets:
- Targets always represent the next 3 laps after each sequence.
- Example: Sequence 0 ends at lap 5, targets are laps 6, 7, 8.

#### Coherence Between Sequences and Targets:
- The target values also "slide" accordingly:
  - **Target of Sequence 0:** `[-3.88, -3.73, -4.36]`
  - **Target of Sequence 1:** `[-3.73, -4.36, -3.83]`
- The first two values of Target 1 match the last two of Target 0.

#### Maintaining Structure by Driver, Stint, and Compound:
- All shown sequences belong to the same driver (1), same stint (1.0), and same compound (2).
- This ensures we are analyzing the degradation of a single set of tires.

### Implications for Our LSTM Model
- The model can learn degradation patterns based on consecutive lap sequences.
- It can predict degradation for the next 3 laps using the previous 5 laps.
- The data structure captures both the absolute degradation level and its rate of change.

With this verification, we confirm that the data is correctly prepared for training the LSTM model. We have selected `FuelAdjustedDegPercent` as our target, which is appropriate since it represents degradation adjusted for fuel effects, precisely what we aim to predict

---

## 15 LSTM: Data Preparation

In [None]:
def prepare_for_lstm(sequences, targets):
    """
    Convert the list of DataFrames and targets into numpy arrays suitable for LSTM training
    """
    # Get the number of features (columns) in the sequence DataFrames
    n_features = len(sequences[0].columns)
    sequence_length = len(sequences[0])
    prediction_horizon = len(targets[0])
    
    # Initialize arrays
    X = np.zeros((len(sequences), sequence_length, n_features))
    y = np.zeros((len(sequences), prediction_horizon))
    
    # Fill the arrays
    for i, (seq, target) in enumerate(zip(sequences, targets)):
        X[i] = seq.values
        y[i] = target
    
    print(f"Prepared data for LSTM with shape: X: {X.shape}, y: {y.shape}")
    return X, y


In [None]:
# Prepare data for LSTM (convert to numpy arrays)
X, y = prepare_for_lstm(sequences, targets)

### 15.1 **Prepared Data:**
- **X**: (763, 5, 16) → 763 sequences, each containing 5 laps and 16 features per lap
- **y**: (763, 3) → For each sequence, we predict 3 future laps of degradation

---

### 15.2 **Dataset Split:**

In [None]:
# Split data into train, validation and test sets (70-15-15)
# First split: separate test set (15%)
from sklearn.model_selection import train_test_split


X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)

# Second split: divide remaining data into train (70%) and validation (15%)
# The validation should be 17.65% of the temporary set (0.15/0.85)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.15/0.85, random_state=42
)

---

In [None]:
# Print the shapes of the resulting datasets
print("Data split:")
print(f"X_train shape: {X_train.shape} ({len(X_train)/len(X):.1%})")
print(f"X_val shape: {X_val.shape} ({len(X_val)/len(X):.1%})")
print(f"X_test shape: {X_test.shape} ({len(X_test)/len(X):.1%})")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"y_test shape: {y_test.shape}")

- **Train**: 533 sequences (69.9%) - very close to the 70% target
- **Validation**: 115 sequences (15.1%) - very close to the 15% target
- **Test**: 115 sequences (15.1%) - very close to the 15% target