# Notebook 02 â€” Data Preprocessing & Feature Engineering



**Objective:** clean the base dataset and engineer features for analysis and machine learning.

---


| Step | Description | Output |
|------|-------------|--------|
| 1 | Load base dataset from Notebook 01 | Raw merged data |
| 2 | Clean qualifying grid positions | `grid_clean` column |
| 3 | Create race-level features | `position_gain`, `is_podium` |
| 4 | Add finish status flags | `is_finished`, `is_dnf`, `is_dns` |
| 5 | Engineer historical features | Driver & constructor past performance |
| 6 | Final cleaning & validation | Handle remaining missing values |

### Output

 saved to: `../data/processed/processed_f1_2018_2024.csv`

---

### Table of Contents

1. [Setup & Imports](#1-setup)
2. [Load Base Data](#2-load)
3. [Initial Data Quality](#3-quality)
4. [Grid Cleaning](#4-grid)
5. [Race Features](#5-race-features)
6. [Status & DNF Flags](#6-status)
7. [Historical Features](#7-historical)
8. [Final Cleaning](#8-final-clean)
9. [Validation & Verification](#9-validation)
10. [Save & Summary](#10-save)

---

## 1. Setup & Imports <a id='1-setup'></a>

In [None]:

import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


PROJECT_ROOT = Path.cwd().resolve()
if not (PROJECT_ROOT / "src").exists() and (PROJECT_ROOT.parent / "src").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.insert(0, str(PROJECT_ROOT))

#import custom preprocessing functions
from src.data_processing import (
    clean_grid,
    add_race_features,
    attach_status_text,
    add_dnf_dns_flags,
    add_time_aware_aggregates,
    final_clean,
)


pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 200)

print("Setup complete!")
print(f"Project root: {PROJECT_ROOT}")

In [None]:

DATA_RAW = PROJECT_ROOT / 'data' / 'raw'
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed'
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

BASE_PATH = DATA_PROCESSED / 'f1_base_2018_2024.csv'
FINAL_PATH = DATA_PROCESSED / 'processed_f1_2018_2024.csv'

print(f"Input:  {BASE_PATH}")
print(f"Output: {FINAL_PATH}")
print(f"Input exists: {BASE_PATH.exists()}")

---

## 2. Load Base Data <a id='2-load'></a>

Load the base dataset created in Notebook 01.

In [None]:
# loading base dataset
df = pd.read_csv(BASE_PATH, parse_dates=['date'])

print(f"Base dataset loaded!")
print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"Years: {df['year'].min()} - {df['year'].max()}")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

In [None]:
#preview
df.head()

In [None]:
#current columns
print("Current columns:")
print(df.columns.tolist())

---

## 3. Initial Data Quality <a id='3-quality'></a>

Assess data quality before any transformations.

In [None]:
#missing values BEFORE cleaning
print("MISSING VALUES (Before Cleaning)")
print("=" * 50)

missing_before = df.isna().sum()
missing_pct = (missing_before / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing_before,
    'Missing %': missing_pct
}).sort_values('Missing Count', ascending=False)

display(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("No missing values in the dataset!")

In [None]:
#Key statistics BEFORE cleaning
print("KEY STATISTICS (Before Cleaning)")
print("=" * 50)

display(df[['grid', 'positionOrder', 'points']].describe())

In [None]:
#check grid=0 values(special case in Ergast data)
grid_zero_count = (df['grid'] == 0).sum()
print(f"Grid position = 0 (pit lane start / unknown): {grid_zero_count} rows")
print(f"This represents {grid_zero_count / len(df) * 100:.2f}% of data")
print()
print("These will be treated as missing (NaN) in grid_clean column.")

---

## 4. Grid Cleaning <a id='4-grid'></a>

### Problem
In F1 data, `grid = 0` doesn't mean pole position. It indicates:
- Pit lane start
- Unknown qualifying position
- Disqualification from qualifying

### Solution
Create `grid_clean` column where `grid = 0` is replaced with `NaN`.

In [None]:
#Apply grid cleaning
df = clean_grid(df)

print("Grid cleaning applied!")
print()
print("Comparison:")
print(f"   grid values:       min={df['grid'].min()}, max={df['grid'].max()}")
print(f"   grid_clean values: min={df['grid_clean'].min()}, max={df['grid_clean'].max()}")
print(f"   grid_clean NaN:    {df['grid_clean'].isna().sum()} rows")

In [None]:
#show rows where grid=0 became NaN
print("Sample rows where grid=0 (now grid_clean=NaN):")
df.loc[df['grid'] == 0, ['driverName', 'name', 'year', 'grid', 'grid_clean', 'positionOrder', 'points']].head()

---

## 5. Race Features <a id='5-race-features'></a>

Create derived features for each race result:

| Feature | Formula | Description |
|---------|---------|-------------|
| `position_gain` | grid_clean - positionOrder | Positive = gained positions |
| `is_podium` | positionOrder <= 3 | Binary: finished top 3 |

In [None]:
#Add race features
df = add_race_features(df)

print("Race features added!")
print()
print("New columns: position_gain, is_podium")

In [None]:
#verify race features
print("RACE FEATURES VERIFICATION")
print("=" * 50)

df[['driverName', 'name', 'grid_clean', 'positionOrder', 'position_gain', 'points', 'is_podium']].head(10)

In [None]:
#position gain statistics
print("POSITION GAIN STATISTICS")
print("=" * 50)
print(f"Mean position gain:  {df['position_gain'].mean():.2f}")
print(f"Max positions gained: {df['position_gain'].max():.0f}")
print(f"Max positions lost:   {df['position_gain'].min():.0f}")
print()
print(f"Podium finishes: {df['is_podium'].sum()} ({df['is_podium'].mean()*100:.1f}% of races)")

---

## 6. Status & DNF Flags <a id='6-status'></a>

Create flags to identify race finish status:

| Flag | Description |
|------|-------------|
| `is_finished` | Driver completed the race |
| `is_dnf` | Did Not Finish (mechanical, crash, etc.) |
| `is_dns` | Did Not Start |

In [None]:
#attach status text if available
status_path = DATA_RAW / 'status.csv'

if status_path.exists() and 'statusId' in df.columns:
    status_df = pd.read_csv(status_path)
    df = attach_status_text(df, status_df)
    print(f"Status text attached from {status_path.name}")
else:
    print("Status file not found or statusId not in dataframe.")

#add DNF/DNS flags
df = add_dnf_dns_flags(df)
print("DNF/DNS flags added!")

In [None]:
# Verify status flags
print("STATUS FLAGS VERIFICATION")
print("=" * 50)

status_cols = [c for c in ['statusId', 'status_text', 'is_finished', 'is_dnf', 'is_dns'] if c in df.columns]
if status_cols:
    display(df[['driverName', 'name'] + status_cols].head(10))

In [None]:
#status summary
print("STATUS SUMMARY")
print("=" * 50)
print(f"Finished:       {df['is_finished'].sum():,} ({df['is_finished'].mean()*100:.1f}%)")
print(f"DNF:            {df['is_dnf'].sum():,} ({df['is_dnf'].mean()*100:.1f}%)")
print(f"DNS:            {df['is_dns'].sum():,} ({df['is_dns'].mean()*100:.1f}%)")

---

## 7. Historical Features <a id='7-historical'></a>

Engineer time-aware features using **only past data** to prevent data leakage.

### Driver Features

| Feature | Description |
|---------|-------------|
| `driver_races_past` | Number of races before current race |
| `driver_avg_points_past` | Average points in previous races |
| `driver_consistency_past` | Std dev of finish positions (lower = more consistent) |

### Constructor Features

| Feature | Description |
|---------|-------------|
| `constructor_races_past` | Number of races before current race |
| `constructor_strength_past` | Team's average points in previous races |
| `constructor_avg_finish_past` | Team's average finish position |

### Key Technique: `expanding().mean().shift(1)`

This ensures we only use **past data**, never the current race's result.

In [None]:
#add time-aware historical features
print("Adding historical features...")
print("(This may take a moment)")

df = add_time_aware_aggregates(df)

print("\nHistorical features added!")
print("New columns:")
print("   - driver_races_past")
print("   - driver_avg_points_past")
print("   - driver_consistency_past")
print("   - constructor_races_past")
print("   - constructor_strength_past")
print("   - constructor_avg_finish_past")

In [None]:
#verify historical features
print("HISTORICAL FEATURES VERIFICATION")
print("=" * 50)

hist_cols = [
    'driverName', 'constructorName', 'date',
    'driver_races_past', 'driver_avg_points_past', 'driver_consistency_past',
    'constructor_races_past', 'constructor_strength_past', 'constructor_avg_finish_past'
]

df[hist_cols].head(12)

In [None]:
#Example:Max Verstappen's historical features over time
print("EXAMPLE: Max Verstappen's Historical Features (first 10 races in dataset)")
print("=" * 70)

verstappen = df[df['driverName'] == 'Max Verstappen'].sort_values('date').head(10)
verstappen[['date', 'name', 'points', 'driver_races_past', 'driver_avg_points_past', 'constructor_strength_past']]

---

## 8. Final Cleaning <a id='8-final-clean'></a>

Handle any remaining missing values with sensible defaults.

In [None]:
# Apply final cleaning
df = final_clean(df)

print("Final cleaning applied!")
print()
print("Missing value handling:")
print("   - driver_avg_points_past: NaN -> 0 (new drivers)")
print("   - constructor_strength_past: NaN -> 0 (new teams)")
print("   - consistency features: NaN -> median")

In [None]:
#Check missing values AFTER cleaning
print("MISSING VALUES (After Cleaning)")
print("=" * 50)

missing_after = df.isna().sum()
missing_after_df = missing_after[missing_after > 0]

if len(missing_after_df) > 0:
    print(missing_after_df)
else:
    print("No missing values remaining! (except grid_clean by design)")
    
print(f"\ngrid_clean NaN count: {df['grid_clean'].isna().sum()} (expected - pit lane starts)")

---

## 9. Validation & Verification <a id='9-validation'></a>

Verify all transformations were applied correctly.

In [None]:
# All columns in processed dataset
print("ALL COLUMNS IN PROCESSED DATASET")
print("=" * 50)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2}. {col}")

In [None]:
#Final statistics
print("FINAL DATASET STATISTICS")
print("=" * 50)

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
display(df[numeric_cols].describe().round(2).T)

In [None]:
# Visualize key engineered features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Position gain distribution
axes[0, 0].hist(df['position_gain'].dropna(), bins=30, color='#3498db', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[0, 0].set_title('Position Gain Distribution', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Position Gain (positive = gained positions)')
axes[0, 0].set_ylabel('Frequency')

# 2. Driver avg points past
axes[0, 1].hist(df['driver_avg_points_past'].dropna(), bins=30, color='#2ecc71', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Driver Historical Avg Points', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Average Points (past races)')
axes[0, 1].set_ylabel('Frequency')

# 3. Constructor strength
axes[1, 0].hist(df['constructor_strength_past'].dropna(), bins=30, color='#e74c3c', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Constructor Strength Distribution', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Team Avg Points (past races)')
axes[1, 0].set_ylabel('Frequency')

# 4. Driver consistency
axes[1, 1].hist(df['driver_consistency_past'].dropna(), bins=30, color='#9b59b6', edgecolor='black', alpha=0.7)
axes[1, 1].set_title('Driver Consistency Distribution', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Finish Position Std Dev (lower = more consistent)')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

---

## 10. Save & Summary <a id='10-save'></a>

In [None]:
#select final columns in order
final_cols = [
    # Identifiers
    'raceId', 'year', 'round', 'name', 'date',
    # Driver info
    'driverId', 'driverName', 'nationality',
    # Constructor info
    'constructorId', 'constructorName',
    # Original results
    'grid', 'grid_clean', 'positionOrder', 'points',
    # Derived features
    'position_gain', 'is_podium',
    # Status flags
    'is_finished', 'is_dnf', 'is_dns',
    # Historical features
    'driver_races_past', 'driver_avg_points_past', 'driver_consistency_past',
    'constructor_races_past', 'constructor_strength_past', 'constructor_avg_finish_past',
]

#keep only columns that exist
final_cols = [c for c in final_cols if c in df.columns]
df_final = df[final_cols].copy()

print(f"Final dataset prepared!")
print(f"Shape: {df_final.shape[0]:,} rows x {df_final.shape[1]} columns")

In [None]:
#Save to CSV
df_final.to_csv(FINAL_PATH, index=False)

print(f"Saved to: {FINAL_PATH}")
print(f"File size: {FINAL_PATH.stat().st_size / 1024:.1f} KB")

In [None]:
#Final summary
print("=" * 70)
print("NOTEBOOK 02 COMPLETE - DATA PREPROCESSING SUMMARY")
print("=" * 70)
print()
print("INPUT:")
print(f"   File: {BASE_PATH.name}")
print(f"   Rows: {len(pd.read_csv(BASE_PATH)):,}")
print()
print("TRANSFORMATIONS APPLIED:")
print("   1. Grid cleaning (grid=0 -> NaN)")
print("   2. Race features: position_gain, is_podium")
print("   3. Status flags: is_finished, is_dnf, is_dns")
print("   4. Driver historical: races_past, avg_points_past, consistency_past")
print("   5. Constructor historical: races_past, strength_past, avg_finish_past")
print("   6. Final cleaning: handle remaining NaN values")
print()
print("OUTPUT:")
print(f"   File: {FINAL_PATH.name}")
print(f"   Rows: {df_final.shape[0]:,}")
print(f"   Columns: {df_final.shape[1]}")
print()
print("NEW FEATURES CREATED:")
new_features = [
    'grid_clean', 'position_gain', 'is_podium',
    'is_finished', 'is_dnf', 'is_dns',
    'driver_races_past', 'driver_avg_points_past', 'driver_consistency_past',
    'constructor_races_past', 'constructor_strength_past', 'constructor_avg_finish_past'
]
for f in new_features:
    if f in df_final.columns:
        print(f"   - {f}")
print()
print("NEXT STEP:")
print("   -> Notebook 03: Exploratory Data Analysis (EDA)")
print()
print("=" * 70)