# Notebook 03 â€” Exploratory Data Analysis (EDA)


Explore the processed dataset through visualizations to understand patterns, relationships, and insights that will inform our machine learning models.

---

### Research Questions

| # | Question | Visualization |
|---|----------|---------------|
| 1 | How are race points distributed? | Histogram |
| 2 | Have points changed over the years? | Line plot |
| 3 | How does qualifying position affect race results? | Scatter + Boxplot |
| 4 | Does constructor (team) strength predict performance? | Scatter + Bar |
| 5 | Which drivers and teams dominate? | Bar charts |
| 6 | What features correlate most with points? | Heatmap + Bar |
| 7 | Does track type affect qualifying importance? | Bar chart |

### Output

all visualizations saved to: `reports/figures/`

---

### Table of Contents

1. [Setup & Imports](#1-setup)
2. [Load Data](#2-load)
3. [Data Overview](#3-overview)
4. [Points Distribution](#4-points-dist)
5. [Points Trend Over Time](#5-trend)
6. [Qualifying Position Analysis](#6-qualifying)
7. [Constructor & Driver Analysis](#7-constructor-driver)
8. [Correlation Analysis](#8-correlation)
9. [Track-Specific Analysis](#9-track)
10. [Key Findings & Conclusions](#10-conclusions)

---

## 1. Setup & Imports <a id='1-setup'></a>

In [None]:

import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


PROJECT_ROOT = Path.cwd().resolve()
if not (PROJECT_ROOT / "src").exists() and (PROJECT_ROOT.parent / "src").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.insert(0, str(PROJECT_ROOT))

#import visualization functions
from src.visualization import (
    ensure_fig_dir,
    plot_points_distribution,
    plot_avg_points_by_year,
    plot_qualifying_vs_points_scatter,
    plot_points_by_grid_bucket_boxplot,
    plot_constructor_strength_vs_points,
    plot_driver_consistency_vs_points,
    plot_top_constructors_avg_points,
    plot_top_drivers_avg_points,
    plot_corr_heatmap_top_features,
    plot_corr_with_points_bar,
    plot_pole_win_rate_selected_gps,
)

# Create figures directory
fig_dir = ensure_fig_dir(PROJECT_ROOT)

print("Setup complete!")
print(f"Project root: {PROJECT_ROOT}")
print(f"Figures directory: {fig_dir}")

---

## 2. Load Data <a id='2-load'></a>

Load the processed dataset created in Notebook 02.

In [None]:
#load processed data
DATA_PATH = PROJECT_ROOT / "data" / "processed" / "processed_f1_2018_2024.csv"
df = pd.read_csv(DATA_PATH, parse_dates=['date'])

print(f"Data loaded!")
print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"Years: {df['year'].min()} - {df['year'].max()}")

In [None]:
# Preview
df.head()

---

## 3. Data Overview <a id='3-overview'></a>

Quick summary statistics before diving into visualizations.

In [None]:
# Key statistics
print("DATASET OVERVIEW")
print("=" * 50)
print(f"Total race entries:    {len(df):,}")
print(f"Unique races:          {df['raceId'].nunique()}")
print(f"Unique drivers:        {df['driverId'].nunique()}")
print(f"Unique constructors:   {df['constructorId'].nunique()}")
print(f"Year range:            {df['year'].min()} - {df['year'].max()}")
print()
print("TARGET VARIABLE (points):")
print(f"   Mean:   {df['points'].mean():.2f}")
print(f"   Median: {df['points'].median():.2f}")
print(f"   Std:    {df['points'].std():.2f}")
print(f"   Range:  {df['points'].min():.0f} - {df['points'].max():.0f}")

In [None]:
# Points breakdown
print("POINTS BREAKDOWN")
print("=" * 50)
zero_points = (df['points'] == 0).sum()
podium_count = df['is_podium'].sum()
win_count = (df['positionOrder'] == 1).sum()

print(f"Zero points finishes:  {zero_points:,} ({zero_points/len(df)*100:.1f}%)")
print(f"Podium finishes (P1-3): {podium_count:,} ({podium_count/len(df)*100:.1f}%)")
print(f"Race wins (P1):         {win_count:,} ({win_count/len(df)*100:.1f}%)")

---

## 4. Points Distribution <a id='4-points-dist'></a>

**Question:** How are race points distributed across all driver-race entries?

In [None]:
# Plot points distribution
saved_path = plot_points_distribution(df, fig_dir)
print(f"Saved: {saved_path}")

### Interpretation: Points Distribution

**Key Observations:**
- The distribution is **heavily right-skewed** with a large spike at 0 points
- Only the **top 10 finishers** score points in F1 (P1: 25, P2: 18, P3: 15, ... P10: 1)
- Most drivers (~50%) finish outside the points in any given race
- The discrete bars correspond to the F1 points system values: 1, 2, 4, 6, 8, 10, 12, 15, 18, 25, 26

**Implication for ML:** The target variable is not normally distributed, which may affect model choice.

---

## 5. Points Trend Over Time <a id='5-trend'></a>

**Question:** Has the average points per race changed across the 2018-2024 seasons?

In [None]:
#Plot average points by year
saved_path = plot_avg_points_by_year(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
#yearly statistics
yearly_stats = df.groupby('year').agg({
    'points': ['mean', 'sum'],
    'raceId': 'nunique'
}).round(2)
yearly_stats.columns = ['Avg Points', 'Total Points', 'Races']
print("Points by Year:")
display(yearly_stats)

### Interpretation: Points Trend Over Time

**Key Observations:**
- Average points per race is relatively **stable** around 5.0-5.2 points
- Small dip in 2020-2021 may be due to COVID-affected seasons (fewer races, different conditions)
- 2022-2024 show slight increase, possibly due to Red Bull dominance concentrating points

**Implication:** The points system remained consistent, so year-over-year comparisons are valid.

---

## 6. Qualifying Position Analysis <a id='6-qualifying'></a>

**Question:** How strongly does qualifying position (grid) predict race results?

In [None]:
#Scatter plot: Grid vs Points
saved_path = plot_qualifying_vs_points_scatter(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
#boxplot by grid bucket
saved_path = plot_points_by_grid_bucket_boxplot(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
#Statistics by grid position bucket
def grid_bucket(g):
    if pd.isna(g): return 'Unknown'
    if g <= 3: return 'P1-3'
    if g <= 10: return 'P4-10'
    return 'P11-20'

df['grid_bucket'] = df['grid_clean'].apply(grid_bucket)

bucket_stats = df.groupby('grid_bucket').agg({
    'points': ['mean', 'median', 'std', 'count']
}).round(2)
bucket_stats.columns = ['Mean', 'Median', 'Std', 'Count']

print("Points by Grid Position Bucket:")
display(bucket_stats.loc[['P1-3', 'P4-10', 'P11-20']])

In [None]:
#correlation between grid and points
grid_corr = df['grid_clean'].corr(df['points'])
print(f"Correlation between grid position and points: {grid_corr:.3f}")
print(f"(Negative because lower grid = better position = more points)")

### Interpretation: Qualifying Position Analysis

**Key Observations:**
- **Strong negative correlation (-0.66)** between grid position and points
- Front row starters (P1-3): **median 18 points** (almost guaranteed podium/points)
- Midfield starters (P4-10): **median 4-6 points**
- Back of grid (P11-20): **median 0 points**

**Key Insight:** Qualifying is CRITICAL in F1. Starting P1-3 almost guarantees points, while starting P11+ makes scoring very difficult.



---

## 7. Constructor & Driver Analysis <a id='7-constructor-driver'></a>

**Question:** How does team strength and driver quality affect performance?

In [None]:
#constructor strength vs points
saved_path = plot_constructor_strength_vs_points(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
#driver consistency vs points
saved_path = plot_driver_consistency_vs_points(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
#Top 10 constructors by average points
saved_path = plot_top_constructors_avg_points(df, fig_dir, top_n=10)
print(f"Saved: {saved_path}")

In [None]:
#top 15 drivers by average points
saved_path = plot_top_drivers_avg_points(df, fig_dir, top_n=15)
print(f"Saved: {saved_path}")

In [None]:
#constructor statistics
constructor_stats = df.groupby('constructorName').agg({
    'points': ['mean', 'sum'],
    'is_podium': 'sum',
    'raceId': 'count'
}).round(2)
constructor_stats.columns = ['Avg Points', 'Total Points', 'Podiums', 'Races']
constructor_stats = constructor_stats.sort_values('Avg Points', ascending=False)

print("Top 10 Constructors (2018-2024):")
display(constructor_stats.head(10))

In [None]:
#Constructor strength correlation
const_corr = df['constructor_strength_past'].corr(df['points'])
print(f"Correlation between constructor strength and points: {const_corr:.3f}")

### Interpretation: Constructor & Driver Analysis

**Key Observations:**

**Constructors:**
- **Mercedes, Red Bull, Ferrari** form a clear top tier (10-15 avg points)
- **McLaren** moved into this tier in 2023-2024
- Midfield teams (Alpine, Aston Martin): 2-4 avg points
- Back markers (Williams, Haas): <2 avg points
- **Constructor strength correlation: 0.65** - Team matters almost as much as qualifying!

**Drivers:**
- **Max Verstappen** leads with ~17 avg points (Red Bull dominance 2022-2024)
- **Lewis Hamilton** second with ~15 avg points
- Clear gap between top 4-5 drivers and the rest

**Key Insight:** The car (constructor) is nearly as important as qualifying position. A driver in a top-3 team has a massive advantage regardless of qualifying.

---

## 8. Correlation Analysis <a id='8-correlation'></a>

**Question:** Which features correlate most strongly with race points?

In [None]:
# Correlation heatmap
saved_path = plot_corr_heatmap_top_features(df, fig_dir, top_k=9)
print(f"Saved: {saved_path}")

In [None]:
# Correlation bar chart
saved_path = plot_corr_with_points_bar(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
# Correlation table
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlations = df[numeric_cols].corr()['points'].drop('points').sort_values(key=abs, ascending=False)

print("CORRELATION WITH POINTS")
print("=" * 50)
for col, corr in correlations.head(12).items():
    direction = "positive" if corr > 0 else "negative"
    strength = "Strong" if abs(corr) > 0.5 else "Moderate" if abs(corr) > 0.3 else "Weak"
    print(f"{col:30} {corr:+.3f}  ({strength} {direction})")

### Interpretation: Correlation Analysis

**Top Correlations with Points:**

| Feature | Correlation | Interpretation |
|---------|-------------|----------------|
| `positionOrder` | -0.85 | Expected: position determines points |
| `is_podium` | +0.84 | Top 3 = high points |
| `grid_clean` | -0.66 | **Qualifying is crucial** |
| `constructor_strength_past` | +0.65 | **Team matters a lot** |
| `driver_avg_points_past` | +0.62 | Good drivers score more |
| `position_gain` | +0.35 | Gaining positions helps |

**Key Insight:** After removing obvious correlations (positionOrder, is_podium), the two strongest predictors are:
1. **Qualifying position (-0.66)**
2. **Constructor strength (+0.65)**



---

## 9. Track-Specific Analysis <a id='9-track'></a>

**Question:** Does track type affect how important qualifying is?

We compare three circuits with different characteristics:
- **Monaco**: Street circuit, very hard to overtake
- **Spa (Belgium)**: Long circuit with overtaking opportunities
- **Monza (Italy)**: High-speed with slipstreaming, easier to overtake

In [None]:
# Pole-to-win conversion rate by circuit
saved_path = plot_pole_win_rate_selected_gps(df, fig_dir)
print(f"Saved: {saved_path}")

In [None]:
# Calculate pole-to-win stats
tracks = ['Monaco Grand Prix', 'Italian Grand Prix', 'Belgian Grand Prix']

print("POLE-TO-WIN CONVERSION RATE")
print("=" * 50)

for track in tracks:
    track_df = df[df['name'] == track]
    pole_starts = track_df[track_df['grid_clean'] == 1]
    pole_wins = pole_starts[pole_starts['positionOrder'] == 1]
    
    total_poles = len(pole_starts)
    wins_from_pole = len(pole_wins)
    rate = wins_from_pole / total_poles * 100 if total_poles > 0 else 0
    
    print(f"{track:25} {wins_from_pole}/{total_poles} ({rate:.1f}%)")

### Interpretation: Track-Specific Analysis

**Key Observations:**

| Circuit | Pole-to-Win | Track Type | Overtaking |
|---------|-------------|------------|------------|
| Monaco | ~67% | Street | Very Hard |
| Spa | ~43% | Traditional | Moderate |
| Monza | ~14% | High-speed | Easy (slipstream) |

**Key Insight:** Track characteristics dramatically affect qualifying importance!
- At **Monaco**, pole position almost guarantees victory (67%)
- At **Monza**, starting from pole only wins 14% of the time due to slipstreaming

**Implication:** Teams should prioritize qualifying differently based on circuit type. A model could potentially include circuit features to improve predictions.

---

## 10. Key Findings & Conclusions <a id='10-conclusions'></a>

In [None]:
#list all generated figures
print("GENERATED VISUALIZATIONS")
print("=" * 50)
for i, fig_file in enumerate(sorted(fig_dir.glob("eda_*.png")), 1):
    print(f"{i:2}. {fig_file.name}")

print(f"\nTotal: {len(list(fig_dir.glob('eda_*.png')))} EDA figures saved to {fig_dir}")

### Summary of Key Findings

#### 1. Points Distribution
- Heavily right-skewed: ~50% of race entries score 0 points
- Only top 10 finishers receive points

#### 2. Qualifying is Critical
- **Correlation: -0.66** (one of the strongest predictors)
- P1-3 starters: median 18 points
- P11-20 starters: median 0 points

#### 3. Constructor (Team) Matters Almost as Much
- **Correlation: +0.65**
- Mercedes, Red Bull, Ferrari dominate with 10-15 avg points
- Back markers average <2 points regardless of driver skill

#### 4. Clear Driver Hierarchy
- Verstappen leads with ~17 avg points (Red Bull dominance)
- Hamilton second with ~15 avg points
- Top 4-5 drivers clearly separate from the field

#### 5. Track Type Affects Strategy
- Monaco: 67% pole-to-win (qualifying crucial)
- Monza: 14% pole-to-win (race pace matters more)

---

### Features for Machine Learning

Based on EDA, the most predictive features for our ML model should be:

| Rank | Feature | Correlation | Reason |
|------|---------|-------------|--------|
| 1 | `grid_clean` | -0.66 | Qualifying performance |
| 2 | `constructor_strength_past` | +0.65 | Team quality |
| 3 | `driver_avg_points_past` | +0.62 | Driver quality |
| 4 | `driver_consistency_past` | varies | Driver reliability |
| 5 | `constructorName` | categorical | Team-specific effects |

---

### Next Step

**Notebook 04: Machine Learning** - Use these insights to build predictive models for race points.

In [None]:
# Final summary
print("=" * 70)
print("NOTEBOOK 03 COMPLETE - EDA SUMMARY")
print("=" * 70)
print()
print("VISUALIZATIONS CREATED: 11")
print("   - 1 Histogram (points distribution)")
print("   - 1 Line plot (points trend)")
print("   - 3 Scatter plots (grid, constructor, consistency vs points)")
print("   - 1 Boxplot (points by grid bucket)")
print("   - 3 Bar charts (top constructors, drivers, pole win rate)")
print("   - 2 Correlation plots (heatmap, bar chart)")
print()
print("KEY INSIGHTS:")
print("   1. Qualifying position is crucial (corr: -0.66)")
print("   2. Constructor strength nearly as important (corr: +0.65)")
print("   3. Track type affects qualifying importance (Monaco 67% vs Monza 14%)")
print("   4. Clear team hierarchy: Mercedes/Red Bull/Ferrari >> rest")
print()
print("OUTPUT:")
print(f"   Figures saved to: {fig_dir}")
print()
print("=" * 70)