# 02 â€” Exploratory Data Analysis (EDA)

> **Objective:** To explore the cleaned public transit delay dataset through visualizations and summary statistics, identify patterns and relationships, and summarize key findings to guide further analysis or modeling.

This notebook outlines the following stages:
1. [**Load processed data**](#load-processed-data) â€” import the cleaned dataset  
2. [**Exploratory data analysis**](#exploratory-data-analysis) â€” visualizations and insights  
3. [**Key findings**](#key-findings) â€” summary of main takeaways  

> **Note:** Run `01_data_cleaning.ipynb` first to generate `data/processed/transit_delays_cleaned.csv`.

---
### ðŸ§  Project Context

EDA helps us understand delay distributions, temporal patterns, and relationships between variables. All plots are well-labeled and accompanied by short written insights to keep the narrative clear and portfolio-ready.

---
### ðŸ§° Imports <a id="imports"></a>

- **pandas** â€” data loading and manipulation  
- **numpy** â€” numerical utilities  
- **matplotlib.pyplot** â€” plotting  
- **seaborn** â€” statistical visualizations and styling  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set(style="whitegrid")

---
### ðŸ“¥ Load Processed Data <a id="load-processed-data"></a>

Load the cleaned dataset produced by `01_data_cleaning.ipynb`.

In [None]:
# df = pd.read_csv(Path("../data/processed/transit_delays_cleaned.csv"))
# df.head()

---
### ðŸ“Š Exploratory Data Analysis <a id="exploratory-data-analysis"></a>

The following sections contain well-labeled plots with short written insights under each. Replace placeholders with actual visualizations and findings once the dataset is loaded and analyzed.

#### Plot 1 â€” Delay distribution (e.g. histogram or KDE)

*(Describe what this plot shows: distribution of delay minutes, skewness, typical range.)*

In [None]:
# fig, ax = plt.subplots(figsize=(8, 4))
# sns.histplot(df['delay_minutes'], kde=True, ax=ax)
# ax.set_title('Distribution of Delay (minutes)')
# ax.set_xlabel('Delay (minutes)')
# plt.tight_layout()
# plt.show()

**Insight:** *(e.g. Delays are right-skewed; most trips are on time or slightly delayed, with a long tail of severe delays.)*

#### Plot 2 â€” Delays by time of day (e.g. line or bar)

*(Describe: how delay varies by hour or time period.)*

In [None]:
# e.g. hourly aggregation and line/bar plot
# hourly = df.groupby('hour')['delay_minutes'].mean()
# hourly.plot(kind='bar', title='Average delay by hour', xlabel='Hour')

**Insight:** *(e.g. Peak hours show higher average delays; morning and evening rush align with worse performance.)*

#### Plot 3 â€” Delays by day of week

*(Describe: weekday vs weekend or variation across days.)*

In [None]:
# e.g. boxplot or bar by day_of_week
# sns.boxplot(data=df, x='day_of_week', y='delay_minutes')
# plt.title('Delay by day of week')

**Insight:** *(e.g. Weekdays show higher median delay than weekends; Monday and Friday may show distinct patterns.)*

#### Plot 4 â€” Delays by route or line (e.g. top N routes)

*(Describe: which routes or lines have the highest delays or most variability.)*

In [None]:
# e.g. top 10 routes by mean delay
# top_routes = df.groupby('route')['delay_minutes'].mean().nlargest(10)
# top_routes.plot(kind='barh', title='Top 10 routes by average delay')

**Insight:** *(e.g. A few routes concentrate high average delays; these may be candidates for operational focus.)*

#### Plot 5 â€” Correlation heatmap (numeric features)

*(Describe: correlations between delay and other numeric variables.)*

In [None]:
# numeric = df.select_dtypes(include=[np.number])
# sns.heatmap(numeric.corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0)
# plt.title('Correlation matrix')
# plt.tight_layout()
# plt.show()

**Insight:** *(e.g. Delay correlates weakly with [X]; stronger relationships can guide feature selection for modeling.)*

#### Plot 6 â€” On-time vs delayed share (e.g. pie or bar)

*(Describe: proportion of trips on time vs delayed, or delay severity breakdown.)*

In [None]:
# e.g. df['on_time'] = df['delay_minutes'] <= 0
# df['on_time'].value_counts().plot(kind='pie', labels=['Delayed', 'On time'], autopct='%1.1f%%')
# plt.title('Share of trips: on time vs delayed')

**Insight:** *(e.g. Roughly X% of trips are on time; the remainder are delayed to varying degrees.)*

---
### ðŸŽ¯ Key Findings <a id="key-findings"></a>

A concise summary of the main takeaways from this exploratory analysis:

1. **Delay distribution:** *(e.g. Most delays are small; distribution is right-skewed with a long tail.)*  
2. **Temporal patterns:** *(e.g. Peak hours and weekdays show higher delays.)*  
3. **Route/location:** *(e.g. Certain routes or lines consistently show worse performance.)*  
4. **Relationships:** *(e.g. Key correlations or lack thereof with delay.)*  
5. **On-time performance:** *(e.g. Overall on-time rate and implications.)*  

*(Replace with concrete findings once the analysis is run.)*