# 01 ‚Äî Data Cleaning & Preparation

> **Objective:** To load the raw public transit delay dataset, assess data quality, perform cleaning and feature engineering, and save a processed dataset for downstream exploratory analysis and modeling.

This notebook outlines the following stages:
1. [**Dataset overview**](#dataset-overview) ‚Äî loading raw data and inspecting structure  
2. [**Missing values analysis**](#missing-values-analysis) ‚Äî assessing completeness and handling nulls  
3. [**Data cleaning steps**](#data-cleaning-steps) ‚Äî addressing inconsistencies, types, and outliers  
4. [**Feature engineering**](#feature-engineering) ‚Äî creating derived features for analysis  
5. [**Save cleaned dataset**](#save-cleaned-dataset) ‚Äî exporting to `data/processed/`  

> **Note:** Section links work in Jupyter or nbviewer; they may not render in static GitHub previews.

---
### üß† Project Context

This notebook is the first step in the **Public Transit Delay EDA** project. Clean, well-structured data is essential for reliable exploratory analysis and any subsequent modeling. All transformations applied here are documented so that the pipeline is reproducible.

---
### üß∞ Imports <a id="imports"></a>

Core libraries for data loading, manipulation, and cleaning:

- **pandas** ‚Äî data loading, tabular manipulation, and export  
- **numpy** ‚Äî numerical operations where needed  
- **pathlib / os** ‚Äî path handling for reading and writing files  

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

---
### üì• Dataset Overview <a id="dataset-overview"></a>

Load the raw dataset from `data/raw/` and inspect its structure: shape, column names, dtypes, and a sample of rows.  
This confirms that the import completed successfully and provides a first look at the variables available for analysis.

In [None]:
raw_path = Path("../data/raw/public_transport_delays.csv")
df = pd.read_csv(raw_path)
print("Shape:", df.shape)
df.head()

| Column | Description |
|--------|-------------|
| `trip_id` | Unique trip identifier |
| `date` | Trip date |
| `time` | Trip start time |
| `transport_type` | Bus, Tram, Metro, or Train |
| `route_id` | Route identifier (e.g. Route_1, Route_2) |
| `origin_station`, `destination_station` | Start and end station IDs |
| `scheduled_departure`, `scheduled_arrival` | Planned departure/arrival times |
| `actual_departure_delay_min`, `actual_arrival_delay_min` | Delay in minutes (negative = early) |
| `weather_condition` | Clear, Rain, Snow, Storm, Fog, Cloudy |
| `temperature_C`, `humidity_percent`, `wind_speed_kmh`, `precipitation_mm` | Weather variables |
| `event_type` | None, Sports, Concert, Parade, Protest, Festival |
| `event_attendance_est` | Estimated event attendance |
| `traffic_congestion_index` | Congestion level (0‚Äì100) |
| `holiday` | 1 if holiday, 0 otherwise |
| `peak_hour` | 1 if peak, 0 otherwise |
| `weekday` | Day of week (0‚Äì6) in raw data |
| `season` | Winter, Spring, Summer, Autumn |
| `delayed` | 1 if trip was delayed (arrival delay > 0), 0 otherwise |

---
### üßæ Missing Values Analysis <a id="missing-values-analysis"></a>

Summarize the dataset structure with `df.info()` and count nulls per column.  
Identifying missing values is essential before cleaning so that imputation or removal strategies can be applied consistently.

In [None]:
df.info()

In [None]:
df.isnull().sum()

#### üîé *Summary*

**Only `event_type` has missing values** (1,173 of 2,000 rows). No event was recorded for those trips. We will **fill these with the string `"None"`** so that EDA and modeling can treat "no event" as a distinct category. All other columns are complete.

---
### üßπ Data Cleaning Steps <a id="data-cleaning-steps"></a>

Apply cleaning steps such as:
- Correcting data types (dates, categories, numeric)  
- Handling or imputing missing values  
- Removing or flagging duplicates  
- Addressing obvious outliers or invalid values  

*(Replace the placeholder below with concrete cleaning code and brief comments.)*

In [None]:
# Parse date and build datetime for time-based features
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df["datetime"] = pd.to_datetime(df["date"].astype(str) + " " + df["time"], errors="coerce")

# Fill missing event_type with "None" (no event)
df["event_type"] = df["event_type"].fillna("None")

# Drop duplicate rows if any (by trip_id)
n_before = len(df)
df = df.drop_duplicates(subset=["trip_id"], keep="first")
print(f"Dropped {n_before - len(df)} duplicate trip(s). Rows: {len(df)}")

# Ensure numeric delay columns are int (already are)
df[["actual_departure_delay_min", "actual_arrival_delay_min"]] = df[
    ["actual_departure_delay_min", "actual_arrival_delay_min"]
].astype(int)
df.head(3)

---
### ‚öôÔ∏è Feature Engineering <a id="feature-engineering"></a>

Create derived features that may be useful for EDA and modeling, for example:
- Time-based: hour of day, day of week, month, peak vs off-peak  
- Delay-related: delay bins, on-time vs delayed flag  
- Route or line aggregates  

*(Replace the placeholder below with actual feature engineering code.)*

In [None]:
# Hour of day (0‚Äì23) and day of week (0=Monday, 6=Sunday)
df["hour"] = df["datetime"].dt.hour
df["day_of_week"] = df["datetime"].dt.dayofweek

# Primary delay for analysis: use arrival delay (passenger-facing)
df["delay_minutes"] = df["actual_arrival_delay_min"].copy()

# Delay category for interpretation
def delay_category(minutes):
    if minutes <= 0:
        return "On time"
    if minutes <= 5:
        return "Slight (1‚Äì5 min)"
    if minutes <= 15:
        return "Moderate (6‚Äì15 min)"
    return "Severe (15+ min)"

df["delay_category"] = df["delay_minutes"].apply(delay_category)

# Preview engineered columns
df[["datetime", "hour", "day_of_week", "delay_minutes", "delay_category"]].head(5)

---
### üíæ Save Cleaned Dataset <a id="save-cleaned-dataset"></a>

Export the cleaned and engineered dataset to `data/processed/` so that downstream notebooks (e.g. EDA) can load it without re-running cleaning steps.

In [None]:
out_path = Path("../data/processed/transit_delays_cleaned.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(out_path, index=False)
print(f"Saved {len(df)} rows to {out_path}")