# 01 ‚Äî Data Cleaning & Preparation

> **Objective:** To load the raw public transit delay dataset, assess data quality, perform cleaning and feature engineering, and save a processed dataset for downstream exploratory analysis and modeling.

This notebook outlines the following stages:
1. [**Dataset overview**](#dataset-overview) ‚Äî loading raw data and inspecting structure  
2. [**Missing values analysis**](#missing-values-analysis) ‚Äî assessing completeness and handling nulls  
3. [**Data cleaning steps**](#data-cleaning-steps) ‚Äî addressing inconsistencies, types, and outliers  
4. [**Feature engineering**](#feature-engineering) ‚Äî creating derived features for analysis  
5. [**Save cleaned dataset**](#save-cleaned-dataset) ‚Äî exporting to `data/processed/`  

> **Note:** Section links work in Jupyter or nbviewer; they may not render in static GitHub previews.

---
### üß† Project Context

This notebook is the first step in the **Public Transit Delay EDA** project. Clean, well-structured data is essential for reliable exploratory analysis and any subsequent modeling. All transformations applied here are documented so that the pipeline is reproducible.

---
### üß∞ Imports <a id="imports"></a>

Core libraries for data loading, manipulation, and cleaning:

- **pandas** ‚Äî data loading, tabular manipulation, and export  
- **numpy** ‚Äî numerical operations where needed  
- **pathlib / os** ‚Äî path handling for reading and writing files  

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

---
### üì• Dataset Overview <a id="dataset-overview"></a>

Load the raw dataset from `data/raw/` and inspect its structure: shape, column names, dtypes, and a sample of rows.  
This confirms that the import completed successfully and provides a first look at the variables available for analysis.

In [None]:
# Load raw data (adjust filename as needed)
# df = pd.read_csv(Path("../data/raw/<your_raw_file>.csv"))
# df.shape
# df.head()

| Column | Description |
|--------|-------------|
| *(placeholder)* | *(Add column descriptions once dataset is defined)* |

---
### üßæ Missing Values Analysis <a id="missing-values-analysis"></a>

Summarize the dataset structure with `df.info()` and count nulls per column.  
Identifying missing values is essential before cleaning so that imputation or removal strategies can be applied consistently.

In [None]:
# df.info()
# df.isnull().sum()

#### üîé *Summary*

*(Add a short summary of which columns have missing values and the intended handling strategy.)*

---
### üßπ Data Cleaning Steps <a id="data-cleaning-steps"></a>

Apply cleaning steps such as:
- Correcting data types (dates, categories, numeric)  
- Handling or imputing missing values  
- Removing or flagging duplicates  
- Addressing obvious outliers or invalid values  

*(Replace the placeholder below with concrete cleaning code and brief comments.)*

In [None]:
# Example: ensure datetime column
# df['date'] = pd.to_datetime(df['date'], errors='coerce')
# Example: drop duplicates
# df = df.drop_duplicates()
# (add steps as needed)

---
### ‚öôÔ∏è Feature Engineering <a id="feature-engineering"></a>

Create derived features that may be useful for EDA and modeling, for example:
- Time-based: hour of day, day of week, month, peak vs off-peak  
- Delay-related: delay bins, on-time vs delayed flag  
- Route or line aggregates  

*(Replace the placeholder below with actual feature engineering code.)*

In [None]:
# Example: extract hour from datetime
# df['hour'] = df['datetime_column'].dt.hour
# Example: delay category
# df['delay_category'] = pd.cut(df['delay_minutes'], bins=[...], labels=[...])
# (add features as needed)

---
### üíæ Save Cleaned Dataset <a id="save-cleaned-dataset"></a>

Export the cleaned and engineered dataset to `data/processed/` so that downstream notebooks (e.g. EDA) can load it without re-running cleaning steps.

In [None]:
out_path = Path("../data/processed/transit_delays_cleaned.csv")
out_path.parent.mkdir(parents=True, exist_ok=True)
# df_cleaned.to_csv(out_path, index=False)
# print(f"Saved to {out_path}")