# NYC Taxi Trip Duration EDA

This notebook explores basic distributions and feature relationships for the NYC TLC Yellow Taxi dataset.
Target: `trip_duration` (seconds).


In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

DATA_PATH = Path("../data/processed/processed_data.csv")

if not DATA_PATH.exists():
    print(f"Processed data not found at {DATA_PATH}. Run process_data.py first.")
    df = None
else:
    df = pd.read_csv(DATA_PATH)
    print(df.shape)

In [None]:
if df is not None:
    display(df.head())
    display(df.describe(include="all").T)

In [None]:
if df is not None:
    plt.figure(figsize=(8, 4))
    sns.histplot(df["trip_duration"], bins=50, kde=True)
    plt.title("Trip Duration Distribution")
    plt.xlabel("seconds")
    plt.tight_layout()
    plt.show()

In [None]:
if df is not None:
    plt.figure(figsize=(8, 6))
    corr = df.corr(numeric_only=True)
    sns.heatmap(corr, cmap="coolwarm", center=0)
    plt.title("Feature Correlations")
    plt.tight_layout()
    plt.show()

## Key takeaways

- Trip duration is right-skewed; consider robust metrics (MAE/RMSE) and potential log transforms.
- Distance and duration are positively correlated, but location IDs and time-of-day add signal.
- Use feature engineering (rush hour, weekend flags) to capture temporal effects.
