# Feature Engineering

This notebook transforms the cleaned NYC taxi dataset into modeling-ready features.

Time-based features (hour, weekday, weekend), derive analytical keys (zone × hour), and aggregate trip-level data is engineered into demand tables suitable for statistical analysis and machine learning.


In [2]:
import pandas as pd

CLEAN_PATH = "../../data/processed/nyc_clean_2019_q1.parquet"
FEATURES_PATH = "../../data/processed/nyc_features_2019_q1.parquet"
AGG_PATH = "../../data/processed/nyc_demand_zone_hour_2019_q1.parquet"

# Load clean snapshot
df = pd.read_parquet(CLEAN_PATH)
print("Loaded clean dataset:", df.shape)

Loaded clean dataset: (21903044, 20)


In [5]:
# Time feature engineering

# Hour of day
df["hour"] = df["tpep_pickup_datetime"].dt.hour

# Day of the week
df["day_of_week"] = df["tpep_pickup_datetime"].dt.dayofweek

# Check if day is during the weeknd
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

# Day of the month
df["day"] = df["tpep_pickup_datetime"].dt.day

# Month
df["month"] = df["tpep_pickup_datetime"].dt.month

## Time-Based Features

Derive interpretable time features from the pickup timestamp:

- `hour`: Hour of day (0–23)
- `day_of_week`: Day of week (0 = Monday)
- `is_weekend`: Binary indicator for Saturday/Sunday
- `day`: Day of month
- `month`: Month index (1–3 for this dataset slice)

These features capture daily, weekly, and monthly demand cycles.

In [6]:
# Zone features

# rename pickup zone
df = df.rename(columns={"pulocationid": "zone_id"})

print(df["zone_id"].nunique())
print(df["zone_id"].min(), df["zone_id"].max())

263
1 265


## Zone Features

Pickup location ID (`PULocationID`) is used as the spatial key for modeling demand. It is renamed to `zone_id` for clarity.

All demand modeling is performed at the zone × hour grain.

In [7]:
# Create trip count indicator
df["demand"] = 1

## Target Variable: Demand

Each trip is treated as one unit of demand. At aggregated levels, demand is defined as the number of completed rides per zone per hour.

In [9]:
# Create hourly timestamp
df["pickup_hour_ts"] = df["tpep_pickup_datetime"].dt.floor("h")

In [20]:
# Group and aggregate
df_zone_hour = (
    df
    .groupby([
        "zone_id",
        "pickup_hour_ts"
    ])
    .agg(
        demand=("demand", "sum"),
        avg_fare=("fare_amount", "mean"),
        avg_distance=("trip_distance", "mean")
    )
    .reset_index()
)

In [21]:
df_zone_hour.head(10)

Unnamed: 0,zone_id,pickup_hour_ts,demand,avg_fare,avg_distance
0,1,2019-01-01 10:00:00,2,61.25,16.9
1,1,2019-01-01 12:00:00,1,135.0,19.3
2,1,2019-01-01 15:00:00,1,106.0,41.28
3,1,2019-01-02 02:00:00,1,30.0,1.27
4,1,2019-01-02 03:00:00,1,15.0,12.65
5,1,2019-01-02 13:00:00,1,70.5,18.73
6,1,2019-01-02 14:00:00,2,38.25,4.035
7,1,2019-01-02 17:00:00,1,40.0,0.01
8,1,2019-01-02 18:00:00,2,87.5,2.85
9,1,2019-01-03 13:00:00,2,90.0,0.685


In [22]:
# Add time features to aggregated table
df_zone_hour["hour"] = df_zone_hour["pickup_hour_ts"].dt.hour
df_zone_hour["day_of_week"] = df_zone_hour["pickup_hour_ts"].dt.dayofweek
df_zone_hour["is_weekend"] = df_zone_hour["day_of_week"].isin([5, 6]).astype(int)
df_zone_hour["day"] = df_zone_hour["pickup_hour_ts"].dt.day
df_zone_hour["month"] = df_zone_hour["pickup_hour_ts"].dt.month

In [23]:
df_zone_hour.head(10)

Unnamed: 0,zone_id,pickup_hour_ts,demand,avg_fare,avg_distance,hour,day_of_week,is_weekend,day,month
0,1,2019-01-01 10:00:00,2,61.25,16.9,10,1,0,1,1
1,1,2019-01-01 12:00:00,1,135.0,19.3,12,1,0,1,1
2,1,2019-01-01 15:00:00,1,106.0,41.28,15,1,0,1,1
3,1,2019-01-02 02:00:00,1,30.0,1.27,2,2,0,2,1
4,1,2019-01-02 03:00:00,1,15.0,12.65,3,2,0,2,1
5,1,2019-01-02 13:00:00,1,70.5,18.73,13,2,0,2,1
6,1,2019-01-02 14:00:00,2,38.25,4.035,14,2,0,2,1
7,1,2019-01-02 17:00:00,1,40.0,0.01,17,2,0,2,1
8,1,2019-01-02 18:00:00,2,87.5,2.85,18,2,0,2,1
9,1,2019-01-03 13:00:00,2,90.0,0.685,13,3,0,3,1


## Aggregation to Zone × Hour

Aggregate trip-level data into a zone × hour demand table.

For each zone and pickup hour compute:

- `demand`: Number of completed rides
- `avg_fare`: Mean fare amount
- `avg_distance`: Mean trip distance

This aggregated dataset serves as the primary modeling table for statistical analysis and forecasting.

In [24]:
# Save snapshots
df.to_parquet(FEATURES_PATH, index=False, engine="fastparquet")
df_zone_hour.to_parquet(AGG_PATH, index=False, engine="fastparquet")