# Departure Prediction Model (Demand Forecasting)

This notebook builds a predictive model for **hourly departures (outflow/demand)** at Columbia University area Citi Bike stations.

## Why Predict Departures Instead of Net Flow?
- **More actionable**: Departures = demand for bikes (what users care about)
- **Clearer signal**: Less noise than net flow (arrivals - departures)
- **Operational value**: Helps predict when stations will run out of bikes

## Features Used
1. **Cyclical encoding** of temporal features (hour, day, month)
2. **System-wide features** (total departures across all Columbia stations)
3. **Academic calendar features** from Columbia's actual calendar
4. **National holiday features**
5. **Lag features** (departures from previous hours/days/weeks)
6. **Improved XGBoost** with early stopping and better regularization

## Goal
Achieve R² > 0.50 for departure prediction

---

## 1. Setup and Data Loading

In [47]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb

print("Libraries imported successfully")

Libraries imported successfully


In [48]:
# Load the filtered Columbia area dataset
df = pd.read_csv('../data/columbia_filtered_citibike.csv', parse_dates=['started_at', 'ended_at'])
print(f"Loaded {len(df):,} trips")
print(f"Date range: {df['started_at'].min()} to {df['started_at'].max()}")

# Columbia station IDs
columbia_stations = [
	'7783.18',  # Broadway & W 122 St
	'7741.04',  # Morningside Dr & Amsterdam Ave
	'7745.07',  # W 120 St & Claremont Ave
	'7727.07',  # Amsterdam Ave & W 119 St
	'7713.11',  # W 116 St & Broadway
	'7692.11',  # W 116 St & Amsterdam Ave
	'7713.01'   # W 113 St & Broadway
]

print(f"Analyzing {len(columbia_stations)} Columbia stations")

Loaded 529,908 trips
Date range: 2024-01-01 00:05:39.030000 to 2025-10-31 23:51:14.035000
Analyzing 7 Columbia stations


---

## 2. Load Academic Calendar and Holiday Data

In [49]:
# Load Columbia academic calendar
academic_calendar = pd.read_csv('../data/columbia_academic_calendar.csv', parse_dates=['date'])
print(f"Loaded {len(academic_calendar)} academic calendar events")
print(f"\nEvent types:")
print(academic_calendar['event_type'].value_counts())
academic_calendar.head(10)

Loaded 78 academic calendar events

Event types:
event_type
finals            32
holiday           14
study_day         12
break              6
semester_start     4
semester_end       4
break_start        2
break_end          2
commencement       2
Name: count, dtype: int64


Unnamed: 0,date,event_type,description,classes_held
0,2024-01-15,holiday,Martin Luther King Jr. Birthday,0
1,2024-01-16,semester_start,First Day of Classes Spring 2024,1
2,2024-03-11,break_start,Spring Recess Start,0
3,2024-03-12,break,Spring Recess,0
4,2024-03-13,break,Spring Recess,0
5,2024-03-14,break,Spring Recess,0
6,2024-03-15,break_end,Spring Recess End,0
7,2024-04-29,semester_end,Last Day of Classes Spring 2024,1
8,2024-04-30,study_day,Study Days,0
9,2024-05-01,study_day,Study Days,0


In [50]:
# Create lookup dictionaries for calendar features

# Extract different event types
semester_starts = academic_calendar[academic_calendar['event_type'] == 'semester_start']['date'].tolist()
semester_ends = academic_calendar[academic_calendar['event_type'] == 'semester_end']['date'].tolist()
holidays = academic_calendar[academic_calendar['event_type'] == 'holiday']['date'].tolist()
finals_dates = academic_calendar[academic_calendar['event_type'] == 'finals']['date'].tolist()
study_days = academic_calendar[academic_calendar['event_type'] == 'study_day']['date'].tolist()
breaks = academic_calendar[academic_calendar['event_type'].str.contains('break')]['date'].tolist()

print(f"Semester starts: {len(semester_starts)}")
print(f"Semester ends: {len(semester_ends)}")
print(f"Holidays: {len(holidays)}")
print(f"Finals periods: {len(finals_dates)} days")
print(f"Study days: {len(study_days)}")
print(f"Break days: {len(breaks)}")

Semester starts: 4
Semester ends: 4
Holidays: 14
Finals periods: 32 days
Study days: 12
Break days: 10


In [51]:
# Define active semester periods
# Between semester_start and semester_end
semester_periods = []
for start in semester_starts:
	# Find corresponding end
	corresponding_ends = [e for e in semester_ends if e > start]
	if corresponding_ends:
		end = min(corresponding_ends)
		semester_periods.append((start, end))

print(f"\nActive semester periods:")
for start, end in semester_periods:
	print(f"  {start.date()} to {end.date()}")


Active semester periods:
  2024-01-16 to 2024-04-29
  2024-09-03 to 2024-12-09
  2025-01-21 to 2025-05-05
  2025-09-02 to 2025-12-08


---

## 3. Aggregate Data by Station and Hour

In [52]:
# Extract hour from timestamps
df['start_hour'] = df['started_at'].dt.floor('H')
df['end_hour'] = df['ended_at'].dt.floor('H')

# Calculate departures
departures = df[df['start_station_id'].isin(columbia_stations)].groupby(
	['start_station_id', 'start_hour']
).size().reset_index(name='departures')
departures.columns = ['station_id', 'hour', 'departures']

# Calculate arrivals
arrivals = df[
	(df['end_station_id'].isin(columbia_stations)) & 
	(df['end_station_id'].notna())
].groupby(['end_station_id', 'end_hour']).size().reset_index(name='arrivals')
arrivals.columns = ['station_id', 'hour', 'arrivals']

# Merge
station_hours = departures.merge(arrivals, on=['station_id', 'hour'], how='outer')
station_hours['departures'] = station_hours['departures'].fillna(0).astype(int)
station_hours['arrivals'] = station_hours['arrivals'].fillna(0).astype(int)
station_hours['net_flow'] = station_hours['arrivals'] - station_hours['departures']

print(f"Created station-hour dataset: {len(station_hours):,} rows")

Created station-hour dataset: 27,407 rows


In [53]:
# Create complete time series (fill missing hours with 0)
min_hour = station_hours['hour'].min()
max_hour = station_hours['hour'].max()
all_hours = pd.date_range(start=min_hour, end=max_hour, freq='H')

all_combinations = pd.MultiIndex.from_product(
	[columbia_stations, all_hours],
	names=['station_id', 'hour']
).to_frame(index=False)

station_hours = all_combinations.merge(station_hours, on=['station_id', 'hour'], how='left')
station_hours['departures'] = station_hours['departures'].fillna(0).astype(int)
station_hours['arrivals'] = station_hours['arrivals'].fillna(0).astype(int)
station_hours['net_flow'] = station_hours['net_flow'].fillna(0).astype(int)

print(f"Complete dataset: {len(station_hours):,} rows")
print(f"Date range: {min_hour} to {max_hour}")

Complete dataset: 91,028 rows
Date range: 2024-05-02 17:00:00 to 2025-10-26 12:00:00


---

## 4. Feature Engineering (IMPROVED)

### 4.1 Basic Temporal Features

In [54]:
# Extract date components (for internal use, not as model features)
station_hours['date'] = station_hours['hour'].dt.date
station_hours['hour_of_day'] = station_hours['hour'].dt.hour
station_hours['day_of_week'] = station_hours['hour'].dt.dayofweek
station_hours['month'] = station_hours['hour'].dt.month
station_hours['is_weekend'] = (station_hours['day_of_week'] >= 5).astype(int)

# COMBINED: Rush hour (7-9am OR 4-6pm on weekdays)
station_hours['is_rush_hour'] = (
	(station_hours['is_weekend'] == 0) &
	(
		((station_hours['hour_of_day'] >= 7) & (station_hours['hour_of_day'] <= 9)) |
		((station_hours['hour_of_day'] >= 16) & (station_hours['hour_of_day'] <= 18))
	)
).astype(int)

print("✓ Created basic temporal features")
print("✓ Combined all rush hour features into single is_rush_hour")

✓ Created basic temporal features
✓ Combined all rush hour features into single is_rush_hour


### 4.2 NEW: Cyclical Encoding

In [55]:
# Cyclical encoding: time is circular!
# Hour 23 and hour 0 are only 1 hour apart, not 23 hours

# Hour of day (0-23)
station_hours['hour_sin'] = np.sin(2 * np.pi * station_hours['hour_of_day'] / 24)
station_hours['hour_cos'] = np.cos(2 * np.pi * station_hours['hour_of_day'] / 24)

# Day of week (0-6)
station_hours['day_sin'] = np.sin(2 * np.pi * station_hours['day_of_week'] / 7)
station_hours['day_cos'] = np.cos(2 * np.pi * station_hours['day_of_week'] / 7)

# Month (1-12)
station_hours['month_sin'] = np.sin(2 * np.pi * (station_hours['month'] - 1) / 12)
station_hours['month_cos'] = np.cos(2 * np.pi * (station_hours['month'] - 1) / 12)

print("✓ Created cyclical encoding features")
station_hours[['hour_of_day', 'hour_sin', 'hour_cos', 'day_of_week', 'day_sin', 'day_cos']].head()

✓ Created cyclical encoding features


Unnamed: 0,hour_of_day,hour_sin,hour_cos,day_of_week,day_sin,day_cos
0,17,-0.965926,-0.258819,3,0.433884,-0.900969
1,18,-1.0,-1.83697e-16,3,0.433884,-0.900969
2,19,-0.965926,0.258819,3,0.433884,-0.900969
3,20,-0.866025,0.5,3,0.433884,-0.900969
4,21,-0.707107,0.7071068,3,0.433884,-0.900969


### 4.3 NEW: Academic Calendar Features

In [56]:
# Convert date column to datetime for matching
station_hours['date'] = pd.to_datetime(station_hours['date'])

# Is this an active semester day?
def is_in_semester(date):
	for start, end in semester_periods:
		if start <= date <= end:
			return 1
	return 0

station_hours['is_semester'] = station_hours['date'].apply(is_in_semester)

# COMBINED: Is this a holiday? (academic OR national)
holidays_set = set(pd.to_datetime(holidays))
national_holidays = [
	'2024-01-01', '2024-07-04', '2024-12-25', '2024-12-31',
	'2025-01-01', '2025-07-04', '2025-12-25', '2025-12-31'
]
national_holidays_set = set(pd.to_datetime(national_holidays))
all_holidays = holidays_set.union(national_holidays_set)
station_hours['is_holiday'] = station_hours['date'].isin(all_holidays).astype(int)

# Is this finals week?
finals_set = set(pd.to_datetime(finals_dates))
station_hours['is_finals'] = station_hours['date'].isin(finals_set).astype(int)

# Is this a study day?
study_days_set = set(pd.to_datetime(study_days))
station_hours['is_study_day'] = station_hours['date'].isin(study_days_set).astype(int)

# Is this a break?
breaks_set = set(pd.to_datetime(breaks))
station_hours['is_break'] = station_hours['date'].isin(breaks_set).astype(int)

print("✓ Created academic calendar features")
print(f"\nSemester days: {station_hours['is_semester'].sum():,} hours")
print(f"Holiday hours (academic + national): {station_hours['is_holiday'].sum():,}")
print(f"Finals hours: {station_hours['is_finals'].sum():,}")
print(f"Study day hours: {station_hours['is_study_day'].sum():,}")
print(f"Break hours: {station_hours['is_break'].sum():,}")

✓ Created academic calendar features

Semester days: 43,267 hours
Holiday hours (academic + national): 2,184
Finals hours: 4,032
Study day hours: 1,057
Break hours: 840


In [57]:
# Days since semester started (resets each semester)
def days_since_semester_start(date):
	# Find the most recent semester start before this date
	past_starts = [s for s in semester_starts if s <= date]
	if past_starts:
		most_recent_start = max(past_starts)
		return (date - most_recent_start).days
	return 999  # Not in semester

station_hours['days_since_semester_start'] = station_hours['date'].apply(days_since_semester_start)

print("✓ Created semester-relative features")
print("✓ Removed is_first_week and is_last_two_weeks (redundant with days_since_semester_start)")

✓ Created semester-relative features
✓ Removed is_first_week and is_last_two_weeks (redundant with days_since_semester_start)


### 4.4 Semester-Relative Features

In [58]:
print("✓ Removed separate national holidays cell (now combined with academic holidays)")

✓ Removed separate national holidays cell (now combined with academic holidays)


### 4.5 Lag Features (from v1)

In [59]:
print("✓ Removed old lag features (now using departure-specific ones)")

✓ Removed old lag features (now using departure-specific ones)


### 4.6 NEW: System-Wide Features

In [60]:
# Sort by station and time for lag features
station_hours = station_hours.sort_values(['station_id', 'hour']).reset_index(drop=True)

# Lag features for DEPARTURES (our target variable)
station_hours['departures_lag_1h'] = station_hours.groupby('station_id')['departures'].shift(1)
station_hours['departures_lag_24h'] = station_hours.groupby('station_id')['departures'].shift(24)
station_hours['departures_lag_168h'] = station_hours.groupby('station_id')['departures'].shift(168)

# Also keep arrivals lag for context
station_hours['arrivals_lag_1h'] = station_hours.groupby('station_id')['arrivals'].shift(1)

# Total trips lag
station_hours['total_trips_lag_1h'] = station_hours['departures_lag_1h'] + station_hours['arrivals_lag_1h']

# Rolling averages for departures
station_hours['departures_rolling_avg_24h'] = station_hours.groupby('station_id')['departures'].transform(
	lambda x: x.rolling(window=24, min_periods=1).mean()
)
station_hours['departures_rolling_avg_7d'] = station_hours.groupby('station_id')['departures'].transform(
	lambda x: x.rolling(window=168, min_periods=1).mean()
)

print("✓ Created lag and rolling features for departures")

✓ Created lag and rolling features for departures


In [61]:
print("✓ Removed weather data loading (not using weather features)")

✓ Removed weather data loading (not using weather features)


In [62]:
# Calculate system-wide (all Columbia stations) departures
system_wide = station_hours.groupby('hour').agg({
	'departures': 'sum',
	'arrivals': 'sum'
}).reset_index()

system_wide.columns = ['hour', 'system_departures', 'system_arrivals']

# Merge back
station_hours = station_hours.merge(system_wide, on='hour', how='left')

# System-wide lag features
station_hours = station_hours.sort_values('hour').reset_index(drop=True)
station_hours['system_departures_lag_1h'] = station_hours['system_departures'].shift(1)
station_hours['system_total_trips_lag_1h'] = (
	station_hours['system_departures'].shift(1) + station_hours['system_arrivals'].shift(1)
)

# Re-sort by station and hour
station_hours = station_hours.sort_values(['station_id', 'hour']).reset_index(drop=True)

print("✓ Created system-wide features")
print(f"\nSystem-wide departures stats:")
print(station_hours['system_departures'].describe())

✓ Created system-wide features

System-wide departures stats:
count    91028.000000
mean         5.916410
std         14.904303
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        120.000000
Name: system_departures, dtype: float64


In [63]:
print("✓ Removed old historical average (now using departure-specific one)")

✓ Removed old historical average (now using departure-specific one)


In [64]:
# Interaction features

# Semester × weekday (high activity expected)
station_hours['semester_weekday'] = (
	station_hours['is_semester'] * (1 - station_hours['is_weekend'])
)

# NEW: Hour × Weekend interaction (different hourly patterns on weekends)
station_hours['hour_weekend_interaction'] = (
	station_hours['hour_of_day'] * station_hours['is_weekend']
)

print("✓ Created interaction features")
print("✓ Added hour × weekend interaction")
print("✓ Removed rush_hour_lag_interaction (redundant)")

✓ Created interaction features
✓ Added hour × weekend interaction
✓ Removed rush_hour_lag_interaction (redundant)


### 4.9 Summary of All Features

In [65]:
# Historical average by station-hour-daytype
historical_avg = station_hours.groupby(
	['station_id', 'hour_of_day', 'is_weekend']
)['departures'].transform('mean')

station_hours['historical_avg_departures'] = historical_avg

print("✓ Created historical average feature for departures")

✓ Created historical average feature for departures


---

## 5. Prepare Train/Validation/Test Split

In [66]:
# Drop rows with NaN in lag features
station_hours_clean = station_hours.dropna(subset=[
	'departures_lag_1h', 'departures_lag_24h', 'departures_lag_168h', 'system_departures_lag_1h'
]).copy()

print(f"Rows after dropping NaN: {len(station_hours_clean):,}")
print(f"Date range: {station_hours_clean['hour'].min()} to {station_hours_clean['hour'].max()}")

Rows after dropping NaN: 89,852
Date range: 2024-05-09 17:00:00 to 2025-10-26 12:00:00


In [67]:
# Time-based split (IMPROVED: more training data)
train_end = pd.Timestamp('2025-08-31 23:00:00')  # Extended from June to August
val_end = pd.Timestamp('2025-09-30 23:00:00')

train_data = station_hours_clean[station_hours_clean['hour'] <= train_end].copy()
val_data = station_hours_clean[
	(station_hours_clean['hour'] > train_end) & 
	(station_hours_clean['hour'] <= val_end)
].copy()
test_data = station_hours_clean[station_hours_clean['hour'] > val_end].copy()

print(f"Train: {len(train_data):,} rows ({train_data['hour'].min()} to {train_data['hour'].max()})")
print(f"Val: {len(val_data):,} rows ({val_data['hour'].min()} to {val_data['hour'].max()})")
print(f"Test: {len(test_data):,} rows ({test_data['hour'].min()} to {test_data['hour'].max()})")
print(f"\nTarget statistics:")
print(f"Train - Mean: {train_data['departures'].mean():.2f}, Std: {train_data['departures'].std():.2f}")
print(f"Val   - Mean: {val_data['departures'].mean():.2f}, Std: {val_data['departures'].std():.2f}")
print(f"Test  - Mean: {test_data['departures'].mean():.2f}, Std: {test_data['departures'].std():.2f}")

Train: 80,521 rows (2024-05-09 17:00:00 to 2025-08-31 23:00:00)
Val: 5,040 rows (2025-09-01 00:00:00 to 2025-09-30 23:00:00)
Test: 4,291 rows (2025-10-01 00:00:00 to 2025-10-26 12:00:00)

Target statistics:
Train - Mean: 0.47, Std: 1.78
Val   - Mean: 4.38, Std: 4.85
Test  - Mean: 4.01, Std: 4.48


In [68]:
print("✓ Removed old feature list (now using departure-specific features)")

✓ Removed old feature list (now using departure-specific features)


In [69]:
print("✓ Removed old X, y preparation (now using departure-specific version)")

✓ Removed old X, y preparation (now using departure-specific version)


In [70]:
print("✓ Removed duplicate dropna cell (moved before train/val/test split)")

✓ Removed duplicate dropna cell (moved before train/val/test split)


In [71]:
# Evaluation function with MAPE and zero-inflation metrics
def evaluate_model(y_true, y_pred, model_name):
	mae = mean_absolute_error(y_true, y_pred)
	rmse = np.sqrt(mean_squared_error(y_true, y_pred))
	r2 = r2_score(y_true, y_pred)
	
	# MAPE (Mean Absolute Percentage Error)
	# Avoid division by zero by only calculating for non-zero values
	non_zero_mask = y_true > 0
	if non_zero_mask.sum() > 0:
		mape = np.mean(np.abs((y_true[non_zero_mask] - y_pred[non_zero_mask]) / y_true[non_zero_mask])) * 100
	else:
		mape = np.nan
	
	# Zero-inflation accuracy (how well we predict 0 vs non-zero)
	y_true_binary = (y_true > 0).astype(int)
	y_pred_binary = (y_pred > 0).astype(int)
	zero_inflation_acc = np.sum(y_true_binary == y_pred_binary) / len(y_true) * 100
	
	# For departures: directional accuracy means predicting if demand > median
	median_demand = np.median(y_true)
	correct_direction = np.sum((y_true > median_demand) == (y_pred > median_demand))
	directional_acc = correct_direction / len(y_true) * 100
	
	print(f"\n{'='*50}")
	print(f"{model_name} Results")
	print(f"{'='*50}")
	print(f"MAE:   {mae:.3f} departures")
	print(f"RMSE:  {rmse:.3f} departures")
	print(f"R²:    {r2:.3f}")
	print(f"MAPE:  {mape:.2f}%" if not np.isnan(mape) else "MAPE:  N/A (no non-zero values)")
	print(f"Zero-Inflation Accuracy: {zero_inflation_acc:.2f}%")
	print(f"Directional Accuracy:    {directional_acc:.2f}%")
	
	return {
		'model': model_name,
		'mae': mae,
		'rmse': rmse,
		'r2': r2,
		'mape': mape,
		'zero_inflation_accuracy': zero_inflation_acc,
		'directional_accuracy': directional_acc
	}

In [72]:
# Define feature columns for DEPARTURE prediction (SIMPLIFIED)
feature_cols = [
	# Cyclical temporal features (ONLY cyclical, removed raw temporal)
	'hour_sin', 'hour_cos', 'day_sin', 'day_cos', 'month_sin', 'month_cos',
	# Binary temporal features
	'is_weekend', 'is_rush_hour',
	# Academic calendar features
	'is_semester', 'is_holiday', 'is_finals', 'is_study_day', 'is_break',
	'days_since_semester_start',
	# Lag features for DEPARTURES
	'departures_lag_1h', 'departures_lag_24h', 'departures_lag_168h',
	'arrivals_lag_1h', 'total_trips_lag_1h',
	# Rolling averages for DEPARTURES
	'departures_rolling_avg_24h', 'departures_rolling_avg_7d',
	# System-wide features
	'system_departures_lag_1h', 'system_total_trips_lag_1h',
	# Historical average for DEPARTURES
	'historical_avg_departures',
	# Interaction features
	'semester_weekday', 'hour_weekend_interaction'
]

# Encode station_id
le = LabelEncoder()
station_hours_clean['station_id_encoded'] = le.fit_transform(station_hours_clean['station_id'])
train_data['station_id_encoded'] = le.transform(train_data['station_id'])
val_data['station_id_encoded'] = le.transform(val_data['station_id'])
test_data['station_id_encoded'] = le.transform(test_data['station_id'])

feature_cols.append('station_id_encoded')

print(f"\nTotal features: {len(feature_cols)} (reduced from 36)")
print(f"Target variable: departures (demand)")
print(f"\nRemoved features:")
print(f"  - Raw temporal: hour_of_day, day_of_week, day_of_month, month")
print(f"  - Redundant calendar: is_first_week, is_last_two_weeks, is_national_holiday")
print(f"  - Redundant rush: is_weekday_morning_rush, is_weekday_evening_rush")
print(f"  - Redundant interaction: rush_hour_lag_interaction")
print(f"\nAdded features:")
print(f"  + hour_weekend_interaction")
print(f"\nFeature list:")
for i, col in enumerate(feature_cols, 1):
	print(f"{i:2d}. {col}")


Total features: 27 (reduced from 36)
Target variable: departures (demand)

Removed features:
  - Raw temporal: hour_of_day, day_of_week, day_of_month, month
  - Redundant calendar: is_first_week, is_last_two_weeks, is_national_holiday
  - Redundant rush: is_weekday_morning_rush, is_weekday_evening_rush
  - Redundant interaction: rush_hour_lag_interaction

Added features:
  + hour_weekend_interaction

Feature list:
 1. hour_sin
 2. hour_cos
 3. day_sin
 4. day_cos
 5. month_sin
 6. month_cos
 7. is_weekend
 8. is_rush_hour
 9. is_semester
10. is_holiday
11. is_finals
12. is_study_day
13. is_break
14. days_since_semester_start
15. departures_lag_1h
16. departures_lag_24h
17. departures_lag_168h
18. arrivals_lag_1h
19. total_trips_lag_1h
20. departures_rolling_avg_24h
21. departures_rolling_avg_7d
22. system_departures_lag_1h
23. system_total_trips_lag_1h
24. historical_avg_departures
25. semester_weekday
26. hour_weekend_interaction
27. station_id_encoded


In [73]:
# Prepare X and y (TARGET = departures)
X_train = train_data[feature_cols]
y_train = train_data['departures']

X_val = val_data[feature_cols]
y_val = val_data['departures']

X_test = test_data[feature_cols]
y_test = test_data['departures']

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"\nTarget variable statistics (departures):")
print(f"Train - Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}")
print(f"Val   - Mean: {y_val.mean():.2f}, Std: {y_val.std():.2f}")
print(f"Test  - Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}")

X_train shape: (80521, 27)
X_val shape: (5040, 27)
X_test shape: (4291, 27)

Target variable statistics (departures):
Train - Mean: 0.47, Std: 1.78
Val   - Mean: 4.38, Std: 4.85
Test  - Mean: 4.01, Std: 4.48


---

## 6. Train Models

In [74]:
print("✓ Removed old Linear Regression cell (using departure-specific version below)")

✓ Removed old Linear Regression cell (using departure-specific version below)


In [75]:
print("✓ Removed old feature importance (using departure-specific version below)")

✓ Removed old feature importance (using departure-specific version below)


In [76]:
y_pred_baseline = test_data['historical_avg_departures'].values
baseline_results = evaluate_model(y_test, y_pred_baseline, "Baseline (Historical Average)")
results = [baseline_results]


Baseline (Historical Average) Results
MAE:   3.262 departures
RMSE:  5.029 departures
R²:    -0.259
MAPE:  75.04%
Zero-Inflation Accuracy: 75.67%
Directional Accuracy:    59.36%


In [77]:
print("✓ Removed old Random Forest cell (using departure-specific version below)")

✓ Removed old Random Forest cell (using departure-specific version below)


In [78]:
print("Training Linear Regression for departure prediction...")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)
lr_results = evaluate_model(y_test, y_pred_lr, "Linear Regression - Departures")
results.append(lr_results)

Training Linear Regression for departure prediction...

Linear Regression - Departures Results
MAE:   1.773 departures
RMSE:  2.467 departures
R²:    0.697
MAPE:  62.62%
Zero-Inflation Accuracy: 77.44%
Directional Accuracy:    79.79%


In [79]:
# Feature importance
feature_importance = pd.DataFrame({
	'feature': feature_cols,
	'coefficient': lr_model.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print("\nTop 15 Most Important Features (Departure Prediction):")
print(feature_importance.head(15))

fig = px.bar(
	feature_importance.head(15),
	x='coefficient',
	y='feature',
	orientation='h',
	title='Linear Regression: Top 15 Features for Departure Prediction',
	labels={'coefficient': 'Coefficient Value', 'feature': 'Feature'},
	template='plotly_white'
)
fig.show()


Top 15 Most Important Features (Departure Prediction):
                       feature  coefficient
19  departures_rolling_avg_24h     0.374566
20   departures_rolling_avg_7d    -0.299376
23   historical_avg_departures     0.272694
16         departures_lag_168h     0.187340
0                     hour_sin     0.180466
15          departures_lag_24h     0.121234
1                     hour_cos     0.115993
7                 is_rush_hour    -0.114844
21    system_departures_lag_1h     0.089369
6                   is_weekend    -0.077020
14           departures_lag_1h     0.066288
18          total_trips_lag_1h     0.065162
5                    month_cos     0.018651
24            semester_weekday    -0.017553
4                    month_sin     0.016953


In [80]:
print("✓ Removed old XGBoost cell (using departure-specific version below)")

✓ Removed old XGBoost cell (using departure-specific version below)


In [81]:
print("Training Random Forest for departure prediction...")
rf_model = RandomForestRegressor(
	n_estimators=200,
	max_depth=20,
	min_samples_split=5,
	random_state=42,
	n_jobs=-1,
	verbose=1
)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)
rf_results = evaluate_model(y_test, y_pred_rf, "Random Forest - Departures")
results.append(rf_results)

Training Random Forest for departure prediction...


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    2.7s



Random Forest - Departures Results
MAE:   1.743 departures
RMSE:  2.464 departures
R²:    0.698
MAPE:  61.72%
Zero-Inflation Accuracy: 75.67%
Directional Accuracy:    80.45%


[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:    3.0s finished
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:    0.0s
[Parallel(n_jobs=12)]: Done 200 out of 200 | elapsed:    0.0s finished


In [82]:
# Feature importance
feature_importance_rf = pd.DataFrame({
	'feature': feature_cols,
	'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 15 Most Important Features (Random Forest - Departures):")
print(feature_importance_rf.head(15))

fig = px.bar(
	feature_importance_rf.head(15),
	x='importance',
	y='feature',
	orientation='h',
	title='Random Forest: Top 15 Features for Departure Prediction',
	labels={'importance': 'Importance Score', 'feature': 'Feature'},
	template='plotly_white'
)
fig.show()


Top 15 Most Important Features (Random Forest - Departures):
                       feature  importance
21    system_departures_lag_1h    0.652834
23   historical_avg_departures    0.153329
19  departures_rolling_avg_24h    0.038401
22   system_total_trips_lag_1h    0.026696
20   departures_rolling_avg_7d    0.021284
18          total_trips_lag_1h    0.015233
17             arrivals_lag_1h    0.013594
16         departures_lag_168h    0.012571
13   days_since_semester_start    0.012313
15          departures_lag_24h    0.012302
14           departures_lag_1h    0.009992
1                     hour_cos    0.006104
0                     hour_sin    0.005928
2                      day_sin    0.005005
26          station_id_encoded    0.004898


In [83]:
print("✓ Removed old model comparison (using departure-specific version below)")

✓ Removed old model comparison (using departure-specific version below)


In [84]:
print("Training XGBoost for departure prediction (with IMPROVED hyperparameters)...")
xgb_model = xgb.XGBRegressor(
	n_estimators=500,
	learning_rate=0.05,  # Increased from 0.03
	max_depth=7,  # Increased from 5
	subsample=0.8,
	colsample_bytree=0.8,
	reg_alpha=0.1,
	reg_lambda=1.0,
	min_child_weight=3,
	random_state=42,
	n_jobs=-1,
	verbosity=1,
	early_stopping_rounds=50
)

xgb_model.fit(
	X_train, y_train,
	eval_set=[(X_val, y_val)],
	verbose=50
)

y_pred_xgb = xgb_model.predict(X_test)
xgb_results = evaluate_model(y_test, y_pred_xgb, "XGBoost - Departures")
results.append(xgb_results)

print(f"\nBest iteration: {xgb_model.best_iteration}")
print(f"Best validation RMSE: {xgb_model.best_score:.3f}")
print(f"Hyperparameters: max_depth={xgb_model.max_depth}, learning_rate={xgb_model.learning_rate}")

Training XGBoost for departure prediction (with IMPROVED hyperparameters)...
[0]	validation_0-rmse:5.99187
[50]	validation_0-rmse:2.60744
[100]	validation_0-rmse:2.67095
[106]	validation_0-rmse:2.67799

XGBoost - Departures Results
MAE:   1.630 departures
RMSE:  2.363 departures
R²:    0.722
MAPE:  55.30%
Zero-Inflation Accuracy: 75.69%
Directional Accuracy:    82.31%

Best iteration: 56
Best validation RMSE: 2.600
Hyperparameters: max_depth=7, learning_rate=0.05


In [85]:
# Feature importance
feature_importance_xgb = pd.DataFrame({
	'feature': feature_cols,
	'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 15 Most Important Features (XGBoost - Departures):")
print(feature_importance_xgb.head(15))

fig = px.bar(
	feature_importance_xgb.head(15),
	x='importance',
	y='feature',
	orientation='h',
	title='XGBoost: Top 15 Features for Departure Prediction',
	labels={'importance': 'Importance Score', 'feature': 'Feature'},
	template='plotly_white'
)
fig.show()


Top 15 Most Important Features (XGBoost - Departures):
                       feature  importance
21    system_departures_lag_1h    0.409157
22   system_total_trips_lag_1h    0.213847
23   historical_avg_departures    0.110995
15          departures_lag_24h    0.059573
19  departures_rolling_avg_24h    0.029858
18          total_trips_lag_1h    0.016019
17             arrivals_lag_1h    0.015253
16         departures_lag_168h    0.014735
26          station_id_encoded    0.013917
6                   is_weekend    0.013288
1                     hour_cos    0.011978
14           departures_lag_1h    0.011620
20   departures_rolling_avg_7d    0.011176
0                     hour_sin    0.010674
25    hour_weekend_interaction    0.010161


In [86]:
print("✓ Removed old time series visualization (using departure-specific version below)")

✓ Removed old time series visualization (using departure-specific version below)


In [87]:
results_df = pd.DataFrame(results)

print("\n" + "="*80)
print("MODEL COMPARISON - Departure Prediction (Test Set)")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)

best_model_idx = results_df['mae'].idxmin()
best_model_name = results_df.loc[best_model_idx, 'model']
print(f"\n🏆 Best Model: {best_model_name}")
print(f"   MAE: {results_df.loc[best_model_idx, 'mae']:.3f} departures")
print(f"   R²: {results_df.loc[best_model_idx, 'r2']:.3f}")
print(f"   RMSE: {results_df.loc[best_model_idx, 'rmse']:.3f}")

print(f"\n✅ Success Criteria (Goal: R² > 0.50):")
print(f"   R² > 0.50: {'✓' if results_df.loc[best_model_idx, 'r2'] > 0.50 else '✗'}")
print("="*80)


MODEL COMPARISON - Departure Prediction (Test Set)
                         model      mae     rmse        r2      mape  zero_inflation_accuracy  directional_accuracy
 Baseline (Historical Average) 3.261689 5.028639 -0.259000 75.042022                75.670007             59.356793
Linear Regression - Departures 1.773289 2.467247  0.696925 62.615233                77.441156             79.794920
    Random Forest - Departures 1.743054 2.463650  0.697808 61.718769                75.670007             80.447448
          XGBoost - Departures 1.630292 2.362954  0.722006 55.298830                75.693312             82.311815

🏆 Best Model: XGBoost - Departures
   MAE: 1.630 departures
   R²: 0.722
   RMSE: 2.363

✅ Success Criteria (Goal: R² > 0.50):
   R² > 0.50: ✓


In [88]:
# Visualize comparison
fig = make_subplots(
	rows=1, cols=3,
	subplot_titles=('MAE (lower is better)', 'R² (higher is better)', 'RMSE (lower is better)')
)

fig.add_trace(go.Bar(x=results_df['model'], y=results_df['mae'], name='MAE'), row=1, col=1)
fig.add_trace(go.Bar(x=results_df['model'], y=results_df['r2'], name='R²'), row=1, col=2)
fig.add_trace(go.Bar(x=results_df['model'], y=results_df['rmse'], name='RMSE'), row=1, col=3)

fig.update_layout(
	height=400,
	showlegend=False,
	title_text='Departure Prediction: Model Performance Comparison',
	template='plotly_white'
)

fig.show()

---

## 7. Visualizations

In [89]:
print("="*80)
print("DEPARTURE PREDICTION MODEL - SUMMARY")
print("="*80)

print(f"\n📊 DATASET:")
print(f"   Total trips analyzed: 529,908")
print(f"   Stations: 7 Columbia area stations")
print(f"   Date range: Jan 2024 - Oct 2025")
print(f"   Train/Val/Test split: {len(train_data):,} / {len(val_data):,} / {len(test_data):,} hours")

print(f"\n🎯 TARGET VARIABLE:")
print(f"   Predicting: Hourly departures (demand for bikes at each station)")
print(f"   Average departures per hour: {y_test.mean():.2f}")
print(f"   Max departures per hour: {y_test.max()}")

print(f"\n🔧 FEATURES USED ({len(feature_cols)}):")
print(f"   • Cyclical temporal encoding (hour, day, month)")
print(f"   • Academic calendar (semester, finals, holidays, breaks)")
print(f"   • National holidays")
print(f"   • Departure lag features (1h, 24h, 168h)")
print(f"   • Rolling averages (24h, 7-day)")
print(f"   • System-wide features (all Columbia stations)")
print(f"   • Historical averages by station-hour-daytype")
print(f"   • Interaction features (semester×weekday, rush×lag)")

print(f"\n🏆 BEST MODEL: {best_model_name}")
print(f"   MAE:  {results_df.loc[best_model_idx, 'mae']:.3f} departures")
print(f"   RMSE: {results_df.loc[best_model_idx, 'rmse']:.3f} departures")
print(f"   R²:   {results_df.loc[best_model_idx, 'r2']:.3f}")

print(f"\n💡 KEY INSIGHTS:")
print(f"   • Model predicts demand with ~{results_df.loc[best_model_idx, 'mae']:.1f} trip error per hour")
print(f"   • Explains {results_df.loc[best_model_idx, 'r2']*100:.1f}% of variance in hourly departures")
print(f"   • Most important features: lag values, historical averages, academic calendar")
print(f"   • Weather features did NOT improve performance (removed)")

print(f"\n📈 PRACTICAL USE CASES:")
print(f"   1. Predict when stations will run out of bikes (high demand)")
print(f"   2. Optimize rebalancing truck schedules")
print(f"   3. Alert users about expected bike availability")
print(f"   4. Plan capacity expansion based on demand patterns")

print("\n" + "="*80)

DEPARTURE PREDICTION MODEL - SUMMARY

📊 DATASET:
   Total trips analyzed: 529,908
   Stations: 7 Columbia area stations
   Date range: Jan 2024 - Oct 2025
   Train/Val/Test split: 80,521 / 5,040 / 4,291 hours

🎯 TARGET VARIABLE:
   Predicting: Hourly departures (demand for bikes at each station)
   Average departures per hour: 4.01
   Max departures per hour: 29

🔧 FEATURES USED (27):
   • Cyclical temporal encoding (hour, day, month)
   • Academic calendar (semester, finals, holidays, breaks)
   • National holidays
   • Departure lag features (1h, 24h, 168h)
   • Rolling averages (24h, 7-day)
   • System-wide features (all Columbia stations)
   • Historical averages by station-hour-daytype
   • Interaction features (semester×weekday, rush×lag)

🏆 BEST MODEL: XGBoost - Departures
   MAE:  1.630 departures
   RMSE: 2.363 departures
   R²:   0.722

💡 KEY INSIGHTS:
   • Model predicts demand with ~1.6 trip error per hour
   • Explains 72.2% of variance in hourly departures
   • Most impor

---

## 8. Summary and Conclusions

In [90]:
# Use best model predictions
if 'XGBoost' in best_model_name:
	test_data['predicted_departures'] = y_pred_xgb
elif 'Random' in best_model_name:
	test_data['predicted_departures'] = y_pred_rf
else:
	test_data['predicted_departures'] = y_pred_lr

test_data['actual_departures'] = y_test.values

# Sample station visualization
station_names = {
	'7783.18': 'Broadway & W 122 St',
	'7741.04': 'Morningside Dr & Amsterdam Ave',
	'7745.07': 'W 120 St & Claremont Ave',
	'7727.07': 'Amsterdam Ave & W 119 St',
	'7713.11': 'W 116 St & Broadway',
	'7692.11': 'W 116 St & Amsterdam Ave',
	'7713.01': 'W 113 St & Broadway'
}

sample_station = '7713.01'
sample_data = test_data[test_data['station_id'] == sample_station].copy()

fig = go.Figure()
fig.add_trace(go.Scatter(
	x=sample_data['hour'], y=sample_data['actual_departures'],
	mode='lines', name='Actual Departures', line=dict(color='blue', width=1)
))
fig.add_trace(go.Scatter(
	x=sample_data['hour'], y=sample_data['predicted_departures'],
	mode='lines', name='Predicted Departures', line=dict(color='red', width=1, dash='dash')
))

fig.update_layout(
	title=f'Actual vs Predicted Departures: {station_names[sample_station]}',
	xaxis_title='Date/Time', yaxis_title='Departures (Demand)',
	template='plotly_white', height=500
)

fig.show()

In [91]:
# Calculate performance per station
per_station_results = []

for station_id in columbia_stations:
	station_test = test_data[test_data['station_id'] == station_id]
	if len(station_test) > 0:
		mae = mean_absolute_error(station_test['actual_departures'], station_test['predicted_departures'])
		r2 = r2_score(station_test['actual_departures'], station_test['predicted_departures'])
		
		per_station_results.append({
			'station_id': station_id,
			'station_name': station_names[station_id],
			'mae': mae,
			'r2': r2,
			'n_samples': len(station_test)
		})

per_station_df = pd.DataFrame(per_station_results).sort_values('mae')

print("\n" + "="*80)
print("PER-STATION PERFORMANCE")
print("="*80)
print(per_station_df.to_string(index=False))
print("="*80)

# Visualize per-station MAE
fig = px.bar(
	per_station_df.sort_values('mae', ascending=False),
	x='station_name',
	y='mae',
	title=f'Per-Station MAE - {best_model_name}',
	labels={'mae': 'MAE (departures)', 'station_name': 'Station'},
	template='plotly_white'
)
fig.update_xaxes(tickangle=45)
fig.show()

# Visualize per-station R²
fig = px.bar(
	per_station_df.sort_values('r2', ascending=True),
	x='station_name',
	y='r2',
	title=f'Per-Station R² - {best_model_name}',
	labels={'r2': 'R² Score', 'station_name': 'Station'},
	template='plotly_white'
)
fig.update_xaxes(tickangle=45)
fig.show()


PER-STATION PERFORMANCE
station_id                   station_name      mae       r2  n_samples
   7741.04 Morningside Dr & Amsterdam Ave 1.201468 0.437118        613
   7783.18            Broadway & W 122 St 1.312310 0.504115        613
   7727.07       Amsterdam Ave & W 119 St 1.501455 0.696429        613
   7745.07       W 120 St & Claremont Ave 1.541475 0.685461        613
   7713.11            W 116 St & Broadway 1.830077 0.727901        613
   7713.01            W 113 St & Broadway 2.000722 0.657530        613
   7692.11       W 116 St & Amsterdam Ave 2.024535 0.760452        613


print("✓ End of notebook")

In [92]:
# Scatter plot of actual vs predicted departures
fig = px.scatter(
	x=test_data['actual_departures'],
	y=test_data['predicted_departures'],
	opacity=0.3,
	labels={'x': 'Actual Departures', 'y': 'Predicted Departures'},
	title=f'Actual vs Predicted Departures - {best_model_name}',
	template='plotly_white'
)

# Add perfect prediction line
max_val = max(test_data['actual_departures'].max(), test_data['predicted_departures'].max())
fig.add_trace(go.Scatter(
	x=[0, max_val],
	y=[0, max_val],
	mode='lines',
	name='Perfect Prediction',
	line=dict(color='red', dash='dash')
))

fig.update_layout(height=500)
fig.show()

print(f"\nScatter plot shows how well predictions match actual departures")
print(f"Points closer to the red line = better predictions")


Scatter plot shows how well predictions match actual departures
Points closer to the red line = better predictions


print("✓ End of notebook")