# DA5401 A8: Ensemble Learning on Bike Sharing
## Introduction

This notebook builds and compares ensemble regressors to forecast hourly bike rentals (cnt) using the UCI Bike Sharing dataset. We load hour.csv (with day.csv for context), prepare features (drop instant, dteday, casual, registered; encode time and weather categories), and evaluate models using RMSE on a held-out test set.

Goals
- Establish simple baselines: Decision Tree (max_depth=6) and Linear Regression.
- Reduce variance with Bagging (trees, n_estimators ≥ 50).
- Reduce bias with Gradient Boosting.
- Combine diverse learners via Stacking (KNN, Bagging, Gradient Boosting) with a Ridge meta-learner.

Deliverables
- Clean, reproducible code and brief plots.
- A results table comparing RMSE across all models.
- A short conclusion explaining which model performs best and why, referencing bias–variance and model diversity.



In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import StackingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

We begin by loading the data and dropping irrelevant columns

In [8]:
# Data Loading and basic feature setup

# 1) Load data
df = pd.read_csv("hour.csv")

# 2) Target
y = df["cnt"].copy()

# 3) Drop irrelevant/leaky columns
drop_cols = ["instant", "dteday", "casual", "registered"]
X = df.drop(columns=drop_cols + ["cnt"])

# 4) Define feature groups for preprocessing
categorical_cols = ["season", "yr", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit"]
numeric_cols = ["temp", "atemp", "hum", "windspeed"]

# Quick sanity checks
print("Shape:", df.shape)
print("Features kept:", X.columns.tolist())
print("Categorical:", categorical_cols)
print("Numeric:", numeric_cols)


Shape: (17379, 17)
Features kept: ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']
Categorical: ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit']
Numeric: ['temp', 'atemp', 'hum', 'windspeed']


Now, let's construct  clean, reusable preprocessing step that one-hot encodes the categorical columns and leaves numeric columns untouched

In [None]:


# Categorical and numeric columns defined earlier
categorical_cols = ["season", "yr", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit"]
numeric_cols = ["temp", "atemp", "hum", "windspeed"]

# One-Hot Encoder with safe handling of unseen categories
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

# Column-wise transformer: OHE for categoricals, passthrough for numerics
preprocess = ColumnTransformer(
    transformers=[
        ("cat", ohe, categorical_cols),
        ("num", "passthrough", numeric_cols),
    ],
    remainder="drop",
)


We then use a chronological split to respect time order and avoid leakage from the future into the past. This is the correct choice for hourly demand forecasting.

In [10]:
# Chronological 80/20 split
n = len(X)
cut = int(0.8 * n)

X_train, X_test = X.iloc[:cut], X.iloc[cut:]
y_train, y_test = y.iloc[:cut], y.iloc[cut:]


Next we proceed to fitting a bseline regressor

In [18]:
# 1) Encode train/test explicitly (no pipelines)
X_train_enc = preprocess.fit_transform(X_train)   # preprocess is the ColumnTransformer defined earlier
X_test_enc  = preprocess.transform(X_test)

# 2) Decision Tree (depth=6) with light regularization
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

dt = DecisionTreeRegressor(
    max_depth=6,
    min_samples_leaf=10,    # small regularization for stability
    random_state=42
)
dt.fit(X_train_enc, y_train)
y_pred_dt = dt.predict(X_test_enc)
rmse_dt = mean_squared_error(y_test, y_pred_dt, squared=False)
print(f"Decision Tree (depth=6, min_leaf=10) RMSE: {rmse_dt:.3f}")

# 3) Linear Regression (no normalization; keep intercept)
from sklearn.linear_model import LinearRegression

lr = LinearRegression(fit_intercept=True)
lr.fit(X_train_enc, y_train)
y_pred_lr = lr.predict(X_test_enc)
rmse_lr = mean_squared_error(y_test, y_pred_lr, squared=False)
print(f"Linear Regression RMSE: {rmse_lr:.3f}")

# 4) Choose baseline
baseline_name = "Decision Tree (depth=6)" if rmse_dt <= rmse_lr else "Linear Regression"
baseline_rmse = min(rmse_dt, rmse_lr)
print(f"Baseline: {baseline_name} | RMSE = {baseline_rmse:.3f}")




Decision Tree (depth=6, min_leaf=10) RMSE: 158.723
Linear Regression RMSE: 133.347
Baseline: Linear Regression | RMSE = 133.347




Linear Regression RMSE 133.347 vs Decision Tree (depth=6, min_leaf=10) RMSE 158.723 indicates the linear model generalizes better on your held-out chronological test set. That suggests the shallow tree, even with light regularization, is underfitting important interactions or overfitting to local splits that don’t transfer well across time, while the linear model benefits from one-hot encoded time/weather signals that are fairly additive over the horizon.

Below is a bagging implementation using  Decision Tree (depth=6) as the base estimator, with 100 estimators and standard bagging settings, followed by RMSE evaluation and a short discussion.

In [None]:


# Base learner: match the baseline tree spec
base_tree = DecisionTreeRegressor(
    max_depth=6,
    min_samples_leaf=10,
    random_state=42
)

# Bagging: variance reduction via bootstrap aggregation
bag = BaggingRegressor(
    estimator=base_tree,
    n_estimators=100,       # >= 50 as required
    max_samples=1.0,        # bootstrap over samples
    max_features=1.0,       # use all features per base estimator
    bootstrap=True,         # sample rows with replacement
    bootstrap_features=False,
    n_jobs=-1,
    random_state=42
)

# Fit on encoded training data, evaluate on encoded test data
bag.fit(X_train_enc, y_train)
y_pred_bag = bag.predict(X_test_enc)
rmse_bag = root_mean_squared_error(y_test, y_pred_bag)
print(f"Bagging (100x DT depth=6, min_leaf=10) RMSE: {rmse_bag:.3f}")


Bagging (100x DT depth=6, min_leaf=10) RMSE: 155.736


Bagging improved over the single Decision Tree (158.723 → 155.736 RMSE), but the gain is modest, indicating some variance reduction from averaging bootstrap samples of the same shallow base learner.

The limited improvement is consistent with bias constraints: a depth-6 tree with min_samples_leaf=10 is already strongly regularized, so bagging can only reduce variance on top of a relatively high-bias model.

In practice, bagging tends to help more when base trees are higher-variance (e.g., deeper or less constrained), whereas with shallow trees the ensemble mainly smooths noise without addressing structural underfitting.

We now, proceed with gradient boosting:


In [22]:


gbr = GradientBoostingRegressor(
    n_estimators=300,     # enough stages to reduce bias
    learning_rate=0.05,  # smaller rate + more estimators = smoother fit
    max_depth=3,         # shallow trees as weak learners (common default)
    subsample=1.0,       # pure boosting (set <1.0 for stochastic GB)
    random_state=42
)

gbr.fit(X_train_enc, y_train)
y_pred_gbr = gbr.predict(X_test_enc)
rmse_gbr = root_mean_squared_error(y_test, y_pred_gbr)
print(f"Gradient Boosting RMSE: {rmse_gbr:.3f}")


Gradient Boosting RMSE: 110.447


Gradient Boosting clearly outperforms both baselines on  chronological test split: 110.447 vs Linear Regression 133.347 and Bagging 155.736, indicating substantial error reduction.

This aligns with the hypothesis that boosting primarily reduces bias: by sequentially fitting shallow trees to residuals, the model captures nonlinear interactions between time and weather effects that linear regression misses and that bagging with shallow trees cannot correct.

The gap from 133.347 to 110.447 is sizable, suggesting meaningful structure beyond additive linear terms; the improvement over bagging also shows that averaging similar shallow trees wasn’t enough, whereas boosting’s stage-wise corrections addressed systematic underfitting.

# Stacking
Stacking (stacked generalization) is a two-layer ensemble where a meta-learner is trained to combine the predictions of several diverse base learners into a single, stronger prediction.

How it works
- Level-0 (base learners): Train multiple different models on the same training data (e.g., KNN, bagging trees, gradient boosting). Each model produces its own prediction for every sample. The key is diversity—models should make different kinds of errors so their predictions carry complementary information.
- Out-of-fold predictions: To train the meta-learner without leaking target information, generate base-learner predictions on held-out folds of the training set (out-of-fold). Concatenate these predictions into a new feature matrix where each column is a base model’s prediction; the target remains the original y. This ensures the meta-learner sees predictions for each training sample that were made without training on that sample.
- Level-1 (meta-learner): Fit a simple, regularized model (e.g., Ridge regression) on the out-of-fold prediction matrix to learn how to weight and combine base predictions. At test time, get base-model predictions on the test set, stack them into the same column order, and feed them to the meta-learner for the final prediction.
- Why it helps: The meta-learner learns patterns of agreement/disagreement among base models and assigns higher weight to models that are more reliable in specific regions of the feature space, effectively reducing both bias (by blending nonlinear learners) and variance (by averaging across diverse errors). Unlike bagging (averaging similar models) or boosting (sequentially correcting residuals), stacking learns an explicit mapping from base predictions to the target, which can capture when to trust which model.



We'll use the following as Level-0 (base) learners:

K-Nearest Neighbors Regressor (KNeighborsRegressor)

Bagging Regressor (DecisionTreeRegressor base, from earlier)

Gradient Boosting Regressor (from earlier)

Our Level-1 (meta-learner) will be Ridge Regression, which can learn optimal weights for each base learner’s predictions and helps avoid overfitting through regularization.

Implementing StackingRegressor in scikit-learn
StackingRegressor automatically fits the base estimators on training data, generates out-of-fold predictions to train the meta-learner (Ridge), and predicts on new data by combining base predictions according to the learned coefficients.

In [None]:


# Define base learners (Level-0)
knn = KNeighborsRegressor(n_neighbors=10)  
bag = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=6, min_samples_leaf=10, random_state=42),
    n_estimators=100,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)
gbr = GradientBoostingRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    subsample=1.0,
    random_state=42
)

# Meta-learner (Level-1)
ridge = Ridge(alpha=1.0)

# Combine all in a StackingRegressor
stack = StackingRegressor(
    estimators=[
        ('knn', knn),
        ('bag', bag),
        ('gbr', gbr)
    ],
    final_estimator=ridge,
    cv=5,             # 5-fold cross validation for out-of-fold base predictions
    n_jobs=-1
)

# Fit on training data and predict/test
stack.fit(X_train_enc, y_train)
y_pred_stack = stack.predict(X_test_enc)

from sklearn.metrics import root_mean_squared_error
rmse_stack = root_mean_squared_error(y_test, y_pred_stack)
print(f"Stacking Regressor RMSE: {rmse_stack:.3f}")


Stacking Regressor RMSE: 103.803


# Stacking Regressor Performance and Interpretation

**Stacking Regressor RMSE:** 103.803

The stacking ensemble achieved the lowest RMSE among all your tested models (linear regression, bagging, and gradient boosting). This result strongly suggests that stacking has leveraged the strengths of different base learners—KNN, Bagging, and Gradient Boosting—while the Ridge regression meta-learner found an optimal combination for their outputs. A lower RMSE indicates your model’s predictions are, on average, closer to the true bike rental counts than the alternatives, in line with best practices for evaluating regression models.

**Why does stacking help?**
- **Model diversity:** Each base model learns different aspects and patterns in the data; KNN captures locality, bagging reduces variance, and boosting reduces bias.
- **Bias-variance tradeoff:** By combining diverse models, stacking can reduce both bias and variance more effectively than bagging or boosting alone.
- **Optimal blending:** The Ridge meta-learner adapts to favor base models where they perform best, learning to weight predictions for maximal accuracy on held-out (out-of-fold) data.



## RMSE Comparison Table

Below is a table summarizing the Root Mean Squared Error (RMSE) on the test set for each ensemble and baseline model:

| Model                        | Test RMSE  |
|------------------------------|------------|
| **Linear Regression (Baseline)**  | 133.347    |
| **Decision Tree Regressor**       | 158.723    |
| **Bagging Regressor**             | 155.736    |
| **Gradient Boosting Regressor**   | 110.447    |
| **Stacking Regressor**            | 103.803    |

**Interpretation:**  
- The stacking ensemble achieved the lowest RMSE, indicating the most accurate predictions on the test data.
- Gradient boosting and stacking both substantially outperformed the single-model baselines, confirming the power of bias reduction and diverse model combination.
- Bagging (with shallow trees) reduced variance only slightly compared to the single tree, which is expected given high bias in depth-limited models.

> *RMSE values are rounded to three decimal places for clarity. Lower RMSE indicates better predictive accuracy for this regression task.*



## Conclusion: Best Model & Why Stacking Wins

**Best-performing model:** The Stacking Regressor achieved the lowest test RMSE (103.803), outperforming the baseline Linear Regression (133.347), Decision Tree (158.723), Bagging (155.736), and Gradient Boosting (110.447).

**Why Stacking Regressor outperformed single models:**

- **Model Diversity:** Stacking combines very different types of base learners: K-Nearest Neighbors (captures local patterns), Bagging of Decision Trees (reduces variance), and Gradient Boosting (reduces bias). Since each of these models has unique strengths and weaknesses, their individual prediction errors are less correlated. This diversity allows the meta-learner (Ridge Regression) to learn how to weight and blend the base models' predictions to minimize errors in a way that any one model alone cannot.

- **Bias-Variance Trade-off:** Single models like linear regression are low-variance but high-bias and can't capture nonlinearities. A single decision tree can capture some complexity but may have high variance (especially if deep). Bagging shrinks variance by averaging many trees, but doesn’t reduce their bias. Boosting reduces bias through sequential error correction. Stacking moves beyond both: the meta-learner learns the optimal combination, lowering both bias and variance if the base learners are sufficiently different.

- **Optimal Blending:** The Ridge meta-learner is trained on out-of-fold predictions from each base model, preventing overfitting and ensuring the meta-learner adapts to situations where one model is more accurate than others. This data-driven combination leads to more robust, accurate predictions than picking any single approach.
