# DA5401 A8: Ensemble Learning for Complex Regression Modeling on Bike Share Data

## Objective

To apply and compare **single-model** and **ensemble learning** techniques for predicting hourly bike rentals (`cnt`) using the **UCI Bike Sharing Dataset**. Also demonstrate your understanding of how these methods address model variance and bias, and how a diverse stack of models can yield superior performance to any single model.


## Part A: Data Preprocessing and Baseline 

### A.1 Data Loading and Feature Engineering

In [12]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import StackingRegressor

RANDOM_STATE = 42

import os
os.environ["LOKY_MAX_CPU_COUNT"] = "4"   

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

df = pd.read_csv("hour.csv")
print("Shape:", df.shape)
df.head()


Shape: (17379, 17)


Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [3]:
print("\nData types:\n", df.dtypes)
print("\nMissing values per column:\n", df.isna().sum())
print("\nContinuous columns:", ["temp","atemp","hum","windspeed"])
print("Categorical columns:", ["season","yr","mnth","hr","weekday","weathersit"])
print("Binary columns:", ["holiday","workingday"])



Data types:
 instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

Missing values per column:
 instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

Continuous columns: ['temp', 'atemp', 'hum', 'windspeed']
Categorical columns: ['season', 'yr', 'mnth', 'hr', 'weekday', 'weathersit']
Binary columns: ['holiday', 'workingday']


In [4]:
drop_cols = ["instant", "dteday", "casual", "registered"]
df_model = df.drop(columns=drop_cols)

In [5]:
target = "cnt"
y = df_model[target].values
X = df_model.drop(columns=[target])

categorical_cols = ["season", "yr", "mnth", "hr", "weekday", "weathersit"]
numeric_cols = ["holiday", "workingday", "temp", "atemp", "hum", "windspeed"]

preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_cols),
    ("num", "passthrough", numeric_cols)
])

### A.2 Train/Test Split

In [6]:
n = len(df_model)
split_idx = int(0.8 * n)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print("Train:", X_train.shape, " Test:", X_test.shape)


Train: (13903, 12)  Test: (3476, 12)


### A.3 Baseline Model (Single Regressor)

In [7]:
tree_model = Pipeline([
    ("prep", preprocessor),
    ("reg", DecisionTreeRegressor(max_depth=6, random_state=RANDOM_STATE))
])

lin_model = Pipeline([
    ("prep", preprocessor),
    ("reg", LinearRegression())
])

tree_model.fit(X_train, y_train)
lin_model.fit(X_train, y_train)

pred_tree = tree_model.predict(X_test)
pred_lin  = lin_model.predict(X_test)

rmse = lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred))
rmse_tree = rmse(y_test, pred_tree)
rmse_lin  = rmse(y_test, pred_lin)

print(f"Decision Tree (max_depth=6) RMSE: {rmse_tree:.3f}")
print(f"Linear Regression RMSE: {rmse_lin:.3f}")

baseline = "Decision Tree" if rmse_tree < rmse_lin else "Linear Regression"
print(f"\nâœ… Baseline model: {baseline}")


Decision Tree (max_depth=6) RMSE: 158.692
Linear Regression RMSE: 133.835

âœ… Baseline model: Linear Regression


### ðŸ§¾ **Conclusion â€” Part A**

- After preprocessing and chronological splitting, two baseline models were trained:  
  - **Decision Tree (max_depth = 6)** â†’ RMSE = **158.69**  
  - **Linear Regression** â†’ RMSE = **133.84**  
- Since Linear Regression achieved the lower RMSE, it is selected as the **baseline model** for further ensemble comparisons.  
- This indicates that the relationship between the predictors and total bike rentals (`cnt`) is largely **linear** with limited complex non-linear interactions at this stage.


## Part B: Ensemble Techniques for Bias and Variance Reduction 

### B.1 Bagging (variance reduction)

**Hypothesis:**  
Bagging aims to **reduce variance** by averaging predictions from multiple high-variance models (like decision trees) trained on different bootstrap samples of the data.

**Implementation :**  
- Use **DecisionTreeRegressor(max_depth = 6)** as the baseline.  
- Train a **BaggingRegressor** with at least **50 estimators** (weâ€™ll use 100).  
- Evaluate its **test RMSE** and compare it against the single Decision Tree baseline.  

In [8]:
bagging_pipe = Pipeline([
    ("prep", preprocessor),
    ("bag", BaggingRegressor(
        estimator=DecisionTreeRegressor(max_depth=6, random_state=RANDOM_STATE),
        n_estimators=100,            # >= 50 as required
        random_state=RANDOM_STATE,
        n_jobs=-1
    ))
])

# fit and evaluate
bagging_pipe.fit(X_train, y_train)
pred_bag = bagging_pipe.predict(X_test)
rmse_bag = float(np.sqrt(mean_squared_error(y_test, pred_bag)))

tree_pipe = Pipeline([("prep", preprocessor),
                      ("reg", DecisionTreeRegressor(max_depth=6, random_state=RANDOM_STATE))])
tree_pipe.fit(X_train, y_train)
pred_tree = tree_pipe.predict(X_test)
rmse_tree = float(np.sqrt(mean_squared_error(y_test, pred_tree)))

print(f"Decision Tree (single) RMSE: {rmse_tree:.3f}")
print(f"Bagging (100 DT) RMSE:        {rmse_bag:.3f}")


Decision Tree (single) RMSE: 158.692
Bagging (100 DT) RMSE:        155.270


### Discussion â€” Bagging (Variance Reduction)

- The **single Decision Tree (max_depth = 6)** achieved an RMSE of **158.69**,  
  while the **Bagging ensemble (100 trees)** achieved a lower RMSE of **155.27**.  
- This small but consistent decrease in RMSE indicates that **bagging successfully reduced variance** by averaging predictions from multiple bootstrapped trees.  
- Each tree individually is a high-variance learner, but by aggregating many such models, random fluctuations in individual trees cancel out.  
- Hence, bagging produced a **more stable and slightly more accurate** predictor compared to a single Decision Tree, validating the hypothesis that bagging primarily targets **variance reduction**.


### B.2 Boosting (bias reduction) using GradientBoostingRegressor

**Hypothesis:**  
Boosting primarily targets **bias reduction** by sequentially improving weak learners.  
Each new tree in the sequence is trained to correct the residual (error) of the previous ensemble.

**Implementation plan:**  
- Use **GradientBoostingRegressor** â€” a standard boosting algorithm.  
- Keep default hyperparameters to observe natural bias reduction.  
- Evaluate test **RMSE** and compare it with both the **single regressors** and **Bagging ensemble** results.


In [9]:
gbr_pipe = Pipeline([
    ("prep", preprocessor),
    ("gbr", GradientBoostingRegressor(random_state=20))
])

gbr_pipe.fit(X_train, y_train)
pred_gbr = gbr_pipe.predict(X_test)
rmse_gbr = float(np.sqrt(mean_squared_error(y_test, pred_gbr)))

print(f"Gradient Boosting RMSE: {rmse_gbr:.3f}")
print("\nComparison summary:")
print(f" - Single Decision Tree RMSE: {rmse_tree:.3f}")
print(f" - Linear Regression RMSE: {rmse_lin:.3f}")
print(f" - Bagging (100 DT) RMSE:     {rmse_bag:.3f}")
print(f" - Gradient Boosting RMSE:    {rmse_gbr:.3f}")


Gradient Boosting RMSE: 123.182

Comparison summary:
 - Single Decision Tree RMSE: 158.692
 - Linear Regression RMSE: 133.835
 - Bagging (100 DT) RMSE:     155.270
 - Gradient Boosting RMSE:    123.182


###  Discussion â€” Boosting (Bias Reduction)

- The **Gradient Boosting Regressor** achieved an RMSE of **123.13**, outperforming  
  both the **single Decision Tree (158.69)** and the **Bagging ensemble (155.27)**,  
  as well as the **Linear Regression baseline (133.84)**.  
- This significant improvement supports the hypothesis that **boosting effectively reduces bias** by sequentially learning from the residual errors of prior models.  
- Unlike bagging, which trains models in parallel for variance reduction, boosting builds models **sequentially**, allowing later trees to focus on previously mispredicted cases.  
- The resulting ensemble combines weak learners into a strong, low-bias model â€” evident from the large RMSE drop â€” confirming that **boosting achieved superior accuracy through bias correction**.


## Part C: Stacking for Optimal Performance

### C.1 Stacking principle 

Stacking trains several **diverse base learners** (level-0) and then trains a **meta-learner** (level-1) on the base learners' *out-of-sample* predictions.  
Formally, base learners produce predictions $\{\hat y^{(k)}\}$; the meta-learner $f_{\text{meta}}$ learns:
\\[
\hat y = f_{\text{meta}}\big(\hat y^{(1)}, \hat y^{(2)}, \dots, \hat y^{(K)}\big).
\\]
Because base learners have different inductive biases (instance-based KNN, bagged trees, gradient-boosted trees), stacking leverages their complementary strengths. The meta-learner learns an optimal combination (often regularized) to reduce generalization error.


In [10]:
try:
    preprocessor 
except NameError:
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.compose import ColumnTransformer
    categorical_cols = ["season", "yr", "mnth", "hr", "weekday", "weathersit"]
    numeric_cols = ["holiday", "workingday", "temp", "atemp", "hum", "windspeed"]
    try:
        ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    except TypeError:
        ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)
    preprocessor = ColumnTransformer([
        ("cat", ohe, categorical_cols),
        ("num", "passthrough", numeric_cols)
    ])

# KNN 
knn_pipe = Pipeline([
    ("prep", preprocessor),
    ("scale", StandardScaler()),   
    ("knn", KNeighborsRegressor(n_neighbors=10))
])

# Bagging
try:
    bag_base = BaggingRegressor(
        estimator=DecisionTreeRegressor(max_depth=6, random_state=RANDOM_STATE),
        n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1
    )
except TypeError:
    bag_base = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(max_depth=6, random_state=RANDOM_STATE),
        n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1
    )

bag_pipe = Pipeline([
    ("prep", preprocessor),
    ("bag", bag_base)
])

# Gradient Boosting 
gbr_pipe = Pipeline([
    ("prep", preprocessor),
    ("gbr", GradientBoostingRegressor(random_state=RANDOM_STATE))
])

# Stacking regressor
estimators = [
    ("knn", knn_pipe),
    ("bag", bag_pipe),
    ("gbr", gbr_pipe)
]

stack = StackingRegressor(
    estimators=estimators,
    final_estimator=Ridge(),   
    passthrough=False,
    n_jobs=-1
)


stack.fit(X_train, y_train)
pred_stack = stack.predict(X_test)
rmse_stack = float(np.sqrt(mean_squared_error(y_test, pred_stack)))

print(f"Stacking Regressor RMSE: {rmse_stack:.3f}")


Stacking Regressor RMSE: 116.479


### Stacking Regressor â€” Test Set Result

The **Stacking Regressor** (combining KNN, Bagging, and Gradient Boosting as base learners with a Ridge meta-learner) achieved a **test RMSE of 116.48**.  
This RMSE is notably lower than that of all previous individual and ensemble models, indicating improved predictive performance through optimal blending of diverse learners.


## Part D: Final Analysis 

### D.1 â€” Comparative table of RMSEs for all models 

In [11]:
results = pd.DataFrame([
    {"Model": "Baseline (Linear)", "RMSE": 133.835},   # Linear Regression selected as baseline
    {"Model": "Decision Tree (single, max_depth=6)", "RMSE": 158.692},
    {"Model": "Bagging (100 DTs)", "RMSE": 155.270},
    {"Model": "Gradient Boosting", "RMSE": 123.126},
    {"Model": "Stacking (KNN + Bag + GBR -> Ridge)", "RMSE": 116.479}
])

# Sort by RMSE ascending for clarity
results_sorted = results.sort_values("RMSE").reset_index(drop=True)
results_sorted.style.format({"RMSE": "{:.3f}"})


Unnamed: 0,Model,RMSE
0,Stacking (KNN + Bag + GBR -> Ridge),116.479
1,Gradient Boosting,123.126
2,Baseline (Linear),133.835
3,Bagging (100 DTs),155.27
4,"Decision Tree (single, max_depth=6)",158.692


### ðŸ”Ž Conclusion â€” Best-performing model and why

- **Best-performing model:** **Stacking Regressor** (RMSE = **116.48**).  
  - It outperformed the baseline (Linear Regression, RMSE = 133.84) and all other ensembles (Bagging, Gradient Boosting).

- **Why stacking (and the best ensemble) outperformed the single-model baseline:**
  1. **Model diversity:** Stacking combined models with different inductive biases â€”  
     KNN (instance-based), Bagged trees (variance-stable, non-linear), and Gradient Boosting (sequential bias correction). These models make different errors on different examples, so combining them reduces overall error.
  2. **Biasâ€“variance trade-off:**  
     - The baseline Linear Regression had relatively low variance but higher bias (unable to capture some non-linear patterns).  
     - Bagging reduced variance of trees but did not sufficiently reduce bias relative to boosting.  
     - Gradient Boosting reduced bias substantially (RMSE dropped), but stacking further improved results by leveraging complementary strengths.  
     - The meta-learner (Ridge) learns a regularized combination of base predictions, **reducing variance** of the combination while **preserving bias reductions** from strong base learners.
  3. **Regularized combination:** Ridge as the meta-learner prevents overfitting to base predictions by penalizing large weights, stabilizing the final ensemble.
  4. **Practical outcome:** The stacking ensemble achieved the lowest RMSE, indicating the best empirical balance of bias and variance on the held-out chronological test set.

- **Short takeaway:** Ensembles that combine *diverse* learners and use a *regularized meta-learner* (stacking) can outperform both single models and simpler ensembles by simultaneously reducing bias and variance.
