<a href="https://colab.research.google.com/github/aditya301cs/Daily-Data-Science-ML/blob/main/Stacking_and_Blending_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stacking and Blending — An Intuitive Guide

This notebook explains **Stacking** and **Blending**, two powerful ensemble learning techniques widely used in machine learning competitions and production systems.

The notebook is structured for:
- Conceptual clarity
- Interview preparation
- GitHub / portfolio readiness


## Stacking (Stacked Generalization)

**Stacking** is an ensemble technique where:
1. Multiple base models are trained
2. Their predictions are used as new features
3. A meta-model learns how to combine those predictions

The key idea is that different models capture different patterns in the data.


### Why K-Fold Cross Validation Is Required

If base model predictions are generated from the same data used to train them, the meta-model will see overly optimistic predictions.

To avoid **data leakage**, stacking uses **K-Fold Cross Validation** to generate **out-of-fold (OOF) predictions**.

This ensures:
- Every training sample gets a prediction
- That prediction comes from a model that has NOT seen that sample


### Stacking Workflow (Step-by-Step)

1. Split training data into K folds
2. For each base model:
   - Train on K−1 folds
   - Predict on the remaining fold
3. Collect predictions for all folds (OOF predictions)
4. Retrain base models on full training data
5. Predict on test data
6. Train a meta-model using base model predictions
7. Meta-model produces final predictions


### Intuition

- Decision Trees may capture nonlinear splits
- KNN may capture local neighborhood patterns
- SVM may capture large-margin separation

The meta-model learns:
- Which model to trust
- Under what conditions
- How to weight each prediction


## Step 1: Import Required Libraries


In [14]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


## Step 2: Create Dataset
We use a binary classification dataset for demonstration.


In [15]:
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    random_state=1
)

X = pd.DataFrame(X)
y = pd.Series(y)


## Step 3: Train-Test Split


In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=1
)


## Step 4: Define Stacking Function

This function:
- Uses StratifiedKFold
- Generates out-of-fold predictions
- Averages test predictions across folds
- Prevents data leakage


In [17]:
def Stacking(model, train, y, test, n_fold=10):
    """
    Custom stacking function (classification).
    Returns:
    - test_pred_mean : averaged test predictions
    - train_pred     : out-of-fold training predictions
    """

    skf = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=1)

    train_pred = np.zeros((train.shape[0], 1))
    test_pred = np.zeros((test.shape[0], n_fold))

    for fold, (train_idx, val_idx) in enumerate(skf.split(train, y)):

        X_tr = train.iloc[train_idx]
        y_tr = y.iloc[train_idx]

        X_val = train.iloc[val_idx]

        # Train base model
        model.fit(X_tr, y_tr)

        # Out-of-fold prediction
        train_pred[val_idx, 0] = model.predict(X_val)

        # Test prediction
        test_pred[:, fold] = model.predict(test)

    # Average test predictions
    test_pred_mean = test_pred.mean(axis=1).reshape(-1, 1)

    return test_pred_mean, train_pred


## Step 5: Initialize Base Models


In [18]:
model_1 = DecisionTreeClassifier(random_state=1)
model_2 = KNeighborsClassifier()


## Step 6: Generate Meta-Features Using Stacking


In [19]:
test_pred_1, train_pred_1 = Stacking(
    model=model_1,
    train=X_train,
    y=y_train,
    test=X_test,
    n_fold=10
)

test_pred_2, train_pred_2 = Stacking(
    model=model_2,
    train=X_train,
    y=y_train,
    test=X_test,
    n_fold=10
)


## Step 7: Convert Predictions to DataFrames



In [20]:
train_pred_1 = pd.DataFrame(train_pred_1, columns=["DT_Pred"])
train_pred_2 = pd.DataFrame(train_pred_2, columns=["KNN_Pred"])

test_pred_1 = pd.DataFrame(test_pred_1, columns=["DT_Pred"])
test_pred_2 = pd.DataFrame(test_pred_2, columns=["KNN_Pred"])

df_train = pd.concat([train_pred_1, train_pred_2], axis=1)
df_test = pd.concat([test_pred_1, test_pred_2], axis=1)


## Step 8: Train Meta-Model (Logistic Regression)


In [21]:
meta_model = LogisticRegression(random_state=1)
meta_model.fit(df_train, y_train)


## Step 9: Evaluate Stacking Model


In [22]:
final_predictions = meta_model.predict(df_test)
accuracy = accuracy_score(y_test, final_predictions)

accuracy


0.915

## How to Explain This in an Interview

- Base models are trained using Stratified K-Fold CV
- Validation predictions form out-of-fold features
- Test predictions are averaged across folds
- Meta-model learns optimal combination of base models
- This avoids data leakage and improves generalization


## Blending

**Blending** is a simplified version of stacking.

Instead of K-Fold Cross Validation, blending:
- Reserves a small **holdout set**
- Trains base models on remaining data
- Trains meta-model only on the holdout set


### Blending Workflow

1. Split training data into:
   - Base training set (e.g., 90%)
   - Holdout set (e.g., 10%)
2. Train base models on base training set
3. Generate predictions on:
   - Holdout set
   - Test data
4. Train meta-model on holdout predictions
5. Meta-model predicts final test outputs


## Blending — Step-by-Step Working Implementation

Blending is an ensemble method where:
- Base models are trained on a base training set
- Predictions are generated on a **holdout set**
- A meta-model is trained using holdout predictions


## Step 1: Import Required Libraries


In [23]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


## Step 2: Create Dataset


In [24]:
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    random_state=1
)

X = pd.DataFrame(X)
y = pd.Series(y)


## Step 3: Train–Holdout–Test Split

We split data into:
- Base training set
- Holdout set (for meta-model training)
- Test set (final evaluation)


In [25]:
# First split: train+holdout and test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=1
)

# Second split: base train and holdout
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=1
)


In [26]:
## Step 4: Train Base Models on Base Training Set


In [27]:
model_1 = DecisionTreeClassifier(random_state=1)
model_1.fit(X_train, y_train)

holdout_pred_1 = model_1.predict(X_holdout)
test_pred_1 = model_1.predict(X_test)

holdout_pred_1 = pd.DataFrame(holdout_pred_1, columns=["DT_Pred"])
test_pred_1 = pd.DataFrame(test_pred_1, columns=["DT_Pred"])


In [28]:
model_2 = KNeighborsClassifier()
model_2.fit(X_train, y_train)

holdout_pred_2 = model_2.predict(X_holdout)
test_pred_2 = model_2.predict(X_test)

holdout_pred_2 = pd.DataFrame(holdout_pred_2, columns=["KNN_Pred"])
test_pred_2 = pd.DataFrame(test_pred_2, columns=["KNN_Pred"])


## Step 6: Create Meta-Feature Sets

We combine:
- Original features
- Base model predictions


In [32]:
df_holdout = pd.concat([X_holdout.reset_index(drop=True).set_axis(X_holdout.columns.astype(str), axis=1),
                        holdout_pred_1,
                        holdout_pred_2], axis=1)

df_test = pd.concat([X_test.reset_index(drop=True).set_axis(X_test.columns.astype(str), axis=1),
                     test_pred_1,
                     test_pred_2], axis=1)

## Step 7: Train Meta-Model (Logistic Regression)


In [33]:
meta_model = LogisticRegression(random_state=1, max_iter=1000)
meta_model.fit(df_holdout, y_holdout)

## Step 8: Evaluate Blending Model


In [35]:
final_predictions_blending = meta_model.predict(df_test)
accuracy_blending = accuracy_score(y_test, final_predictions_blending)

print(f"Blending Model Accuracy: {accuracy_blending}")

Blending Model Accuracy: 0.925


## How to Explain Blending in an Interview

- Base models are trained on base training data
- Predictions are generated on a holdout set
- Meta-model learns from holdout predictions
- Test set is used only once for final evaluation


## Stacking vs Blending

| Aspect | Stacking | Blending |
|------|---------|---------|
| Data usage | K-Fold CV | Holdout set |
| Data efficiency | High | Lower |
| Leakage risk | Very low | Higher |
| Implementation | Complex | Simple |
| Competition usage | Very common | Common |


## 5. When to Use What?

### Use Stacking When:
- Dataset is small or medium
- Model performance is critical
- You want robust generalization

### Use Blending When:
- Dataset is large
- Faster experimentation is needed
- Slight data wastage is acceptable


## Interview Q&A

**Q1. Why is K-Fold Cross Validation used in stacking?**  
To generate out-of-fold predictions and prevent data leakage.

**Q2. What happens if you skip OOF predictions?**  
The meta-model overfits because it sees predictions from models trained on the same data.

**Q3. Why are linear models commonly used as meta-models?**  
They are simple, interpretable, and reduce overfitting.

**Q4. Can stacking outperform boosting?**  
Yes, especially when base models are diverse.

**Q5. Is stacking used in production?**  
Yes, but with careful monitoring due to higher complexity.


## Key Takeaways

- Stacking and Blending are powerful ensemble techniques
- Stacking is more robust but computationally expensive
- Blending is simpler but wastes training data
- Always prevent data leakage
- Start with simple meta-models before adding complexity
