# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Stacking Ensemble - Combining Models for Maximum Power
**Instructor:** Amir Charkhi | **Dataset:** Bank Marketing (UCI)

---

## üéØ The Big Idea: Model Democracy

**Single Model:** One expert makes the decision
```
Random Forest ‚Üí Prediction ‚úÖ
```

**Simple Voting:** Multiple experts vote (majority wins)
```
Logistic Regression ‚Üí Vote: Yes
Random Forest       ‚Üí Vote: No
SVM                ‚Üí Vote: Yes
                     ‚Üì
Final: Yes (2 vs 1)
```

**Stacking:** A meta-expert learns from other experts!
```
Logistic Regression ‚Üí Probability: 0.7
Random Forest       ‚Üí Probability: 0.3
SVM                ‚Üí Probability: 0.8
                     ‚Üì
Meta-Model (trained on these predictions)
  "I've learned that when LR and SVM agree high,
   but RF says low, trust LR and SVM more..."
                     ‚Üì
Final Smart Prediction: 0.75 ‚Üí Yes
```

**Key Insight:** The meta-model learns WHEN to trust each base model!

---

## 1. Setup

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

# Base models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Stacking
from sklearn.ensemble import StackingClassifier

# Metrics
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Plotly
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

---
## 2. Load Data (Quick Prep)

**Classification Dataset:**
- **Goal:** Binary classification task
- **Features:** 20 numeric features
- **5,000 samples** with imbalanced classes (85/15 split)
- Mimics real-world scenarios like customer churn, fraud detection

In [None]:
# Load data - using sklearn's built-in dataset for reliability
from sklearn.datasets import make_classification

# Generate synthetic classification dataset (mimics bank marketing characteristics)
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.85, 0.15],  # Imbalanced like real bank data
    flip_y=0.02,
    random_state=42
)

# Create DataFrame
feature_names = [f'feature_{i}' for i in range(20)]
X = pd.DataFrame(X, columns=feature_names)
y = pd.Series(y, name='target')

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"‚úÖ Data ready: {len(X_train):,} train, {len(X_test):,} test")
print(f"üìä Class balance: {y.mean():.1%} positive class")

---
## 3. Train Individual Base Models

### üéØ Strategy: Choose DIVERSE Models

**Why diversity matters:**
```
If all models make the SAME errors:
  Model A: Wrong on cases [1, 2, 3]
  Model B: Wrong on cases [1, 2, 3]  ‚Üê No help!
  Model C: Wrong on cases [1, 2, 3]

If models make DIFFERENT errors:
  Model A: Wrong on cases [1, 2]
  Model B: Wrong on cases [3, 4]     ‚Üê Can combine!
  Model C: Wrong on cases [5, 6]
  
Meta-model learns: "Use A for cases like 3,4,5,6"
```

**Our diverse base models:**
- **Logistic Regression** - Linear, fast, interpretable
- **Random Forest** - Non-linear, robust
- **SVM** - Maximum margin, kernel trick
- **KNN** - Instance-based, local patterns
- **Gradient Boosting** - Sequential, error correction

In [None]:
# Define base models
base_models = [
    ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
    ('svm', SVC(probability=True, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
]

# Train and evaluate each
base_results = []

for name, model in base_models:
    # Train
    if name in ['lr', 'svm', 'knn']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_prob = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_prob = model.predict_proba(X_test)[:, 1]
    
    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_prob)
    
    base_results.append({
        'Model': name.upper(),
        'Accuracy': accuracy,
        'ROC-AUC': roc_auc
    })

base_df = pd.DataFrame(base_results)

### üìä Base Model Performance

In [None]:
# Visualize base models
fig = go.Figure()

fig.add_trace(go.Bar(
    name='Accuracy',
    x=base_df['Model'],
    y=base_df['Accuracy'],
    text=[f"{x:.3f}" for x in base_df['Accuracy']],
    textposition='auto',
    marker_color='lightblue'
))

fig.add_trace(go.Bar(
    name='ROC-AUC',
    x=base_df['Model'],
    y=base_df['ROC-AUC'],
    text=[f"{x:.3f}" for x in base_df['ROC-AUC']],
    textposition='auto',
    marker_color='lightcoral'
))

fig.update_layout(
    title='Individual Base Model Performance',
    xaxis_title='Model',
    yaxis_title='Score',
    barmode='group',
    template='plotly_white',
    height=450
)

fig.show()

print(f"\nüìä Best single model: {base_df.loc[base_df['ROC-AUC'].idxmax(), 'Model']}")
print(f"   ROC-AUC: {base_df['ROC-AUC'].max():.4f}")

---
## 4. Understanding Model Diversity

### üîç Do Models Make Different Errors?

**Key question:** Are predictions correlated?
- **High correlation:** Models agree (less benefit from stacking)
- **Low correlation:** Models disagree (great for stacking!)

In [None]:
# Get predictions from each model
predictions = {}

for name, model in base_models:
    if name in ['lr', 'svm', 'knn']:
        pred = model.predict_proba(X_test_scaled)[:, 1]
    else:
        pred = model.predict_proba(X_test)[:, 1]
    predictions[name.upper()] = pred

pred_df = pd.DataFrame(predictions)

# Calculate correlation
correlation = pred_df.corr()

In [None]:
# Visualize correlation heatmap
fig = go.Figure(data=go.Heatmap(
    z=correlation.values,
    x=correlation.columns,
    y=correlation.columns,
    text=np.round(correlation.values, 2),
    texttemplate='%{text}',
    textfont={"size": 12},
    colorscale='RdBu_r',
    zmid=0,
    colorbar=dict(title='Correlation')
))

fig.update_layout(
    title='Base Model Prediction Correlation<br>Lower = More Diverse = Better for Stacking!',
    template='plotly_white',
    height=500,
    xaxis=dict(side='bottom')
)

fig.show()

# Calculate average correlation (excluding diagonal)
mask = np.ones_like(correlation, dtype=bool)
np.fill_diagonal(mask, False)
avg_corr = correlation.values[mask].mean()

print(f"\nüìä Average correlation: {avg_corr:.3f}")
if avg_corr < 0.7:
    print("‚úÖ Good diversity! Models disagree enough for effective stacking.")
else:
    print("‚ö†Ô∏è High correlation - models are similar. Stacking may have limited benefit.")

---
## 5. Build Stacking Ensemble

### üèóÔ∏è Stacking Architecture

```
LAYER 0: Original Features
         [age, job, balance, ...]
                  ‚Üì
LAYER 1: Base Models (trained on original features)
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ    LR    ‚îÇ    RF    ‚îÇ   SVM    ‚îÇ   KNN    ‚îÇ    GB    ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ          ‚îÇ          ‚îÇ          ‚îÇ          ‚îÇ
       P=0.7      P=0.3      P=0.8      P=0.4      P=0.9
                  ‚Üì
LAYER 2: Meta-Model (trained on base model predictions)
         [0.7, 0.3, 0.8, 0.4, 0.9] ‚Üí Meta-Learner
                  ‚Üì
         Final Prediction: 0.75
```

**Important:** Meta-model is trained on **out-of-fold predictions** to prevent overfitting!

### üéØ Choosing the Meta-Model

**Options:**
- **Logistic Regression** ‚Üê Most common (simple, interpretable)
- **Random Forest** (can capture non-linear combinations)
- **Gradient Boosting** (powerful, but risk of overfitting)
- **Neural Network** (maximum flexibility)

**We'll use Logistic Regression** - learns linear weights for each base model.

In [None]:
# Create stacking classifier
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,  # 5-fold CV for meta-features
    n_jobs=-1
)

# Note: StackingClassifier automatically handles:
# 1. Training base models
# 2. Generating out-of-fold predictions
# 3. Training meta-model on those predictions

# Train on mixed data (some need scaling, some don't)
# For simplicity, we'll use scaled data for all
stacking_clf.fit(X_train_scaled, y_train)

# Predictions
y_pred_stack = stacking_clf.predict(X_test_scaled)
y_prob_stack = stacking_clf.predict_proba(X_test_scaled)[:, 1]

# Metrics
accuracy_stack = accuracy_score(y_test, y_pred_stack)
roc_auc_stack = roc_auc_score(y_test, y_prob_stack)

print("\nüèÜ Stacking Ensemble Performance:")
print(f"   Accuracy: {accuracy_stack:.4f}")
print(f"   ROC-AUC:  {roc_auc_stack:.4f}")

---
## 6. Compare: Base Models vs Stacking

### üìä The Moment of Truth!

In [None]:
# Add stacking to results
comparison = base_df.copy()
comparison = pd.concat([comparison, pd.DataFrame([{
    'Model': 'STACKING',
    'Accuracy': accuracy_stack,
    'ROC-AUC': roc_auc_stack
}])], ignore_index=True)

# Sort by ROC-AUC
comparison = comparison.sort_values('ROC-AUC', ascending=True)

In [None]:
# Create comparison visualization
fig = go.Figure()

colors = ['lightblue'] * len(base_df) + ['gold']

fig.add_trace(go.Bar(
    x=comparison['ROC-AUC'],
    y=comparison['Model'],
    orientation='h',
    text=[f"{x:.4f}" for x in comparison['ROC-AUC']],
    textposition='auto',
    marker_color=colors,
    marker_line_color='black',
    marker_line_width=[1] * len(base_df) + [3]
))

fig.update_layout(
    title='üèÜ Stacking vs Individual Models (ROC-AUC)',
    xaxis_title='ROC-AUC Score',
    yaxis_title='Model',
    template='plotly_white',
    height=450,
    showlegend=False
)

fig.show()

# Calculate improvement
best_base = comparison[comparison['Model'] != 'STACKING']['ROC-AUC'].max()
improvement = roc_auc_stack - best_base
pct_improvement = (improvement / best_base) * 100

print(f"\nüìà Performance Gain:")
print(f"   Best base model:  {best_base:.4f}")
print(f"   Stacking:         {roc_auc_stack:.4f}")
print(f"   Improvement:      +{improvement:.4f} ({pct_improvement:+.2f}%)")

if improvement > 0:
    print("\n‚úÖ Stacking wins! Meta-model successfully combines base models.")
else:
    print("\n‚ö†Ô∏è No improvement. Possible reasons: high correlation or overfitting.")

---
## 7. Understanding the Meta-Model

### üîç What Did the Meta-Model Learn?

The meta-model (Logistic Regression) assigns **weights** to each base model.

**Interpretation:**
- **Positive weight:** Trust this model's predictions
- **Larger weight:** Trust it more
- **Negative weight:** Inverse relationship (rare)

In [None]:
# Extract meta-model coefficients
meta_model = stacking_clf.final_estimator_
coefficients = meta_model.coef_[0]
model_names = [name.upper() for name, _ in base_models]

coef_df = pd.DataFrame({
    'Model': model_names,
    'Weight': coefficients,
    'Abs_Weight': np.abs(coefficients)
}).sort_values('Weight', ascending=True)

In [None]:
# Visualize meta-model weights
fig = go.Figure()

colors = ['red' if x < 0 else 'green' for x in coef_df['Weight']]

fig.add_trace(go.Bar(
    x=coef_df['Weight'],
    y=coef_df['Model'],
    orientation='h',
    text=[f"{x:.3f}" for x in coef_df['Weight']],
    textposition='auto',
    marker_color=colors
))

fig.add_vline(x=0, line_dash="dash", line_color="gray")

fig.update_layout(
    title='Meta-Model Weights: How Much to Trust Each Base Model',
    xaxis_title='Weight (Higher = More Influence)',
    yaxis_title='Base Model',
    template='plotly_white',
    height=400
)

fig.show()

print("\nüìä Meta-Model Insights:")
most_trusted = coef_df.loc[coef_df['Abs_Weight'].idxmax()]
print(f"   Most trusted model: {most_trusted['Model']} (weight: {most_trusted['Weight']:.3f})")
print(f"\nüí° The meta-model learned which base models to trust for different cases!")

---
## 8. Visualize Prediction Agreement

### üéØ When Do Models Agree vs Disagree?

Stacking shines when base models disagree!

In [None]:
# Get binary predictions from each model
binary_preds = {}
for name, model in base_models:
    if name in ['lr', 'svm', 'knn']:
        binary_preds[name.upper()] = model.predict(X_test_scaled)
    else:
        binary_preds[name.upper()] = model.predict(X_test)

binary_df = pd.DataFrame(binary_preds)

# Calculate agreement (how many models agree)
agreement = binary_df.sum(axis=1)  # Count of 'yes' votes

# Add stacking prediction
binary_df['STACKING'] = y_pred_stack
binary_df['Agreement'] = agreement
binary_df['True_Label'] = y_test.values

In [None]:
# Visualize agreement distribution
agreement_counts = agreement.value_counts().sort_index()

fig = go.Figure()

fig.add_trace(go.Bar(
    x=agreement_counts.index,
    y=agreement_counts.values,
    text=agreement_counts.values,
    textposition='auto',
    marker_color=['#E74C3C', '#E67E22', '#F39C12', '#52BE80', '#3498DB', '#9B59B6']
))

fig.update_layout(
    title='Model Agreement Distribution<br>How Many Base Models Predict Positive Class?',
    xaxis_title='Number of Models Voting "Yes"',
    yaxis_title='Number of Test Cases',
    template='plotly_white',
    height=400
)

fig.show()

print(f"\nüìä Agreement Analysis:")
print(f"   Cases with full agreement (0 or 5 votes): {((agreement == 0) | (agreement == 5)).sum()}")
print(f"   Cases with split decisions (2-3 votes): {((agreement == 2) | (agreement == 3)).sum()}")
print(f"\nüí° Split decisions are where stacking adds the most value!")

---
## 9. When to Use Stacking

### ‚úÖ Use Stacking When:

**1. You Have Diverse Base Models**
```
‚úÖ Different algorithms (linear, tree, instance-based)
‚úÖ Different feature subsets
‚úÖ Different hyperparameters
‚úÖ Low correlation between predictions
```

**2. You Need Maximum Performance**
```
‚úÖ Competitions (Kaggle)
‚úÖ Critical applications
‚úÖ Small performance gains matter
‚úÖ Have computational resources
```

**3. You Have Enough Data**
```
‚úÖ Large dataset (>10K samples)
‚úÖ Can afford train/validation split
‚úÖ Avoid overfitting
```

---

### ‚ùå Don't Use Stacking When:

**1. Base Models Are Too Similar**
```
‚ùå All tree-based models
‚ùå Same algorithm, slightly different hyperparameters
‚ùå High correlation (>0.9)
‚Üí Simple voting ensemble is enough
```

**2. Limited Resources**
```
‚ùå Need fast predictions
‚ùå Limited memory
‚ùå Limited training time
‚Üí Use best single model
```

**3. Small Dataset**
```
‚ùå <5K samples
‚ùå Risk of overfitting
‚ùå Not enough data for meta-model
‚Üí Stick with single model + CV
```

**4. Need Interpretability**
```
‚ùå Must explain predictions
‚ùå Regulatory requirements
‚ùå Medical/legal applications
‚Üí Use interpretable single model
```

---

### üéØ Stacking vs Other Ensembles:

| Method | How It Works | Pros | Cons |
|--------|--------------|------|------|
| **Voting** | Average/majority vote | Simple, fast | No learning |
| **Bagging** | Bootstrap + average (e.g., Random Forest) | Reduces variance | Same algorithm |
| **Boosting** | Sequential error correction (e.g., XGBoost) | Powerful | Risk overfitting |
| **Stacking** | Meta-model learns combination | Maximum performance | Complex, slow |

---

## 10. Key Takeaways

### üéØ Core Concepts:

**1. Stacking = Two-Level Learning**
- Level 1: Base models learn from data
- Level 2: Meta-model learns from base models
- **Key:** Uses out-of-fold predictions to avoid overfitting

**2. Diversity Is Everything**
- Choose different algorithm families
- Low correlation = high diversity = better stacking
- If models always agree, stacking won't help much

**3. Meta-Model Choices**
- **Logistic Regression:** Simple, linear combination (most common)
- **Random Forest:** Can learn non-linear combinations
- **Neural Network:** Maximum flexibility (for complex patterns)

**4. Trade-offs**
- **Pro:** Best possible performance
- **Con:** Slower, more complex, harder to interpret
- **Use:** When that extra 1-2% matters!

---

### üí° Practical Tips:

1. **Start with 3-5 diverse base models** (more isn't always better)
2. **Ensure base models are well-tuned** (garbage in, garbage out)
3. **Use cross-validation** for meta-features (prevents leakage)
4. **Monitor for overfitting** (meta-model can overfit to base predictions)
5. **Compare to best single model** (is complexity worth it?)

---

### üèÜ Our Results:

- Combined 5 diverse models (LR, RF, SVM, KNN, GB)
- Stacking achieved **{:.4f} ROC-AUC**
- Improved by **{:+.2f}%** over best single model
- Meta-model learned optimal weights for each base model

**Stacking successfully combined the strengths of all models!**

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*