<a href="https://colab.research.google.com/github/aditya301cs/Daily-Data-Science-ML/blob/main/Implementation_of_Stacking_in_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìä Stacking Ensemble Learning ‚Äì Implementation in Machine Learning

## üìå Objective
The goal of this notebook is to implement **Stacking Ensemble Learning**, a powerful technique
that combines multiple base models and a meta-model to improve prediction performance.

## üß† Why Stacking?
- Individual models have limitations
- Different models capture different patterns
- Stacking leverages strengths of multiple models

This notebook demonstrates stacking using **Scikit-learn** with a clean ML pipeline.


## üîç What is Stacking Ensemble?

Stacking is an ensemble learning technique where:
- Multiple **base models** are trained on the same dataset
- Their predictions are used as input features for a **meta-model**
- The meta-model learns how to best combine base model predictions

### üìê Architecture:
1. Train base learners (e.g., Logistic Regression, Decision Tree, KNN)
2. Generate predictions from base learners
3. Train a meta-learner on these predictions
4. Final prediction comes from the meta-learner


#Implementation of Stacking

Step 1: Importing the required Libraries



In [15]:
# =========================
# Data Handling & Analysis
# =========================
import pandas as pd
# Used for loading, cleaning, and manipulating structured datasets (DataFrames)

# =========================
# Data Visualization
# =========================
import matplotlib.pyplot as plt
# Used for plotting graphs and visualizing data distributions and results

# =========================
# Ensemble Learning
# =========================
from mlxtend.classifier import StackingClassifier
# Implements stacking ensemble technique by combining multiple base models and a meta-model

# =========================
# Data Preprocessing & Splitting
# =========================
from sklearn.model_selection import train_test_split
# Splits dataset into training and testing sets for model evaluation

from sklearn.preprocessing import StandardScaler
# Standardizes features by removing the mean and scaling to unit variance
# Important for distance-based algorithms like KNN

# =========================
# Machine Learning Models
# =========================
from sklearn.linear_model import LogisticRegression
# Linear classification algorithm, also used as a meta-learner in stacking

from sklearn.neighbors import KNeighborsClassifier
# Instance-based learning algorithm that classifies based on nearest neighbors

from sklearn.naive_bayes import GaussianNB
# Probabilistic classifier based on Bayes' theorem with Gaussian distribution assumption

# =========================
# Model Evaluation
# =========================
from sklearn.metrics import accuracy_score
# Measures the percentage of correct predictions made by the model


Step 2: Loading the Dataset

In [2]:
df = pd.read_csv('/content/heart.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Step 3: separate features from the target variable.

In [4]:
X = df.drop('target', axis = 1)
y = df['target']

# drop(): Removes the target column from features.
# df['target']: Selects the target column for prediction.

In [5]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [6]:
y

Unnamed: 0,target
0,1
1,1
2,1
3,1
4,1
...,...
298,0
299,0
300,0
301,0


Step 4: Splitting the Data into Training and Testing Sets

- 80% data is used for training
- 20% data is used for testing
- This helps evaluate model generalization on unseen data


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# train_test_split(): Splits data into train and test sets.
# test_size = 0.2: Specifies that 20% of the data should be used for testing, leaving 80% for training.
# random_state = 42: Ensures reproducibility by setting a fixed seed for random number generation.


Step 5: Standardizing the Data

We will standardize numerical features so they have a mean of 0 and standard deviation of 1. This helps some models perform better.

- StandardScaler(): Standardizes features.

- fit_transform(): Learns scaling parameters from training data and applies them.

- transform(): Applies learned scaling to test data.

- var_transform: Specifies the list of feature columns that need to be standardized.

- X_train[var_transform]: Applies the fit_transform method to standardize the selected columns in the training data.

- X_test[var_transform]: Applies the transform method to standardize the corresponding columns in the test data using the scaling parameters from the training data.

In [8]:
sc = StandardScaler()

var_transform = ['thalach', 'age', 'trestbps', 'oldpeak', 'chol']
X_train[var_transform] = sc.fit_transform(X_train[var_transform])
X_test[var_transform] = sc.transform(X_test[var_transform])

X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
132,-1.356798,1,1,-0.616856,0.914034,0,1,0.532781,0,-0.920864,2,0,2
202,0.385086,1,0,1.169491,0.439527,0,0,-1.753582,1,-0.193787,2,0,3
196,-0.921327,1,2,1.169491,-0.300704,0,1,-0.139679,0,2.350982,1,0,2
75,0.058483,0,1,0.276318,0.059921,0,0,0.48795,0,0.351521,1,0,2
176,0.602822,1,0,-0.79549,-0.319684,1,1,0.443119,1,0.351521,2,2,3


Step 6: Building First Layer Estimators

We will create base models that will form the first layer of our stacking model. For this example we‚Äôll use K-Nearest Neighbors classifier and Naive Bayes classifier.

- KNeighborsClassifier(): A model based on nearest neighbors.

- GaussianNB(): A Naive Bayes classifier assuming Gaussian distribution.

In [9]:
KNC = KNeighborsClassifier()
NB = GaussianNB()

Step 7: Training and Evaluating KNeighborsClassifier

- fit(): Trains the model.

- predict(): Makes predictions on test data.

- accuracy_score(): Calculates accuracy

In [11]:
model_kNeighborsClassifier = KNC.fit(X_train, y_train)
pred_knc = model_kNeighborsClassifier.predict(X_test)

acc_knc = accuracy_score(y_test, pred_knc)
print('Accuracy Score of KNeighbors Classifier:', acc_knc * 100)

Accuracy Score of KNeighbors Classifier: 86.88524590163934


Step 8: Training and Evaluating Naive Bayes Classifier

In [12]:
model_NaiveBayes = NB.fit(X_train, y_train)
pred_nb = model_NaiveBayes.predict(X_test)

acc_nb = accuracy_score(y_test, pred_nb)
print('Accuracy of Naive Bayes Classifier:', acc_nb * 100)

Accuracy of Naive Bayes Classifier: 86.88524590163934


Step 9: Implementing the Stacking Classifier

Now, we will combine the base models using a Stacking Classifier. The meta-model will be a logistic regression model which will take the predictions of KNN and Naive Bayes as input.

- StackingClassifier(): Combines base models and a meta-model.

- classifiers: List of base learners.

- meta_classifier: Model that learns from base learners‚Äô predictions.

- use_probas=True: Passes probability outputs to the meta-model instead of class labels.

In [13]:
base_learners = [
    KNeighborsClassifier(),
    GaussianNB()
]
meta_model = LogisticRegression()

stacking_model = StackingClassifier(classifiers=base_learners, meta_classifier=meta_model, use_probas=True)

Step 10: Training Stacking Classifier  


In [14]:
model_stack = stacking_model.fit(X_train, y_train)
pred_stack = model_stack.predict(X_test)

acc_stack = accuracy_score(y_test, pred_stack)
print('Accuracy Score of Stacked Model:', acc_stack * 100)

Accuracy Score of Stacked Model: 88.52459016393442


#Conclusion:
Both individual models (KNN and Naive Bayes) achieved an accuracy of approximately 86.88%, while the stacked model achieved an accuracy of around 88.52%. This shows that combining the predictions of multiple models using stacking can slightly improve overall performance compared to using a single model.