# Stacking Classifier for Diabetes Prediction

This notebook demonstrates how to build and evaluate a Stacking Classifier using the PIMA Indians Diabetes Dataset. Stacking is an ensemble learning technique that uses the predictions of multiple models (called base learners) as input for a final model (called a meta-model).

In [1]:
# Import necessary libraries
import pandas as pd  # For data manipulation and reading CSV files
import numpy as np   # For numerical operations (though not explicitly used here, it's good practice)

# Import train_test_split to divide the dataset into training and testing subsets
from sklearn.model_selection import train_test_split

# Import the individual machine learning models that will be used as base learners
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier # Multi-layer Perceptron (a simple neural network)

# Import the StackingClassifier, which allows us to combine the base learners
from sklearn.ensemble import StackingClassifier

# Import accuracy_score to evaluate the model's performance
from sklearn.metrics import accuracy_score

## 1. Load and Prepare the Data

We'll start by loading the dataset from a CSV file and preparing it for modeling.

In [2]:
# Load the dataset from the 'diabetes.csv' file into a pandas DataFrame
df = pd.read_csv('diabetes.csv')

# Display the first 5 rows of the DataFrame to get a quick overview of the data
print("First 5 rows of the dataset:")
print(df.head())

# Separate the features (independent variables) from the target (dependent variable)
# 'X' contains all columns except for 'Outcome'
X = df.drop('Outcome', axis='columns') 
# 'y' contains only the 'Outcome' column, which is what we want to predict
y = df['Outcome']

# Split the entire dataset into training and testing sets
# test_size=0.3 means 30% of the data will be used for testing, and 70% for training
# random_state=42 ensures that the split is the same every time the code is run, for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"\nTraining data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

First 5 rows of the dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

Training data shape: (537, 8)
Testing data shape: (231, 8)


## 2. Define Base Learners and Meta-Model

Stacking involves two levels of models:
1.  **Base Learners**: A set of diverse models that are trained on the original dataset.
2.  **Meta-Model**: A final model that is trained on the *predictions* made by the base learners.



In [3]:
# Define the list of base learners (level-0 models)
# Each model is a tuple containing a name and the model instance
base_learners = [
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('ann', MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000, random_state=42)),
    ('svm', SVC(kernel='linear', probability=True, random_state=42)), # probability=True is often needed for meta-learners
    ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('nb', GaussianNB())
]

# Define the meta-model (level-1 model)
# This model will learn from the outputs of the base learners
meta_model = LogisticRegression()

## 3. Create and Train the Stacking Ensemble

Now we'll use Scikit-Learn's `StackingClassifier` to combine our base learners and the meta-model. The `fit` method handles the entire training process automatically.

In [4]:
# Create an instance of the Stacking Classifier
# `estimators`: The list of base learners
# `final_estimator`: The meta-model
stacking_model = StackingClassifier(estimators=base_learners, final_estimator=meta_model, cv=5) # cv=5 means cross-validation is used to generate predictions for the meta-model

# Train the stacking model on the training data
# This process trains all base learners and then trains the meta-model on their predictions
print("Training the Stacking Classifier...")
stacking_model.fit(X_train, y_train)
print("Training complete.")

Training the Stacking Classifier...
Training complete.


## 4. Evaluate the Model's Performance

We'll use the trained stacking model to make predictions on the test set and then calculate its accuracy.

In [5]:
# Use the trained stacking model to make predictions on the test set
y_pred = stacking_model.predict(X_test)

# Calculate the accuracy by comparing the predicted values (y_pred) with the actual values (y_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the final accuracy, formatted as a percentage with two decimal places
print(f'\nAccuracy of the Stacking Classifier: {accuracy * 100:.2f}%')


Accuracy of the Stacking Classifier: 75.32%
