# Stacking Exercise

In this exercise, you will explore the Stacking technique applied to classification. Stacking (stacked generalization) is an ensemble learning method that combines multiple classification models via a meta-classifier. The base level models are trained based on a complete training set, then a meta-model is trained on the outputs of the base level model as features.

## Dataset
We will use the Wine dataset for this exercise. This dataset consists of chemical analyses of wines grown in the same region in Italy but derived from three different cultivars. **Feel free to use another dataset!!**

## Task
Your task is to:
1. Load the dataset.
2. Preprocess the data (if necessary).
3. Implement a stacking model using various classifiers as base learners and one as a meta-classifier.
4. Evaluate the model performance.

Please fill in the following code blocks to complete the exercise.

In [119]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine

### Load the dataset

In [120]:
data = load_wine()

# Convert to DataFrame
df = pd.DataFrame(data=data.data, columns=data.feature_names)

# forward slash (/) might cause some issues when we retreive a column contains it
# So we should rename it to avoid unexpected results
df = df.rename(columns={
    "od280/od315_of_diluted_wines": "od280_od315_of_diluted_wines"
})

# Include the target as well
df['target'] = data.target


### Preprocess the data (if necessary)

In [121]:
df.sample(10)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,target
136,12.25,4.72,2.54,21.0,89.0,1.38,0.47,0.53,0.8,3.85,0.75,1.27,720.0,2
18,14.19,1.59,2.48,16.5,108.0,3.3,3.93,0.32,1.86,8.7,1.23,2.82,1680.0,0
67,12.37,1.17,1.92,19.6,78.0,2.11,2.0,0.27,1.04,4.68,1.12,3.48,510.0,1
80,12.0,0.92,2.0,19.0,86.0,2.42,2.26,0.3,1.43,2.5,1.38,3.12,278.0,1
140,12.93,2.81,2.7,21.0,96.0,1.54,0.5,0.53,0.75,4.6,0.77,2.31,600.0,2
78,12.33,0.99,1.95,14.8,136.0,1.9,1.85,0.35,2.76,3.4,1.06,2.31,750.0,1
52,13.82,1.75,2.42,14.0,111.0,3.88,3.74,0.32,1.87,7.05,1.01,3.26,1190.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
159,13.48,1.67,2.64,22.5,89.0,2.6,1.1,0.52,2.29,11.75,0.57,1.78,620.0,2
12,13.75,1.73,2.41,16.0,89.0,2.6,2.76,0.29,1.81,5.6,1.15,2.9,1320.0,0


In [122]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280_od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

In [123]:
# Check if there are any columns contains "0" as a value, except target column
for col in df.columns:
  if df.query(f"{col} == 0").shape[0] > 0:
    print(col)

target


In [125]:
X = df.drop("target", axis=1)
y = df["target"]


### Split the data

In [126]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scalling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

### Implement a stacking model

#### Initialize base learners and meta learner

In [127]:
# Base learners:
base_learners = [
    ("decision_tree", DecisionTreeClassifier()),
    ("svc", SVC()),
    ("knn", KNeighborsClassifier()),
    ("random_forest", RandomForestClassifier())
]

# Meta learner
meta_learner = LogisticRegression()

#### Stacking model

In [128]:
stacking_model = StackingClassifier(estimators=base_learners, final_estimator=meta_learner, cv=5)

stacking_model.fit(X_train, y_train)

### Evaluate the model performance

In [129]:
y_pred = stacking_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
class_report =  classification_report(y_test, y_pred)

print(f"Stacking model accuracy {accuracy * 100:.2f}%")

print("Confusion matrix:")
print(cm)

print("Classification report:")
print(class_report)

Stacking model accuracy 100.00%
Confusion matrix:
[[14  0  0]
 [ 0 14  0]
 [ 0  0  8]]
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00         8

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



### Conclusion
 ---
 Using the load_wine dataset from sklearn, I implemented a stacking ensemble model with four base classifiers: **DecisionTreeClassifier**, **SVC**, **KNeighborsClassifier**, and **RandomForestClassifier** with **LogisticRegression** as the meta learner. I achieved a final accuracy of **100%** on the dataset.
 However, a **100%** accuracy might also indicate that the dataset is relatively simple