### **Ensemble Model**: `Various Type`

We will demonstrate voting and stacking ensemble. First, set the python library that will be needed.

In [9]:
# Import required libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
import numpy as np
import pandas as pd

Create a synthetic binary classification problem with 1000 examples and 10 input features using make_classification().

#### **1. Voting Ensembles**

**Definition**: is an ensemble machine leraning model that combines the predictions from multiple other models. It can be used for classification or regression. In the case of regression, this involves calculating the average of the predictions from the models. In the case of classification, the predictions for each label are summed and the label with the majority vote is predicted.

There are two approaches to the majority vote prediction for classification.

- Hard Voting. Voting is calculated on the predicted output class.
- Soft Voting. Voting is calculated on the predicted probability of the output class.

Use a voting ensemble if:

**It results in better performance than any model used in the ensemble.**

The voting ensemble is not guaranteed to provide better performance than any single model used in the ensemble. If any given model used in the ensemble performs better than the voting ensemble, that model should probably be used instead of the voting ensemble.

**It results in a lower variance than any model used in the ensemble.**

This is not always the case. A voting ensemble can offer lower variance in the predictions made over individual models. This lower variance may result in a lower mean performance of the ensemble, which might be desirable given the higher stability or confidence of the model.

**Notes:** A limitation of the voting ensemble is that it treats all models the same, meaning all models contribute equally to the prediction. This is a problem if some models are good in some situations and poor in others.

**Practice:**

First, we can create a function named get_voting() that creates each model and combines the models into a hard voting ensemble



The get_models() function below creates the list of models for us to evaluate. This will help us directly compare each standalone configuration of the each model with the ensemble in terms of the distribution of classification accuracy scores.

The evaluate_model() function below takes a model instance and return as a list of scores from three repeats of stratified 10-fold-cross-validation

Finally we try to compare the voting ensemble to all standalone versions of the model.

#### **2. Stacking Ensembles**

**Definition**: is an ensemble machine learning model that uses a meta-learning algorithm to learn how to best combine the prediction from two or more base machine learning algorithms.

Stacking is appropriate when multiple different machine learning models have skill on a dataset, but have skill in different ways. Another way to say this is that the predictions made by the models or the errors in predictions made by the models are uncorrelated or have a low correlation.

The architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model.

- **Level-0 Models (Base-Models)**: 
Models fit on the training data and whose predictions are compiled. Use a diverse range of models that make different assumptions about the prediction task.

- **Level-1 Model (Meta-Model)**: 
Model that learns how to best combine the predictions of the base models. The meta-model is often simple, providing a smooth interpretation of the predictions made by the base models. As such, linear regression or logistic regression are often used as the meta-model.

**Notes**
- Stacking is designed to improve modeling performance, although is not guaranteed to result in an improvement in all cases.
- If a base-model performs as well as or better than the stacking ensemble, the base model should be used instead, given its lower complexity (e.g. it’s simpler to describe, train and maintain).

**Practice:**

First, we can evaluate a suite of different machine learning mmodels on the dataset. Each algorithm will be evaluated using default model hyperparameters. The get_stacking() function below defines the StackingClassifier model by first defining a list of tuples for the three base models, then defining the logistic regression meta-model to combine the predictions from the base models using 5-fold-cross-validation.



In [10]:
from sklearn.ensemble import StackingClassifier

def get_stacking():
    base = []
    base.append(('lr', LogisticRegression()))
    base.append(('knn',KNeighborsClassifier()))
    base.append(('dt',DecisionTreeClassifier()))
    meta = LogisticRegression()

    ensemble = StackingClassifier(estimators=base, final_estimator=meta, cv=5)

    return ensemble

In [11]:
get_stacking()

We can include the stacking ensemble in the list of models to evaluate, along with the standalone models

In [12]:
def get_models():
    models = {}
    models ['LogisticRegression'] = LogisticRegression()
    models ['KNeighborsClassifier'] = KNeighborsClassifier()
    models ['DecisionTreeClassifier'] = DecisionTreeClassifier()
    models ['Stacking'] = get_stacking()
    return models

The evaluate_model() function below takes a model instance and return as a list of scores from three repeats of stratified 10-fold-cross-validation

In [13]:
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(
        model, X, y, cv=cv, scoring='f1', n_jobs=1
    )
    return scores

Finally we try to compare the stacking ensemble to all standalone versions of the model.

In [14]:
models = get_models()

for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train)
    print(f'{name}, mean f1: {np.mean(scores):.2f}, std f1:{np.std(scores):.2f}')

NameError: name 'X_train' is not defined

### `Application`

`Load Dataset`


In [15]:
data = pd.read_csv('white_wine.csv')
data.sample(5, random_state=0)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
422,7.0,0.21,0.28,8.6,0.045,37.0,221.0,0.9954,3.25,0.54,10.4,6.0
107,7.1,0.23,0.35,16.5,0.04,60.0,171.0,0.999,3.16,0.59,9.1,6.0
253,5.8,0.24,0.44,3.5,0.029,5.0,109.0,0.9913,3.53,0.43,11.7,3.0
235,7.2,0.23,0.38,14.3,0.058,55.0,194.0,0.9979,3.09,0.44,9.0,6.0
311,5.0,0.55,0.14,8.3,0.032,35.0,164.0,0.9968,3.53,0.51,12.5,8.0


`Data Cleaning`

**Duplicated Value**, Deteksi dan kuantifikasi duplikasi data

In [16]:
print(f'Jumlah data duplikat: {data.duplicated().sum()}')
print(f'Persentase data duplikat: {data.duplicated().sum()/len(data)*100:.2f}%')

Jumlah data duplikat: 84
Persentase data duplikat: 16.15%


Sebesar 16.15% data terindikasi duplikat, maka hilangkan salah satunya.

Handling duplikasi data:

In [17]:
data.drop_duplicates(keep='first', inplace=True, ignore_index=True)

**Missing Value**, Deteksi dan kuantifikasi missing value

In [18]:
data.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      1
sulphates               1
alcohol                 1
quality                 1
dtype: int64

Terdapat missing value pada kolom ph, sulphates, alcohol dan quality. Kita akan drop saja missing value tersebut.

In [19]:
data.dropna(inplace=True)

**Change Target**, Mengubah target menjadi kategorikal biner.

In [20]:
data['quality'] = np.where(data['quality']> 6, 1, 0)
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,0


`Data Splitting`

Bagi dataset menjadi train dan test set dengan komposisi 80:20.

In [22]:
from sklearn.model_selection import train_test_split

X = data.drop(columns='quality')
y = data['quality' ]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =.2,random_state=0,stratify=y)

X_train.shape, X_test.shape

((348, 11), (87, 11))

In [23]:
X_train

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
394,5.5,0.16,0.22,4.50,0.030,30.0,102.0,0.9938,3.24,0.36,9.4
13,6.2,0.66,0.48,1.20,0.029,29.0,75.0,0.9942,3.33,0.39,12.8
309,6.8,0.15,0.30,5.30,0.050,40.0,127.0,0.9942,3.40,0.39,9.7
163,6.6,0.28,0.28,8.50,0.052,55.0,211.0,0.9962,3.09,0.55,8.9
220,6.9,0.29,0.40,19.45,0.043,36.0,156.0,0.9996,2.93,0.47,8.9
...,...,...,...,...,...,...,...,...,...,...,...
207,6.8,0.21,0.27,2.10,0.030,26.0,139.0,0.9950,3.16,0.61,12.6
100,5.9,0.36,0.04,5.70,0.046,21.0,87.0,0.9934,3.22,0.51,10.2
281,6.8,0.24,0.34,4.60,0.032,37.0,135.0,0.9927,3.20,0.39,10.6
325,5.0,0.17,0.56,1.50,0.026,24.0,115.0,0.9956,3.48,0.39,10.8


### `Model Experiment`

Setting parameter dasar untuk menginisialisasi experiment 

In [25]:
from sklearn. preprocessing import RobustScaler
from imblearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier

scaler = RobustScaler()

In [27]:
# define base model
logreg = LogisticRegression(max_iter=1000)
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier(max_depth=5, random_state=0)

# base model pipeline
logreg_pipeline = Pipeline( [
    ('scaling', scaler),
    ('modeling', logreg)
])

knn_pipeline = Pipeline( [
    ('scaling', scaler),
    ('modeling', knn)
])

dt_pipeline = Pipeline([
    ('modeling', dt)
])

# define meta learner
meta_learner = LogisticRegression(max_iter=1000)

# voting classifier hard
voting_clf_hard = VotingClassifier(
    estimators=[
        ('clf1',logreg_pipeline),
        ('clf2',knn_pipeline),
        ('clf3',dt_pipeline),
    ], voting='hard'
)

# voting classifier soft
voting_clf_soft = VotingClassifier(
    estimators=[
        ('clf1', logreg_pipeline),
        ('clf2',knn_pipeline),
        ('clf3',dt_pipeline),
    ], voting='soft'
)

# stacking classifier
# voting classifier hard
stacking_clf = StackingClassifier(
    estimators=[
        ('clf1', logreg_pipeline),
        ('clf2',knn_pipeline),
        ('clf3',dt_pipeline),
    ], final_estimator=meta_learner
)

**Benchmark Model**

In [31]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

# defenisi model
models = [logreg_pipeline, knn_pipeline, dt_pipeline, voting_clf_hard, voting_clf_soft, stacking_clf]
name = ['Logistic Regression','KNeighborsClassifier','DecisionTreeClassifier','VotingClassifier(Hard)','VotingClassifier(Soft)','StackingClassifier']

f1_mean = []
f1_std = []
all_f1 = []

for model in models:
    skfold = StratifiedKFold(n_splits = 5)
    model_cv = cross_val_score(
        model,
        X_train,
        y_train,
        cv = skfold,
        scoring = 'f1',
        n_jobs=1
    )

    f1_mean.append(model_cv.mean())
    f1_std.append(model_cv.std())
    all_f1.append(model_cv.round(3))

In [33]:
pd.DataFrame({
    'model'     : name,
    'f1 mean score'   : f1_mean,
    'f1 std score'    : f1_std,
    'f1 score'  : all_f1,
}).sort_values('f1 mean score', ascending=False)

Unnamed: 0,model,f1 mean score,f1 std score,f1 score
0,Logistic Regression,0.992,0.016,"[1.0, 0.96, 1.0, 1.0, 1.0]"
5,StackingClassifier,0.975304,0.020204,"[1.0, 0.96, 0.96, 1.0, 0.957]"
3,VotingClassifier(Hard),0.923789,0.049178,"[0.96, 0.833, 0.96, 0.957, 0.909]"
4,VotingClassifier(Soft),0.902577,0.089754,"[0.96, 0.727, 0.96, 0.957, 0.909]"
2,DecisionTreeClassifier,0.85913,0.101001,"[0.96, 0.696, 0.96, 0.88, 0.8]"
1,KNeighborsClassifier,0.806192,0.094593,"[0.833, 0.636, 0.783, 0.909, 0.87]"


**Predict Benchmark Model to Test Set**

Untuk mengukur performa akhir model

In [49]:
from sklearn.metrics import f1_score

models = [logreg_pipeline, knn_pipeline, dt_pipeline, voting_clf_hard, voting_clf_soft, stacking_clf]
model_names = ['LogisticRegression','KNeighborsClassifier','DecisionTreeClassifier','VotingClassifier(Hard)','VotingClassifier(Soft)','StackingClassifier']

list_f1 = []
dict_pred = {}
dict_proba = {}

for model,name in zip(models,model_names):
    model.fit(X_train,y_train)
    
    y_pred = model.predict(X_test)
    dict_pred[name] = y_pred

    if model in [logreg_pipeline, knn_pipeline, dt_pipeline, voting_clf_soft]:
        y_proba = model.predict_proba(X_test)
        dict_proba[name] = y_proba[:,1].round(3)

    score = f1_score(y_test,y_pred)
    list_f1.append(score)

In [36]:
pd.DataFrame({
    'model'         : model_names,
    'f1 score'      : list_f1,
}).sort_values('f1 score', ascending=False)

Unnamed: 0,model,f1 score
5,StackingClassifier,0.967742
2,DecisionTreeClassifier,0.941176
4,VotingClassifier(Soft),0.903226
0,Logistic Regression,0.896552
3,VotingClassifier(Hard),0.896552
1,KNeighborsClassifier,0.785714


**Ilustrasi**

`Hard Voting`

In [51]:
pd.DataFrame({
    'Pred Logreg' : dict_pred['LogisticRegression'],
    'Pred KNN' : dict_pred['KNeighborsClassifier'],
    'Pred DT' : dict_pred['DecisionTreeClassifier'],
    'Pred Voting Hard' : dict_pred['VotingClassifier(Hard)'],
})
# Suara Terbanyak

Unnamed: 0,Pred Logreg,Pred KNN,Pred DT,Pred Voting Hard
0,0,0,0,0
1,0,0,0,0
2,0,0,1,0
3,0,0,0,0
4,1,0,1,1
...,...,...,...,...
82,1,1,1,1
83,0,0,0,0
84,0,0,0,0
85,0,0,0,0


`Soft Voting`

In [53]:
pd.DataFrame({
    'Pred Logreg' : dict_proba['LogisticRegression'],
    'Pred KNN' : dict_proba['KNeighborsClassifier'],
    'Pred DT' : dict_proba['DecisionTreeClassifier'],
    'Pred Voting Hard' : dict_proba['VotingClassifier(Soft)'],
})
# Menggunakan rata Rata, >0,5 = 1, <0.5 = 0

Unnamed: 0,Pred Logreg,Pred KNN,Pred DT,Pred Voting Hard
0,0.003,0.0,0.000,0.001
1,0.016,0.0,0.000,0.005
2,0.463,0.4,1.000,0.621
3,0.000,0.0,0.000,0.000
4,0.943,0.4,1.000,0.781
...,...,...,...,...
82,0.843,0.8,1.000,0.881
83,0.012,0.0,0.077,0.030
84,0.000,0.0,0.000,0.000
85,0.004,0.0,0.000,0.001


`Stacking`

In [40]:
logreg_pipeline.fit(X_train,y_train)
knn_pipeline.fit(X_train,y_train)
dt_pipeline.fit(X_train,y_train)

y_train_meta_logreg = logreg_pipeline.predict(X_train)
y_train_meta_knn    = knn_pipeline.predict(X_train)
y_train_meta_dt     = dt_pipeline.predict(X_train)


**Data yang dipelajari meta learner**

In [41]:
pd.DataFrame({
    'Pred Logreg' : y_train_meta_logreg,
    'Pred KNN' : y_train_meta_knn,
    'Pred DT' : y_train_meta_dt,
    'Y Train' : y_train,
})

Unnamed: 0,Pred Logreg,Pred KNN,Pred DT,Y Train
394,0,0,0,0
13,1,1,1,1
309,0,0,0,0
163,0,0,0,0
220,0,0,0,0
...,...,...,...,...
207,1,1,1,1
100,0,0,0,0
281,0,0,0,0
325,1,1,1,1


**Data yang akan diprediksi meta learner**

In [54]:
pd.DataFrame({
    'Pred Logreg' : dict_pred['LogisticRegression'],
    'Pred KNN' : dict_pred['KNeighborsClassifier'],
    'Pred DT' : dict_pred['DecisionTreeClassifier'],
})

Unnamed: 0,Pred Logreg,Pred KNN,Pred DT
0,0,0,0
1,0,0,0
2,0,0,1
3,0,0,0
4,1,0,1
...,...,...,...
82,1,1,1
83,0,0,0
84,0,0,0
85,0,0,0


**Hasil prediksi meta learner**

In [56]:
pd.DataFrame({
    'Pred Logreg' : dict_pred['LogisticRegression'],
    'Pred KNN' : dict_pred['KNeighborsClassifier'],
    'Pred DT' : dict_pred['DecisionTreeClassifier'],
    'Pred Voting Hard' : dict_pred['VotingClassifier(Hard)'],
})

Unnamed: 0,Pred Logreg,Pred KNN,Pred DT,Pred Voting Hard
0,0,0,0,0
1,0,0,0,0
2,0,0,1,0
3,0,0,0,0
4,1,0,1,1
...,...,...,...,...
82,1,1,1,1
83,0,0,0,0
84,0,0,0,0
85,0,0,0,0
