# Comparison of Advanced Ensemble Techniques

Scenario: You have tried the benchmark model on the credit card dataset and have got some benchmark metrics. Having learned some advanced ensemble techniques, you want to determine which technique to use for the credit card approval dataset.

In this activity, you will use all three advanced techniques and compare the results before selecting your final technique.

In [22]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from mlxtend.classifier import StackingClassifier

In [7]:
# Loading the data using pandas
credData = pd.read_csv('../Dataset/crx.data', sep=",", header = None, na_values = "?")

# Changing the Classes to 1 & 0
credData.loc[credData[15] == '+' , 15] = 1
credData.loc[credData[15] == '-' , 15] = 0
# Dropping all the rows with na values
newcred = credData.dropna(axis = 0)
# Seperating the categorical variables to make dummy variables
credCat = pd.get_dummies(newcred[[0,3,4,5,6,8,9,11,12]])
# Seperating the numerical variables
credNum = newcred[[1,2,7,10,13,14]]
# Making the X variable which is a concatenation of categorical and numerical data
X = pd.concat([credCat,credNum],axis = 1)
# Seperating the label as y variable
y = newcred[15].astype('int')

# Normalising the data sets
minmaxScaler = preprocessing.MinMaxScaler()
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))

# Splitting the data set to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tran, y, test_size=0.3, random_state=123)

## Implement the bagging technique with the base learner as the logistic regression model. 

In the bagging classifier, define n_estimators = 15, max_samples = 0.7, and max_features = 0.8. 

Fit the model on the training set, generate the predictions, and print the confusion matrix and the classification report.

In [10]:
base_estimator = LogisticRegression()
bagging_ensemble_model = BaggingClassifier(
    base_estimator=base_estimator, 
    n_estimators=15, 
    max_samples=0.7,
    max_features=0.8,
    random_state=123,
    verbose=2
)

In [11]:
bagging_ensemble_model.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Building estimator 1 of 15 for this parallel run (total 15)...
Building estimator 2 of 15 for this parallel run (total 15)...
Building estimator 3 of 15 for this parallel run (total 15)...
Building estimator 4 of 15 for this parallel run (total 15)...
Building estimator 5 of 15 for this parallel run (total 15)...
Building estimator 6 of 15 for this parallel run (total 15)...
Building estimator 7 of 15 for this parallel run (total 15)...
Building estimator 8 of 15 for this parallel run (total 15)...
Building estimator 9 of 15 for this parallel run (total 15)...
Building estimator 10 of 15 for this parallel run (total 15)...
Building estimator 11 of 15 for this parallel run (total 15)...
Building estimator 12 of 15 for this parallel run (total 15)...
Building estimator 13 of 15 for this parallel run (total 15)...
Building estimator 14 of 15 for this parallel run (total 15)...
Building estimator 15 of 15 for 

BaggingClassifier(base_estimator=LogisticRegression(), max_features=0.8,
                  max_samples=0.7, n_estimators=15, random_state=123,
                  verbose=2)

In [12]:
y_pred = bagging_ensemble_model.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


In [15]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}\n')
print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

Accuracy: 0.898

[[94 13]
 [ 7 82]]


              precision    recall  f1-score   support

           0       0.93      0.88      0.90       107
           1       0.86      0.92      0.89        89

    accuracy                           0.90       196
   macro avg       0.90      0.90      0.90       196
weighted avg       0.90      0.90      0.90       196



## Implement boosting with random forest as the base learner. 

In the AdaBoostClassifier, define n_estimators = 300. 

Fit the model on the training set, generate the predictions, and print the confusion matrix and classification report.

In [18]:
base_estimator = RandomForestClassifier(random_state=123)
boosting_ensemble_model = AdaBoostClassifier(
    base_estimator=base_estimator, 
    n_estimators=300,
    random_state=123
)

In [19]:
boosting_ensemble_model.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=RandomForestClassifier(random_state=123),
                   n_estimators=300, random_state=123)

In [20]:
y_pred = boosting_ensemble_model.predict(X_test)

In [21]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}\n')
print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

Accuracy: 0.898

[[95 12]
 [ 8 81]]


              precision    recall  f1-score   support

           0       0.92      0.89      0.90       107
           1       0.87      0.91      0.89        89

    accuracy                           0.90       196
   macro avg       0.90      0.90      0.90       196
weighted avg       0.90      0.90      0.90       196



## Implement the stacking technique. 

Make the KNN and logistic regression models base learners and random forest a meta learner. 

Fit the model on the training set, generate the predictions, and print the confusion matrix and classification report.

In [23]:
knn = KNeighborsClassifier(n_neighbors=5)
lr = LogisticRegression(random_state=123)
rf = RandomForestClassifier(random_state=123)

stacking_ensemble_model = StackingClassifier(
    classifiers=[knn, lr],
    meta_classifier=rf,
    verbose=2
)

In [24]:
stacking_ensemble_model.fit(X_train, y_train)

Fitting 2 classifiers...
Fitting classifier1: kneighborsclassifier (1/2)
KNeighborsClassifier()
Fitting classifier2: logisticregression (2/2)
LogisticRegression(random_state=123)


StackingClassifier(classifiers=[KNeighborsClassifier(),
                                LogisticRegression(random_state=123)],
                   meta_classifier=RandomForestClassifier(random_state=123),
                   verbose=2)

In [25]:
y_pred = stacking_ensemble_model.predict(X_test)

In [26]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}\n')
print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

Accuracy: 0.867

[[99  8]
 [18 71]]


              precision    recall  f1-score   support

           0       0.85      0.93      0.88       107
           1       0.90      0.80      0.85        89

    accuracy                           0.87       196
   macro avg       0.87      0.86      0.86       196
weighted avg       0.87      0.87      0.87       196

