## Ensembles

***

For these exercises we are going to build several machine learning models for the mnist_27 dataset and then build an ensemble. Each of the exercises in this comprehension check builds on the last.

Use the training set to build a model with several of the models available from the caret package. We will test out 10 of the most common machine learning models in this exercise:

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# import shap
# import statsmodels.api as sm
# import datetime
# from datetime import datetime, timedelta
# import scipy.stats
# import pandas_profiling
# from pandas_profiling import ProfileReport
# import graphviz

# import xgboost as xgb
# from xgboost import XGBClassifier, XGBRegressor
# from xgboost import to_graphviz, plot_importance

#from sklearn.experimental import enable_hist_gradient_boosting
#from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression, Ridge
#from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor
#from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, HistGradientBoostingClassifier, HistGradientBoostingRegressor


%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)


#from sklearn.pipeline import Pipeline
#from sklearn.model_selection import RepeatedStratifiedKFold
#from sklearn.feature_selection import RFE, RFECV, SelectKBest, f_classif, f_regression, chi2

from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import export_graphviz, plot_tree
from sklearn.metrics import confusion_matrix, classification_report, mean_absolute_error, mean_squared_error,r2_score
from sklearn.metrics import plot_confusion_matrix, plot_precision_recall_curve, plot_roc_curve, accuracy_score
from sklearn.metrics import auc, f1_score, precision_score, recall_score, roc_auc_score


#from tpot import TPOTClassifier, TPOTRegressor
#from imblearn.under_sampling import RandomUnderSampler
#from imblearn.over_sampling import RandomOverSampler
#from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

import pickle
from pickle import dump, load

# Use Folium library to plot values on a map.
#import folium

# Use Feature-Engine library

#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce


np.random.seed(0)

from pycaret.classification import *
#from pycaret.clustering import *
#from pycaret.regression import *

pd.set_option('display.max_columns',100)
#pd.set_option('display.max_rows',100)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Exploratory Data Analysis

In [2]:
df = pd.read_csv("mnist27train.csv")

In [3]:
df

Unnamed: 0,y,x_1,x_2
0,2,0.04,0.18
1,7,0.16,0.09
2,2,0.02,0.28
3,2,0.14,0.22
4,7,0.39,0.37
...,...,...,...
795,7,0.15,0.15
796,2,0.13,0.22
797,2,0.18,0.37
798,2,0.00,0.27


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       800 non-null    int64  
 1   x_1     800 non-null    float64
 2   x_2     800 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 18.9 KB


In [5]:
df.describe(include='all')

Unnamed: 0,y,x_1,x_2
count,800.0,800.0,800.0
mean,4.63,0.18,0.29
std,2.5,0.09,0.09
min,2.0,0.0,0.09
25%,2.0,0.12,0.22
50%,7.0,0.18,0.28
75%,7.0,0.24,0.34
max,7.0,0.47,0.58


In [6]:
df.shape

(800, 3)

In [7]:
df.columns

Index(['y', 'x_1', 'x_2'], dtype='object')

### Model Training

### Using PyCaret

In [8]:
exp = setup(data = df, target = 'y', session_id=0, normalize=True) 

Unnamed: 0,Description,Value
0,session_id,0
1,Target,y
2,Target Type,Binary
3,Label Encoded,"2: 0, 7: 1"
4,Original Data,"(800, 3)"
5,Missing Values,False
6,Numeric Features,2
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [9]:
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,pycaret.internal.tunable.TunableMLPClassifier,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


In [10]:
compare_models(exclude=['catboost','lightgbm','dt','rbfsvm','gpc','mlp','ridge','gbc','et','xgboost'],fold=5) #For Classifier

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.8229,0.922,0.861,0.8163,0.8361,0.6435,0.6483,0.016
knn,K Neighbors Classifier,0.8122,0.885,0.8237,0.823,0.8211,0.6233,0.6268,0.024
ada,Ada Boost Classifier,0.805,0.9007,0.8169,0.8159,0.8136,0.609,0.6135,0.064
nb,Naive Bayes,0.7942,0.8979,0.8136,0.8027,0.8063,0.5865,0.5892,0.014
lr,Logistic Regression,0.7925,0.8785,0.8136,0.7985,0.804,0.5834,0.5866,1.154
lda,Linear Discriminant Analysis,0.7925,0.8781,0.8136,0.7984,0.8041,0.5834,0.5862,0.014
rf,Random Forest Classifier,0.7871,0.8853,0.7966,0.7999,0.7968,0.5732,0.5754,0.156
svm,SVM - Linear Kernel,0.7586,0.0,0.7695,0.773,0.7673,0.5162,0.5223,0.016


QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
                              store_covariance=False, tol=0.0001)

Now that you have all the trained models in a list, use sapply() or map() to create a matrix of predictions for the test set. You should end up with a matrix with length(mnist_27test$y$) rows and length(models) columns.

In [11]:
df2 = pd.read_csv("mnist27test.csv")

In [12]:
df2.shape

(200, 3)

Now compute accuracy for each model on the test set. Report the mean accuracy across all models.

Next, build an ensemble prediction by majority vote and compute the accuracy of the ensemble. Vote 7 if more than 50% of the models are predicting a 7, and 2 otherwise.

In Q3, we computed the accuracy of each method on the test set and noticed that the individual accuracies varied.
How many of the individual methods do better than the ensemble?

Which individual methods perform better than the ensemble?

It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem with this approach is that we are using the test data to make a decision. However, we could use the minimum accuracy estimates obtained from cross validation with the training data for each model from fit$results$Accuracy. Obtain these estimates and save them in an object. Report the mean of these training set accuracy estimates.
What is the mean of these training set accuracy estimates?

Now let's only consider the methods with a minimum accuracy estimate of greater than or equal to 0.8 when constructing the ensemble. Vote 7 if 50% or more of those models are predicting a 7, and 2 otherwise.

#### Python code done by Dennis Lam