# A1: Ensemble Models
Alex Dien

# Table of Contents

>[A7: Blackbox methods (ANN and SVM)](#scrollTo=Hk0FF9Wsr-KO)

>[Table of Contents](#scrollTo=mo2g0z7dtnWo)

>[Task 1 - Prepare your data](#scrollTo=qsR-rr7MsX4U)

>>>[A. Package loading, and data import. Make sure to read the data in with the encoding parameter set as follows:](#scrollTo=5YQ09-P_uuxC)

>>>[B. Show overall structure and summary of input data](#scrollTo=CY13OGHlu7cx)

>>>[C. Pop the target variable](#scrollTo=aFlMz8HUIKHJ)

>>>[C1: encode the target variable using the following code: y_target = y_target.eq('yes').mul(1). Show the first 7 entries of your newly encoded target variable.](#scrollTo=8n5bHGovg48B)

>>>[D. Are there any numeric columns that need to be converted to categorical prior to one-hot encoding? If so do it here, otherwise note it in a text block.](#scrollTo=92-yGhPPhU6C)

>>>[E. Encode the dataframe using pd.get_dummies() show the head of the encoded dataframe.](#scrollTo=Z6_k-N6miaAu)

>[Task 2](#scrollTo=Dc1iMMD8sdIs)

>>[Train multiple MLP models and use 5 fold cross validation for each one to evaluate their performance for the following metrics](#scrollTo=hbvWhC2UvhEz)

>>>[Accuracy, Precision, Recall, F1. The CV results should be shown as a dataframe for each model as shown in the tutorial code. Compute the mean (ie results.mean()) to get the average results across cross validation.](#scrollTo=QnbjNpIVi6db)

>>[Do not use GridSearchCV for this section. Each model must be built and cross validated individually](#scrollTo=47TwDd7dnYTn)

>>>[1 hidden layer with 7 neurons](#scrollTo=uyGHLyg2ngZd)

>>>[2 hidden layers each with 30 neurons](#scrollTo=MMEDDnSZoUWE)

>>>[3 hidden layers, the first has 25 neurons and the second and third have 10 neurons](#scrollTo=RRooNSe-op_C)

>>>[2 hidden layers each with 20 neurons](#scrollTo=J8078SXrpI2b)

>>>[2 hidden layers each with 10 neurons](#scrollTo=n2pCf6RTpWyL)

>>[Describe after evaluating each model which model appears to have the best average accuracy score by examining the results.](#scrollTo=tvJhLq0a4K49)

>[Task 3](#scrollTo=ZdAY1o-crYDv)

>>[Using the best model hidden layer sizes from Step 2 as you have determined the best accuracy score.](#scrollTo=nta4V8mfrlUG)

>>>[Build 2 new models and change both the learning_rate_init, the activation, and the solver hyperparameters](#scrollTo=clEO1BohsVnD)

>[Task 4](#scrollTo=oQgf00QvYLCK)

>>>[Using GridSearchCV attempt to improve on the the best model from Task 2. Try a variety of hyperparameter combinations is gridsearch (at least 20 combinations <--- not to exceed 10 minutes total runtime. if you find your runtime is exceeding 10 minutes you may use less combinations). You may change the number of hidden layers as well and you might want to if your results in task 3 didn't improve your model. Use scoring of 'accuracy'](#scrollTo=e1GhnEFEi-bc)

>>>[Show the top 5 results in a dataframe sorted by rank.](#scrollTo=GE693Lbu1Juz)

>[Task 5](#scrollTo=BG06VTQO3sSJ)

>>>[Using GridSearchCV (use scoring of 'accuracy' explore at least 40 hyperparameter combinations of C and kernel for the SVC classifier](#scrollTo=qyndpkYK40vj)

>>>[Show the top 5 results in a dataframe sorted by rank.](#scrollTo=kGy8jCUySBEv)

>[Task 6](#scrollTo=LNaSi1tSCPzB)

>>>[Write up your reflections on all of the models built. Which model was most successful? What would you do to potentially improve on the successes of these models?](#scrollTo=cp3FakQJCh5Y)



# Task 1 - Prepare your data

### A. Data Import

In [393]:
# load packages
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from pandas import DataFrame, Series



In [394]:
df = pd.read_csv('https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/CD_additional_modified.csv')

### B. Show overall structure and summary of input data

In [395]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4117 entries, 0 to 4116
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4117 non-null   int64  
 1   job             4117 non-null   object 
 2   marital         4117 non-null   object 
 3   education       4117 non-null   object 
 4   default         4117 non-null   object 
 5   housing         4117 non-null   object 
 6   loan            4117 non-null   object 
 7   contact         4117 non-null   object 
 8   month           4117 non-null   object 
 9   day_of_week     4117 non-null   object 
 10  duration        4117 non-null   int64  
 11  campaign        4117 non-null   int64  
 12  pdays           4117 non-null   int64  
 13  previous        4117 non-null   int64  
 14  poutcome        4117 non-null   object 
 15  emp_var_rate    4117 non-null   float64
 16  cons_price_idx  4117 non-null   float64
 17  cons_conf_idx   4117 non-null   f

In [396]:
# Show overall structure
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
count,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0
mean,40.115375,256.850376,2.537042,960.403449,0.190187,0.085742,93.580131,-40.500947,3.621904,5166.496502
std,10.314847,254.749615,2.568668,191.967524,0.541765,1.562799,0.579061,4.593445,1.733448,73.670942
min,18.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.635,4963.6
25%,32.0,103.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.334,5099.1
50%,38.0,181.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,317.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,88.0,3643.0,35.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


In [397]:
# Summary of df
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


### C. pop() the target variable (quality) into a new variable ie y_target

In [398]:
y_target = df.pop('y')

In [399]:
y_target = pd.get_dummies(y_target,drop_first=True)

### D. Show the overall structure and summary of the data frame using info() describe() head() 

In [400]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4117 entries, 0 to 4116
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4117 non-null   int64  
 1   job             4117 non-null   object 
 2   marital         4117 non-null   object 
 3   education       4117 non-null   object 
 4   default         4117 non-null   object 
 5   housing         4117 non-null   object 
 6   loan            4117 non-null   object 
 7   contact         4117 non-null   object 
 8   month           4117 non-null   object 
 9   day_of_week     4117 non-null   object 
 10  duration        4117 non-null   int64  
 11  campaign        4117 non-null   int64  
 12  pdays           4117 non-null   int64  
 13  previous        4117 non-null   int64  
 14  poutcome        4117 non-null   object 
 15  emp_var_rate    4117 non-null   float64
 16  cons_price_idx  4117 non-null   float64
 17  cons_conf_idx   4117 non-null   f

In [401]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
count,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0,4117.0
mean,40.115375,256.850376,2.537042,960.403449,0.190187,0.085742,93.580131,-40.500947,3.621904,5166.496502
std,10.314847,254.749615,2.568668,191.967524,0.541765,1.562799,0.579061,4.593445,1.733448,73.670942
min,18.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.635,4963.6
25%,32.0,103.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.334,5099.1
50%,38.0,181.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,317.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,88.0,3643.0,35.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


In [402]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,487,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1
1,39,services,single,high.school,no,no,no,telephone,may,fri,346,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,227,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,17,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,58,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8


### E. In a text block explain why you do, or do not need to dummy encode the dataset



Dummy encoding is a technique used to convert categorical variables into numerical variables. Many of the data points are still objects, so we must convert them to numerical data in order for the models to succeed.

In [403]:
df_enc = pd.get_dummies(df)
df_enc

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,30,487,2,999,0,-1.8,92.893,-46.2,1.313,5099.1,...,0,0,1,0,0,0,0,0,1,0
1,39,346,4,999,0,1.1,93.994,-36.4,4.855,5191.0,...,0,0,1,0,0,0,0,0,1,0
2,25,227,1,999,0,1.4,94.465,-41.8,4.962,5228.1,...,0,0,0,0,0,0,1,0,1,0
3,38,17,3,999,0,1.4,94.465,-41.8,4.959,5228.1,...,0,0,1,0,0,0,0,0,1,0
4,47,58,1,999,0,-0.1,93.200,-42.0,4.191,5195.8,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4112,30,53,1,999,0,1.4,93.918,-42.7,4.958,5228.1,...,0,0,0,0,1,0,0,0,1,0
4113,39,219,1,999,0,1.4,93.918,-42.7,4.959,5228.1,...,0,0,1,0,0,0,0,0,1,0
4114,27,64,2,999,1,-1.8,92.893,-46.2,1.354,5099.1,...,0,0,0,1,0,0,0,1,0,0
4115,58,528,1,999,0,1.4,93.444,-36.1,4.966,5228.1,...,0,0,1,0,0,0,0,0,1,0


# Task 2 - Simple Model


## 1. Use cross_validate to evaluate a simple DecisionTree classifier model on the encoded dataset. Show the f1 score.


In [404]:
clf = DecisionTreeClassifier().fit(df_enc, y_target)

In [405]:
results = pd.DataFrame(cross_validate(clf, df_enc, y_target, scoring=['f1']))
results

Unnamed: 0,fit_time,score_time,test_f1
0,0.032775,0.004001,0.454545
1,0.03078,0.003826,0.455556
2,0.030263,0.003733,0.529101
3,0.032201,0.005583,0.452381
4,0.031127,0.003736,0.507937


## 2. Turn the cross_validate results into a dataframe and agg the dataframe to compute the mean and standard deviation.



In [406]:
# Turn the results into a dataframe
df_dtc = pd.DataFrame(results)

In [407]:
scores = df_dtc.agg(['mean', 'std'])

In [408]:
print(scores)

      fit_time  score_time   test_f1
mean  0.031429    0.004176  0.479904
std   0.001034    0.000794  0.036054


# Task 3

### 1. Create 3 models using GridSearchCV. The goal here is to have 3 good models whose best hyperparameters were discovered using gridsearch.


In [409]:
# Model 1
rfc = RandomForestClassifier(n_jobs=-1)

# Params
param_grid = {'n_estimators': [10, 50, 100, 200], 'max_depth': list(range(1,20,2))}

rfc_grid = GridSearchCV(rfc, param_grid, scoring=['f1'], refit=False)

rfc_grid = rfc_grid.fit(df_enc, y_target.values.ravel())

results1 = rfc_grid.cv_results_

In [410]:
# Model 2
svc_sigmoid = SVC(kernel="sigmoid",random_state=1)

# SVC params
param_grid = {"C": [0.01, 0.1, 1, 10, 100]}

# Use GridSearchCV to search for the best hyperparameters using the f1 scoring function
svc_sigmoid_grid = GridSearchCV(svc_sigmoid, param_grid, scoring=['f1'], refit=False)

# Fit the SVC model on the dataset
svc_sigmoid_grid = svc_sigmoid_grid.fit(df_enc, y_target.values.ravel())

results2 = svc_sigmoid_grid.cv_results_

In [411]:
# Model 3
svc_rbf = SVC(kernel="rbf")

# SVC params
param_grid = {"C":list(range(1,3,1))}

svc_rbf_grid = GridSearchCV(svc_rbf, param_grid, scoring=['f1'],refit=False)

# Fit the SVC model on the dataset
svc_rbf_grid.fit(df_enc, y_target.values.ravel())

results3 = svc_rbf_grid.cv_results_

### 2. Use f1 score for scoring 

### 3. Show the mean,std deviation, and pararms for each of the best models (ideally filter the results dataframe for rank 1)

In [412]:
results1.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_n_estimators', 'params', 'split0_test_f1', 'split1_test_f1', 'split2_test_f1', 'split3_test_f1', 'split4_test_f1', 'mean_test_f1', 'std_test_f1', 'rank_test_f1'])

In [413]:
# Results for DecisionTreeClassifier model
pd.DataFrame(results1)[['mean_fit_time','mean_test_f1', 'mean_score_time', 'std_fit_time', 'std_score_time', 'std_test_f1', 'params']].sort_values('mean_test_f1',ascending=False).head(10)

Unnamed: 0,mean_fit_time,mean_test_f1,mean_score_time,std_fit_time,std_score_time,std_test_f1,params
33,0.240865,0.458725,0.024685,0.006772,0.001728,0.027358,"{'max_depth': 17, 'n_estimators': 50}"
38,0.443328,0.455465,0.041783,0.013527,0.00101,0.027374,"{'max_depth': 19, 'n_estimators': 100}"
39,0.80391,0.447914,0.081592,0.016573,0.001209,0.036468,"{'max_depth': 19, 'n_estimators': 200}"
34,0.447278,0.444186,0.039959,0.047103,0.001289,0.05247,"{'max_depth': 17, 'n_estimators': 100}"
31,0.773893,0.43873,0.078182,0.007321,0.003899,0.030716,"{'max_depth': 15, 'n_estimators': 200}"
35,0.792377,0.43754,0.078122,0.004873,0.004119,0.045799,"{'max_depth': 17, 'n_estimators': 200}"
26,0.414031,0.433902,0.037607,0.025183,0.001365,0.022993,"{'max_depth': 13, 'n_estimators': 100}"
24,0.071348,0.433206,0.014122,0.00728,0.006341,0.041766,"{'max_depth': 13, 'n_estimators': 10}"
37,0.250945,0.428207,0.025069,0.012408,0.001457,0.03172,"{'max_depth': 19, 'n_estimators': 50}"
27,0.747204,0.425785,0.077401,0.009501,0.001892,0.051484,"{'max_depth': 13, 'n_estimators': 200}"


In [414]:
# Results for SVC sigmoid model
pd.DataFrame(results2)[['mean_fit_time','mean_test_f1', 'mean_score_time', 'std_score_time', 'std_fit_time', 'std_test_f1', 'rank_test_f1', 'params']].sort_values('mean_test_f1',ascending=False).head(10)

Unnamed: 0,mean_fit_time,mean_test_f1,mean_score_time,std_score_time,std_fit_time,std_test_f1,rank_test_f1,params
4,0.269331,0.385482,0.060956,0.00898,0.003779,0.054527,1,{'C': 100}
3,0.289185,0.322945,0.063265,0.00154,0.004789,0.04221,2,{'C': 10}
0,0.272997,0.0,0.067507,0.00391,0.005781,0.0,3,{'C': 0.01}
1,0.300027,0.0,0.065885,0.002394,0.004783,0.0,3,{'C': 0.1}
2,0.295043,0.0,0.064732,0.003281,0.005769,0.0,3,{'C': 1}


In [415]:
# Results for SVC rbf model
pd.DataFrame(results3)[['mean_fit_time','mean_test_f1', 'mean_score_time', 'std_score_time', 'std_fit_time', 'std_test_f1', 'rank_test_f1', 'params']].sort_values('mean_test_f1',ascending=False).head(10)

Unnamed: 0,mean_fit_time,mean_test_f1,mean_score_time,std_score_time,std_fit_time,std_test_f1,rank_test_f1,params
1,0.238229,0.331276,0.072711,0.001878,0.005405,0.057791,1,{'C': 2}
0,0.246031,0.328872,0.075335,0.004685,0.005497,0.060131,2,{'C': 1}


# Task 4 - Voting Ensemble Classifier

### 1. Using the best parameters found in the GridSearch for the 3 models to create/instantiate new models

In [416]:
# Best RFC model
rfc_best = RandomForestClassifier(n_estimators=100, max_depth=19)

In [417]:
# Best SVC with sigmoid kernel model
svc_sig_best = SVC(kernel='sigmoid', C=.01)

In [418]:
# Best SVC with radial basis kernel
rbf_best = SVC(kernel='rbf', C=1)

### 2. Include these 3 new models in a VotingClassifier object

In [419]:
eclf = VotingClassifier(
    estimators=[('svc_sig', svc_sig_best), ('rf', rfc_best), ('svc_rbf', rbf_best)],
    voting='hard')

In [420]:
eclf = eclf.fit(df_enc, y_target.values.ravel())

### 3. Using cross_validation evaluate the VotingClassifier for f1 score. Agg for mean

In [421]:
pd.DataFrame(cross_validate(eclf, df_enc, y_target.values.ravel(), scoring=['f1'])).agg('mean')

fit_time      0.951853
score_time    0.185172
test_f1       0.283105
dtype: float64

### 4. In a text block describe the voting ensemble method.  (see the sklearn documentation)

The voting ensemble method is a way of combining the predictions of multiple models. It is used to improve the overall performance of a model by reducing overfitting and increasing the generalizability of the model. In the voting ensemble method, multiple models are trained on the same dataset and their predictions are combined using one of several voting strategies. The most common strategy is called "hard voting," in which the final prediction is the one that is most frequently predicted by the individual models. Other voting strategies include "soft voting," in which the final prediction is based on the average of the predicted probabilities of the individual models, and "weighted voting," in which each model is assigned a weight based on its performance and the final prediction is based on the weighted average of the individual model predictions. Overall, the voting ensemble method can be a powerful tool for improving the performance of machine learning models.

# Task 5 - Boosting Model 


### 1. Using grid search create and evaluate a GradientBoostingClassifier for f1 score.

In [422]:
parameters = {'max_depth':list(range(1,20,2)),
              'n_estimators':list(range(1,20,2))
              }

In [None]:
clf = GridSearchCV(GradientBoostingClassifier(), parameters,scoring=['f1'],refit=False)
clf = clf.fit(df_enc, y_target.values.ravel())
results4 = clf.cv_results_

In [None]:
results4.keys()

### 2. Filter the grid search result dataframe to show the rank 1 result of grid search. 

In [None]:
pd.DataFrame(results4)[['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_n_estimators', 'params', 'split0_test_f1', 'split1_test_f1', 'split2_test_f1', 'split3_test_f1', 'split4_test_f1', 'mean_test_f1', 'std_test_f1', 'rank_test_f1']].sort_values('rank_test_f1',ascending=False).head()

### 3. In a text block describe how a boosting model works. (see the sklearn documentation)

Boosting is an ensemble method that combines the predictions of multiple weak learners to produce a more accurate prediction. It works by training weak learners sequentially, each one trying to correct the mistakes of the previous weak learner. By combining the predictions of these weak learners, the boosting algorithm is able to make more accurate predictions than any of the individual weak learners would be able to on their own. Typically, the weak learners used in boosting are decision trees, although other types of models can also be used. Boosting can be used for both regression and classification tasks.

# Task 6 - Stacked Ensemble Classifier

### 1. Using the 3 models created in the task 3 in a StackingClassifier object


In [None]:
estimators = [('svc_sig', svc_sig_best),
              ('rf', rfc_best),
              ('svc_rbf', rbf_best)]

### 2. For the final_estimator use a GradientBoostingClassifier 


In [None]:

final_estimator = GradientBoostingClassifier(
    n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,
    random_state=42)



stacked_ens = StackingClassifier(
    estimators=estimators,
    final_estimator=final_estimator)

### 3. Using cross_validation evaluate the StackingClassifierfor f1 score. Agg for mean


In [None]:
pd.DataFrame(cross_validate(stacked_ens, df_enc, y_target.values.ravel(), scoring=['f1'])).agg('mean')

### 4. In text block describe the stacked ensemble method (see the sklearn documentation)


The stacked ensemble method is a technique for combining the predictions of multiple models, called base models, in order to improve the overall predictive performance. This is done by training a second model, called the meta-model, on the outputs of the base models. The meta-model can be trained using a variety of methods, such as regression or classification, and is typically trained on the outputs of the base models using cross-validation. This technique can help to reduce overfitting and improve the generalizability of the model.

# Mount and Export

In [392]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp "/content/drive/My Drive/Colab Notebooks/A11_Dien_Alex.ipynb" ./

# create html from ipynb
!jupyter nbconvert --to html "A11_Dien_Alex.ipynb"