# Ensemble Classification

The data set contains data about high school students. Each row represents a single student. The columns include the characteristics of deidentified students. Binary classification task: predict whether a student drinks alcohol or not (this is the **Alc** column: 1=Yes, 0=No). This is an important prediction task to detect underage drinking and deploy intervention techniques. 

## Goal

Use the **alcohol.csv** data set and build a model to predict **Alc**.

# Read and Prepare the Data:

In [1]:
import numpy as np
import pandas as pd

np.random.seed(42)

In [2]:
alcohol = pd.read_csv("alcohol.csv")
alcohol.head()

Unnamed: 0,sex,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,Alc
0,F,18,4,4,2,2,0,4,3,4,3,6,0
1,F,17,1,1,1,2,0,5,3,3,3,4,0
2,F,15,1,1,1,2,3,4,3,2,3,10,1
3,F,15,4,2,1,3,0,3,2,2,5,2,0
4,F,16,3,3,1,2,0,4,3,2,5,4,1


In [3]:
#Check for missing values
alcohol.isna().sum()

sex           0
age           0
Medu          0
Fedu          0
traveltime    0
studytime     0
failures      0
famrel        0
freetime      0
goout         0
health        0
absences      0
Alc           0
dtype: int64

### Split into Training and Test

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(alcohol, test_size=0.3)

### Separate Target

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

train_y = train[['Alc']]
test_y = test[['Alc']]

train_inputs = train.drop(['Alc'], axis=1)
test_inputs = test.drop(['Alc'], axis=1)

### Identify Numeric/Categorical columns

In [6]:
train_inputs.dtypes

sex           object
age            int64
Medu           int64
Fedu           int64
traveltime     int64
studytime      int64
failures       int64
famrel         int64
freetime       int64
goout          int64
health         int64
absences       int64
dtype: object

In [7]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns
numeric_columns

Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'health', 'absences'],
      dtype='object')

In [8]:
# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns
categorical_columns

Index(['sex'], dtype='object')

### Pipeline

In [9]:
numeric_transformer = Pipeline(steps=[
                #('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [10]:
categorical_transformer = Pipeline(steps=[
    #('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [11]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)])

### Transform Train and Test

In [12]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x.shape

(276, 13)

In [13]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x.shape

(119, 13)

# Determine Baseline Accuracy

In [14]:
# Find the majority class:
train_y['Alc'].value_counts()

1    172
0    104
Name: Alc, dtype: int64

In [15]:
#Find the percentage of the majority class:
train_y['Alc'].value_counts()/len(train_y)

1    0.623188
0    0.376812
Name: Alc, dtype: float64

Baseline accuracy is 62.3%

# Hard voting classifier (should include at least two models)

In [16]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import SGDClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

dtree_clf = DecisionTreeClassifier(max_depth=5)
log_clf = LogisticRegression(multi_class='multinomial', solver = 'lbfgs', C=11, max_iter=1000)
sgd_clf = SGDClassifier(max_iter=8000, tol=1e-3)

Hvoting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf), 
                        ('sgd', sgd_clf)],
            voting='hard')

Hvoting_clf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(class_weight=None,
                                                     criterion='gini',
                                                     max_depth=5,
                                                     max_features=None,
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     presort=False,
                                                     random_state=None,
                                                     splitter='best')),
         

In [17]:
#Train accuracy
train_y_pred = Hvoting_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.7391304347826086


In [18]:
#Test accuracy
test_y_pred = Hvoting_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.6134453781512605


### Classification Matrix

In [19]:
confusion_matrix(test_y, test_y_pred)

array([[19, 28],
       [18, 54]], dtype=int64)

### Individual classifier accuracies

In [20]:
for clf in (dtree_clf, log_clf, sgd_clf, Hvoting_clf):
    clf.fit(train_x, train_y)
    test_y_pred = clf.predict(test_x)
    print(clf.__class__.__name__, 'Test acc=', accuracy_score(test_y, test_y_pred))

DecisionTreeClassifier Test acc= 0.5966386554621849
LogisticRegression Test acc= 0.5714285714285714
SGDClassifier Test acc= 0.5042016806722689
VotingClassifier Test acc= 0.5378151260504201


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


The hard voting classifier has an accuracy of 61.3%, failing to improve on the baseline, and has significant overfitting. None of the individual classifiers achieves over 60% accuracy, with the Decision Tree the best at 59.7%. The voting classifier actually reduces the accuracy compared to simply choosing DT.

# Soft voting classifier (should include at least two models)

In [21]:
#The same DT and LogReg models from above are repeated, but using soft voting

Svoting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf)],
            voting='soft')

Svoting_clf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


VotingClassifier(estimators=[('dt',
                              DecisionTreeClassifier(class_weight=None,
                                                     criterion='gini',
                                                     max_depth=5,
                                                     max_features=None,
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     presort=False,
                                                     random_state=None,
                                                     splitter='best')),
         

In [22]:
#Train accuracy
train_y_pred = Svoting_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.7898550724637681


In [23]:
#Test accuracy
test_y_pred = Svoting_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5882352941176471


### Classification Matrix

In [24]:
confusion_matrix(test_y, test_y_pred)

array([[19, 28],
       [21, 51]], dtype=int64)

### Individual classifier accuracies

In [25]:
for clf in (dtree_clf, log_clf, Svoting_clf):
    clf.fit(train_x, train_y)
    test_y_pred = clf.predict(test_x)
    print(clf.__class__.__name__, 'Test acc=', accuracy_score(test_y, test_y_pred))

DecisionTreeClassifier Test acc= 0.5798319327731093
LogisticRegression Test acc= 0.5714285714285714
VotingClassifier Test acc= 0.5966386554621849


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


The soft voting classifier has an accuracy of 58.8%, again failing to improve on the baseline, and has more severe overfitting compared to the hard voting classifier. Again, none of the individual classifiers achieves over 60% accuracy, with the Decision Tree the best at 58%.

# Bagging classifier

In [68]:
from sklearn.ensemble import BaggingClassifier 

#we will use the Decision Tree classifier because it performed best in the voting models
bag_clf = BaggingClassifier( 
            DecisionTreeClassifier(max_depth=4), n_estimators=30, 
            max_samples=250, bootstrap=True, n_jobs=-1) 

bag_clf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
                                                        criterion='gini',
                                                        max_depth=4,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort=False,
                                                        random_state=None,
                                                        splitter='best'),
       

In [69]:
#Train accuracy
train_y_pred = bag_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8043478260869565


In [70]:
#Test accuracy
test_y_pred = bag_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5882352941176471


### Out of Bag evaluation

In [67]:
#Use the oob_score variable:
bag_clf = BaggingClassifier( 
            DecisionTreeClassifier(max_depth=4), n_estimators=30,  
            max_samples=250, bootstrap=True, n_jobs=-1, oob_score=True) 

bag_clf.fit(train_x, train_y)
bag_clf.oob_score_

  y = column_or_1d(y, warn=True)


0.6557971014492754

The DT bagging classifier with 30 estimators and max-samples=250 achieved a test accuracy of 58.8%, not improving on the voting classifiers.

### Random Patches

In [72]:
#Random Patches: see the max_features variable
RPbag_clf = BaggingClassifier( 
            DecisionTreeClassifier(max_depth=4), n_estimators=30, max_features=8, #8 of 13 features
            max_samples=250, bootstrap=True, n_jobs=-1) 

RPbag_clf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
                                                        criterion='gini',
                                                        max_depth=4,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort=False,
                                                        random_state=None,
                                                        splitter='best'),
       

In [73]:
#Train accuracy
train_y_pred = RPbag_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8043478260869565


In [74]:
#Test accuracy
test_y_pred = RPbag_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5714285714285714


Reducing the max features from 13 to 8 results in a slightly lower test accuracy.

### Random Subspaces

In [75]:
#Random subspaces: see the max_samples variable
RSbag_clf = BaggingClassifier( 
            DecisionTreeClassifier(max_depth=4), n_estimators=30, max_features=8,
            max_samples=len(train_x), bootstrap=True, n_jobs=-1) #use all observations

RSbag_clf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
                                                        criterion='gini',
                                                        max_depth=4,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort=False,
                                                        random_state=None,
                                                        splitter='best'),
       

In [76]:
#Train accuracy
train_y_pred = RSbag_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.7862318840579711


In [77]:
#Test accuracy
test_y_pred = RSbag_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5630252100840336


Using all of the samples in each DT resulted in an even lower test accuracy.

# Random forest classifier

In [95]:
from sklearn.ensemble import RandomForestClassifier 

#similar to the DT bagging but with different parameters

rnd_clf = RandomForestClassifier(n_estimators=50, n_jobs=-1) 

rnd_clf.fit(train_x, train_y)

  """


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [96]:
#Train accuracy
train_y_pred = rnd_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 1.0


In [97]:
#Test accuracy
test_y_pred = rnd_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.5882352941176471


### Feature Importance

In [90]:
rnd_clf.feature_importances_

array([0.09433709, 0.08028426, 0.07205131, 0.0491125 , 0.08574393,
       0.03863618, 0.06988233, 0.09033418, 0.13373167, 0.08957674,
       0.14028151, 0.02852916, 0.02749914])

In [94]:
#Get a list of the columns in the training set to match the values
alcohol.columns

Index(['sex', 'age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures',
       'famrel', 'freetime', 'goout', 'health', 'absences', 'Alc'],
      dtype='object')

Of the features, it appears that "famrel" and "goout" have the highest importance at 13.4% and 14% respectively.

The random forest classifier has a 58.8% test accuracy but a 100% train accuracy, indicating maximum overfitting.

# AdaBoost Classifier

In [128]:
from sklearn.ensemble import AdaBoostClassifier 

#Create Adapative Boosting with Decision Stumps (depth=1)
ada_clf = AdaBoostClassifier( 
            DecisionTreeClassifier(max_depth=1), n_estimators=1000, 
            algorithm="SAMME.R", learning_rate=0.3) 

ada_clf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=None,
                             

In [129]:
#Train accuracy
train_y_pred = ada_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.75


In [130]:
#Test accuracy
test_y_pred = ada_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.6386554621848739


An Ada boost model with DT stumps, 1000 estimators, and learning rate of .3 performs better than any previous model (including the baseline) with 63.9% test accuracy, and has less overfitting than any other model.

# Stochastic Gradient Boosting Classifier

In [255]:
from sklearn.ensemble import GradientBoostingClassifier

#Train on 75% of the sample only
gbclf = GradientBoostingClassifier(max_depth=2, n_estimators=60, 
                                   learning_rate=0.1, subsample=0.75,
                                   tol=0.2, n_iter_no_change=4, validation_fraction=0.15) 

gbclf.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=2,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=60,
                           n_iter_no_change=4, presort='auto',
                           random_state=None, subsample=0.75, tol=0.2,
                           validation_fraction=0.15, verbose=0,
                           warm_start=False)

In [256]:
#Train accuracy
train_y_pred = gbclf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.6594202898550725


In [257]:
#Test accuracy
test_y_pred = gbclf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.6470588235294118


The SG Boost model has about 64.7% test accuracy, which beats that of the Ada Boost model, and it has noticeably lower overfitting (only about 1 percentage point difference between train and test).

None of the models above managed to achieve a major improvement in accuracy over the 62.3% baseline. However, of the models the SG Boost model had the highest accuracy of 64.7%, and little to no overfitting. The next best model was the Ada Boost model with 63.9% accuracy, but significantly more overfitting. These results indicate that a sequential-type modeling approach is best suited to this dataset.