# Applying different tree-based models

In this case study, we are going to predict whether someone responds to a marketing campaign by subscribing to a term deposit. To this purpose, we will train various tree-based models. First we load the dataset and the necessary libraries for data manipulation:

In [1]:
import pandas as pd
import numpy as np

np.random.seed(10)
df = pd.read_csv('bank-full.csv')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


We can see we have a variety of variables, including whether someone subscribed (yes/no) and mostly demographic information. There are also variables describing previous interaction: pdays (days until the last interaction with -1 for no interaction), previous (number of contacts before), and poutcome (the type of outcome before).

## Data pre-processing

We are going to separate the dependent variable, subscribed, first:

In [2]:
y = df['subscribed']
X = df.drop(['subscribed'],axis=1)

In [3]:
y.value_counts()

no     39922
yes     5289
Name: subscribed, dtype: int64

There is a significant amount of people that have subscribed. Now, let's look at the indepedent variables. Given that we need numeric input data for the decision tree implementations, we might have to convert:

In [4]:
X.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
dtype: object

It seems there are a few categorical variables that need to be converted:

In [5]:
for column in X.columns:
    if X[column].dtype == np.object:
        print('Converting ', column)
        X = pd.concat([X,pd.get_dummies(X[column], prefix=column, drop_first=True)],axis=1).drop([column],axis=1)

Converting  job
Converting  marital
Converting  education
Converting  default
Converting  housing
Converting  loan
Converting  contact
Converting  month
Converting  poutcome


Finally, our data looks like this:

In [6]:
X.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,58,2143,5,261,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1,44,29,5,151,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,33,2,5,76,1,-1,0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
3,47,1506,5,92,1,-1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
4,33,1,5,198,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


We have a lot more colums now. We also convert the dependent variable to make sure the AUC can be calculated:

In [7]:
y = pd.get_dummies(y, prefix='subscribed', drop_first=True)
y.head()

Unnamed: 0,subscribed_yes
0,0
1,0
2,0
3,0
4,0


## Modelling

Now it's up to you to create the various models. You have to create a cross-validated (10-fold) grid search that tests at least two parameters for decision trees, random forests, and AdaBoost. 

In [8]:
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import AdaBoostClassifier as AdaBoost

from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, f1_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# The classifier input below is a pair of the name and the classifier instance
classifiers = [('DecisionTree',DT()),('RandomForest',RF()),('AdaBoost',AdaBoost())]

You return the grid search instance, the best estimator (classifier), the accuracy, and auc of that estimator on the training set. Be careful with parameters, AdaBoost might take a while to run.

In [9]:
def create_tree_classifier(classifier,X_train,y_train):

    grid_search = None
    best_classifier = None
    accuracy = 0
    auc = 0
    
    print('Treating ', classifier[0])
    
    ### BEGIN SOLUTION    

    if classifier[0] == 'DecisionTree':
        parameters = {'min_samples_leaf':[1,5,10],'max_depth':[None,10,20]}
    elif classifier[0] == 'RandomForest':
        parameters = {'min_samples_leaf':[5,10],'n_estimators':[10,30,50]}
    elif classifier[0] == 'AdaBoost':
        parameters = {'learning_rate':[0.5,1],'n_estimators':[10,30,50]}
    
    grid_search = GridSearchCV(classifier[1], parameters, cv=10)
    grid_search.fit(X_train, y_train.values.reshape(-1,))
    
    pred = grid_search.predict(X_train)
    
    accuracy = accuracy_score(y_train, pred)
    auc = roc_auc_score(y_train, pred)
    best_classifier = grid_search.best_estimator_
        
    ### END SOLUTION
    
    return grid_search, best_classifier, accuracy, auc

Decision tree:

In [12]:
gsr, bc, acc, auc = create_tree_classifier(('DecisionTree',DT()),X_train,y_train) 
print('Best classifier:',bc)
pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, pred)
auc_test = roc_auc_score(y_test, pred)
print("Accuracy (test set):",acc_test)
print("AUC (test set):",auc_test)

Treating  DecisionTree
Best classifier: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Accuracy (test set): 0.9025361250368623
AUC (test set): 0.6933143398347895


Random forests:

In [13]:
gsr, bc, acc, auc = create_tree_classifier(('RandomForest',RF()),X_train,y_train) 
print('Best classifier:',bc)
pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, pred)
auc_test = roc_auc_score(y_test, pred)
print("Accuracy (test set):",acc_test)
print("AUC (test set):",auc_test)

Treating  RandomForest
Best classifier: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=30,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Accuracy (test set): 0.9071070480684164
AUC (test set): 0.6593109093720149


AdaBoost:

In [14]:
gsr, bc, acc, auc = create_tree_classifier(('AdaBoost',AdaBoost()),X_train,y_train) 
print('Best classifier:',bc)
pred = bc.predict(X_test)
acc_test = accuracy_score(y_test, pred)
auc_test = roc_auc_score(y_test, pred)
print("Accuracy (test set):",acc_test)
print("AUC (test set):",auc_test)

Treating  AdaBoost
Best classifier: AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1,
                   n_estimators=50, random_state=None)
Accuracy (test set): 0.9024624004718372
AUC (test set): 0.6770745457395393


## Feature importance

Now, create a function that returns the features that are in the top 5 given their variable importance for a given classifier on a test set. Add the feature names as a string to the set provided:

In [15]:
def return_most_important_features(estimator,X_test,y_test):

    important_features = set()
    
    estimator.fit(X_test,y_test.values.reshape(-1,))
    
    ### BEGIN SOLUTION 
    for c, column in enumerate(X_test.columns):
        if estimator.feature_importances_[c] in sorted(estimator.feature_importances_)[-5:]:
            important_features.add(column)
    ### END SOLUTION
    
    return important_features

Example input:

In [16]:
classifier = DT(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

return_most_important_features(classifier,X_test,y_test)

{'age', 'balance', 'day', 'duration', 'poutcome_success'}