# Data Science

## Notebook 5 (Ensemble methods)

### Ensemble methods on the bank data set

We repeat the steps from Notebook02 and Homework02: importing and preprocessing the data

In [1]:
import pandas as pd
bank_data = pd.read_csv("../Data/bank.csv", delimiter = " ",
                        names = ['age', 'sex', 'region', 'income', 'married', 'children',
                                 'car','save_acct', 'current_acct', 'mortgage', 'pep'])

In [2]:
numeric_data = bank_data.replace(['NO', 'YES', 'MALE', 'FEMALE'],[0,1,0,1])

In [3]:
import numpy as np
numeric_data = numeric_data.values

In [4]:
features = np.zeros((len(numeric_data),4))
numeric_data = np.append(numeric_data,features,1)
j = 2
for i in range(len(numeric_data)):
    if numeric_data[i][j] == 'INNER_CITY':
        numeric_data[i][11:15] = [1,0,0,0]
    if numeric_data[i][j] == 'TOWN':
         numeric_data[i][11:15] = [0,1,0,0]
    if numeric_data[i][j] == 'RURAL':
         numeric_data[i][11:15] = [0,0,1,0]
    if numeric_data[i][j] == 'SUBURBAN':
         numeric_data[i][11:15] = [0,0,0,1]        
#remove redundant column
numeric_data = np.delete(numeric_data, 2, 1)

In [5]:
bank_labels = numeric_data[:, 9].astype(int)

In [6]:
bank_attrs  = np.delete(numeric_data, 9, 1)

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [8]:
bank_features_train, bank_features_test, bank_labels_train, bank_labels_test = train_test_split(bank_attrs, bank_labels, test_size=0.33, random_state=42)

##### kNN

In [9]:
neigh = KNeighborsClassifier(n_neighbors=11,metric="euclidean")
neigh.fit(bank_features_train,bank_labels_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=11, p=2,
                     weights='uniform')

In [10]:
predictions_test = neigh.predict(bank_features_test)

In [11]:
knn_prob = neigh.predict_proba(bank_features_test)

In [12]:
metrics.roc_auc_score(bank_labels_test,knn_prob[:,1], average='macro', sample_weight=None)

0.5680102040816326

We have already seen it, quite bad performance. In HW2 you did the same with decision tree classifier!

##### Decision tree

In [13]:
from sklearn.tree import DecisionTreeClassifier
dectree = DecisionTreeClassifier( max_depth = 5 )
dectree = dectree.fit(bank_features_train,bank_labels_train)
predictions = dectree.predict(bank_features_test)
dectree_prob = dectree.predict_proba(bank_features_test)

In [14]:
metrics.roc_auc_score(bank_labels_test, dectree_prob[:,1], average='macro', sample_weight=None)

0.8466836734693878

Much Better! But still not great!

##### Let us combine them!

We have a bad performing kNN and a quite good perfprming decision tree classifier. Let us combine them!

In [17]:
metrics.roc_auc_score(bank_labels_test,0.9*dectree_prob[:,1]+0.2*knn_prob[:,1], average='macro', sample_weight=None)

0.8506632653061226

Using a very simple ensemble method (we took the linear combination of the confident scores) we obtained a better classifier!

##### The prediction

In [19]:
metrics.confusion_matrix(bank_labels_test,0.9*dectree_prob[:,1]+0.2*knn_prob[:,1]>0.65)

array([[89,  9],
       [30, 70]])

##### Bagging

In [20]:
from sklearn import ensemble
bagging_dt=ensemble.BaggingClassifier (
    base_estimator=DecisionTreeClassifier(random_state=0,max_depth=10,min_samples_leaf=5), 
    n_estimators=10)
bagging_dt.fit(bank_features_train,bank_labels_train)
bagging_dt_prob_test = bagging_dt.predict_proba(bank_features_test)
metrics.roc_auc_score(bank_labels_test,bagging_dt_prob_test[:,1], average='macro', sample_weight=None)

0.8975510204081633

Even better! :)