# Decision Tree, Random Forest and AdaBoost

This exercice will use the breast-cancer dataset in order to show the difference between each methods. This exercice will also implements lot of information on these algorithm on how to understand and improve them.

First of all, we should explore the decision tree.

### Data pre-processing

In [28]:
from sklearn import datasets, tree, ensemble
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
import pandas as pd

In [29]:
df = pd.read_csv("breast-cancer.data", sep=",", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,
                  45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,
                 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
  
As we can see we have a lot of categorical data, we will need to convert them in order to be understood by the model. In the first place, we will also remove non-valued data.

In [30]:
df.count()

0    286
1    286
2    286
3    286
4    286
5    286
6    286
7    286
8    286
9    286
dtype: int64

In [31]:
df_nnv = df
df_nnv.drop(df_nnv.loc[df_nnv[8] == '?'].index, inplace=True)
df_nnv.drop(df_nnv.loc[df_nnv[5] == '?'].index, inplace=True)
df_nnv.count()

0    277
1    277
2    277
3    277
4    277
5    277
6    277
7    277
8    277
9    277
dtype: int64

In [32]:
association = []
for column in df_nnv.columns:
    df_temp = list(df_nnv[column].value_counts().index)
    temp = {}
    for i in range(0, len(df_temp)):
        temp[df_temp[i]] = i
    association.append(temp)
print(association)

[{'no-recurrence-events': 0, 'recurrence-events': 1}, {'50-59': 0, '40-49': 1, '60-69': 2, '30-39': 3, '70-79': 4, '20-29': 5}, {'premeno': 0, 'ge40': 1, 'lt40': 2}, {'30-34': 0, '25-29': 1, '20-24': 2, '15-19': 3, '10-14': 4, '40-44': 5, '35-39': 6, '0-4': 7, '50-54': 8, '5-9': 9, '45-49': 10}, {'0-2': 0, '3-5': 1, '6-8': 2, '9-11': 3, '15-17': 4, '12-14': 5, '24-26': 6}, {'no': 0, 'yes': 1}, {2: 0, 3: 1, 1: 2}, {'left': 0, 'right': 1}, {'left_low': 0, 'left_up': 1, 'right_up': 2, 'right_low': 3, 'central': 4}, {'no': 0, 'yes': 1}]


In [33]:
final_df = pd.DataFrame()
for i in range(0, len(df_nnv.columns)):
    col = list(df_nnv[df_nnv.columns[i]])
    col = [association[i][col[j]] for j in range(0, len(col))]
    final_df.insert(i, i, col)
final_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,3,0,0,0,0,1,0,0,0
1,0,1,0,2,0,0,0,1,2,0
2,0,1,0,2,0,0,0,0,0,0
3,0,2,1,3,0,0,0,1,1,0
4,0,1,0,7,0,0,0,1,3,0


In [34]:
Y = pd.DataFrame(final_df[[9]])
X = pd.DataFrame(final_df[range(0,len(final_df.columns)-1)])
# split our dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
print("Training set count X:{}/Y:{}, Test set count X:{}/Y:{}".format(
    int(X_train.shape[0]), int(Y_train.shape[0]), int(X_test.shape[0]), int(Y_test.shape[0])))

Training set count X:193/Y:193, Test set count X:84/Y:84


## I - Decision Tree

We will now try to predict the species with a basic decision tree using Sklearn.

In [35]:
dt_class = tree.DecisionTreeClassifier()
params = {'criterion':["gini", "entropy"], 'max_depth':range(1,20), 'max_features':["auto", "log2", "sqrt"]}
gs = GridSearchCV(dt_class, params, cv=3)
gs.fit(X_train, Y_train)



GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(1, 20),
                         

In [36]:
print(gs.best_params_)

{'criterion': 'gini', 'max_depth': 6, 'max_features': 'log2'}


In [37]:
def result(gs, X_test, Y_test):
    Y_pred = gs.best_estimator_.predict(X_test)
    res = confusion_matrix(Y_test, Y_pred)
    print("Safe Patient well predicted : {}".format(res[0][0]))
    print("Safe Patient predicted Ill : {}".format(res[0][1]))
    print("Ill Patient predicted Safe : {}".format(res[1][0]))
    print("Ill Patient well predicted : {}".format(res[1][1]))
    print("Safe Patient accuracy : {}".format(res[0][0] / (res[0][1] + res[0][0])))
    print("Ill Patient accuracy : {}".format(res[1][1] / (res[1][1] + res[1][0])))
    print("Global accuracy : {}".format((res[1][1]+res[0][0]) / 
                                        (res[1][1] + res[1][0] + res[0][1] + res[0][0])))
    return(res)

In [38]:
res_dt_class = result(gs, X_test, Y_test)

Safe Patient well predicted : 59
Safe Patient predicted Ill : 5
Ill Patient predicted Safe : 17
Ill Patient well predicted : 3
Safe Patient accuracy : 0.921875
Ill Patient accuracy : 0.15
Global accuracy : 0.7380952380952381


As we can see, we obtain a first score of 0.73 well classified data. 
But we should mention something quite interesting, in this case, we only detected 15% of ill patient and moreover we predicted 10% of safe patient to be ill. 

This highlights the poor result of our predictions.

## II - Random Forest

We will now try to predict the species with a basic decision tree using Sklearn.

In [39]:
rf_class = ensemble.RandomForestClassifier()
params = {'criterion':["gini", "entropy"], 'max_depth':range(2,15), 'max_features':["auto", "log2", "sqrt"], 
          'n_estimators':[25,35,45,50,65,75]}
gs = GridSearchCV(rf_class, params, cv=3)
gs.fit(X_train, list(Y_train[9]))



GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [40]:
print(gs.best_params_)

{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'log2', 'n_estimators': 65}


note that : 'n_estimators' = [25,35,45,50,65,75] since we tried multiple intervals before and the best one was under 100 and over 20

In [42]:
res_rf_class = result(gs, X_test, Y_test)

Safe Patient well predicted : 61
Safe Patient predicted Ill : 3
Ill Patient predicted Safe : 17
Ill Patient well predicted : 3
Safe Patient accuracy : 0.953125
Ill Patient accuracy : 0.15
Global accuracy : 0.7619047619047619


As we can see here, we slighty increase the result of patient well predicted.
We reduced the number of patient safe predicted to be ill, which is a good point, but we remain constant for the others.

## III - AdaBoosting with the best decision Tree

We are going to compute the adaboosting algorithm using a decision tree

In [47]:
ada_class = ensemble.AdaBoostClassifier()
params = {'n_estimators':[25,35,45,50,65,75], "learning_rate":[0.01,0.05,0.1,0.2,0.5,0.7,1,1.5,2], 
          'algorithm':['SAMME', 'SAMME.R']}
gs = GridSearchCV(ada_class, params, cv=3)
gs.fit(X_train, list(Y_train[9]))



GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                          base_estimator=None,
                                          learning_rate=1.0, n_estimators=50,
                                          random_state=None),
             iid='warn', n_jobs=None,
             param_grid={'algorithm': ['SAMME', 'SAMME.R'],
                         'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.5, 0.7, 1,
                                           1.5, 2],
                         'n_estimators': [25, 35, 45, 50, 65, 75]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [48]:
print(gs.best_params_)

{'algorithm': 'SAMME', 'learning_rate': 0.2, 'n_estimators': 50}


In [50]:
res_ada_class_1 = result(gs, X_test, Y_test)

Safe Patient well predicted : 64
Safe Patient predicted Ill : 0
Ill Patient predicted Safe : 19
Ill Patient well predicted : 1
Safe Patient accuracy : 1.0
Ill Patient accuracy : 0.05
Global accuracy : 0.7738095238095238


One thing to mention here is that we don't predict safe patient to be ill, it's a really good improvement. But there is still a huge problem, since we are not able to see if a patient is ill.

## III - AdaBoosting with the best Random Forest

We are going to compute the adaboosting algorithm using a decision tree

In [53]:
ada_class = ensemble.AdaBoostClassifier(ensemble.RandomForestClassifier(n_estimators=50))
params = {'n_estimators':[25,35,45,50,65,75], "learning_rate":[0.01,0.05,0.1,0.2,0.5,0.7,1,1.5,2], 
          'algorithm':['SAMME', 'SAMME.R']}
gs = GridSearchCV(ada_class, params, cv=3)
gs.fit(X_train, list(Y_train[9]))



GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                          base_estimator=RandomForestClassifier(bootstrap=True,
                                                                                class_weight=None,
                                                                                criterion='gini',
                                                                                max_depth=None,
                                                                                max_features='auto',
                                                                                max_leaf_nodes=None,
                                                                                min_impurity_decrease=0.0,
                                                                                min_impurity_split=None,
                                                                                mi

In [54]:
print(gs.best_params_)

{'algorithm': 'SAMME', 'learning_rate': 0.7, 'n_estimators': 35}


In [55]:
res_ada_class_2 = result(gs, X_test, Y_test)

Safe Patient well predicted : 62
Safe Patient predicted Ill : 2
Ill Patient predicted Safe : 17
Ill Patient well predicted : 3
Safe Patient accuracy : 0.96875
Ill Patient accuracy : 0.15
Global accuracy : 0.7738095238095238
