## Artificial Intelligence
## L2 International, Univ. Bordeaux
### Lab #4, Supervised Learning (2)

### Decision Trees
A decision tree is a tree structure where an internal node represents feature, the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This tree structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

For this lab, we will use pima-indians-diabetes dataset. It is well described in the following address:

https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

As usual, we need to import necessary python modules:

In [19]:
import pandas as pd
import numpy as np

Two additional modules for data visualization:

In [20]:
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

And the, load the dataset:

In [21]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bml', 'pedigree', 'age', 'label']
pima = pd.read_csv('https://www.labri.fr/~zemmari/datasets/pima-indians-diabetes.csv',
                  header=None, names=col_names)
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bml,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [22]:
feature_cols = ['pregnant', 'insulin', 'bml', 'age', 'glucose', 'bp', 'pedigree']
X=pima[feature_cols]
y=pima.label

In [23]:
X.head()

Unnamed: 0,pregnant,insulin,bml,age,glucose,bp,pedigree
0,6,0,33.6,50,148,72,0.627
1,1,0,26.6,31,85,66,0.351
2,8,0,23.3,32,183,64,0.672
3,1,94,28.1,21,89,66,0.167
4,0,168,43.1,33,137,40,2.288


In [42]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pregnant  768 non-null    int64  
 1   insulin   768 non-null    int64  
 2   bml       768 non-null    float64
 3   age       768 non-null    int64  
 4   glucose   768 non-null    int64  
 5   bp        768 non-null    int64  
 6   pedigree  768 non-null    float64
dtypes: float64(2), int64(5)
memory usage: 42.1 KB


In [24]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: label, dtype: int64

Now, we will split the dataset into two subsets: one for the training and the other for the test. For this, we will import the necessary function:

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=109)

Now, we can train our supervised learning model using Decision Tree classifier:

In [26]:
import warnings
warnings.filterwarnings('ignore')

In [44]:
from sklearn import tree
from sklearn import metrics
dt = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
dt.fit(X_train, y_train)

DecisionTreeClassifier(presort=False)

In [None]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [45]:
y_pred = dt.predict(X_test)
print(y_pred.shape)
print(y_test.shape)

(231,)
(231,)


In [46]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[123  26]
 [ 44  38]]


In [47]:
scores = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: ','{:2.2%}'.format(scores))

Accuracy:  69.70%


In [35]:
from sklearn.tree import export_graphviz
#from sklearn.externals.six import StringIO  
from six import StringIO
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(dt, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

ModuleNotFoundError: No module named 'pydotplus'

### Random Forests
Random forests is a supervised learning algorithm. It is the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

In the previous section, we used a decision tree classifier and got an accuracy of 69.70%.

Let's try to do better and use random forest classifier:

In [38]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
scores = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: ','{:2.2%}'.format(scores))

Accuracy:  66.67%


In [39]:
rf

RandomForestClassifier(n_estimators=10)

In [40]:
rf = RandomForestClassifier(bootstrap=True, max_depth=None)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
scores = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: ','{:2.2%}'.format(scores))

Accuracy:  66.67%


In [41]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
