# Decision Trees

**Introduction:**

Decision Trees are a class of algorithms that are based on "if" and "else" conditions. Based on these conditions, decisions are made to the task at hand. These conditions are decided by an algorithm based on data at hand. How many conditions, kind of conditions, and answers to that conditions are based on data and will be different for each dataset. We'll be covering the usage of decision tree implementation available in scikit-learn for classification and regression tasks below.

Below we have highlighted some characteristics of decision tree

Fast to train and easy to understand & interpret.

Binary splitting of questions is the essence of decision tree models.

Requires little preprocessing of data.

Can work with variables of different types (continuous & discrete)

Invariant to feature scaling.

Models are called "nonparametric" because there are no hyper-parameters to tune.

If given more data then the model becomes more flexible

We'll start by importing the necessary modules needed for our tutorial. We'll need pydotplus library installed as it'll be used to plot decision trees trained by scikit-learn.

In [None]:
!git clone https://github.com/hussain0048/Machine-Learning.git

## 1 - Importing necessary libraries ##

In [None]:
!pip install pydotplus

In [None]:
import pandas as pd

import matplotlib.pyplot as plt

import sklearn

import warnings
import sys

print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)

warnings.filterwarnings("ignore") ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)

## Beow magic function fits plot inside of current notebook. 
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline

**DecisionTreeClassifier**

Below we are loading classic IRIS classification dataset provided by scikit-learn which has 150 samples of 3 categories of flowers containing 50 samples for each category (iris-setosa, iris-virginica, iris-versicolor). We'll use DecisionTreeClassifier provided by scikit-learn for the classification tasks.

## 2 - Loading Data ##
Below we are loading the IRIS dataset which comes as default with the sklearn package. it returns Bunch object which is almost the same as the dictionary.


In [None]:
from sklearn import datasets

iris = datasets.load_iris()
X, Y = iris.data, iris.target

print('Dataset features names : '+str(iris.feature_names))
print('Dataset features size : '+str(iris.data.shape))
print('Dataset target names : '+str(iris.target_names))
print('Dataset target size : '+str(iris.target.shape))

## 3 - Splitting Dataset into Train & Test sets ##
We'll split the dataset into two parts:

    Training data which will be used for the training model.

    Test data against which accuracy of the trained model will be checked.
    
train_test_split function of the model_selection module of sklearn will help us split data into two sets with 80% for training and 20% for test purposes. We are also using seed(random_state=123) with train_test_split so that we always get the same split and can reproduce results in the future as well.

In [None]:
 from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, stratify=Y, random_state=123)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

##4- Fitting Model To Train Data ##

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree_classifier = DecisionTreeClassifier(random_state=1)
tree_classifier.fit(X_train, Y_train)

## 5-Evaluating Trained Model On Test Data
Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it. We'll use score() which returns the accuracy of the model to check model accuracy on test data.

In [None]:
Y_preds = tree_classifier.predict(X_test)
print(Y_preds)
print(Y_test)
print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%tree_classifier.score(X_train, Y_train))

DecisionTreeClassifier instance provides predict_proba() method which returns probability returned by model for each class. We'll try to print probabilities predicted by the model for the first few test samples.

In [None]:
tree_classifier.predict_proba(X_test)[:10]

## 6 - Finetuning Model By Doing Grid Search On Various Hyperparameters##

Below is a list of common hyper-parameters that needs tuning for getting best fit for our data. We'll try various hyper-parameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy
- criterion: It accepts string argument specifying which function to use to measure the quality of a split.
    - gini - Gini Impurity. This is the default value.
    - entropy - Information Gain.
- max_depth - It defines how finely tree can separate samples (list of "if-else" questions to ask deciding target variable). As we increase max_depth, model over-fits, and less value of max_depth results in under-fit. We need to find the best value. If no value is provided then by default None is used.
- max_features - Number of features to consider when doing split. It accepts int(0-n_features), float(0.0-0.5], string(sqrt, log2, auto) or None as value.
    - None - n_features are used as value if None is provided.
    - sqrt - sqrt(n_features) features are used for split.
    - auto - sqrt(n_features) features are used for split.
    - log2 - log2(n_features) features are used for split.
- min_samples_split - Number of samples required to split internal node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_split * n_samples) features.
- min_samples_leaf - Minimum number of samples required to be at leaf node. It accepts int(0-n_samples), float(0.0-0.5] values. Float takes ceil(min_samples_leaf * n_samples) features.

**GridSearchCV**

It's a wrapper class provided by sklearn which loops through all parameters provided as params_grid parameter with a number of cross-validation folds provided as cv parameter, evaluates model performance on all combinations and stores all results in cv_results_ attribute. It also stores model which performs best in all cross-validation folds in best_estimator_ attribute and best score in best_score_ attribute.
We'll below try various values for the above-mentioned hyper-parameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.




In [None]:
from sklearn.model_selection import GridSearchCV

n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(DecisionTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'criterion': ['gini', 'entropy'],
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

## 7 - Printing First Few Cross-Validation Results ## 
GridSearchCV maintains results for all parameter combinations tried with all cross-validation splits. We can access results for all iterations as a dictionary by calling cv_results_ attribute on it. We are converting it to pandas dataframe for better visuals.



In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

##8- Plotting Feature Importance # 
We can access the feature importance of each feature in the decision tree through feature_importances_ attributes. We have plotted it as well for better understanding.

In [None]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(10,4))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(4), iris.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

## 9- Visualizing Decision Tree Using GraphViz & PyDotPl##
We can visualize the decision tree by using graphviz. Scikit-learn provides export_graphviz() function which can let us convert tree trained to graphviz format. We can then generate a graph from it using the pydotplus library using its method graph_from_dot_data.

We can easily ask questions about flower type based on flower features and get an answer from the decision tree based on True or False answer to the question. 

In [None]:
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus

dot_data = StringIO()

export_graphviz(grid.best_estimator_, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,
                class_names=iris.target_names,
                feature_names=iris.feature_names)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())


##10- ExtraTreeClassifier##
ExtraTreeClassifier is commonly referred to as an extremely randomized decision tree. When deciding to split samples into 2 groups based on a feature, random splits are drawn for each of randomly selected features and the best of them is selected.


###10.1 Fitting Model To Train Data#

In [None]:
from sklearn.tree import ExtraTreeClassifier
extra_tree_classifier = ExtraTreeClassifier(random_state=1)
extra_tree_classifier.fit(X_train, Y_train)

###10.2 Evaluating Trained Model On Test Data ###
Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it

In [None]:
Y_preds = extra_tree_classifier.predict(X_test)

print(Y_preds)
print(Y_test)

print('Test Accuracy : %.3f'%(Y_preds == Y_test).mean() )
print('Test Accuracy : %.3f'%extra_tree_classifier.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%extra_tree_classifier.score(X_train, Y_train))

In [None]:
extra_tree_classifier.predict_proba(X_test)[:10]


###10.3- Finetuning Model By Doing Grid Search On Various Hyperparameters###
ExtraTreeClassifier has same hyperparameters as that of DecisionTreeClassifier

In [None]:
n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(ExtraTreeClassifier(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'criterion': ['gini', 'entropy'],
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)

print('Train Accuracy : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 5184 candidates, totalling 15552 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 2312 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 8072 tasks      | elapsed:   11.8s


Train Accuracy : 0.982
Test Accuracy : 0.974
Best Score Through Grid Search : 0.955
Best Parameters :  {'criterion': 'gini', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}


[Parallel(n_jobs=-1)]: Done 15454 tasks      | elapsed:   21.6s
[Parallel(n_jobs=-1)]: Done 15552 out of 15552 | elapsed:   21.7s finished


### 10.4-Printing First Few Cross Validation Results###

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

###10.5- Plotting Feature Importance#

In [None]:
print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))


In [None]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(10,4))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(4), iris.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();

##11 DecisionTreeRegressor##
We'll now try loading the Boston dataset provided by sklearn and will try DecisionTreeRegressor on it as well with different depth of the decision tree. We'll also visualize results letter comparing performance on train and test sets with different tree depths.

###11.1-Loading Data###

In [None]:
boston = datasets.load_boston()
X, Y  = boston.data, boston.target
print('Dataset features names : '+str(boston.feature_names))
print('Dataset features size : '+str(boston.data.shape))
print('Dataset target size : '+str(boston.target.shape))

###11.2-Splitting Dataset into Train & Test sets###
Below we are splitting the Boston dataset into the train set(80%) and test set(20%). We are also using seed(random_state=123) so that we always get the same split and can reproduce results in the future as well.

In [None]:
X_train, X_test,Y_train, Y_test = train_test_split(X, Y, train_size=0.75, test_size=0.25, random_state=1)
print('Train/Test Set Sizes : ', X_train.shape, Y_train.shape, X_test.shape, Y_test.shape)

Train/Test Set Sizes :  (379, 13) (379,) (127, 13) (127,)


###11.3 Fitting Model To Train Data###

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_regressor = DecisionTreeRegressor(random_state=1)
tree_regressor.fit(X_train, Y_train)

###11.4 Evaluating Trained Model On Test Data.###
Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [None]:
Y_preds = tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%tree_regressor.score(X_test, Y_test))

###11.4 Finetuning Model By Doing Grid Search On Various Hyperparameters###
DecisionTreeRegressor has same hyperparameters as DecisionTreeClassifier. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation

In [None]:
n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(DecisionTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 2592 candidates, totalling 7776 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 2312 tasks      | elapsed:    4.9s


Train R^2 Score : 0.909
Test R^2 Score : 0.768
Best R^2 Score Through Grid Search : 0.780
Best Parameters :  {'max_depth': 5, 'max_features': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 2}


[Parallel(n_jobs=-1)]: Done 7712 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:   13.3s finished


###11.5 Printing First Few Cross Validation Results###

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

###11.6 Plotting Feature Importance###

In [None]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(12,8))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(13), boston.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();


###11.6 Visualizing Decision Tree Using GraphViz & PyDotPlus###

In [None]:
dot_data = StringIO()
export_graphviz(grid.best_estimator_, out_file=dot_data,
                filled=True, rounded=True,
                special_characters=True,
                feature_names=boston.feature_names,)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())


##12 ExtraTreeRegressor##
ExtraTreeRegressor like ExtraTreeClassifier is an extremely randomized decision tree for regression problems. We'll follow the same process as previous examples to explain its usage.

###12.1 Fitting Model To Train Data###


In [None]:
from sklearn.tree import ExtraTreeRegressor

extra_tree_regressor = ExtraTreeRegressor(random_state=1)
extra_tree_regressor.fit(X_train, Y_train)

###12.2 Evaluating Trained Model On Test Data###

In [None]:
Y_preds = extra_tree_regressor.predict(X_test)

print(Y_preds[:10])
print(Y_test[:10])

print('Training Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_train, Y_train))
print('Test Coefficient of R^2 : %.3f'%extra_tree_regressor.score(X_test, Y_test))

###12.3Finetuning Model By Doing Grid Search On Various Hyperparameters.###
ExtraTreeRegressor has same hyperparameters as ExtraTreeClassifier. We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [None]:
n_features = X.shape[1]
n_samples = X.shape[0]

grid = GridSearchCV(ExtraTreeRegressor(random_state=1), cv=3, n_jobs=-1, verbose=5,
                    param_grid ={
                    'max_depth': [None,1,2,3,4,5,6,7],
                    'max_features': [None, 'sqrt', 'auto', 'log2', 0.3,0.5,0.7, n_features//2, n_features//3, ],
                    'min_samples_split': [2,0.3,0.5, n_samples//2, n_samples//3, n_samples//5],
                    'min_samples_leaf':[1, 0.3,0.5, n_samples//2, n_samples//3, n_samples//5]},
                    )

grid.fit(X_train, Y_train)
print('Train R^2 Score : %.3f'%grid.best_estimator_.score(X_train, Y_train))
print('Test R^2 Score : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Best R^2 Score Through Grid Search : %.3f'%grid.best_score_)
print('Best Parameters : ',grid.best_params_)

Fitting 3 folds for each of 2592 candidates, totalling 7776 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 4358 tasks      | elapsed:    7.8s


Train R^2 Score : 0.907
Test R^2 Score : 0.780
Best R^2 Score Through Grid Search : 0.707
Best Parameters :  {'max_depth': 7, 'max_features': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 2}


[Parallel(n_jobs=-1)]: Done 7776 out of 7776 | elapsed:   12.7s finished


### 12.4 Printing First Few Cross Validation Results###

In [None]:
cross_val_results = pd.DataFrame(grid.cv_results_)
print('Number of Various Combinations of Parameters Tried : %d'%len(cross_val_results))

cross_val_results.head() ## Printing first few results.

###12.5 Plotting Feature Importance###

In [None]:
print("Feature Importance : %s"%str(grid.best_estimator_.feature_importances_))


In [None]:
with plt.style.context(('seaborn', 'ggplot')):
    plt.figure(figsize=(12,8))
    plt.imshow(grid.best_estimator_.feature_importances_.reshape(1,-1), cmap=plt.cm.Blues, interpolation='nearest')
    plt.xticks(range(13), boston.feature_names)
    plt.yticks([])
    plt.grid(None)
    plt.colorbar();


References:
- Decision Trees
https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-decision-trees