# Decision Trees

In [14]:
import numpy as np
np.random.seed(1)

import warnings
warnings.filterwarnings("ignore")

## Classification with DTs

First we'll load the famous *iris* dataset, dealing with plant classification:

In [15]:
from sklearn.datasets import load_iris
iris = load_iris()

Let's look inside of it to see what datatypes scikit-learn wants, and how their sample dataset is formatted, so that we can prepare our own datasets later:

In [16]:
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

So the data is in dictionary format, and we can access the data and labels by indexing certain keys:

In [17]:
print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

So what are the features, and what are we predicting?

In [18]:
print(iris.feature_names)
print(len(iris.feature_names))
print()
print(iris.target_names)
print(len(iris.target_names))

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
4

['setosa' 'versicolor' 'virginica']
3


So we are using 4 features for each observation, trying to classfiy each observation into one of three categories, using only those 4 features. How are these input features formatted?

In [19]:
print(len(iris.data))
print(type(iris.data))


150
<class 'numpy.ndarray'>


We have a large numpy array of length 150, one for each observation, and each observation has its own numpy array of length 4, one for each feature. Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**.

What about the prediction?

In [20]:
print(len(iris.target))
print(type(iris.target))
iris.target

150
<class 'numpy.ndarray'>


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Again, we have 150 observations, but *no* sub arrays. The target data is one dimension. Order matters here as well, they should correspond to the feature indices in the data array.

Now we split the data into training and testing:

In [21]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)

After the train/test split, scikit-learn makes the rest of the process easy. We just have to decide on our parameters:

In [22]:
from sklearn import tree

dt_classifier = tree.DecisionTreeClassifier(criterion='gini',  # or 'entropy' for information gain
                       splitter='best',  # or 'random' for random best split
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features=None,  # number of features to look for when splitting
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_split=1e-07)  # early stopping

model = dt_classifier.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.9736842105263158


### Grid Search

In [24]:
from sklearn.model_selection import GridSearchCV

param_grid = {'min_samples_split': range(2,10),
              'min_samples_leaf': range(1,10)}

model_c = GridSearchCV(tree.DecisionTreeClassifier(), param_grid)
model_c.fit(X_train, y_train)

best_index = np.argmax(model_c.cv_results_["mean_test_score"])

print(model_c.cv_results_["params"][best_index])
print(max(model_c.cv_results_["mean_test_score"]))
print(model_c.score(X_test, y_test))

{'min_samples_leaf': 1, 'min_samples_split': 3}
0.9375
0.9736842105263158


## Regression with DTs

### Dataset and prep

For demonstration, we will use a boston housing dataset, which comes with scikit-learn:

In [25]:
from sklearn.datasets import load_boston

boston = load_boston()

If you are going to follow along in other tutorials in the scikit-learn documentation, you will need to know the data structures used as inputs to the models. Let'see what's in the boston dataset:

In [26]:
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

The description will tell us more about the dataset:

In [27]:
boston.DESCR



So we are working on predicitng median value of a home from 506 observations, and 13 covariates including crime rate, lot size, industry/commercial proportion, presence of the Charles River, nitric oxide concentration, rooms per dwelling, units built before 1940, distance to employment centers, access to highways, tax rate, school proxy, black population, and status. To get the variable names we can ask for them in the dictionary:

In [28]:
print(boston.feature_names)
print()
print(type(boston.feature_names))
print()
print(len(boston.feature_names))

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

<class 'numpy.ndarray'>

13


We see the input is a numpy array of strings for the variable labels. To get the variable data, we ask the dictionary for the data:

In [29]:
print(boston.data)
print()
print(type(boston.data))
print()
print(len(boston.data))

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

<class 'numpy.ndarray'>

506


The data is a numpy array, inside of which there is a separate array for each observation (all 506 for each hous, *not* 13 for each variable). Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**

The target, or *y* is accessed in the dictionary as well:

In [30]:
print(boston.target)
print()
print(type(boston.target))
print()
print(len(boston.target))

[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22.  20.3 20.5 17.3 18.8 21.4
 15.7 16.2 18.  14.3 19.2 19.6 23.  18.4 15.6 18.1 17.4 17.1 13.3 17.8
 14.  14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
 17.  15.6 13.1 41.3 24.3 23.3 27.  50.  50.  50.  22.7 25.  50.  23.8
 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
 37.9 32.5 26.4 29.6 50.  32.  29.8 34.9 37.  30.5 36.4 31.1 29.1 50.
 33.3 3

The target array is only one dimmension, lined up in order with the with the observations in the data array.

Now that we're familiar with the input data, we need to split it up for training and testing:

In [31]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
                                                    train_size=0.75, test_size=0.25)

Now we have 75% of the data as training data, and 25% of the data as testing data:

In [32]:
print(len(X_train), len(y_train))
print()
print(len(X_test), len(y_test))

379 379

127 127


In scikit-learn, as soon as you have `X_train`, `X_test`, `y_train`, and `y_test`, everything else is just a matter of choosing parameters for whichever model you choose. But this should not be trivialized, selecting models and that model's parameters is *very* important. While we will not cover it here, you should always select the model and parameters best suited for your data.

### Decision Tree Regression

In [33]:
from sklearn import tree

dt_reg = tree.DecisionTreeRegressor(criterion='mse',  # how to measure fit
                                    splitter='best',  # or 'random' for random best split
                                    max_depth=None,  # how deep tree nodes can go
                                    min_samples_split=2,  # samples needed to split node
                                    min_samples_leaf=1,  # samples needed for a leaf
                                    min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                                    max_features=None,  # number of features to look for when splitting
                                    max_leaf_nodes=None,  # max nodes
                                    min_impurity_split=1e-07)  # early stopping

model = dt_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.857589194231606


In [34]:
print(model.decision_path(X_train))

  (0, 0)	1
  (0, 634)	1
  (0, 635)	1
  (0, 636)	1
  (0, 637)	1
  (0, 659)	1
  (0, 660)	1
  (0, 661)	1
  (0, 662)	1
  (0, 663)	1
  (0, 664)	1
  (0, 665)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (1, 40)	1
  (1, 42)	1
  (1, 332)	1
  (1, 340)	1
  (1, 341)	1
  (1, 351)	1
  (1, 352)	1
  (1, 353)	1
  (1, 354)	1
  (1, 355)	1
  :	:
  (377, 1)	1
  (377, 387)	1
  (377, 487)	1
  (377, 488)	1
  (377, 489)	1
  (377, 491)	1
  (377, 505)	1
  (377, 525)	1
  (377, 526)	1
  (377, 542)	1
  (377, 543)	1
  (377, 544)	1
  (378, 0)	1
  (378, 1)	1
  (378, 2)	1
  (378, 40)	1
  (378, 42)	1
  (378, 43)	1
  (378, 44)	1
  (378, 46)	1
  (378, 47)	1
  (378, 48)	1
  (378, 49)	1
  (378, 53)	1
  (378, 54)	1


### Grid Search

In [35]:
param_grid = {'min_samples_split': range(2,10),
              'min_samples_leaf': range(1,10)}

model_r = GridSearchCV(tree.DecisionTreeRegressor(), param_grid)
model_r.fit(X_train, y_train)

best_index = np.argmax(model_r.cv_results_["mean_test_score"])

print(model_r.cv_results_["params"][best_index])
print(max(model_r.cv_results_["mean_test_score"]))
print(model_r.score(X_test, y_test))

{'min_samples_leaf': 5, 'min_samples_split': 4}
0.7514875373713642
0.8121224475342722
