# Decision Trees

## Classification with DTs

First we'll load the famous *iris* dataset, dealing with plant classification:

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

Let's look inside of it to see what datatypes scikit-learn wants, and how their sample dataset is formatted, so that we can prepare our own datasets later:

In [None]:
iris.keys()

So the data is in dictionary format, and we can access the data and labels by indexing certain keys:

In [None]:
iris.DESCR

So what are the features, and what are we predicting?

In [None]:
print(iris.feature_names)
print(len(iris.feature_names))
print()
print(iris.target_names)
print(len(iris.target_names))

So we are using 4 features for each observation, trying to classfiy each observation into one of three categories, using only those 4 features. How are these input features formatted?

In [None]:
print(len(iris.data))
print(type(iris.data))
iris.data

We have a large numpy array of length 150, one for each observation, and each observation has its own numpy array of length 4, one for each feature. Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**.

What about the prediction?

In [None]:
print(len(iris.target))
print(type(iris.target))
iris.target

Again, we have 150 observations, but *no* sub arrays. The target data is one dimension. Order matters here as well, they should correspond to the feature indices in the data array.

Now we split the data into training and testing:

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)

After the train/test split, scikit-learn makes the rest of the process easy. We just have to decide on our parameters:

In [None]:
from sklearn import ensemble, tree

rf_classifier = ensemble.RandomForestClassifier(n_estimators=10,  # number of trees
                       criterion='gini',  # or 'entropy' for information gain
                       max_depth=None,  # how deep tree nodes can go
                       min_samples_split=2,  # samples needed to split node
                       min_samples_leaf=1,  # samples needed for a leaf
                       min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                       max_features='auto',  # number of features for best split
                       max_leaf_nodes=None,  # max nodes
                       min_impurity_split=1e-07,  # early stopping
                       n_jobs=1,  # CPUs to use
                       random_state = 10,  # random seed
                       class_weight="balanced")  # adjusts weights inverse of freq, also "balanced_subsample" or None

model = rf_classifier.fit(X_train, y_train)
print(model.score(X_test, y_test))

In [None]:
print(model.decision_path(X_test)[0])

## Regression with RFs

### Dataset and prep

For demonstration, we will use a boston housing dataset, which comes with scikit-learn:

In [None]:
from sklearn.datasets import load_boston

boston = load_boston()

If you are going to follow along in other tutorials in the scikit-learn documentation, you will need to know the data structures used as inputs to the models. Let'see what's in the boston dataset:

In [None]:
boston.keys()

The description will tell us more about the dataset:

In [None]:
boston.DESCR

So we are working on predicitng median value of a home from 506 observations, and 13 covariates including crime rate, lot size, industry/commercial proportion, presence of the Charles River, nitric oxide concentration, rooms per dwelling, units built before 1940, distance to employment centers, access to highways, tax rate, school proxy, black population, and status. To get the variable names we can ask for them in the dictionary:

In [None]:
print(boston.feature_names)
print()
print(type(boston.feature_names))
print()
print(len(boston.feature_names))

We see the input is a numpy array of strings for the variable labels. To get the variable data, we ask the dictionary for the data:

In [None]:
print(boston.data)
print()
print(type(boston.data))
print()
print(len(boston.data))

The data is a numpy array, inside of which there is a separate array for each observation (all 506 for each hous, *not* 13 for each variable). Each inner array *must* lineup with the order of the variables *and* all other arrays. **ORDER MATTERS**

The target, or *y* is accessed in the dictionary as well:

In [None]:
print(boston.target)
print()
print(type(boston.target))
print()
print(len(boston.target))

The target array is only one dimmension, lined up in order with the with the observations in the data array.

Now that we're familiar with the input data, we need to split it up for training and testing:

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
                                                    train_size=0.75, test_size=0.25)

Now we have 75% of the data as training data, and 25% of the data as testing data:

In [None]:
print(len(X_train), len(y_train))
print()
print(len(X_test), len(y_test))

In scikit-learn, as soon as you have `X_train`, `X_test`, `y_train`, and `y_test`, everything else is just a matter of choosing parameters for whichever model you choose. But this should not be trivialized, selecting models and that model's parameters is *very* important. While we will not cover it here, you should always select the model and parameters best suited for your data.

### Random Forest Regression

In [None]:
from sklearn import ensemble

rf_reg = ensemble.RandomForestRegressor(n_estimators=10,  # number of trees
                                        criterion='mse',  # how to measure fit
                                        max_depth=None,  # how deep tree nodes can go
                                        min_samples_split=2,  # samples needed to split node
                                        min_samples_leaf=1,  # samples needed for a leaf
                                        min_weight_fraction_leaf=0.0,  # weight of samples needed for a node
                                        max_features='auto',  # max feats
                                        max_leaf_nodes=None,  # max nodes
                                        random_state = 10,  # random seed
                                        n_jobs=1)  # how many to run parallel

model = rf_reg.fit(X_train, y_train)
print(model.score(X_test, y_test))

In [None]:
print(model.decision_path(X_test)[0])