# Basics of scikit-learn with Decision Trees

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">


In this notebook we present some basics of scikit-learn: preparing training data, fitting a model, predicting against it, and exporting it.

We illustrate this with Decision Tree classification and also show some specific features related to tree models.

## 1. Loading data

### Reading from CSV

We use the Pandas library to easily load CSV files:

In [None]:
from pandas import read_csv
path = "https://oml-data.s3.amazonaws.com/" # load data from http location; you can also load from local path
data = read_csv(path + "iris.csv")

### Inspecting data

I recommend to inspect data in a spreadsheet program and in a data visualization tool. Pandas can also be used to some extent. Here's a quick way to just make sure the data was read correctly:

In [None]:
data.head()

### Define inputs and outputs

Pandas uses its own data structures called "data frames". We need to...
* select what constitutes our inputs and outputs in the data frame
* transform that to standard Python data structures (that scikit can work with).

Let's start with outputs. The usual convention is to store them in a variable named `y`.

In [None]:
target_column = 'Name'
outputs = data[target_column]
y = outputs.values
print(y)

And now the inputs, which we call `X`.

In [None]:
features = data.drop(target_column, axis=1)
X = features.values
print(X)

## 2. Initializing an estimator

* Implementations of learning algorithms reside in _estimator_ objects in scikit.
* An estimator is an object that can "fit" a model to data. More info [here](http://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects).
* The estimator's (hyper)parameters are set upon initialization — explicitly, or implicitly to default values.
* Usually, we name estimators with variables that say `model`, but the actual model is "empty" at first, since no data has been seen.

Here is how to initialize a DecisionTreeClassifier estimator:

In [None]:
from sklearn import tree
model = tree.DecisionTreeClassifier(max_depth = None)

* Check out the other arguments that this constructor can take, from the inline documentation (via Shift + Tab) or the [online documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
* Scikit's online documentation on [trees](http://scikit-learn.org/stable/modules/tree.html) also contains practical tips and theoretical explanations
* Also try using a kNN model, with `neighbors.KNeighborsClassifier`

## 3. Learning a model and predicting

It's time to actually train the model from the training inputs and outputs:

In [None]:
model = model.fit(X, y)

Let's make 2 predictions with our new model:

In [None]:
new_x = [ [1.2,  3.0,  5.4,  4.2], [1.2,  3.0,  5.4,  4.2] ]
model.predict(new_x)

## 4. Exporting a model

The standard thing to do to persist the `model` object is to save it to a file using the pickle library:

In [None]:
import pickle
pickle.dump(model, open('model.pkl', 'wb'))

You can then load it back:

In [None]:
model = pickle.load(open('model.pkl', 'rb'))

(In order to use that model object, you'll need the right version of scikit.)

### Exporting a tree

The structure of a scikit Decision Tree is completely "open". We can navigate through it and generate another representation of it in our language of choice.

Here we export a description of the tree in a format that can be read by the popular D3.js visualization library in JavaScript (see [source](http://planspace.org/20151129-see_sklearn_trees_with_d3/) and this [JSFiddle playground](https://jsfiddle.net/MetalMonkey/JnNwu/)).

In [None]:
def rules(clf, features, labels, node_index=0):
    node = {}
    if clf.tree_.children_left[node_index] == -1:  # indicates leaf
        count_labels = zip(clf.tree_.value[node_index, 0], labels)
        node['name'] = ', '.join(('{} of {}'.format(int(count), label)
                                  for count, label in count_labels))
    else:
        feature = features[clf.tree_.feature[node_index]]
        threshold = clf.tree_.threshold[node_index]
        node['name'] = '{} > {}'.format(feature, threshold)
        left_index = clf.tree_.children_left[node_index]
        right_index = clf.tree_.children_right[node_index]
        node['children'] = [rules(clf, features, labels, right_index),
                            rules(clf, features, labels, left_index)]
    return node

Apply that function to our tree model:

In [None]:
r = rules(model, ['sepal L', 'sepal W', 'petal L', 'petal W'], ['setosa', 'versicolor', 'virginica'])
print(r)

## 5. Creating a simpler tree

First, look at feature importances:

In [None]:
model.feature_importances_

Only keep the two most important features (i.e. the last two features):

In [None]:
model = tree.DecisionTreeClassifier(max_depth = 3)
model = model.fit(X[:, [2, 3]], y)

Export tree as rule-set:

In [None]:
r = rules(model, ['petal L', 'petal W'], ['setosa', 'versicolor', 'virginica'])
print(r)