Decision Tree and Random Forest
=====

**Objective**: Build and tune a decision tree for a classification problem.  Build and evaluate a random forest for a classification problem.

In [6]:
MY_NAME = "Chris Phillips" # <-- Your name here

1) Bring in some data over ground cover types.  A full description of the dataset can be found at [UCI Machine Learning Dataset](https://archive.ics.uci.edu/ml/datasets/covertype).  It contains 54 attributes that characterize a location in a forest and the target is the type of tree that is found at that location.  

In [7]:
import sklearn.datasets

covetype = sklearn.datasets.fetch_covtype()

print covetype.DESCR
print covetype.data
print covetype.target

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <http://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like object
with the feature matrix in the ``data`` member
and the target values in ``target``.
The dataset will be downloaded from the web if necessary.

[[2.596e+03 5.100e+01 3.000e+00 ... 0

2) Shuffle the data and split into three seperate datasets, using a 60-20-20 split.  Check the distribution of the classes across the three datasets to see if they are balanced.

In [27]:
import random
import collections

X = covetype.data
y = covetype.target

idx = random.shuffle(range(X.shape[0]))
X[idx,:] = X
y[idx] = y

N = int(0.8 * X.shape[0])

tv_X = X[:N,:]
tv_y = y[:N]
test_X = X[N:,:]
test_y = y[N:]

N = int(0.75 * tv_X.shape[0])
train_X = tv_X[:N,:]
train_y = tv_y[:N]
val_X = tv_X[N:,:]
val_y = tv_y[N:]

print train_X.shape, train_y.shape, val_X.shape, val_y.shape, test_X.shape, test_y.shape

# Check if classes are balanced across the datasets
train_class_counts = collections.Counter(train_y)
val_class_counts = collections.Counter(val_y)
test_class_counts = collections.Counter(test_y)

print train_class_counts
print val_class_counts
print test_class_counts

(348606, 54) (348606,) (116203, 54) (116203,) (116203, 54) (116203,)
Counter({2: 193927, 1: 103044, 3: 23260, 6: 12403, 5: 7497, 7: 5728, 4: 2747})
Counter({1: 56740, 2: 42480, 7: 7978, 3: 4798, 6: 2448, 5: 1759})
Counter({1: 52056, 2: 46894, 3: 7696, 7: 6804, 6: 2516, 5: 237})


3) Now we will build a decision tree for this problem. 

In [21]:
from sklearn.tree import DecisionTreeClassifier
# Build and fit tree
clf = DecisionTreeClassifier(random_state=0)
clf.fit(train_X, train_y)

print "Accuracy on training data ...", clf.score(train_X, train_y)
print "Accuracy on validation data ...", clf.score(val_X, val_y)

Accuracy on training data ... 1.0
Accuracy on validation data ... 0.687615638150478


4) There is clearly some overfitting taking place.  Let's try to minimize that and also tune the tree at the same time.

In [22]:
import itertools
from sklearn.tree import DecisionTreeClassifier


for (criterion, max_depth, splitter) in itertools.product(["gini", "entropy"], [5, 7, 9, 11], ["best", "random"]): 
    
    clf = DecisionTreeClassifier(criterion=criterion, max_depth=max_depth, splitter=splitter, random_state=0)
    clf.fit(train_X, train_y)

    print criterion, max_depth, splitter
    print "Accuracy on training data ...", clf.score(train_X, train_y)
    print "Accuracy on validation data ...", clf.score(val_X, val_y)
    

gini 5 best
Accuracy on training data ... 0.7506353878017016
Accuracy on validation data ... 0.6160770375979966
gini 5 random
Accuracy on training data ... 0.6507260345490324
Accuracy on validation data ... 0.4049809385299863
gini 7 best
Accuracy on training data ... 0.7719115563128575
Accuracy on validation data ... 0.6584167362288409
gini 7 random
Accuracy on training data ... 0.7404318915910799
Accuracy on validation data ... 0.6166105866457837
gini 9 best
Accuracy on training data ... 0.8005398644888498
Accuracy on validation data ... 0.6680464359784171
gini 9 random
Accuracy on training data ... 0.7561200897288056
Accuracy on validation data ... 0.5939691746340456
gini 11 best
Accuracy on training data ... 0.8339902353946863
Accuracy on validation data ... 0.6472896568935398
gini 11 random
Accuracy on training data ... 0.7516422551533823
Accuracy on validation data ... 0.6756796296136933
entropy 5 best
Accuracy on training data ... 0.7514357182607299
Accuracy on validation data ..

5) Select the best set of hyperparameters, recreate the tree and evaluate the performance by class.  

In [25]:
from sklearn.metrics import confusion_matrix, classification_report

criterion = 'entropy'
max_depth = 9
splitter = 'random'

clf = DecisionTreeClassifier(criterion=criterion, max_depth=max_depth, splitter=splitter, random_state=0)
clf.fit(train_X, train_y)

print "Accuracy on training data ...", clf.score(train_X, train_y)
print "Accuracy on validation data ...", clf.score(val_X, val_y)

pred_y = clf.predict(val_X)

print confusion_matrix(val_y, pred_y)
print classification_report(val_y, pred_y)

Accuracy on training data ... 0.7691835481890731
Accuracy on validation data ... 0.6790014027176579
[[35521 16893    28    15     0  4283]
 [ 8593 32996   346     2   269   274]
 [    0  1391  2990    41   376     0]
 [    0  1566   128    38    27     0]
 [    0  1251   291     7   899     0]
 [ 1443    77     0     0     0  6458]]
              precision    recall  f1-score   support

           1       0.78      0.63      0.69     56740
           2       0.61      0.78      0.68     42480
           3       0.79      0.62      0.70      4798
           5       0.37      0.02      0.04      1759
           6       0.57      0.37      0.45      2448
           7       0.59      0.81      0.68      7978

   micro avg       0.68      0.68      0.68    116203
   macro avg       0.62      0.54      0.54    116203
weighted avg       0.69      0.68      0.67    116203



6) Finally, let's build a random forest built from a number of decision trees.

In [28]:
from sklearn.ensemble import RandomForestClassifier

# Build and fit forest here
clf = RandomForestClassifier(n_estimators=30, random_state=0, max_depth=30)
clf.fit(train_X, train_y)

print "Accuracy on training data ...", clf.score(train_X, train_y)
print "Accuracy on validation data ...", clf.score(val_X, val_y)


Accuracy on training data ... 0.9977883341078467
Accuracy on validation data ... 0.7621059697253943


In [31]:
print "Accuracy on test data ...", clf.score(test_X, test_y)

pred_y = clf.predict(test_X)

print confusion_matrix(test_y, pred_y)
print classification_report(test_y, pred_y)

Accuracy on test data ... 0.7046289682710429
[[37926 10717    73   111   169  3060]
 [13482 28247  1707   823  1983   652]
 [    5   442  6664     5   580     0]
 [   17    88     4   128     0     0]
 [    1    99   178     2  2236     0]
 [  120     5     0     0     0  6679]]
              precision    recall  f1-score   support

           1       0.74      0.73      0.73     52056
           2       0.71      0.60      0.65     46894
           3       0.77      0.87      0.82      7696
           5       0.12      0.54      0.20       237
           6       0.45      0.89      0.60      2516
           7       0.64      0.98      0.78      6804

   micro avg       0.70      0.70      0.70    116203
   macro avg       0.57      0.77      0.63    116203
weighted avg       0.72      0.70      0.70    116203

