# Decision Trees

Decision trees are machine learning models that try to find patterns in the features of data points.

Decision trees ar supervised machine learning models, which means that they're created from a training set of labeled data. Creating the tree is where the _learning_ in machine learning happens.

Let's create a decision tree build off a dataset about cars. When considering buying a car, what factors go into making that decision?

Each car can fall into four different classes which represent how satisfied someone would be with purchasing the car—`unacc` (unacceptable), `acc` (acceptable), `good`, `vgood`.

Each car has 6 features:
* The price of the car which can be `"vhigh"`, `"high"`, `"med"`, or `"low"`
* The cost of maintaining the car which can be `"vhigh"`, `"high"`, `"med"`, or `"low"`
* The number of doors which can be `"2"`, `"3"`, `"4"`, `"5more"`
* The number of people the car can hold which can be `"2"`, `"4"`, or `"more"`
* The size of the trunk which can be `"small"`, `"med"`, or `"big"`
* The safety rating of the car which can be `"low"`, `"med"`, or `"high"`

In [1]:
from tree import tree, classify

car = ['med', 'med', '4', '4', 'big', 'high']

classified = classify(car, tree)
print(classified)

vgood


## Gini Impurity

To see if our decision tree is useful, we want to calculate the __Gini impurity__ of a set of data points.

To find the Gini impurity, start at `1` and subtract the squared percentage of each label in the set. For example, if a data set had three items of class `A` and one item of class `B`, the Gini impurity of the set would be:

$1 - \bigg(\frac{3}{4}\bigg)^2 - \bigg(\frac{1}{4}\bigg)^2 = 0.3751$

If a data set has only one class, you'd end up with a Gini impurity of `0`. The lower the impurity, the better the decision tree.

In [2]:
from collections import Counter

labels = ['unacc', 'unacc', 'acc', 'acc', 'good', 'good']
labels = ['unacc', 'unacc', 'unacc', 'unacc', 'unacc', 'unacc']

impurity = 1

# count up how many times every unique label is in the dataset
label_counts = Counter(labels)
print(label_counts)

Counter({'unacc': 6})


In [3]:
# find the probability of each label 
labels = ["unacc","unacc","unacc", "good", "vgood", "vgood"]

for label in label_counts:
    probability_of_label = label_counts[label] / len(labels)
    impurity -= (probability_of_label ** 2)

print(impurity)

0.0


## Information Gain

We know that we want to end up with leaves with a low Gini Impurity, but we still need to figure out which features to split on in order to achieve this. 

To answer this, we can calculate the __information gain__ of splitting the data on a certain feature. Information gain measures difference in the impurity of the data before and after the split. 

For example, let’s say you had a dataset with an impurity of `0.5`. After splitting the data based on a feature, you end up with three groups with impurities `0`, `0.375`, and `0`. The information gain of splitting the data in that way is `0.5 - 0 - 0.375 - 0 = 0.125`.

By splitting the data in that way, we've gained some information about how the data is structured—the datasets after the split are purer than they were before the split. The higher the information gain the better—if information gain is `0`, then splitting the data on that feature was useless. 

In [4]:
unsplit_labels = ["unacc", "unacc", "unacc", "unacc", "unacc", "unacc", "good", "good", "good", "good", "vgood", "vgood", "vgood"]

split_labels_1 = [
  ["unacc", "unacc", "unacc", "unacc", "unacc", "unacc", "good", "good", "vgood"], 
  [ "good", "good"], 
  ["vgood", "vgood"]
]

split_labels_2 = [
  ["unacc", "unacc", "unacc", "unacc","unacc", "unacc", "good", "good", "good", "good"], 
  ["vgood", "vgood", "vgood"]
]

def gini(dataset):
    impurity = 1
    label_counts = Counter(dataset)
    for label in label_counts:
        prob_of_label = label_counts[label] / len(dataset)
        impurity -= prob_of_label ** 2
    return impurity

info_gain = gini(unsplit_labels)

for subset in split_labels_1:
    info_gain -= gini(subset)
    
print(info_gain)


0.14522609394404257


In [5]:
for subset in split_labels_2:
    info_gain -= gini(subset)

print(info_gain)

-0.3347739060559574


## Weighted Information Gain

The sizes of the subset that get created after the split are important too. 

Let's modify the formula for information gain to reflect the fact that the size of the set is relevant. Instead of simply subtarcting the impurity of each set, we'll subtract the _weighted_ impurity of each of the split sets. If the data before the split contained `20` items and one of the resulting splits contained `2` items, then the weighted impurity of that subset would be `2/20` * `impurity`.

Now, that we can calculate the information gain using weighted impurity, let's do that for every possible feature. If we do this, we can find the best feature to split the data on.

In [6]:
cars = [['med', 'low', '3', '4', 'med', 'med'], ['med', 'vhigh', '4', 'more', 'small', 'high'], ['high', 'med', '3', '2', 'med', 'low'], ['med', 'low', '4', '4', 'med', 'low'], ['med', 'low', '5more', '2', 'big', 'med'], ['med', 'med', '2', 'more', 'big', 'high'], ['med', 'med', '2', 'more', 'med', 'med'], ['vhigh', 'vhigh', '2', '2', 'med', 'low'], ['high', 'med', '4', '2', 'big', 'low'], ['low', 'low', '2', '4', 'big', 'med']]

car_labels = ['acc', 'acc', 'unacc', 'unacc', 'unacc', 'vgood', 'acc', 'unacc', 'unacc', 'good']

def split(dataset, labels, column):
    data_subsets = []
    label_subsets = []
    counts = list(set([data[column] for data in dataset]))
    counts.sort()
    for k in counts:
        new_data_subset = []
        new_label_subset = []
        for i in range(len(dataset)):
            if dataset[i][column] == k:
                new_data_subset.append(dataset[i])
                new_label_subset.append(labels[i])
        data_subsets.append(new_data_subset)
        label_subsets.append(new_label_subset)
    return data_subsets, label_subsets
    
    
# update information_gain to make it calculate weighted information gain
def information_gain(starting_labels, split_labels):
    info_gain = gini(starting_labels)
    for subset in split_labels:
        info_gain -= gini(subset) * (len(subset) / len(starting_labels))
    return info_gain
    
split_data, split_labels = split(cars, car_labels, 3)
print(split_data, split_labels)


[[['high', 'med', '3', '2', 'med', 'low'], ['med', 'low', '5more', '2', 'big', 'med'], ['vhigh', 'vhigh', '2', '2', 'med', 'low'], ['high', 'med', '4', '2', 'big', 'low']], [['med', 'low', '3', '4', 'med', 'med'], ['med', 'low', '4', '4', 'med', 'low'], ['low', 'low', '2', '4', 'big', 'med']], [['med', 'vhigh', '4', 'more', 'small', 'high'], ['med', 'med', '2', 'more', 'big', 'high'], ['med', 'med', '2', 'more', 'med', 'med']]] [['unacc', 'unacc', 'unacc', 'unacc'], ['acc', 'unacc', 'good'], ['acc', 'vgood', 'acc']]


In [7]:
print(len(split_data))

3


In [8]:
print(split_data[0])

[['high', 'med', '3', '2', 'med', 'low'], ['med', 'low', '5more', '2', 'big', 'med'], ['vhigh', 'vhigh', '2', '2', 'med', 'low'], ['high', 'med', '4', '2', 'big', 'low']]


In [9]:
print(split_data[1])

[['med', 'low', '3', '4', 'med', 'med'], ['med', 'low', '4', '4', 'med', 'low'], ['low', 'low', '2', '4', 'big', 'med']]


In [10]:
print(information_gain(car_labels, split_labels))

0.30666666666666675


In [11]:
for i in range(0, 6):
    split_data, split_labels = split(cars, car_labels, i)
    print(information_gain(car_labels, split_labels))

0.2733333333333334
0.04000000000000001
0.10666666666666663
0.30666666666666675
0.15000000000000002
0.29000000000000004


## Recursive Tree Building

Now that we can find the best feature to split the dataset, we can repeat this process again and again to create the full tree. This is a recursive algorithm! We start with every data point from the training set, find the best feature to split the data, split the data based on that feature, and then recursively repeat the process again on each subset that was created from the split.

We’ll stop the recursion when we can no longer find a feature that results in any information gain. In other words, we want to create a leaf of the tree when we can’t find a way to split the data that makes purer subsets.

The leaf should keep track of the classes of the data points from the training set that ended up in the leaf. In our implementation, we’ll use a Counter object to keep track of the counts of labels.

We’ll use these counts to make predictions about new data that we give the tree.

In [12]:
from tree_1 import *

car_data = [['med', 'low', '3', '4', 'med', 'med'], ['med', 'vhigh', '4', 'more', 'small', 'high'], ['high', 'med', '3', '2', 'med', 'low'], ['med', 'low', '4', '4', 'med', 'low'], ['med', 'low', '5more', '2', 'big', 'med'], ['med', 'med', '2', 'more', 'big', 'high'], ['med', 'med', '2', 'more', 'med', 'med'], ['vhigh', 'vhigh', '2', '2', 'med', 'low'], ['high', 'med', '4', '2', 'big', 'low'], ['low', 'low', '2', '4', 'big', 'med']]

car_labels = ['acc', 'acc', 'unacc', 'unacc', 'unacc', 'vgood', 'acc', 'unacc', 'unacc', 'good']

def find_best_split(dataset, labels):
    best_gain = 0
    best_feature = 0
    for feature in range(len(dataset[0])):
        data_subsets, label_subsets = split(dataset, labels, feature)
        gain = information_gain(labels, label_subsets)
        if gain > best_gain:
            best_gain, best_feature = gain, feature
    return best_feature, best_gain


In [13]:
def build_tree(data, labels):
    best_feature, best_gain = find_best_split(data, labels)
    if best_gain == 0:
        return Counter(labels)
    data_subsets, label_subsets = split(data, labels, best_feature)
    branches = []
    for i in range(len(data_subsets)):
        branch = build_tree(data_subsets[i], label_subsets[i])
        branches.append(branch)
    return branches

tree = build_tree(car_data, car_labels)
print_tree(tree)

Splitting
--> Branch 0:
  Counter({'unacc': 4})
--> Branch 1:
  Splitting
  --> Branch 0:
    Counter({'good': 1})
  --> Branch 1:
    Counter({'acc': 1})
  --> Branch 2:
    Counter({'unacc': 1})
--> Branch 2:
  Splitting
  --> Branch 0:
    Counter({'vgood': 1})
  --> Branch 1:
    Counter({'acc': 1})
  --> Branch 2:
    Counter({'acc': 1})


## Classifying New Data

We now can use our tree as a classifier. Given a new data point, we start at the top of the tree and follow the path of the tree until we hit a leaf. Once we get to a leaf, we'll use the classes of the points from the training set to make a classification. 

In [14]:
from tree_2 import *
import operator

test_point = ['vhigh', 'low', '3', '4', 'med', 'med']

print_tree(tree)

Splitting on Estimated Saftey
--> Branch high:
  Splitting on Person Capacity
  --> Branch 2:
    Counter({'unacc': 174})
  --> Branch 4:
    Splitting on Buying Price
    --> Branch high:
      Splitting on Price of maintenance
      --> Branch high:
        Counter({'acc': 11})
      --> Branch low:
        Counter({'acc': 12})
      --> Branch med:
        Counter({'acc': 11})
      --> Branch vhigh:
        Counter({'unacc': 12})
    --> Branch low:
      Splitting on Price of maintenance
      --> Branch high:
        Splitting on Size of luggage boot
        --> Branch big:
          Counter({'vgood': 4})
        --> Branch med:
          Splitting on Number of doors
          --> Branch 2:
            Counter({'acc': 1})
          --> Branch 3:
            Counter({'acc': 1})
          --> Branch 4:
            Counter({'vgood': 1})
          --> Branch 5more:
            Counter({'vgood': 1})
        --> Branch small:
          Counter({'acc': 3})
      --> Branch low:
        

In [15]:
def classify(datapoint, tree):
    if isinstance(tree, Leaf):
        return max(tree.labels.items(), key=operator.itemgetter(1))[0]
    value = datapoint[tree.feature]
    for branch in tree.branches:
        if branch.value == value:
            return classify(datapoint, branch)

print(classify(test_point, tree))

unacc


## Decision Trees in scikit-learn

Let's take a look at how the Python library `scikit-learn` implements decision trees.

The `sklearn.tree` module contains the `DecisionTreeClassifier` class. To create a `DecisionTreeClassifier` object, call the constructor:

`classifier = DecisionTreeClassifier()`

Next, we want to create the tree based on our training data. To do this, we'll use the `.fit()` method. 

`.fit()` takes a list of data points followed by a list of the labels associated with that data.

`classifier.fit(training_data, training_labels)`

Once we've made our tree, we can use it to classify new data points. The `.predict()` method takes an array of data points and will return an array of classifications for those data points. 

`predictions = classifier.predict(test_data)`

If you split your data into a test set, you can find the accuracy of the model by calling the `.score()` method using the test data and the test labels as parameters.

`print(classifier.score(test_data, test_labels))`

`.score()` returns the percentage of data points from the test set that it classified correctly.

In [16]:
from cars import training_points, training_labels, testing_points, testing_labels
from sklearn.tree import DecisionTreeClassifier

print(training_points[0])
print(training_labels[0])

[4.0, 3.0, 4.0, 2.0, 1.0, 2.0]
acc


In [17]:
classifier = DecisionTreeClassifier()
classifier.fit(training_points, training_labels)
classifier.predict(testing_points, testing_labels)
print(classifier.score(testing_points, testing_labels))

0.9710982658959537


## Decision Tree Limitations

Our current strategy of creating trees is greedy. We assume that the best way to create a tree is to find the feature that will result in the largest information gain right now and split on that feature. We never consider the ramifications of that split further down the tree. It’s possible that if we split on a suboptimal feature right now, we would find even better splits later on. Unfortunately, finding a globally optimal tree is an extremely difficult task, and finding a tree using our greedy approach is a reasonable substitute.

Another problem with our trees is that they potentially overfit the data. This means that the structure of the tree is too dependent on the training data and doesn’t accurately represent the way the data in the real world looks like. In general, larger trees tend to overfit the data more. As the tree gets bigger, it becomes more tuned to the training data and it loses a more generalized understanding of the real world data.

One way to solve this problem is to prune the tree. The goal of pruning is to shrink the size of the tree.

In [18]:
from cars import training_points, training_labels, testing_points, testing_labels
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(random_state = 0, max_depth = 12)
classifier.fit(training_points, training_labels)
print(classifier.score(testing_points, testing_labels))

print(classifier.tree_.max_depth)

0.9710982658959537
12


In [19]:
# reduce max_depth parameter
classifier = DecisionTreeClassifier(random_state = 0, max_depth = 7)
classifier.fit(training_points, training_labels)
print(classifier.score(testing_points, testing_labels))

print(classifier.tree_.max_depth)

0.8959537572254336
7


# Review

* Good decision trees have pure leaves. A leaf is pure if all of the data points in that class have the same label.
* Decision trees are created using a greedy algorithm that prioritizes finding the feature that results in the largest information gain when splitting the data using that feature.
* Creating an optimal decision tree is difficult. The greedy algorithm doesn’t always find the globally optimal tree.
* Decision trees often suffer from overfitting. Making the tree small by pruning helps to generalize the tree so it is more accurate on data in the real world.