# Chapter 17. Decision Trees

In [21]:
from __future__ import division
from collections import Counter, defaultdict
from functools import partial
import math, random

DataSciencester's VP of Talent has interviewed a number of job candidates from the site, with varying degrees of success.  
He's collected a data set consisting of several (qualitative) attributes of each candidate, as well as whether that candidate interviewed well or poorly.  
Could you, he asks, use this data to build a model identifying which candidates will interview well, so that he doesn't have to waste time conducting interviews?  
This task seems like a good fit for a [decision tree](https://en.wikipedia.org/wiki/Decision_tree), which is another predictive modeling tool in the data scientist's kit.

## What is a Decision Tree?

A decision tree uses a tree structure to represent a number of possible [decision paths](http://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html) and an outcome for each path.  
If you have ever played the game [Twenty Questions](https://en.wikipedia.org/wiki/Twenty_Questions), then you are familiar with decision trees and how they work.

Decision trees have a lot to recommend them.  
They are very easy to understand and interpret, and the process by which they reach a prediction is very transparent.  
Decision trees can easily handle a mix of numeric and categorical attributes as well as classify data for which attributes are missing.  

However, finding an "optimal" decision tree for a set of training data is computationally a very hard problem.  
We will work around this problem by trying to build a good-enough tree rather than an optimal one, although for large data sets this can still be a lot of work.  
More importantly, it is very easy (and very bad) to build decision trees that are [overfitted](https://en.wikipedia.org/wiki/Overfitting) to the training data, and that don't generalize well to unseen data.  
We'll look at ways to address this.

Decision trees are often divided into [classification trees](http://www.stat.wisc.edu/~loh/treeprogs/guide/wires11.pdf), which produce categorical outputs, and [regression trees](http://www.stat.wisc.edu/~loh/treeprogs/guide/wires11.pdf), which produce numeric outputs.  
In this chapter, we will focus on classification trees.  
We'll work through the [ID3 algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for learning a decision tree from a set of labeled data, which should help us understand how decision trees actually work.  
To make things simple, we'll restrict ourselves to problems with binary outputs like, "Should I hire this candidate?"

## Entropy

In order to build a decision tree, we will need to decide what questions to ask and in what order.  
At each stage of the tree there are some possibilities that we have eliminated and some that we haven't.  
After learning that an animal doesn't have more than five legs, we've eliminated the possibility that it's a grasshopper, but we haven't eliminated the possibility that it's a duck.  
Every possible question partitions the remaining possibilities according to their answers.  
Ideally, we would like to choose questions whose answers give a lot of information about what our tree should predict.  
If there is a single yes/no question for which "yes" answers always correspond to `True` outputs and "no" answers to `False` outputs (or vice versa), this would be an awesome question to pick.
We capture this notion of "how much information" with [entropy](https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution), a term that we will use to represent the uncertainty associated with data.

Imagine that we have a set $S$ of data, each member of which is labeled as belonging to one of a finite number of classes $C_1,\;\ldots,\;C_n$.  
If all of the data points belog to a single class, then there is no real uncertainty, which means we'd like there to be low entropy.  
If the data points are evenly spread across the classes, then there is a lot of uncertainty, and we'd like there to be high entropy.

In math terms, if $p_i$ is the proportion of data labeled as class $C_i$, we define the entropy as:  

$\Large H(S) = -\;p_1\text{log}_2p_1 - \;\ldots\; - p_n\text{log}_2p_n$  

with the (standard) convention that $0\;\text{log}\;0 = 0$.

Without worrying too much about the details, each term $-p_i\text{log}_2p_i$ is non-negative and is close to zero precisely when $p_i$ is either close to zero or close to one.  
This means that the entropy will be small when every $p_i$ is close to 0 or 1 (like when most of the data is in a single class), and it will be larger when many of the $p_i$'s are not too close to 0 (like when the data is spread across multiple classes).  
This is exactly the behavior that we desire, and it is simple enough to roll all of this into a function:

In [22]:
def entropy(class_probabilities):
    """ given a list of class probabilities, calculate the entropy """
    return sum(-p * math.log(p, 2) for p in class_probabilities if p)  # if p means ignore zero probabilities 

Our data will consist of pairs (`input`, `label`), which means that we will need to compute the class probabilities ourselves.  
Observe that we don't actually care which label is associated with each probability, only what the probabilites are:

In [23]:
def class_probabilities(labels):
    total_count = len(labels)
    return [count / total_count for count in Counter(labels).values()]

def data_entropy(labeled_data):
    labels = [label for _, label in labeled_data]
    probabilities = class_probabilities(labels)
    return entropy(probabilities)

## The Entropy of a Partition

What we have done so far is calculate the entropy ("uncertainty") of a single set of labeled data.  
Each stage of a decision tree involves asking a question whose answer partitions data into one or (hopefully) more subsets.  
As an example, our "does it have more than five legs?" question partitions animals into those that have more than 5 legs and those that don't.  
We would also like some idea of the entropy that results from partitioning a set of data in a certain way.  
We want a partition to have low entropy if it splits the data into subsets that themselves have low entropy (high certainty), and high entropy if the data contains subsets that are large and have low certainty.

Mathematically, if we partition our data $S$ into subsets $S_1,\;\ldots,\;S_m$ containing proportions $q_1,\;\ldots,\;q_m$ of the data, then we can calculate the entropy of the partition as a weighted sum:  

$\Large H = q_1H(S_1) + \;\ldots\;+ q_mH(S_m)$  

which can be implemented as:

In [24]:
def partition_entropy(subsets):
    """ find the entropy from this partition of data into subsets """
    """ subsets is a list of lists of labeled data """
    total_count = sum(len(subset) for subset in subsets)
    return sum(data_entropy(subset) * len(subset) / total_count for subset in subsets)

**Note**  
One problem with this approach is that partitioning by an attribute with many different values will result in a very low entropy due to overfitting.  
For example, imagine that you work for a bank and you are trying to build a decision tree to predict which of your customers are likely to default on their mortgages, using some historical data as your training set.  
This example data contains each customer's Social Security Number.  
Partitioning on SSN will produce one-person subsets, each of which has zero entropy.  
However, a model that relies on SSN will not generalize beyond the training set.  
For this reason, try to avoid (or at least bucket, if appropriate) attributes with large numbers of possible values when creating decision trees.

## Creating a Decision Tree