# Decision Trees and Random Forests

Trees and tree based ensemble models like random forest and gradient boosted trees are commonly used.  Here, a simple decision tree is coded-up to demonstrate key principles like splitting on features and recursively building branches.  When building trees we want to branch on what gives us the most information at each step which means creating the most homogeneous branches possible.  This is quantified by the entropy difference between the parent node and the children, meaning the difference in quantitative uncertainty, 

$$E = -\Sigma p(x)log(p(x))$$

where p is the probability of a given class which we can measure as the fraction.

Decision trees are good because they are transparent (you can visually display them), but they are very prone to overfitting.  Because of this, they are often used in ensembles, like random forests, but they can become tough to interpret as in random forests each tree is given a certain bit of the data and feature set and used to "vote" on the most important features.

In [28]:
import math
from functools import partial
from collections import defaultdict, Counter

In [15]:
def entropy(class_probabilities):
    '''given a list of class probabilities, compute the entropy'''
    H = sum(-p*math.log(p, 2) for p in class_probabilities if p) # ignore zeros
    return H

def class_probabilities(labels):
    total_count = len(labels)
    return [count/total_count for count in Counter(labels).values()]

def data_entropy(labeled_data):
    labels = [label for _, label in labeled_data]
    probs = class_probabilities(labels)
    return entropy(probs)

The entropy of a partition is the weighted sum of the proportions of the subsets

In [11]:
def partition_entropy(subsets):
    '''find the entropy from this partition of data into subsets wehre 
    subsets is a list of lists of labeled data'''
    total_count = sum(len(subset) for subset in subsets) # sum over lists
    return sum(data_entropy(subset)*len(subset)/total_count #weighted sum
              for subset in subsets)

In [4]:
inputs = [
    ({'level':'Senior','lang':'Java','tweets':'no','phd':'no'},   False),
    ({'level':'Senior','lang':'Java','tweets':'no','phd':'yes'},  False),
    ({'level':'Mid','lang':'Python','tweets':'no','phd':'no'},     True),
    ({'level':'Junior','lang':'Python','tweets':'no','phd':'no'},  True),
    ({'level':'Junior','lang':'R','tweets':'yes','phd':'no'},      True),
    ({'level':'Junior','lang':'R','tweets':'yes','phd':'yes'},    False),
    ({'level':'Mid','lang':'R','tweets':'yes','phd':'yes'},        True),
    ({'level':'Senior','lang':'Python','tweets':'no','phd':'no'}, False),
    ({'level':'Senior','lang':'R','tweets':'yes','phd':'no'},      True),
    ({'level':'Junior','lang':'Python','tweets':'yes','phd':'no'}, True),
    ({'level':'Senior','lang':'Python','tweets':'yes','phd':'yes'},True),
    ({'level':'Mid','lang':'Python','tweets':'no','phd':'yes'},    True),
    ({'level':'Mid','lang':'Java','tweets':'yes','phd':'no'},      True),
    ({'level':'Junior','lang':'Python','tweets':'no','phd':'yes'},False)
]

In [5]:
def partition_by(inputs, attribute):
    '''each input is a pair of (attribute_dict, label) 
    and returns a dict of attribute_value -> inputs'''
    
    groups = defaultdict(list)
    for input in inputs:
        key = input[0][attribute]   # get value of specified attribute
        groups[key].append(input)   # add input to correct list
    return groups

In [6]:
def partition_entropy_by(inputs, attribute):
    '''computes the entropy corresponding to given partition'''
    partitions = partition_by(inputs, attribute)
    return partition_entropy(partitions.values())

Need to find the minimum entropy partition for the whole dataset as we want to split on lowest entropy.

In [18]:
for key in ['level', 'lang', 'tweets', 'phd']:
    print(key, partition_entropy_by(inputs, key))

level 0.6935361388961919
lang 0.8601317128547441
tweets 0.7884504573082896
phd 0.8921589282623617


In [20]:
def classify(tree, input):
    '''classify the input using the decision tree'''
    
    # if this is a leaf node, return value
    if tree in [True, False]:
        return tree
    
    attribute, subtree_dict = tree # unpack dict
    subtree_key = input.get(attribute)
    
    if subtree_key not in subtree_dict:
        subtree_key = None
        
    subtree = subtree_dict[subtree_key]
    
    # recursively call until we get to the end of the tree
    return classify(subtree, input)

In [26]:
def build_tree_id3(inputs, split_candidates=None):
    
    # on first pass all attributes are available to split
    if split_candidates is None:
        split_candidates = inputs[0][0].keys()
    
    # count Trues and Falses in the inputs
    
    num_inputs = len(inputs)
    num_trues = len([label for item, label in inputs if label])
    num_falses = num_inputs - num_trues
    
    if num_trues == 0: return False    # make "False" leaf
    if num_falses == 0: return True    # make "True" leaf
    
    if not split_candidates:
        return num_trues >= num_falses # if no candidates, return majority leaf
    
    best_attribute = min(split_candidates, key=partial(partition_entropy_by, inputs))
    
    partitions = partition_by(inputs, best_attribute)
    new_candidates = [a for a in split_candidates if a != best_attribute]
    
    # recursively build subtrees
    subtrees = {attribute_value : build_tree_id3(subset, new_candidates)
               for attribute_value, subset in partitions.items()}
    
    subtrees[None] = num_trues > num_falses   # default case
    
    return (best_attribute, subtrees)

In [29]:
tree = build_tree_id3(inputs)

In [31]:
print (tree)

('level', {'Senior': ('tweets', {'no': False, 'yes': True, None: False}), 'Mid': True, 'Junior': ('phd', {'no': True, 'yes': False, None: True}), None: True})
