## Decision Trees

A decision tree is one of the most commonly used algorithms in machine learning, both because of it's interpretability and because (with some modifications, discussed below) it can have incredibly accuracy. Here's an implementation using a dataset with several categorical variables and a target representing whether the individual will carry out a violent act.

In [61]:
import numpy as np
import pandas as pd
from collections import Counter, defaultdict

df = pd.read_csv("data/aps.csv")

data = df[['PLACE', 'RACE', 'GENDER', 'NEURO', 'EMOT', 'DANGER', 'BEHAV', 'VIOL']]

cutoff = int(.8 * len(data))
data_train = data[:cutoff]; data_test = data[cutoff:]

data_train.head()

Unnamed: 0,PLACE,RACE,GENDER,NEURO,EMOT,DANGER,BEHAV,VIOL
0,1,0,0,3,1,0,0,0
1,3,1,1,0,0,1,7,1
2,0,1,0,0,0,1,4,0
3,0,0,1,0,0,3,6,1
4,0,0,1,2,1,3,7,1


Before we start figuring out how the decision tree works, we need a way to represent that tree. One way is by using two dict of dicts, one to represent nodes and another to represent leaves.

In [62]:
class Node(dict):
    def __init__(self, feature):
        self.feature = feature

    def is_leaf(self):
        return False

class Leaf(dict):
    def __init__(self, target):
        self.target = target
        
    def is_leaf(self):
        return True


To get started building our tree, we'll need an entropy function, something that tells us how varied our data is. Intuitively, we want to make splits in our tree on features that have the widest range of target values, so that we get the most "information" from a new data point at each step along the tree.

In [67]:

class DecisionTree():
    
    def __init__(self):
        '''
        start up the tree
        '''
        self.root = None
        
    def fit(self, data_train):
        '''
        generates feature list and calls create_tree
        we need this method to define the feature list before we begin building 
        '''
        data_train = np.array(data_train)
        feature_list = [i for i in range(len(data_train[0]))]
        self.root = self.create_tree(np.array(data_train), feature_list)
                                                
    def target_entropy(self, values):
        '''
        the entropy for a given feature's label values
        '''
        if len(set(values)) <= 1:
            return 1
        size = float(len(values))
        classes = Counter(values)
        probs = [i / size for i in classes.values()]
        entropy = np.sum([-probs[i]*np.log(probs[i]) for i in range(len(probs))])
        
        return entropy
        
    def data_entropy(self, train_data):
        '''
        the entropy for any chunk of data- calls self.entropy
        '''
        data = np.array(train_data)
        target_col = data[:,-1]

        return self.target_entropy(target_col)

    def best_feature(self, train_data, feature_list):
        '''
        returns the feature to split on, or -1 if we can't improve
        '''
        best_entropy = -100
        base_entropy = self.data_entropy(train_data)
        for f in feature_list:
            
            unique_vals = list(set(train_data.T[f]))
            parts = self.split(train_data, f, unique_vals)
            n = sum([len(p) for p in parts])
           
            props = [float(len(p)) / n for p in parts]
            new_entropy = np.sum([props[i] * self.data_entropy(p) for i, p in enumerate(parts)])
            
            entropy_change = base_entropy - new_entropy

            if entropy_change > best_entropy:
                best_entropy = entropy_change
                split_feature = f
         
        feature_list.pop(feature_list.index(split_feature)) 
        return split_feature
        
    def split(self, train_data, f, unique_vals):
        '''
        split into groups for each value the feature can take on
        '''
        parts = [[row for row in train_data if row[f] == v] for v in unique_vals]
        return parts
    
    def get_most_common(self, train_data):
        '''
        returns the most common value for a chunk of data
        '''
        target_col = train_data[:,-1]
        classes = Counter(target_col)
        return classes.most_common(1)[0][0] # indices are most common, then actual value
    
    def create_tree(self, train_data, feature_list):
        '''
        recursively create the decision tree
        '''
        if len(feature_list) == 0:
            val = self.get_most_common(train_data)
            root = Leaf(val)
        
        else:
            best_feature = self.best_feature(train_data, feature_list)
            print "feature chosen: ", best_feature
            root = Node(best_feature)
            unique_vals = list(set(train_data.T[best_feature]))
            
            for value in unique_vals:
                subdata = np.array([row for row in train_data if row[best_feature] == value])
                child = self.create_tree(subdata, feature_list)
                root[value] = child

        return root
    
    def predict(self, test_data):
        '''
        traverse the tree we created to predict a new data point
        '''
        labels = []

        for t in test_data:
            target = None
            current_node = self.root
            while target is None:
                if current_node.is_leaf():
                    target = current_node.target
                else:
                    key_value = t[current_node.feature]
                    current_node = current_node[key_value]
            labels.append(target)
        return labels
    
    def score(self, test_data):
        '''
        evaluate the accuracy of that prediction
        '''
        preds = np.array(self.predict(np.array(test_data)))
        actual_targets = np.array(test_data)[:, -1]
        accuracy = np.mean(preds == actual_targets)
        
        return accuracy

In [68]:
d = DecisionTree()

d.fit(data_train)
d.score(data_test)

feature chosen:  5
feature chosen:  0
feature chosen:  1
feature chosen:  2
feature chosen:  3
feature chosen:  4
feature chosen:  6
feature chosen:  7


0.84313725490196079

# Ensembles

Once this is done, we can see how bagging, random forests, and boosting are all techniques that involve relatively painless modifications of our original class.

Bagging simply involves making multiple trees from different portions of the dataset and averaging out all the examples. Random forests is similar, except we only use a specific subset of the features. Finally, boosting means that for each tree we make, we take note of the data points which were incorrectly classified and adjust our loss function accordingly.

In each of these cases, we'll want to keep a basic version of our decision tree that has the entropy, splitting, create tree, and score functions, and add methods to deal with our expanded options. I'll be including the BaseDecisionTree code soon, but here is how this would work for a random forest.

In [72]:
class RF(BaseDecisionTree):
    
    def __init__(self, nbags=5, d_portion=0.2, f_portion=0.5):
        '''
        initialize variables
        '''
        
        self.root = [None] * nbags
        self.n = nbags
        self.data_portion = d_portion
        self.feature_portion = f_portion
            
    def get_portion(self, data_train):
        '''
        only select a subset of data
        '''
        
        n_data = int(self.data_portion * len(data_train))
        data_part = np.random.permutation(np.array(data_train))[:n]
        
        n_feat = int(self.feature_portion * (len(data_train.T) - 1))
        total_feature_list = [i for i in range(len(data_train[0]))]
        feature_part = np.random.permutation(np.array(total_feature_list))[:n]
        
        return data_part, feature_part
        
    def fit(self, data_train):
        '''
        only select a subset of features
        '''
        
        data_part, feature_part = get_portion(data_train)
        for i in range(self.n):
            self.root[i] = self.create_tree(data_part, feature_part)
    
    def average_predictions(self, test_data):
        '''
        use the predict method but average out all the results
        '''
        predictions = []
        for t in test_data:
            tree_preds = [predict(t, self.root[i]) for i in range(self.nbags)]
            predictions.append(get_most_common(tree_preds), single_list=True)
    

If you're interested in the details, I highly recommend checking out the book "Ensemble Methods in Data Mining" by Seni and Elder, which has a nice explanation of how these kinds of ensemble methods have developed and what they all have in common.
    