In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import csv
from random import choices
import time

### Questions
* should the bootstrapped datasets be the same size as the original dataset? or smaller?
* do i need an instance var for the datasets or just the OOB samples for each tree?

### To Do
* find new dataset
* function to pretty print tree
* clean up code to make more concise (use list comprehensions instead of loops)

## To Explain

### Decision Tree
* purity/impurity
* entropy vs. Gini index
* using decision tree for prediction

### Random Forest
* bootstrap aggregating
* OOB error estimating
* pros and cons of random forest

### Extra
* compare classification with sci-kit learn functions vs. random forest

In [16]:
# load data and save feature and class vectors
X = []
y = []

#with open("Churn.csv", newline="") as f:
with open("ChurnTestMedium.csv", newline="") as f:
#with open("ChurnTest.csv", newline="") as f:
    reader = csv.reader(f)
    next(reader)
    for line in reader:
        X.append([float(num) for num in line[0:-1]]) # save features to X list
        y.append(int(line[-1])) # save class to y list

#print(X)
#print(y)

#### Summary
A decision tree is a data structure used to classify new data points. You can visualize it as an upside-down tree with one main root node. This is where we begin with a full dataset. Then the data is split on a certain feature and value, resulting in two branch nodes. Each of those nodes is similarly split, and this process continues until the final split. The last split results in terminal or leaf nodes, which contain a classification for the data point in question. This is a very simple explanation of how decision trees work.

The training process involves a training dataset, which is used to determine the best feature and value at which to split the data at each node. Once the tree is built (trained), we can use it to predict classifications for new data points and to evaluate the tree's accuracy or error rate. One decision tree on its own is prone to overfitting the data and is likely to have high variance. It will not perform well with new data, especially if there is no limit set on the number of times the data can be split (this is called the depth of the tree).

In order to reduce the error rate and the impact of overfitting, we can train many decision trees on subsets of the same dataset. This collection of decision trees is known as a random forest. The "random" part of its name refers to two key elements:

1. taking one dataset and randomly sampling it many times to create many datasets, none of which is exactly the same as any other
2. using a random sample of predictors for each tree

The first step, known as bootstrapping, is taking one dataset and creating new datasets from it by randomly sampling the points in the original set with replacement. For example, if we only had one dataset with points A, B, C, and D, and we wanted to create three bootstrapped datasets, we might have one with A, B, B, C, one with A, C, D, D, and one with B, C, C, D. We use the bootstrapped sets to train the decision trees in the random forest.

The second random element in a random forest is the random sample of predicting variables used in each decision tree. Instead of using all predictors in all trees, we randomly select a certain number ($m$) of them. One common value for $m$ is $\sqrt{p}$, where $p$ is the total number of predictors in the original training data.

In each bootstrapped dataset, a certain number of the original datapoints will likely be left out. They are known as the out-of-bag samples, and we use them to get predictions from the trees that did not use them in training. The most common prediction for each unseen data point is the overall prediction from the random forest. This method of aggregating the predictions for each out-of-bag sample is known as bootstrapped aggregating, or bagging.

Additionally, for each tree in the forest, its accuracy can be assessed using these left out data points, because they are unseen data for that tree. The average accuracy achieved by all trees in the forest in this way is known as the out-of-bag accuracy. It is useful because it reduces the variance of the random forest in comparison with an individual decision tree. 

In [23]:
# take a predictors vector and class vector, create bootstrapped datasets

class RandomForest():
    def __init__(self, num_trees=100, depth=10): # default depth?
        self.max_depth = depth
        self.num_trees = num_trees
        #self.datasets = [] # don't need instance variable for datasets, just oob samples
        self.oob_Xs = []
        self.oob_ys = []
        self.forest = []
    
    # return one bootstrapped dataset from given data
    def bootstrap(self, X, y):
        sample_idxs = choices(range(len(X)), k=len(X))
        bsX = [X[idx] for idx in sample_idxs]
        bsY = [y[idx] for idx in sample_idxs]
        # add data not included in boostrapped sample to oob_data list
        oob_xs = []
        oob_ys = []
        for i in range(len(X)):
            if X[i] not in bsX:
                oob_xs.append(X[i])
                oob_ys.append(y[i])
        #self.oob_data.append((oob_xs, oob_ys))
        self.oob_Xs.append(oob_xs)
        self.oob_ys.append(oob_ys)
        return bsX, bsY

    # return list of bootstrapped datasets (1 for each tree in forest)
    def get_datasets(self, X, y):
        bootstrap_sets = []
        for i in range(self.num_trees): # for each tree in the forest
            bootstrap_sets.append(self.bootstrap(X, y)) # get boostrapped sample, add to list
        return bootstrap_sets

    # grow forest
    def grow_forest(self, X, y):
        datasets = self.get_datasets(X, y) # get bootstrapped datasets to use building trees
        #self.datasets = self.get_datasets(X, y) # get bootstrapped datasets to use building trees (maybe don't need instance var for this)
        for i in range(self.num_trees): # for each tree and each dataset
            tree = DecisionTree(self.max_depth)
            #tree.build_tree(X, y)
            tree.build_tree(datasets[i][0], datasets[i][1]) # build tree with next bootstrapped dataset
            #tree.build_tree(self.datasets[i][0], self.datasets[i][1]) # build tree with next bootstrapped dataset
            # is this where i need to get oob predictions?
            self.forest.append(tree) # add tree to forest
    
    def calc_oob_accuracy(self):
        tot_accuracy = 0
        for i in range(self.num_trees): # for each tree in the forest
            preds = []
            for x in self.oob_Xs[i]: # for each row in that tree's oob sample
                preds.append(self.forest[i].predict(x)) # use that tree to get a prediction for that row
            tot_accuracy += calc_accuracy(preds, self.oob_ys[i]) # keep running total of accuracy
        return tot_accuracy / self.num_trees # return average accuracy across all trees

In [24]:
# build forest
myForest = RandomForest(num_trees=100, depth=10)
start_forest = time.perf_counter()
myForest.grow_forest(X, y)
finish_forest = time.perf_counter()
forest_time = finish_forest - start_forest

# assess accuracy
start_oob = time.perf_counter()
myForest.calc_oob_accuracy()
finish_oob = time.perf_counter()
oob_time = finish_oob - start_oob

# print results
print("Forest was grown in ", forest_time, " seconds.")
print("OOB error rate was calculated in ", oob_time, " seconds.")

Forest was grown in  0.11923340000021199  seconds.
OOB error rate was calculated in  0.0009731000000101631  seconds.


In [22]:
forest_time / 60

0.0019030416666737438

In [4]:
# main function
# load data
# instantiate random forest class
# call build forest method

### Utility functions
* calculate entropy
    - entropy is a measure of how heterogeneous the data is
    - we want to split the data in a way that will reduce the entropy of the set
    - in a tree that classifies perfectly, each leaf would have an entropy of 0, meaning that each leaf contains only one class
    - a tree like this would of course be overfit, so we will put a limit on the number of times the data can be split (max depth)
    - Gini index is another metric for assessing the heterogeneity of a set of data points
* calculate information gain
    - information gain is the change in entropy that is achieved by splitting a dataset in a certain way
* determine the best split
    - this function finds the best way to split the data
    - the best way is the way the results in the highest information gain
    - another way of understanding this is to split in a way that reduces the entropy in the data
    - basically we're looking to split the data so the points in each branch are as similar as possible
    - this will result in a better classification for new datapoints
* split data
    - this function takes several parameters:
        * the index of the predictor to split on
        * the value for that predictor that should be used to split the data
        * the predictors vector and class vector to be split
    - for each data point, we look at the given predictor and check whether its value is less than the given value
        * if it's less than the value, it goes into the left vector
        * if not, it goes into the right vector
        * the corresponding class from the class vector is similarly put into either the left or right vector
    - then the function returns both the left and right predictor and class vectors (4 in total)

In [8]:
# utility functions to support building decision tree

# calculate cost function for split (entropy)
def calc_entropy(y_vals):
    ent = 0
    for y_val in set(y_vals):
        prop = len([val for val in y_vals if val==y_val]) / len(y_vals)
        ent += (-1 * prop) * np.log2(prop) # update entropy using formula
    return ent

# calculate information gain
def calc_infogain(parent_yvals, left_yvals, right_yvals):
    H = calc_entropy(parent_yvals) # entropy of parent node
    #print("H: ", H)
    H_left = calc_entropy(left_yvals) # entropy of left child node
    #print("H_left: ", H_left)
    H_right = calc_entropy(right_yvals) # entropy of right child node
    #print("H_right: ", H_right)
    P_left = len(left_yvals) / len(parent_yvals)
    P_right = len(right_yvals) / len(parent_yvals)
    cond_entropy = (H_left * P_left) + (H_right * P_right) # conditional entropy to compare to parent node
    #print("cond_entropy: ", cond_entropy)
    return H - cond_entropy # difference between parent node and child node entropy

# determine best split (or no split)
def best_split(X, y):
    m = int(np.round(np.sqrt(len(X[1])),2)) # set number of predectors to test = sqrt total # predictors
    pred_idxs_to_test = np.random.choice(range(0,len(X[1])),m, replace=False) # select random subset of predictors to test
    pred_vals_to_test = np.mean(X, axis=0)[pred_idxs_to_test] # use mean value for each predictor as split value
    best_idx = 0
    best_val = 0
    max_infogain = 0
    #max_infogain, best_idx, best_val, best_left, best_right = 0, 9999999, 9999999, {}, {}
    X_left, X_right, y_left, y_right = [], [], [], []
    for i in range(len(pred_idxs_to_test)): # for each predictor in random subset
        X_l, X_r, y_l, y_r = split_data(pred_idxs_to_test[i], pred_vals_to_test[i], X, y) # split data on mean value for each predictor
        infogain = calc_infogain(y, y_l, y_r)
        if infogain > max_infogain: # determine if split increases information gain / reduces entropy
            max_infogain = infogain
            best_idx = pred_idxs_to_test[i]
            best_val = pred_vals_to_test[i]
            X_left, y_left = X_l, y_l
            X_right, y_right = X_r, y_r
            #best_left = {"X_left": X_l, "y_left": y_l}
            #best_right = {"X_right": X_r, "y_right": y_r}
    #print("max_infogain", max_infogain)
    return {"pred_idx": best_idx, "pred_val": best_val, "left": {"X_left": X_left, "y_left": y_left}, "right": {"X_right": X_right, "y_right": y_right}}

# split data
def split_data(pred_idx, pred_val, X_vals, y_vals):
        X_left, X_right, y_left, y_right = [], [], [], []
        for i in range(len(X_vals)):
            if X_vals[i][pred_idx] < pred_val:
                X_left.append(X_vals[i])
                y_left.append(y_vals[i])
            else:
                X_right.append(X_vals[i])
                y_right.append(y_vals[i])
        return X_left, X_right, y_left, y_right

### decision tree summary
* a decision tree is a structure that allows new data points to be classified based on the values of their predictors
* it may be helpful to visualize an upside down tree, with the trunk split first into two branches/nodes (left and right)
* each branch is then split into its own left and right branches, until an optimal depth is reached
* for each branch, we determine the best predictor and value at which to split
* the final left and right branches are called leaves or leaf nodes
* the leaf nodes contain the classification for a data point
* this structure can be followed for a new data point, travelling down and following either the left or right branch depending on the value of each predictor in the new data point
* with a single decision tree, we would build it and then prune it back, removing nodes as needed to achieve the lowest possible error rate when predicting on new data
* pruning minimizes the effects of overfitting, which is likely to occur with a single tree
* when building trees as part of a random forest, however, pruning is not needed
* it's acceptable for individual trees to be overfit, since overall the forest will have a lower prediction error rate than the individual trees
* when the decision tree is instantiated, it takes a maximum depth, which is the number of times the original data will be split
* the decision tree class has an instance variable, self.tree, which holds a dictionary that is built using data passed to the build_tree method once the tree is instantiated
* the structure of the tree is as follows:
    - each key/value pair is either a predictor index, predictor value, left branch or right branch
    - branches are dictionaries as well
    - each branch has key/value pairs that mimic the original structure (predictor index, predictor value, left and right branch)
    - leaf nodes have left and right values of either 0 or 1 depending on the class assigned by the tree
* build tree
    - this method takes the predictors and class vectors as well as a parent node
    - it then calls the grow tree function
* grow tree
    - this method takes the data and calls the best_split utility function to find the best initial split for the data
    - the nodes returned by the best_split function become the left and right branch of the decision tree
* split tree
    - this method calls itself in a recursive fashion and builds the decision tree, adding another layer of depth each time it is called
    - it takes a node, a parent node, and the current depth of the tree
    - for each branch (left and right) of the node, it checks whether there is data in the node
    - if further branching is needed, a new node is created with the best_split utility function, the original node's data becomes the parent node, and the function calls itself on the new node and parent node
    - this occurs until one of three posibilities occurs:
        * in the case of an empty node, a leaf node is created that contains the most common classification from the parent node
        * in the case of maximum depth reached, a leaf node is created that contains the most common classification from the original data that was in that node
        * in the case of a single data point in a node, a leaf node is created that contains that data point's class
    - when the decision tree is built and all data split as needed, the function returns and the tree instance variable has been populated
* predict
    - this method takes a new data point and follows the tree to determine which class to predict
    - it returns the class for the new data point

In [9]:
class DecisionTree(object):
    # create new instance of DecisionTree
    def __init__(self, depth):
        self.max_depth = depth
        self.tree = {}
    
    # build decision tree
    def build_tree(self, X, y, parent={}, depth=0): # does this need to take a parent node? or only split_tree needs that?
        
        # grow decision tree
        def grow_tree(X, y):
            self.tree = best_split(X, y) # get root node with best split for full data
            parent = {} # begin with empty parent node
            split_tree(self.tree, parent, 1) # call recursive function to build tree

        # split tree, called recursively
        def split_tree(node, parent_node, d):

            # save data from node to be used in split if needed
            left, right = node["left"], node["right"]
            #print("left: ", left)
            #print("right: ", right)

            # delete data from node so can reassign best classification
            del(node["left"], node["right"])
            #print("node: ", node)

            # check if node contains empty dataset
            if len(left["X_left"])==0 or len(right["X_right"])==0:
            #if not left["X_left"] or not right["X_right"]:
                # assign each branch of the node to the most common class from the parent node
                node["left"] = max(set(parent_node["left"]["y_left"]), key=parent_node["left"]["y_left"].count)
                node["right"] = max(set(parent_node["right"]["y_right"]), key=parent_node["right"]["y_right"].count)
                return

            elif d >= self.max_depth: # check if tree has been split maximum number of times
                # assign each branch of the node to the most common class from this node
                node['left'] = max(set(left['y_left']), key=left['y_left'].count)
                node['right'] = max(set(right['y_right']), key=right['y_right'].count)
                #node["left"] = max(set(parent_node["left"]["y_left"]), key=parent_node["left"]["y_left"].count)
                #node["right"] = max(set(parent_node["right"]["y_right"]), key=parent_node["right"]["y_right"].count)
                return
            else:
                # check left and right datasets to see if need to split more or make terminal node
                # assess left node
                if len(set(left["y_left"]))==1:
                    # assign each branch of the node to the most common class from this node
                    node['left'] = max(set(left['y_left']), key=left['y_left'].count)
                else:
                    # split this branch by calling split_tree function
                    node["left"] = best_split(left["X_left"], left["y_left"])
                    parent = {"left": left, "right": right}
                    split_tree(node["left"], parent, d+1)
                # assess right node
                if len(set(right["y_right"]))==1: 
                    # assign each branch of the node to the most common class from this node
                    node['right'] = max(set(right['y_right']), key=right['y_right'].count)
                    return
                else:
                    # split this branch by calling split_tree function
                    node["right"] = best_split(right["X_right"], right["y_right"])
                    parent = {"left": left, "right": right}
                    split_tree(node["right"], parent, d+1)
        
        # call grow_tree to create decision tree
        grow_tree(X, y)
                
    # predict classification for new datapoint
    def predict(self, x):
        curr_node = self.tree
        while True:
            if x[curr_node['pred_idx']] < curr_node['pred_val']:
                if type(curr_node['left'])==int:
                    return curr_node['left']
                else:
                    curr_node = curr_node['left']
                    continue
            else:
                if type(curr_node['right'])==int:
                    return curr_node['right']
                else:
                    curr_node = curr_node['right']
                    continue

In [7]:
# test decision tree
mytree = DecisionTree(5)
mytree.build_tree(X, y)
print("tree: ", mytree.tree)

tree:  {'pred_idx': 1, 'pred_val': 0.55, 'left': {'pred_idx': 8, 'pred_val': 106449.1488888889, 'left': {'pred_idx': 3, 'pred_val': 2.75, 'left': 1, 'right': 0}, 'right': 1}, 'right': {'pred_idx': 7, 'pred_val': 0.7272727272727273, 'left': {'pred_idx': 6, 'pred_val': 0.6666666666666666, 'left': 0, 'right': 1}, 'right': 0}}


In [8]:
# test predict
preds = []
for i in range(len(X)):
    preds.append(mytree.predict(X[i]))
#print("predictions: ", preds)
#print("actual vals: ", y)

### Assessing Accuracy
* calc_accuracy
    - this function takes a vector of predictions and a vector of classes
    - it checks how many of the predictions match their corresponding classes
    - it returns the percent of correct predictions
    - we can call this function using the original data used to build the tree, but this will give an inflated measure of accuracy, since training data is not new data
    - we can also use this function with the OOB samples, which will give a more accurate idea of how well the tree predicts on new data

In [12]:
# should this function be with utility functions?
# calculate accuracy
def calc_accuracy(y_hat, y):
    #print("len yhat: ", len(y_hat))
    #print("len y: ", len(y))
    correct = []
    for i in range(len(y_hat)):
        correct.append(y_hat[i]==y[i])
    return sum(correct) / len(y_hat)

# calculate number of accurately predicted rows
def calc_accurate_rows(y_hat, y):
    correct = []
    for i in range(len(y_hat)):
        correct.append(y_hat[i]==y[i])
    return sum(correct)
    
#print("accuracy: ", calc_accuracy(preds, y))

In [11]:
# calculate number of accurately predicted rows
def calc_accurate_rows(y_hat, y):
    correct = []
    for i in range(len(y_hat)):
        correct.append(y_hat[i]==y[i])
    return sum(correct)

yhat = [1, 0, 1, 0, 0, 1]
y = [1, 0, 0, 1, 0, 1]
calc_accurate_rows(yhat, y)

4

In [11]:
from statistics import mean
accuracies = []
for i in range(100):
    #print("building tree #", i+1)
    tree = DecisionTree(8)
    tree.build_tree(X, y)
    preds = []
    for j in range(len(X)):
        preds.append(tree.predict(X[j]))
    accuracies.append(calc_accuracy(preds, y))
    #print(accuracies)
print("avg accuracy: ", mean(accuracies))

IndexError: list index out of range

#### Next Steps
In time there are changes and additions I would like to make to this code so it can do more with a greater variety of data. Here are some examples:

* modify code to accept categorical (vs. only numerical) predicting variables

* use similar code to build regression tree from scratch

#### References
http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15381-s06/www/DTs.pdf

https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710

https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8

https://medium.com/analytics-steps/understanding-the-gini-index-and-information-gain-in-decision-trees-ab4720518ba8