# Random forest regression

### Training
* Set number of trees = B
* Set tree depth rule = terminal node partition size >=5
* Take a bootstrap sample from given data
* Build a random forest tree 
    * Set the sample size for the feature sample, start with m = sqrt(p) 
    
    * Take a random sample m of input features from the feature set p  
    
    * For each feature selected define cutpoints to be assessed for partitioning the feature space. Method investigation required.  
    
    * For each feature (p) - cutpoint (s) combination partition the data and calculate the mean of both partitions.  
    
    * Calculate the combined RSS of both partitions  
    
    * Select and store the feature - cutpoint combination with minnimum combined RSS of both partitions. Also store the mean of each partition.  
    
    * Apply the same random forest partition steps for each new partition until the tree depth rule is violated. That is, each new partition is treated the same as the original data and partitioned by the same rules.  
    
    * Store the tree as the list of selected partitions and the out of bag R2 and MSE for the tree  

* Repeat the random forest steps until the  number of random froest trees = B
* Store thje Random forest as a set of random forest trees.
* Once the number of random forest trees = B, for each observation in the training data, predict the outcome using the trees for which the observation is out-of-bag. 
    * To predict the outcome, apply the store partitions for each tree to the feature space of the observation.
    * Take the mean of the terminal node the observation is placed in for each tree
    * Take the mean of the predicted outcome for each tree.
    * This mean of the means is the prediction for that observation.
* Using the OOB prediction of each outcome, calculate the OOB MSE and R2 for the random forest.
* Store the OOB MSE and R2 for the Random forest
* Uing the stored training predictions, calculate the training MSE and R2
* Store the training MSE and R2 for the Random forest

### Testing
* for each observation of the test set:
    * partition the feature space acording to the random forest model provided
    * Calculate the mean of the terminal node for each tree
    * take the mean of the prediction across all trees in the forest
* From the predicted values calculate the test MSE and R2.


## Questions
* How do I store the final tree?
* How do I build the tree iteratively?
* How do I build the tree without side effects?
* Can I use depth to create the new entry at the correct nested depth of the tree? Is this necessary?
* What do I do if multiple features/splits satisfy the select_feature criteria? Take the first? 




In [1]:
from sklearn import datasets
import pandas as pd
import numpy as np
import random as rand
import math
import pprint

In [2]:
y = datasets.load_boston()['target']
X = datasets.load_boston()['data']
columns = datasets.load_boston()['feature_names']


In [5]:
train = pd.DataFrame(X)
train.columns = columns
train.loc[:, 'y'] = y


In [6]:
# tiny small data for tree by hand
np.random.seed(0)
toy_data = pd.DataFrame({'a' : np.random.choice(57, 10), 'c' : np.random.choice(11, 10),
                         'y' : np.random.choice(78, 10)})
y_toy = toy_data.loc[:,'y']
X_toy = toy_data.loc[:, ['a', 'c']]
toy_data


Unnamed: 0,a,c,y
0,44,2,46
1,47,4,37
2,53,7,25
3,0,6,77
4,3,8,72
5,3,8,9
6,39,10,20
7,9,1,69
8,19,6,47
9,21,7,64


In [7]:
# tiny data for testing tree by hand
np.random.seed(1)
toy_data_test = pd.DataFrame({'a' : np.random.choice(89, 3), 'c' : np.random.choice(45, 3),
                         'y' : np.random.choice(23, 3)})
y_toy_test = toy_data_test.loc[:,'y']
X_toy_test = toy_data_test.loc[:, ['a', 'c']]
toy_data_test

Unnamed: 0,a,c,y
0,37,9,15
1,12,11,0
2,72,5,16


In [8]:
# build a simple regression tree

# I want to create n splits in a feature with n entries - a split for each entry

# I need to store the cost function of each split so it can be compared to others in order to select the best

# I also need to store the feature and feature value of each split

# I need a function
    # to calculate the cost function
    # to create create two data sets for each value of a given feature to be assesed by the cost function
    # to apply the above function to every feature in the data set
    # to select the split with the lowest cost

# I need a function to recursively build a tree using the above steps


In [9]:
# feature partitioning

def get_feature_splits(y, f):
   
    '''split the y variable by split s in feature f'''
    
    return ([{'feature' : f.name, 'split' :  s, 'left': y[f < s],
              'right' :  y[f >= s]} for s in f])






In [10]:
def combined_RSS(y1, y2):
    return np.sum((y1 - np.mean(y1))**2) + np.sum((y2 - np.mean(y2))**2)
   
    
    
    
    

In [11]:
# cost function

def cost_function(feature_split): 
    
    '''calculate the total RSS for the partitions defined by splitting feature f at split s'''
    
    return ({'feature' : feature_split['feature'],
              'split' : feature_split['split'],
              'cost' : combined_RSS(feature_split['left'], feature_split['right'])
            }
           )



In [12]:
def flatten_list(l):
    return [item for sublist in l for item in sublist]
    

In [13]:
# partitioning y for each feature split combination

def get_split_costs(df):
    
    '''calulate the cost for each partition'''
    return flatten_list(
        [[cost_function(split) for split in get_feature_splits(
            df.iloc[:,-1], df.iloc[:,:-1].loc[:, feature])] for feature in rand.sample(list(
            df.iloc[:,:-1].columns), int(math.sqrt(len(list(df.iloc[:,:-1].columns)))))])





In [127]:

toy_data.iloc[:, -1]

0    17
1     5
2    11
Name: y, dtype: int64

In [115]:
rand.sample(list(
            toy_data.iloc[:,:-1].columns), int(math.sqrt(len(list(toy_data.iloc[:,:-1].columns)))))

['y']

In [128]:
rand.sample(list(
                 train.columns), int(math.sqrt(len(list(train.iloc[:,:-1].columns)))))

['NOX', 'RAD', 'DIS']

In [14]:
def select_node(split_costs): 
    
    
    '''select the lowest cost node'''
    
    return (rand.choice([{'feature' : split['feature'], 'split' : split['split'],
                          'cost' : split['cost']}
                         for split in split_costs if split['cost'] == np.min(
                             [split['cost'] for split in split_costs])]))
    



In [15]:
def terminal_node(node_y):
    return np.mean(node_y)

### Build a tree with basic functions with a tiny dataset

In [142]:

display(toy_data)
splits = get_split_costs(toy_data)

splits


Unnamed: 0,a,c,y
0,11,10,17
1,2,52,5
2,7,45,11


[{'feature': 'a', 'split': 11, 'cost': 18.0},
 {'feature': 'a', 'split': 2, 'cost': 72.0},
 {'feature': 'a', 'split': 7, 'cost': 18.0}]

In [144]:
display(y_toy)
display(X_toy)

select_node(get_split_costs(toy_data))




0    17
1     5
2    11
Name: y, dtype: int64

Unnamed: 0,a,c
0,11,10
1,2,52
2,7,45


{'feature': 'c', 'split': 52, 'cost': 18.0}

In [16]:
toy_left = toy_data.loc[toy_data.loc[:, 'a'] <11, :]
display(toy_left)
toy_right = toy_data.loc[toy_data.loc[:, 'a'] >= 11, :]
toy_right

Unnamed: 0,y,a,c
1,5,2,52
2,11,7,45


Unnamed: 0,y,a,c
0,17,11,10


In [17]:
display(toy_left)
flatten_list(get_split_costs(toy_left.loc[:,'y'], toy_left.loc[:, ['a', 'c']]))

flatten_list(get_split_costs(toy_right.loc[:,'y'], toy_right.loc[:, ['a', 'c']]))

toy_right_left = toy_right.loc[toy_right.loc[:, 'a'] < 11, :]

list(toy_right_left.loc[:, 'y']) + list(toy_right.loc[:, 'y'])

Unnamed: 0,y,a,c
1,5,2,52
2,11,7,45


[17]

In [18]:
display(toy_left)
# select_feature(partitions)

select_node(flatten_list(get_split_costs(toy_left.loc[:,'y'], toy_left.loc[:, ['a', 'c']])))


Unnamed: 0,y,a,c
1,5,2,52
2,11,7,45


{'feature': 'c', 'split': 52, 'cost': 0.0}

In [19]:
# tree structure idea

# partition :
# {'value' : {'feature' : feature, 'split' : split, 'cost' : cost, 'y_left' : y_left, 'y_right' : y_right}, 
#  'left' : {'value' : {'feature' : feature, 'split' : split, 'cost' : cost, 'y_left' : y_left, 'y_right' : y_right}, 
#  'left' : {}, 'right' : {}}, 'right' : {'value' : {'feature' : feature, 'split' : split, 'cost' : cost, 'y_left' : y_left, 'y_right' : y_right}, 
#  'left' : {}, 'right' : {}}
# }




In [20]:
# my_toy recursive split function

max_depth = 2
depth = 1
min_size = 1

test_partition = select_node(flatten_list(get_split_costs(y_toy,X_toy)))

left = y_toy[X_toy.loc[:, test_partition['feature']] < test_partition['split']]
right = y_toy[X_toy.loc[:, test_partition['feature']] >= test_partition['split']]


if left.empty or right.empty:
    test_partition['left'] = test_partition['right'] = terminal_node(left + right)
#     return

if depth >= max_depth:
    test_partition['left'], test_partition['right'] = terminal_node(left), terminal_node(right)
#     return

if len(left) <= min_size:
    test_partition['left'] = terminal_node(left)

else:
    test_partition['left'] = select_node(flatten_list(
        get_split_costs(y_toy[X_toy.loc[:, test_partition['feature']] < test_partition['split']],
                            X_toy.loc[X_toy.loc[:, test_partition['feature']] < test_partition['split']]
                           )))
    
if len(right) <= min_size:
    test_partition['right'] = terminal_node(right)

else:
    test_partition['right'] = select_node(flatten_list(
        get_split_costs(y_toy[X_toy.loc[:, test_partition['feature']] >= test_partition['split']],
                            X_toy.loc[X_toy.loc[:, test_partition['feature']] >= test_partition['split']]
                           )))
    
    

    
display(test_partition)

{'feature': 'c',
 'split': 45,
 'cost': 18.0,
 'left': 17.0,
 'right': {'feature': 'c', 'split': 52, 'cost': 0.0}}

In [16]:
def recursive_feature_split(y, X, node, max_depth, min_size, depth):
    '''recursively applies select_node to the input data and resulting nodes until user 
       specified limits are reached '''
    
    left = y[X.loc[:, node['feature']] < node['split']]
    right = y[X.loc[:, node['feature']] >= node['split']]

    if left.empty or right.empty:
        node['left'] = node['right'] = terminal_node(left + right)
        return

    if depth >= max_depth:
        node['left'], node['right'] = terminal_node(left), terminal_node(right)
        return

    
    if len(left) <= min_size:
        node['left'] = terminal_node(left)

    else:
        node['left'] = select_node(
            get_split_costs(y[X.loc[:, node['feature']] < node['split']],
                                X.loc[X.loc[:, node['feature']] < node['split']]
                               ))
        recursive_feature_split(y, X, node['left'], max_depth, min_size, depth + 1)

    if len(right) <= min_size:
        node['right'] = terminal_node(right)

    else:
        node['right'] = select_node(
            get_split_costs(y[X.loc[:, node['feature']] >= node['split']],
                                X.loc[X.loc[:, node['feature']] >= node['split']]
                               ))
        recursive_feature_split(y, X, node['right'], max_depth, min_size, depth + 1)
    

In [17]:
def grow_tree(y, X, max_depth, min_size):
    root = select_node(get_split_costs(y,X))
    recursive_feature_split(y, X, root, max_depth, min_size, 1)
    return root

In [369]:
grow_tree(train.loc[:,'y'], train.iloc[:,:-1], 3, 5)

KeyError: 'feature'

In [24]:
def rec(depth, direction='L'):
    if depth >= 3:
        return direction+str(depth)

    return [direction+str(depth), rec(depth+1, 'L'), rec(depth+1, 'R')]

test = rec(3)

print(test)

L3


In [18]:
def grow_tree(df, node, depth, max_depth = 1, min_size = 5 ):
    
    ''' recursively grow a decision tree by applying the function to each node it 
    returns until max depth of min size criteria is met '''
    
    left = df.loc[df.loc[:, node['feature']] < node['split']]
    right = df.loc[df.loc[:, node['feature']] >= node['split']]
    
    if left.empty or right.empty:
        return terminal_node(list(left.iloc[:, -1]) + list(right.iloc[:, -1]))
        
    elif depth >= max_depth:
        return {'node': node, 
                'left': terminal_node(left.iloc[:, -1]), 
                'right': terminal_node(right.iloc[:, -1])}
    
    else:
        return {'node' : node,
                
                'left' : (lambda x: terminal_node(list(x.iloc[:, -1])) 
                          if len(x.iloc[:, -1]) <= min_size
                          else grow_tree(x, select_node(get_split_costs(x)),
                                        depth + 1, max_depth, min_size))(left),
                
                'right' : (lambda x: terminal_node(list(x.iloc[:, -1])) 
                           if len(x.iloc[:, -1]) <= min_size 
                           else grow_tree(x, select_node(get_split_costs(x)),
                                        depth + 1, max_depth, min_size))(right)
               }
    
    

In [371]:
(lambda x: terminal_node(list(x.iloc[:, -1])) if len(x.iloc[:, -1]) <= 1 
                         else grow_tree(x, select_node(get_split_costs(x)),
                                        depth + 1, max_depth, min_size))(toy_data)

# len(toy_data.iloc[:, -1])

{'node': {'feature': 'a', 'split': 53, 'cost': 72.0},
 'left': 15.0,
 'right': 36.0}

In [372]:
grow_tree(toy_data,select_node(get_split_costs(toy_data)), 1, 3, 4)

{'node': {'feature': 'a', 'split': 53, 'cost': 72.0},
 'left': 15.0,
 'right': 36.0}

In [326]:
toy_data

Unnamed: 0,a,c,y
0,44,0,9
1,47,3,21
2,53,3,36


### bootstrap n samples from my training data

In [34]:
no_of_samples = 10
sample_size = len(toy_data.loc[:, 'y'])
toy_bootstrap = np.random.choice(toy_data.loc[:, 'y'], (no_of_samples, sample_size),
                                 replace=True)

### integrate the bootstraped sample into my random forst tree method
* I want to bootstrap 1 sample and grow one tree, bootstrap the next sample, grow the next ree, etc.

In [19]:
def bootstrap(df, random_state):
    return df.sample(len(df), replace = True, random_state = random_state)

In [377]:
bootstrap(toy_data, 1)

Unnamed: 0,a,c,y
1,47,3,21
0,44,0,9
0,44,0,9


In [379]:
grow_tree(bootstrap(train, 1), 
          select_node(get_split_costs(bootstrap(train, 1))), 1, 2, 1)

{'node': {'feature': 'RM', 'split': 6.8, 'cost': 23810.56028210336},
 'left': {'node': {'feature': 'AGE',
   'split': 70.6,
   'cost': 14485.878335251247},
  'left': 22.659493670886075,
  'right': 17.430303030303033},
 'right': {'node': {'feature': 'RM',
   'split': 7.454,
   'cost': 3163.3206699928724},
  'left': 30.46229508196721,
  'right': 44.85217391304347}}

In [20]:
# random forest idea
    # create bootstrap samples
    # create a tree for each sample

def grow_forest(df, max_depth = 1, min_size = 5, no_of_trees = 10):  
    
    ''' grow a forest by growing a random forest tree for each of B bootstrapped samples'''
    
    return [grow_tree(
        bootstrap(df, i), select_node( get_split_costs(bootstrap(df, i))),
        1, max_depth, min_size) for i in range(0, no_of_trees)]



In [21]:
grow_forest(toy_data, 3, 5, 100)

[{'node': {'feature': 'a', 'split': 3, 'cost': 4596.0},
  'left': 77.0,
  'right': {'node': {'feature': 'c', 'split': 2, 'cost': 3745.5},
   'left': 69.0,
   'right': {'node': {'feature': 'a', 'split': 21, 'cost': 3408.0},
    'left': 30.0,
    'right': 45.0}}},
 {'node': {'feature': 'a', 'split': 9, 'cost': 1856.875},
  'left': 9.0,
  'right': {'node': {'feature': 'c', 'split': 10, 'cost': 887.4285714285714},
   'left': {'node': {'feature': 'a', 'split': 44, 'cost': 332.0},
    'left': 61.0,
    'right': 43.0},
   'right': 20.0}},
 {'node': {'feature': 'c', 'split': 4, 'cost': 2884.222222222222},
  'left': 69.0,
  'right': {'node': {'feature': 'a', 'split': 39, 'cost': 2191.95},
   'left': 44.4,
   'right': 26.75}},
 {'node': {'feature': 'a', 'split': 3, 'cost': 2294.0},
  'left': 77.0,
  'right': {'node': {'feature': 'a', 'split': 19, 'cost': 510.8571428571429},
   'left': 9.0,
   'right': {'node': {'feature': 'c', 'split': 7, 'cost': 0.75},
    'left': 46.75,
    'right': 64.0}}},
 

### investigate prediction from trained random forest
* to predict the partition an observation falls into I need to compare each feature, to any splits it is used for in the pathway to the partition.


In [368]:
toy_tree = grow_tree(bootstrap(toy_data, 1), 
          select_node(get_split_costs(bootstrap(toy_data, 1))), 1, 2, 2)
display(toy_tree)
display(bootstrap(toy_data, 1))
display(toy_data_test)

{'node': {'feature': 'c', 'split': 3, 'cost': 0.0}, 'left': 9.0, 'right': 21.0}

Unnamed: 0,a,c,y
1,47,3,21
0,44,0,9
0,44,0,9


Unnamed: 0,a,c,y
0,37,9,15
1,12,11,0
2,72,5,16


In [35]:
toy_tree = grow_tree(bootstrap(toy_data, 1), 
          select_node(get_split_costs(bootstrap(toy_data, 1))), 1, 2, 1)
display(toy_tree)
display(bootstrap(toy_data, 1))
display(toy_data_test)

{'node': {'feature': 'a', 'split': 9, 'cost': 1856.875},
 'left': 9.0,
 'right': {'node': {'feature': 'c', 'split': 10, 'cost': 887.4285714285714},
  'left': 53.285714285714285,
  'right': 20.0}}

Unnamed: 0,a,c,y
5,3,8,9
8,19,6,47
9,21,7,64
5,3,8,9
0,44,2,46
0,44,2,46
1,47,4,37
7,9,1,69
6,39,10,20
9,21,7,64


Unnamed: 0,a,c,y
0,37,9,15
1,12,11,0
2,72,5,16


In [40]:
def tree_predict_row(row, tree):
    
    '''use the trained tree to partition the feature space of the test observation and return 
    the partition prediction value, which for regression is the mean of the partition'''
    
    if row[tree['node']['feature']] <  tree['node']['split']:
        if not isinstance(tree['left'], dict):
            return tree['left']
        else:
            return tree_predict_row(row, tree['left'])
    else:
         if not isinstance(tree['right'], dict):
            return tree['right']
         else:
            return tree_predict_row(row, tree['right'],)
        
       
        

In [41]:
def forest_predict_row(row, forest):
    
    ''' predicts each row of df for each tree than takes the average across 
    all trees for each row '''
    
    return np.mean([tree_predict_row(row, tree) for tree in forest])

In [42]:
def predict(df, forest):

    ''' applies predict row function to a df of test observations '''
    
    return df.apply(forest_predict_row, axis = 1, forest = forest)
    

In [43]:
tree_predict_row(toy_data_test.iloc[2,:], toy_tree)

53.285714285714285

In [439]:
tree_predict(toy_data_test, tree = toy_tree)

0     9.0
1    20.0
2    43.0
dtype: float64

In [44]:
toy_forest = grow_forest(toy_data, 3, 1, 100)

In [45]:
forest_predict_row(toy_data_test.iloc[2,:], toy_forest)

38.282428571428575

In [46]:
predict(toy_data_test, toy_forest)

0    44.007381
1    36.423381
2    38.282429
dtype: float64

In [47]:
# r_squared = 1 - (np.sum((y - predicted_y)**2)/np.sum((y - mean_y)**2))

def r_squared(y, predicted_y):
    
    return 1 - (np.sum((y - predicted_y)**2)/np.sum((y - np.mean(y))**2))

In [48]:
# mse = np.sum((y - predicted_y)**2)

def mse(y, predicted_y):
    
    return np.sum((y - predicted_y)**2)

In [49]:
predicted_y = predict(toy_data_test, toy_forest)

In [52]:
display(r_squared(toy_data.loc[:,'y'], predicted_y))

display(mse(toy_data.loc[:,'y'], predicted_y))

0.9642438412175306

180.7259289501135