# Random forest regression

### Training
* Set number of trees = B
* Set tree depth rule = terminal node partition size >=5
* Take a bootstrap sample from given data
* Build a random forest tree 
    * Set the sample size for the feature sample, start with m = sqrt(p) 
    
    * Take a random sample m of input features from the feature set p  
    
    * For each feature selected define cutpoints to be assessed for partitioning the feature space. Method investigation required.  
    
    * For each feature (p) - cutpoint (s) combination partition the data and calculate the mean of both partitions.  
    
    * Calculate the combined RSS of both partitions  
    
    * Select and store the feature - cutpoint combination with minnimum combined RSS of both partitions. Also store the mean of each partition.  
    
    * Apply the same random forest partition steps for each new partition until the tree depth rule is violated. That is, each new partition is treated the same as the original data and partitioned by the same rules.  
    
    * Store the tree as the list of selected partitions and the out of bag R2 and MSE for the tree  

* Repeat the random forest steps until the  number of random froest trees = B
* Store thje Random forest as a set of random forest trees.
* Once the number of random forest trees = B, for each observation in the training data, predict the outcome using the trees for which the observation is out-of-bag. 
    * To predict the outcome, apply the store partitions for each tree to the feature space of the observation.
    * Take the mean of the terminal node the observation is placed in for each tree
    * Take the mean of the predicted outcome for each tree.
    * This mean of the means is the prediction for that observation.
* Using the OOB prediction of each outcome, calculate the OOB MSE and R2 for the random forest.
* Store the OOB MSE and R2 for the Random forest
* Uing the stored training predictions, calculate the training MSE and R2
* Store the training MSE and R2 for the Random forest

### Testing
* for each observation of the test set:
    * partition the feature space acording to the random forest model provided
    * Calculate the mean of the terminal node for each tree
    * take the mean of the prediction across all trees in the forest
* From the predicted values calculate the test MSE and R2.


### To do
* R2 function
* MSE function
* change interface to recieve y and X separately at all levels (update bootstrap function)
* Out of bag error and R2 optional helper function in grow_forest
* Cross validation function for using growing and assessing random forests
* priune tree function to find the best subttree when penalised for complexity

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import random as rand
import math

In [2]:
# training data
y = datasets.load_boston()['target']
X = datasets.load_boston()['data']
columns = datasets.load_boston()['feature_names']
data = pd.DataFrame(X)
data.columns = columns
data.loc[:, 'y'] = y

train = data.sample(frac =0.7)
test = data[~data.isin(train)].dropna()

display(train)
display(test)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,y
440,22.05110,0.0,18.10,0.0,0.7400,5.818,92.4,1.8662,24.0,666.0,20.2,391.45,22.11,10.5
168,2.30040,0.0,19.58,0.0,0.6050,6.319,96.1,2.1000,5.0,403.0,14.7,297.09,11.10,23.8
450,6.71772,0.0,18.10,0.0,0.7130,6.749,92.6,2.3236,24.0,666.0,20.2,0.32,17.44,13.4
434,13.91340,0.0,18.10,0.0,0.7130,6.208,95.0,2.2222,24.0,666.0,20.2,100.63,15.17,11.7
205,0.13642,0.0,10.59,0.0,0.4890,5.891,22.3,3.9454,4.0,277.0,18.6,396.90,10.87,22.6
136,0.32264,0.0,21.89,0.0,0.6240,5.942,93.5,1.9669,4.0,437.0,21.2,378.25,16.90,17.4
69,0.12816,12.5,6.07,0.0,0.4090,5.885,33.0,6.4980,4.0,345.0,18.9,396.90,8.79,20.9
269,0.09065,20.0,6.96,1.0,0.4640,5.920,61.5,3.9175,3.0,223.0,18.6,391.34,13.65,20.7
334,0.03738,0.0,5.19,0.0,0.5150,6.310,38.5,6.4584,5.0,224.0,20.2,389.40,6.75,20.7
220,0.35809,0.0,6.20,1.0,0.5070,6.951,88.5,2.8617,8.0,307.0,17.4,391.70,9.71,26.7


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,y
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
11,0.11747,12.5,7.87,0.0,0.524,6.009,82.9,6.2267,5.0,311.0,15.2,396.90,13.27,18.9
12,0.09378,12.5,7.87,0.0,0.524,5.889,39.0,5.4509,5.0,311.0,15.2,390.50,15.71,21.7
25,0.84054,0.0,8.14,0.0,0.538,5.599,85.7,4.4546,4.0,307.0,21.0,303.42,16.51,13.9
26,0.67191,0.0,8.14,0.0,0.538,5.813,90.3,4.6820,4.0,307.0,21.0,376.88,14.81,16.6
29,1.00245,0.0,8.14,0.0,0.538,6.674,87.3,4.2390,4.0,307.0,21.0,380.23,11.98,21.0
33,1.15172,0.0,8.14,0.0,0.538,5.701,95.0,3.7872,4.0,307.0,21.0,358.77,18.35,13.1
35,0.06417,0.0,5.96,0.0,0.499,5.933,68.2,3.3603,5.0,279.0,19.2,396.90,9.68,18.9
38,0.17505,0.0,5.96,0.0,0.499,5.966,30.2,3.8473,5.0,279.0,19.2,393.43,10.13,24.7
39,0.02763,75.0,2.95,0.0,0.428,6.595,21.8,5.4011,3.0,252.0,18.3,395.63,4.32,30.8


In [3]:
# tiny small data for tree by hand
# tiny small data for tree by hand
np.random.seed(0)
toy_data = pd.DataFrame({'a' : np.random.choice(57, 10), 'c' : np.random.choice(11, 10),
                         'y' : np.random.choice(78, 10)})
y_toy = toy_data.loc[:,'y']
X_toy = toy_data.loc[:, ['a', 'c']]
toy_data

Unnamed: 0,a,c,y
0,44,2,46
1,47,4,37
2,53,7,25
3,0,6,77
4,3,8,72
5,3,8,9
6,39,10,20
7,9,1,69
8,19,6,47
9,21,7,64


In [4]:
# tiny data for testing tree by hand
np.random.seed(1)
toy_data_test = pd.DataFrame({'a' : np.random.choice(89, 10), 'c' : np.random.choice(45, 10),
                         'y' : np.random.choice(23, 10)})


In [5]:
# feature partitioning

def get_feature_splits(y, f):
   
    '''split the y variable by split s in feature f'''
    
    return ([{'feature' : f.name, 'split' :  s, 'left': y[f < s],
              'right' :  y[f >= s]} for s in f])






In [6]:
def combined_RSS(y1, y2):
    
    '''calculate the combined residual sum of squares for y1 and y1'''
    
    return np.sum((y1 - np.mean(y1))**2) + np.sum((y2 - np.mean(y2))**2)
   
    
    
    
    

In [7]:
# cost function

def cost_function(feature_split): 
    
    '''calculate the total RSS for the splits defined by splitting feature f at split s'''
    
    return ({'feature' : feature_split['feature'],
              'split' : feature_split['split'],
              'cost' : combined_RSS(feature_split['left'], feature_split['right'])
            }
           )



In [8]:
def flatten_list(l):
    return [item for sublist in l for item in sublist]
    

In [9]:
# partitioning y for each feature split combination

def get_split_costs(df):
    
    '''calulate the cost for each split'''
    
    return flatten_list(
        [[cost_function(split) for split in get_feature_splits(
            df.iloc[:,-1], df.iloc[:,:-1].loc[:, feature])] for feature in rand.sample(list(
            df.iloc[:,:-1].columns), int(math.sqrt(len(list(df.iloc[:,:-1].columns)))))])





In [10]:
def select_node(split_costs): 
    
    
    '''select the lowest cost node'''
    
    return (rand.choice([{'feature' : split['feature'], 'split' : split['split'],
                          'cost' : split['cost']}
                         for split in split_costs if split['cost'] == np.min(
                             [split['cost'] for split in split_costs])]))
    



In [11]:
def terminal_node(node_y):
    
    '''return the prediction for the terminal node'''
    
    return np.mean(node_y)

In [12]:
def grow_tree(df, node, depth, max_depth = 1, min_size = 5 ):
    
    ''' recursively grow a decision tree by applying the function to each node it 
    returns until max depth of min size criteria is met '''
    
    left = df.loc[df.loc[:, node['feature']] < node['split']]
    right = df.loc[df.loc[:, node['feature']] >= node['split']]
    
    if left.empty or right.empty:
        return terminal_node(list(left.iloc[:, -1]) + list(right.iloc[:, -1]))
        
    elif depth >= max_depth:
        return {'node': node, 
                'left': terminal_node(left.iloc[:, -1]), 
                'right': terminal_node(right.iloc[:, -1])}
    
    else:
        return {'node' : node,
                
                'left' : (lambda x: terminal_node(list(x.iloc[:, -1])) 
                          if len(x.iloc[:, -1]) <= min_size
                          else grow_tree(x, select_node(get_split_costs(x)),
                                        depth + 1, max_depth, min_size))(left),
                
                'right' : (lambda x: terminal_node(list(x.iloc[:, -1])) 
                           if len(x.iloc[:, -1]) <= min_size 
                           else grow_tree(x, select_node(get_split_costs(x)),
                                        depth + 1, max_depth, min_size))(right)
               }
    
    

In [13]:
def bootstrap(df, random_state):
    return df.sample(len(df), replace = True, random_state = random_state)

In [14]:
def grow_random_forest(df, max_depth = 1, min_size = 5, no_of_trees = 10):  
    return [grow_tree(
        bootstrap(df, i), select_node( get_split_costs(bootstrap(df, i))),
        1, max_depth, min_size) for i in range(0, no_of_trees)]



In [32]:
def tree_predict_row(row, tree):
    
    '''use the trained tree to partition the feature space of the test observation and return 
    the partition prediction value, which for regression is the mean of the partition'''
    
    if row[tree['node']['feature']] <  tree['node']['split']:
        if not isinstance(tree['left'], dict):
            return tree['left']
        else:
            return tree_predict_row(row, tree['left'])
    else:
         if not isinstance(tree['right'], dict):
            return tree['right']
         else:
            return tree_predict_row(row, tree['right'],)
        
       
        

In [33]:
def forest_predict_row(row, forest):
    
    ''' predicts each row of df for each tree than takes the average across 
    all trees for each row '''
    
    return np.mean([tree_predict_row(row, tree) for tree in forest])

In [34]:
def predict(df, forest):

    ''' applies predict row function to a df of test observations '''
    
    return df.apply(forest_predict_row, axis = 1, forest = forest)
    

In [35]:
def mse(y, predicted_y):
    
    ''' returns the mean square error: the sum of squyared differences between predicted y and actual y'''
    
    return np.sum((y - predicted_y)**2)

In [36]:
def total_sum_of_squares(y):

    '''returns the total sum of squares: the sum of squared difference between y and the mean of y i.e var(y)'''
    
    return np.sum((y - np.mean(y))**2)

In [37]:
def r_squared(y, predicted_y):
   
    ''' retruns to proportion of variance attributable the model: 1 - the ratio of mse to total sum of squares.'''

    return 1 - (mse(y, predicted_y)/total_sum_of_squares(y))

In [38]:
toy_forest = grow_random_forest(toy_data, 1, 5, 100)

In [39]:
display(toy_forest)

[{'node': {'feature': 'c', 'split': 7, 'cost': 4379.6},
  'left': 69.2,
  'right': 35.8},
 {'node': {'feature': 'a', 'split': 9, 'cost': 1856.875},
  'left': 9.0,
  'right': 49.125},
 {'node': {'feature': 'c', 'split': 4, 'cost': 2884.222222222222},
  'left': 69.0,
  'right': 36.55555555555556},
 {'node': {'feature': 'c', 'split': 8, 'cost': 1323.5555555555557},
  'left': 59.22222222222222,
  'right': 9.0},
 {'node': {'feature': 'c', 'split': 4, 'cost': 1848.8333333333335},
  'left': 69.0,
  'right': 38.166666666666664},
 {'node': {'feature': 'c', 'split': 10, 'cost': 1265.875},
  'left': 58.375,
  'right': 20.0},
 {'node': {'feature': 'c', 'split': 6, 'cost': 217.58333333333331},
  'left': 39.25,
  'right': 68.83333333333333},
 {'node': {'feature': 'c', 'split': 10, 'cost': 652.2222222222223},
  'left': 67.55555555555556,
  'right': 20.0},
 {'node': {'feature': 'c', 'split': 8, 'cost': 4213.714285714285},
  'left': 56.42857142857143,
  'right': 30.0},
 {'node': {'feature': 'c', 'split

In [40]:
predict(toy_data_test, toy_forest)

0    38.810437
1    51.466849
2    45.851623
3    40.487806
4    31.760687
5    39.290556
6    31.760687
7    31.760687
8    40.487806
9    44.057480
dtype: float64

In [None]:
boston_forest = grow_random_forest(train,max_depth = 5, min_size = 5, no_of_trees = 100)

In [45]:
display(boston_forest)

[{'node': {'feature': 'RM', 'split': 6.678, 'cost': 18117.63403073744},
  'left': {'node': {'feature': 'DIS',
    'split': 1.1742,
    'cost': 6796.653598484849},
   'left': 50.0,
   'right': {'node': {'feature': 'INDUS',
     'split': 18.1,
     'cost': 4117.947542293234},
    'left': {'node': {'feature': 'NOX',
      'split': 0.515,
      'cost': 1532.706918767507},
     'left': {'node': {'feature': 'DIS',
       'split': 6.32,
       'cost': 608.5017304075233},
      'left': 23.270909090909093,
      'right': 20.76206896551724},
     'right': {'node': {'feature': 'DIS',
       'split': 4.4986,
       'cost': 756.25568627451},
      'left': 18.880392156862744,
      'right': 20.829411764705885}},
    'right': {'node': {'feature': 'LSTAT',
      'split': 15.17,
      'cost': 1215.7224923747276},
     'left': {'node': {'feature': 'LSTAT',
       'split': 14.43,
       'cost': 81.34666666666666},
      'left': 21.288888888888888,
      'right': 17.311111111111114},
     'right': {'node'

In [46]:
predict(test, boston_forest)

3      29.577782
11     21.161069
12     20.788955
25     17.941047
26     19.200634
29     21.984931
33     17.851606
35     21.994023
38     21.310696
39     26.961916
42     23.139038
48     20.391083
54     20.117659
62     22.671328
65     25.383332
66     21.638625
68     20.332789
70     23.109212
73     22.038041
76     20.698989
77     20.698989
81     24.921201
86     21.315259
87     22.349490
89     30.008972
90     23.905745
93     22.519726
99     29.758701
102    20.054931
116    20.547868
         ...    
389    14.583136
391    17.698038
392    12.409299
400    10.988080
404    10.460665
406    18.555837
416    13.220964
418    11.582451
419    13.172393
423    14.733898
430    15.326083
431    17.866480
435    13.145891
437    11.485451
442    16.656972
447    13.790386
448    18.155753
449    15.694591
455    16.496888
460    19.590317
462    18.421127
464    17.855788
465    19.667771
468    15.906042
473    24.048526
476    18.230561
487    19.713839
489    16.5821

In [47]:
mse(test.iloc[:, -1], predict(test, boston_forest))

2287.227618258193

In [48]:
r_squared(test.iloc[:, -1], predict(test, boston_forest))

0.822594039851509