# Model Building

## Which algorithm to use?
We'll use a random forest classifier (rfc) with bootstrapping and feature bagging optimizations because:
- ease of implementation
- rfcs handle multi-class predictions well without more additional effort
- works well with high dimensional data
- we'll choose use random forest as opposed to boosted trees since we have highly dimensional data
- with a reasonably high probability, can be used with the other datasets for this project since the algorithm is very robust

## The Algorithm
We'll use the CART algorithm for splitting since we have continuous data.  
  
[Full example](https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/)  
  
Steps:
1. Initialize Tree
2. For each column, calc best split across all rows based using gini impurity score - [exmplanation](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) | [exmaple](https://www.researchgate.net/post/How_to_compute_impurity_using_Gini_Index) | [useful blog](http://dni-institute.in/blogs/cart-algorithm-for-decision-tree/)
3. Split the dataset based on the split condition with the highest gini score and add both sets as leaves on a tree node. The node represents a decision point, that being the condition with the highest gini score.
3. Repeat 2 & 3 until an arbitrary minimum number of rows are left
4. Prune tree

idea: instead of using the raw values, categorize the numbers as # of stds away from mean

In [2]:
calc_squared_probaility = lambda x, total: (x.count() / total) ** 2
calc_split_gini = lambda split, total: split.apply(calc_squared_probability, {total: total}).sum()

def calc_best_gini_split(df, labels):
    total_rows = df.size
    grouped = labels.merge(df, left_index=True,right_index=True).groupby("label")
    ginis = {}
    for index, row in df.iterrows():
        split1 = grouped[row > df]
        split2 = grouped[row < df]
        ginis[index] = np.sum(calc_split_gini(split1, total_rows) / total_rows, calc_split_gini(split2, total_rows) / total_rows)
    
    return df.iloc[ginis.values().index(ginis.values().max())]