In [1]:
%matplotlib inline
%store -r
from time import time
from math import sqrt, floor
import numpy as np
from sklearn.utils import shuffle
from sklearn.metrics import precision_recall_fscore_support as score
from time import time
from sklearn.metrics import log_loss, accuracy_score
import pandas as pd
from multiprocessing.dummy import Pool as ThreadPool 
from IPython.core.debugger import set_trace
import pickle
import os
from rfc import RandomForest
pd.options.display.float_format = '{:.3f}'.format

import matplotlib.pyplot as plt
plt.style.use = "default"

In [2]:
train = pd.read_csv("cleaned_testData1.csv")
labels = pd.read_csv("cleaned_trainLabel1.csv")

In [3]:
train = train.drop(train.columns[0], axis=1)
labels = labels.drop(labels.columns[0], axis=1)

In [4]:
model_file = "model.binary"
train_file = "model-train.csv"
test_file = "model-test.csv"

def save_model(model, train_set, test_set):
    with open(model_file, "wb") as f:
        pickle.dump(model, f)
    train_set.to_csv(train_file)
    test_set.to_csv(test_file)

def read_model():
    if os.path.exists("model.binary"):
        with open("model.binary", "rb") as f:
            model = pickle.load(f)
        
        return pd.read_csv(train_file), pd.read_csv(test_file), model
    else:
        False

In [5]:
def merge(df, labels):
    return labels.merge(df, left_index=True,right_index=True)

In [6]:
def mprint(*args):
    for arg in args:
        print(arg)
        print(" ")

---  

# Model Building

## Which algorithm to use?
We'll use a **random forest classifier** (rfc) with bootstrapping and feature bagging optimizations because:
- ease of implementation
- rfcs handle multi-class predictions well without more additional effort
- works well with high dimensional data
- we'll choose use random forest as opposed to boosted trees since we have highly dimensional data
- with a reasonably high probability, can be used with the other datasets for this project since the algorithm is very robust

## The Algorithm
We'll use the CART algorithm for splitting since we have continuous data.  
  
[Full example](https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/)  
  
Steps:
1. Initialize Tree
2. For each column, calc best split across all rows based using gini impurity score - [exmplanation](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) | [exmaple](https://www.researchgate.net/post/How_to_compute_impurity_using_Gini_Index) | [useful blog](http://dni-institute.in/blogs/cart-algorithm-for-decision-tree/)
3. Split the dataset based on the split condition with the highest gini score and add both sets as leaves on a tree node. The node represents a decision point, that being the condition with the highest gini score.
3. Repeat 2 & 3 until an arbitrary minimum number of rows are left
4. Prune tree

ideas:
- instead of using the raw values, categorize the numbers as # of stds away from mean
- > Alternatively, the random forest can apply weight concept for considering the impact of result from any decision tree. Tree with high error rate are given low weight value and vise versa. This would increase the decision impact of trees with low error rate - [medium post](https://medium.com/machine-learning-101/chapter-5-random-forest-classifier-56dc7425c3e1)
- [parameters to  tune](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
- https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/
- https://stats.stackexchange.com/questions/260460/optimization-of-a-random-forest-model
- https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/
- https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c

---  
# Model Training & Tuning
## Context
Now that we have our classifier, let's think about how we're going to train the model. 

We'll also measure performance through [precision](https://en.wikipedia.org/wiki/Precision_and_recall) & [recall](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c) - it tells us, for each class, how well the model identifies all cases of that class (recall) and how well it can correctly classify those cases (precision). From wikipedia:
> Suppose a computer program for recognizing dogs in photographs identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight dogs identified, five actually are dogs (true positives), while the rest are cats (false positives). The program's precision is 5/8 while its recall is 5/12.

![precision & recall formulas](https://cdn-images-1.medium.com/max/2000/1*6NkN_LINs2erxgVJ9rkpUA.png)
We can use the [f1 score](https://en.wikipedia.org/wiki/F1_score) to maximize precision and recall when testing different models.  
![f1 score formula](https://cdn-images-1.medium.com/max/1600/1*UJxVqLnbSj42eRhasKeLOA.png)

Recall and precision seem to be very related to bias and variance of the model, so we can maximize the f1 score by tuning the model to affect these.
#### Minimizing bias
- use new/different features
- increase the size of the trees (increases variance)
- increase the number of trees in the forest

#### Minimizing variance
- decrease the number of features
    + probably want to aim to features that are correlated and/or collapse the overall number of features through PCA
- use more data for each tree  

  
Beware: too much completixy is bad & not enough complexity is also bad  
![bias variance tradeoff](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)  
  

#### Stability
We need to make sure to train the classifier on as many data points as possible while also leaving enough to test to reliably tell how well the classifier actually performs. We'll use [k-fold cross validation](https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-in-python-r/):  
  
> 1. Randomly split your entire dataset into k ”folds”.
2. For each k folds in your dataset, build your model on k – 1 folds of the data set. Then, test the model to check the effectiveness for kth fold.
3. Record the error you see on each of the predictions.
4. Repeat this until each of the k folds has served as the test set.

## Procedure
1. Record and save an input configuration for the random forest
1. Separate data into k folds
2. For each fold *k*: 
    1. train the classifier on k-1 folds
    2. predict the k-th fold
    3. measure the: accuracy, [logarithmic](http://wiki.fast.ai/index.php/Log_Loss) [loss](https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234#f217), recall, precision, and f1-score
3. Record the performance measures & associate it with the input configuration
3. Evaluate the overall performance difference across all configurations
4. Change **at most** 1 variable from the input configuration that optimizes perfomance & repeat steps 1-5

In [42]:
class Experiment:
    def __init__(self, df, labels, model_class, model_config_names, model_config_init_values, k_folds=10):        
        self.data = df
        self.labels = labels
        self.label_values = labels["label"].unique()
        self.model_class = model_class
        self.model_config_names = model_config_names
        
        self.init_experiments(model_config_names, model_config_init_values)
        
        self.k_folds = k_folds
    
    def tweak(self, parameter, new_value):
        self.current_experiment()[parameter] = new_value
        return self
    
    def run_trial(self):
        exp_num = self.experiments.shape[0]
        mprint("Running trail #{}\n------------------------------------".format(exp_num))

        performance_results = []
        folds, labels = self.split_k_folds()
        
        config = self.current_model_config()
        labels = pd.DataFrame(labels)
        
        for i, test_train in enumerate(folds):
            mprint(
                "*************************",
                "Running fold {} of {}".format(i+1, self.k_folds),
                "*************************"
            )
            
            model = self.model()
            t1 = time()
            model.train(test_train[1], pd.DataFrame(labels), *config)
            mprint("fold {} took {}s".format(i+1, time() - t1))
            performance = self.measure_performance(model, merge(test_train[0], labels), self.label_values)
            
            self.record_trial(i+1, performance, test_train[1], test_train[0], i+1 == len(folds))
        
        return self.trial_results
        
    def trial_results(self, trial_num=None):
        if trial_num == None:
            trial_num = self.experiments["trial_num"].max() - 1
        
        return self.experiments[self.experiments["trial_num"] == trial_num], self.experiment_data[int(trial_num)]    

#<--------  PRIVATE METHODS -------->
    def current_experiment(self):
        return self.experiments[self.experiments["trial_num"] == self.experiments["trial_num"].max()]
    
    def current_model_config(self):
        return self.current_experiment()[self.model_config_names].astype("int32").values[0]
        
    def prev_experiment(self):
        return self.experiments[self.experiments["trial_num"] == (self.experiments["trial_num"].max() - 1)]
    
    def trial_data(self, trial_num=None):
        if trial_num == None:
            trial_num = self.experiments["trial_num"].max()-1
        
        return this.experiment_data[trial_num]
    
    def model(self):
        return self.model_class(verbose=False)
    
    def init_experiments(self, config_names, config_values):
        derived_cols = self.performance_measures() + ["trial_num"]
        all_cols = config_names + derived_cols
        first_row = config_values + [np.nan for i in derived_cols]
        
        self.experiments = pd.DataFrame(columns=all_cols)
        self.experiments.loc[0] = first_row
        self.experiments.loc[0]["trial_num"] = 1
        self.experiments[config_names] = self.experiments[config_names].fillna(-1)
        self.experiment_data = []
    
    def record_trial(self, trial_num, results, train, test, final=False):
        for key in results:
            self.current_experiment()[key] = results[key]
        
        if not final:
            self.experiments.append(self.current_experiment())
            self.experiment_data.append((train, test))
            self.current_experiment()["trial_num"] = trial_num + 1
            
    def performance_measures(self):
        return ["log_loss", "class_accuracy", "precision", "recall", "fscore"]
    
    def split_k_folds(self):
        splitter = int(np.ceil(self.data.shape[0] / self.k_folds))
        df = shuffle(merge(self.data, self.labels))
        labels = df.pop("label")

        folds = []
        for i in range(1, self.k_folds+1):
            train = df.iloc[(i-1) * splitter: i * splitter]
            test = df.iloc[np.r_[0:(i-1) * splitter, i*splitter: df.shape[0]]]
            folds.append((train, test))

        return folds, labels
    
    def measure_performance(self, model, test_set, label_values=None):
        test_labels = test_set.pop("label")
        if label_values is None:
            label_values = test_labels.unique()

        predictions = test_set.apply(lambda row: model.predict(row), axis=1)
        precision, recall, fscore, support = score(test_labels, predictions, average='weighted')

        lvs = [[1 if p == 1 else 0 for l in label_values] for p in predictions]
        return {
            "log_loss" : log_loss(test_labels, lvs, normalize=True, labels=label_values),
            "class_accuracy": accuracy_score(test_labels, predictions, normalize=True),
            "precision": precision,
            "recall": recall,
            "f1-score": fscore
        }

In [43]:
init_config_names = ["num_trees", "num_features", "num_sample_rows", "max_tree_depth", "min_split_samples"]
init_config_values = [   10,          None,           None,              20,                5] # default settings = initial settings

e = Experiment(train, labels, RandomForest, init_config_names, init_config_values)

In [48]:
e.run_trial()

In [45]:
perf, train_test = e.trial_results()

## Trial 1 Notes:

These settings were default. Since we have 5 features, the range of log loss is between 0 and 1.6, meaning our log loss is pretty bad. The accuracy looks good, but that's just because class 1 has a higher chance to appear in general. As we see in iteration 3, accuracy and recall are both _.267_. That iteration probably had more diverse labels than usual, but the model only got it right $\frac{1}{5}^{th}%$ of the time. 

In [11]:
e.experiments

Unnamed: 0,num_trees,num_features,num_sample_rows,max_tree_depth,min_split_samples,log_loss,class_accuracy,precision,recall,fscore,trial_num
0,10.0,-1.0,-1.0,20.0,5.0,1.609,0.733,0.538,0.733,,1.0
1,10.0,-1.0,-1.0,20.0,5.0,1.609,0.733,0.538,0.733,,2.0
2,10.0,-1.0,-1.0,20.0,5.0,1.609,0.667,0.444,0.667,,3.0
3,10.0,-1.0,-1.0,20.0,5.0,1.609,0.6,0.36,0.6,,4.0
4,10.0,-1.0,-1.0,20.0,5.0,1.609,0.667,0.444,0.667,,5.0
5,10.0,-1.0,-1.0,20.0,5.0,1.609,0.733,0.538,0.733,,6.0
6,10.0,-1.0,-1.0,20.0,5.0,1.609,0.6,0.36,0.6,,7.0
7,10.0,-1.0,-1.0,20.0,5.0,1.609,0.867,0.751,0.867,,8.0
8,10.0,-1.0,-1.0,20.0,5.0,1.609,0.8,0.64,0.8,,9.0
9,10.0,-1.0,-1.0,20.0,5.0,1.609,0.786,0.617,0.786,,10.0
