# Course lecture
[2 - Random Forest Deep Dive](http://course18.fast.ai/lessonsml1/lesson2.html)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bogeholm/fastai-intro-to-ml/blob/master/2-random-forest-deep-dive.ipynb)

# Notes
## Lessons from last
- Learn the metric (RMSE log sale price)
- All columns should be _numbers_
  - `datetime` -> `bool`eans (`dayofweek`, `dayofyear`, ...)
  - All string values must be categorized
- Missing values replaced by median
- Additional boolean column `f'{colname}_na'` added to indicate missing values

## This time
- Discussion of [$R^2$](https://en.wikipedia.org/wiki/Coefficient_of_determination) in the context of [overfitting](https://en.wikipedia.org/wiki/Overfitting) and the [validation set](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)
- Validation sets vs. training sets - **test set** is reserved for validating model after training and choosing hyperparameters on the **validation set** (*unclear which set is used for which - ask Google*)
- If the test set represents a different time period than the training set, so should the validation set

### Speeding things up
- Choose a smaller subset of the data for interactive use
- Build model with sinle tree (32:00)

### Random Forests
- A tree consists of a set of binary decisions
- How do you split the trees?
   - Eg. try every possible split of every possible value, and find the one with the best weighted average of MSE in the two groups
 - The tree is finished when there is only one element in each leaf node (default `scikit-learn` behavior)
- A *Forest* is made of *Trees*. The Forest is made of Trees by Bagging. See "Bag of Little Bootstraps", a method of *ensembling*.
- Method: Create a largish set of trees, that massively overfits on a random subset of the data. They all have random errors. What is the average of a set of random errors?
- `scikit-learn` pick random subsets with replacement, ie. Bootstrapping.
- The goal of Random Forests is to come up with 'predictive, but poorly correlated trees' (find link to original 1990's paper)
- Un-correlated trees is more important than accurate trees - see `scikit-learn`'s `XtraTreesRegresor`
- When turning off bootstrapping (`bootstrap=False`), the shallow tree will be contained in a deeper tree

### Hyperparameters
#### Number of trees
- `scikit-learn` parameter `n_estimators`

#### Out-of-bag score
- At each level, use unused rows a validation sets (`oob_score=True` in `scikit-learn`) (1:12:00)

#### Subsampling
- If you pick a random subset for each tree, it doesn't matter how much data you have

#### Growing trees less deeply
- `scikit-learn` parameter `min_samples_leaf` - determines the minimum number of data points in the leaves. `3` suggested.

#### Maximum number of features
- `scikit-learn` parameter `max_features` The less correlated your trees are, the better. Takes a different subset of columns at each split point. Try `0.5`, `sqrt` or `log2`

## Implementation plan
- [x] Q: Can you fit on a DataFrame with categories added, but columns still in place? A: [Not this way at least](https://github.com/bogeholm/dataworks/blob/master/dev/jupyter/df-categories-and-codes.ipynb)
- [ ] Split data to smaller, 'interactive' set
- [ ] Draw a single tree
- [ ] Run the `RandomForestRegressor`, see predictions of the single trees ('`estimators`') (51:16)
- [ ] Define `print_score()` with `oob_score`
- [ ] Predict with each tree in a Forest, and compare with the mean (1:05)
- [ ] Plot metrics of one tree, then average of two trees, then ... (1:06:55)
- [ ] Run with 20, 30, 40 trees (1:08:10)
- [ ] Look at `set_rf_samples` and `reset_rf_samples` (from `fast.ai`, a "horrible hack") (1:16:00 and 1:19)
...
- [ ] Compare results with `XtraTreesRegressor`
- [ ] Move `get_basepath()` to separate file 

# Tips
## Notebook tips
- `?` to view documentation of imported functions
- `??` to view source code of imported functions

## Python tips
- `print(f'{variable}_string')`
- Saving with [`to_feather`](https://github.com/wesm/feather/tree/master/python)

## ML Tips
- "Most people run al of thei models, on all of their data, all of the time" (around 1:20). This is pointless. Do most of the modelling on a reasonably large subset

# Setup
## Install [dataworks](https://github.com/bogeholm/dataworks)

In [1]:
# Uncomment to install utilities:
!pip install --upgrade --quiet git+git://github.com/bogeholm/dataworks.git

## Imports

In [2]:
import numpy as np
import os
import pandas as pd
import sys

from dataworks.df_utils import (add_datefields, 
                                add_nan_columns, 
                                categorize_df,
                                inspect_df, 
                                numeric_nans, 
                                summarize_df, 
                                )

from IPython.display import display
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

## Mount Google Drive (for [Colab](https://colab.research.google.com/) integration)

In [18]:
def get_basepath(relative_dir_local='data', relative_dir_google='data'):
    """ Return path to base directory depending on whether the
        notebook is running locally, or in Google Colab. If the notebook
        is running in Colab, data is loaded from Google Drive
    """
    GOOGLE_DRIVE_HOME = 'drive/My Drive/' # Equivalent to `cd ~` in Google Drive
    # https://stackoverflow.com/questions/39125532/file-does-not-exist-in-jupyter-notebook
    #JUPYTER_CWD =  os.path.dirname(os.path.abspath(''))
    JUPYTER_CWD =  os.path.abspath('')
    
    if 'google.colab' in sys.modules:
        # Notebook is running in Google Colab
        from google.colab import drive
        drive.mount('/content/drive')
        return os.path.join(GOOGLE_DRIVE_HOME, relative_dir_google)
    else:
        return os.path.join(JUPYTER_CWD, relative_dir_local)

In [19]:
DATAPATH = os.path.join(get_basepath(), 'bulldozers')
print(f'DATAPATH: {DATAPATH}')

C:\Users\tbm\Documents\machinelearning\fastai-intro-to-ml
DATAPATH: C:\Users\tbm\Documents\machinelearning\fastai-intro-to-ml\data\bulldozers


## Notebook utilities

In [20]:
def display_allrows(df):
    """ Override max rows and display them all
    """
    with pd.option_context('display.max_rows', len(df)):
            display(df)

# Start learning ...

In [21]:
filename = os.path.join(DATAPATH, 'Train.zip')
df_raw = pd.read_csv(filename, low_memory=False, parse_dates=['saledate'])

In [22]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,...,,,,Standard,Conventional
1,1139248,57000,117657,77,121,...,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,...,,,,,
3,1139251,38500,1026470,332,121,...,,,,,
4,1139253,11000,1057373,17311,121,...,,,,,


### Preprocessing
Essentially a summary of steps in [1-intro-to-random-forests](1-intro-to-random-forests.ipynb), but excluding the 'NaN indicator' columns

In [23]:
#### Log sale price

In [24]:
df_proc = df_raw.copy(deep=True)
if 'SalePrice' in df_proc.columns:
    df_proc['LogSalePrice'] = np.log(df_proc['SalePrice'])
    df_proc.drop(columns=['SalePrice'], inplace=True)

In [25]:
#### Saledate

In [26]:
if 'saledate' in df_proc.columns:
    df_proc = add_datefields(df_proc, 'saledate', drop_original=True)

In [27]:
#### Numerical nulls

In [28]:
df_proc['auctioneerID'] = df_proc['auctioneerID'].fillna(
    df_proc['auctioneerID'].max() + 1
)

df_proc['MachineHoursCurrentMeter'] = df_proc['MachineHoursCurrentMeter'].fillna(
    df_proc['MachineHoursCurrentMeter'].median()
)

In [30]:
#### Categorize

In [31]:
(df_cats, catcodes) = categorize_df(df_proc)

In [32]:
summarize_df(df_cats)

Unnamed: 0,type,ncols,ncols_w_nans,n_nans,n_total,nan_frac
0,bool,4,0,0,1604500,0.0
1,float64,3,0,0,1203375,0.0
2,int16,4,0,0,1604500,0.0
3,int64,9,0,0,3610125,0.0
4,int8,40,0,0,16045000,0.0
