#  Intro to Machine Learning
This is a free course offered by [fast.ai](http://www.fast.ai/) (currently [unlisted](http://forums.fast.ai/t/another-treat-early-access-to-intro-to-machine-learning-videos/6826)). There's a github [repository](https://github.com/fastai/fastai/tree/master/courses/ml1).

## About this course
Some machine learning courses can leave you confused by the enormous range of techniques shown and can make it difficult to have a practical understanding of how to apply them.

The good news is that modern machine learning can be distilled down to a couple of key techniques that are of very wide applicability. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:

- **Ensembles of decision trees** (i.e. Random Forests and Gradient Boosting Machines), mainly for **structured data** (such as you might find in a database table at most companies)
- **Multi-layered neural networks learnt with Stochastic Gradient Descent** (SGD) (i.e. shallow and/or deep learning), mainly for **unstructured data** (such as audio, vision, and natural language)

### The lessons
In this course we'll be learning about:
- **Random Forests** 
- **Stochastic Gradient Descent**.
- **Gradient Boosting** 
- **Deep Learning**

### The dataset
We will be teaching the course using the [Blue Book for Bulldozers Kaggle Competition](https://www.kaggle.com/c/bluebook-for-bulldozers): 
- "The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuration. The data is sourced from auction result postings and includes information on usage and equipment configurations."

### Note:
These are personal notes. For the original code, check the github repository of the course. Also, I will be importing things as I need them.


# Lecture 1
It is recommended to [watch the video-lecture first](https://youtu.be/CzdWqFTmn0Y), then follow the notebook.

# Introduction to Random Forests
## The data
For [this competition](https://www.kaggle.com/c/bluebook-for-bulldozers/data), you are **predicting the sale price of bulldozers sold at auctions**.

The data for this competition is split into three parts:

- **Train.csv** is the training set, which contains data through the end of 2011.
- **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
- **Test.csv** is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

The key fields are in train.csv are:

- **SalesID**: the uniue identifier of the sale
- **MachineID**: the unique identifier of a machine. A machine can be sold multiple times
- **saleprice**: what the machine sold for at auction (only provided in train.csv)
- **saledate**: the date of the sale

### Exploring the data

In [None]:
%%time
# times the whole cell

# import pandas
import pandas as pd

# set the path to read Train.csv
path = '../input/Train.csv'

# read the data into a pandas DataFrame
df_raw = pd.read_csv(path,
                     low_memory=False, # use as much memory as necessary to figure out dtypes
                     parse_dates=["saledate"]) # read saledate column as datetime dtype

In [None]:
# dimensions (rows, columns)
df_raw.shape

It's a relatively **large dataset**.

In [None]:
# print 5 first rows
df_raw.head()

Pandas switchs to **truncate view** when there are too many rows and columns, but we can change the display options if we want to see all of them.

In [None]:
# create a function that displays up to a thousand rows and columns of a dataframe
# pd.option_context(*args) is a context manager to temporarily set options in the `with` statement context
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

Now we can display all the rows and columns whenever we need it.
- For readability, we'll use the **.T attribute**, equal to the .transpose() method, which cause the rows and columns to exchange places.
- Like so, we have to **scroll down** instead of sideways to see all the columns.


In [None]:
# show the tail (last 5 rows)
# view transposed
display_all(df_raw.tail().T)

In [None]:
# generate descriptive statistics of all columns (including object columns)
# view transposed
display_all(df_raw.describe(include='all'))

### Initial processing
#### SalePrice
- SalePrice is the **dependent variable**, the target.
- Kaggle tells us that the evaluation metric for this competition is the **RMSLE** (root mean squared log error) between the actual and predicted auction prices.
- Therefore we take the **log of the prices**, so that RMSE will give us what we need.

In [None]:
#import numpy
import numpy as np

# replacing SalePrice column values with the calculated natural logarithm of SalePrice values.
df_raw['SalePrice'] = np.log(df_raw.SalePrice)

# show head of column
df_raw.SalePrice.head()

We need to **split** the **dependent variable** from the dataset.
- Later on, we'll use a method from the **fastai library** , **`proc_df`**,  to do just that and more.
- We show some alternative code below to illustrate the process.

In [None]:
# DataFrame with a column removed
independent = df_raw.drop('SalePrice', axis=1)

# select a Series from the DataFrame
dependent = df_raw.SalePrice

#### SaleDate
We already know that one of the features is a date and we've explicitly told pandas to read it as  **datetime dtype**.

In [None]:
# prints first 5 rows of the column
df_raw.saledate.head()

 **What's in a date is one of the more important pieces of feaute engineering you can do** (i.e. Was it a holiday? Was it raining?).

- We'll use a method from the **fastai library**, **`add_datepart`**, that extracts particular date fields from a complete datetime for the purpose of constructing categoricals. 
- You should always consider this **feature extraction** step when working with date-time dtypes. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities.

In [None]:
# import add_datepart
from fastai.structured import add_datepart

add_datepart(df_raw, 'saledate')

We can check the **source code** of a function  by typing its name and **??** (before or after the name).

In [None]:
add_datepart??

So `add_datepart` converts a column of a DataFramefrom a **datetime64 to many columns containing the information from the date** (changes occur inplace).

In [None]:
# column labels of the DataFrame.
df_raw.columns

There are a bunch of new columns. Let's take a look.

In [None]:
# acces the selected [rows, columns]. View transposed.
df_raw.loc[:,['saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
       'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
       'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
       'saleElapsed']].T

#### Categorical Data
As in most cases when dealing with datasets, this dataset contains a mix of **continuous** and **categorical** variables.
- We need all of them to be **numeric** so they can be used by the model.
- We must make the necessary changes (**feature engineering**).

In [None]:
# count dtypes in the dataframe
df_raw.dtypes.value_counts()

If we look at the numeric columns, we can se that **most of them** are not cotinuous but **categorical**.
- Even though it's not ideal that they stay that way, **random forest** works fine with it so there is **no problem**.

In [None]:
# select numerical columns
df_raw.select_dtypes(['int64','float64']).head()

Most of the **categorical variables are currently stored as strings** (objects).
- Apart from being inefficient, it doesn't provide the **numeric coding** required for our model.

- We'll use a method from the **fastai library** , **`train_cats`**,  to convert strings to pandas categories.

In [None]:
# import train_cats
from fastai.structured import train_cats

train_cats(df_raw)

We can again check the **source code** of the function for more information.

In [None]:
??train_cats

In [None]:
# examine all the data types
display_all(df_raw.dtypes)

Indeed it has changed inplace all columns that contained **strings to category dtypes**.
- It doesn't make the DataFrame look different but behind the scenes all string are encoded with **integers mapped to the strings**.
- Let's look at UsageBand, our first category column.

In [None]:
# categories of this categorical
df_raw.UsageBand.cat.categories

In [None]:
# encoded variables
df_raw.UsageBand.cat.codes.head()

We can **specify the order** to use for categorical variables if we wish:

In [None]:
# sets the categories to the specified new_categories, ordered in place.
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

Finally, to fit this data to a random forest, we would need to **replace the text categories with their numeric codes**.
- We'll use a method from the **fastai library** , **`proc_df`**,  to do just that and more.
- We show some alternative code below to illustrate the process.

In [None]:
# make a copy for demonstration purposes
df_codes = df_raw.copy()

# iterate through column name and column values and if column values are categories, replace them with their numeric code.
for name, column in df_codes.items():
  if column.dtype.name == 'category':
    df_codes[name] = column.cat.codes
    
# count dtypes in the dataframe
df_codes.dtypes.value_counts()

This is a **two step process** for being able to **fit categorical** variables to our model: first we convert **strings to categories**, then **categories to codes**/numbers.
- The result is that all variables are recoded in **only one column** using different integers for different categories, which is **useful for** columns with **lots of possible values**. 
- An **alternative** to this would be to **get dummy variables** for the categorical variables, but it would resolve it by recoding all the variables in **different columns** with **zeroes** and **ones**.

#### Missing Data
Handling missing data is important as **many** machine learning **algorithms do not support** data with **missing values** ([source](https://machinelearningmastery.com/handle-missing-data-python/)).

Typically, random forest methods/packages encourage **two ways** of handling missing values ([source](https://medium.com/airbnb-engineering/overcoming-missing-values-in-a-random-forest-classifier-7b1fc1fc03ba)): 
- **Drop data points** with missing values (not recommended)
- **Fill** in missing values with the **median** (for numerical values).

In [None]:
# display all columns with the percentage of missing values (NaN)
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Pandas has handled all **missing values** in the **categories** automatically by encoding  them with a **minus one**.

In [None]:
# categories of this categorical
print(df_raw.Hydraulics.cat.categories)

# encoded variables
df_raw.Hydraulics.cat.codes.head(6)

For the **numerical data**, we need to **replace** the **missing values** with the **median** (also called median imputation).
- We'll use a method from the **fastai library** , **`proc_df`**,  to do just that and more.
- We show some alternative code below to illustrate the process.

In [None]:
# select numerical columns
num_col = df_raw.select_dtypes(include='number')

# display numerical columns with the percentage of missing values (NaN)
num_col.isnull().sum().sort_index()/len(df_raw)

In [None]:
# make a copy for demonstration purposes
df_imputed = df_raw.copy()

# iterate through column name and column values and if column values are numerical, replace NA values with the column median.
for name, column in df_imputed.items():
  if column.dtype == np.number:
    df_imputed[name] = df_imputed[name].fillna(column.median())

# select numerical columns
num_col = df_imputed.select_dtypes(include='number')

# display numerical columns with the percentage of missing values (NaN)
num_col.isnull().sum().sort_index()/len(df_raw)

## Saving the progress
We will **save the dataset** in its current form so we can **save ourselves** the trouble of **repeating the process** of getting the data ready to be to be passed to a random forest in **future lessons**.
- In kaggle, when we **commit** the kernel, everything that is writed to the current directory is saved as **output**.

In [None]:
# Save to feather file
df_raw.to_feather('df_raw')

## Pre-processing
We'll use a method from the **fastai library** , **`proc_df`**,  to get the dataset ready for the random forest.
- We'll **replace categories** with their numeric codes.
- **Handle missing** continuous **values**.
- **Split the dependent variable** into a separate variable.

In [None]:
# import proc_df and the functions it depends on
from fastai.structured import numericalize, fix_missing, proc_df

X, y, nas_dict = proc_df(df_raw, 'SalePrice')

For more information, we can again check the **source code**:

In [None]:
??proc_df

The method **`fix_missing`** has returned `nas_dict`, which is a dictionary with the name of the columns that had missing data.
- From the function docstring: Fill missing data in a column of df with the median, and add a {name}_na column which specifies if the data was missing (boolean).
- Our model will **not be using it** but we can check it out.

In [None]:
nas_dict

So we know the columns MachineHoursCurrentMeter' and 'auctioneerID' had missing values and according to the docstring, `fix_missing` has added **new columns** which specifies if the data was missing. Let's see.'

In [None]:
# acces columns with names containing some strings
X.loc[:, X.columns.str.contains('MachineHoursCurrentMeter|auctioneerID')].head()

Anyhow, now the **dataset is ready** to be to be passed to a **random forest**.

# Lecture 2
It is recommended to [watch the video-lecture first](https://youtu.be/blyXCk4sgEg), then follow the notebook.

# Random Forests
## Base model
### Validation set
To **avoid  overfitting** , we'll split the dataset to have two **separate training and validation sets**.
- Since we are trying to predict new prices, we should pick the **latest samples** from our train set.
- We'll use the **same size as** the **Kaggle** validation set, so when we evaluate our model it gives us an estimation of how will it perform in the **public leaderboard**.

In [None]:
# set the path to read the Valid.csv
path = '../input/Test.csv'

# dimensions of the dataframe (rows, columns)
pd.read_csv(path).shape

The set Kaggle used for creating the public leaderboard has roughtly **12,000 samples**, so that is the size we'll use for our validation set.

In [None]:
# create a function for splitting X and y into train and test sets of customizable sizes
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

# validation set size: 12000.
validation = 12000

# split point: length of dataset minus validation set size.
split_point = len(X)-validation

# split X
X_train, X_valid = split_vals(X, split_point)

# split y
y_train, y_valid = split_vals(y, split_point)

# dimensions (row, columns) of X, y and X_valid
X_train.shape, y_train.shape, X_valid.shape

### Model evaluation
#### RMSE
Remember the evaluation metric that Kaggle is going to use for this competition is the Root Mean Squared Log Error (**RMSLE**) between the actual and predicted auction prices.
- Because we **already took the log** of the prices, we can use the Root Mean Squared Error (RMSE) instead.

The **RMSE** is an evaluation metric that expresses the **average error of the model predictions** by comparing the predicted values with the actual known values.
- The score can range from **0** (best score possible, never achieved in practice) to **∞**, so the **lower the score the better**.

#### R Squared
The coefficient of determination R^2 (**R squared**) is an evaluation metric for regression problems that essentially tells us **how good is our model** compared with **a model that just predicts the average** of the target (saleprice) all the time .
- The **best** possible score is **1**.
- The score of the **average** model is **0**.
- A **negative** score means the model is even **worse** than simply predicting the average of the target.

In [None]:
# create a function that takes the RMSE
def rmse(pred,known): return np.sqrt(((pred-known)**2).mean())

# create a function that rounds to 5 decimal places (like kaggle leaderboard)
def rounded(value): return np.round(value, 5)

# create a function that prints a list of 4 scores, rounded:
# [RMSE of X_train, RMSE of X_valid, R Squared of X_train, R Squared of X_valid]
def print_scores(model):
    RMSE_train = rmse(model.predict(X_train), y_train)
    RMSE_valid = rmse(model.predict(X_valid), y_valid)
    R2_train = model.score(X_train, y_train)
    R2_valid = model.score(X_valid, y_valid)
    scores = [rounded(RMSE_train), rounded(RMSE_valid), rounded(R2_train), rounded(R2_valid)]
    if hasattr(m, 'oob_score_'): scores.append(m.oob_score_) # appends OOB score (if any) to the list 
    print(scores)

In [None]:
# import the class
from sklearn.ensemble import RandomForestRegressor

# instantiate the model 
rf = RandomForestRegressor(n_jobs=-1) # n_jobs is a performance parameter. -1 is to paralelize the computations across the number of CPU cores.

# fit the model with data and calculate the running time
%time rf.fit(X_train, y_train)

# print a list of 4 scores:
# [RMSE of X_train, RMSE of X_valid, R Squared of X_train, R Squared of X_valid]
print_scores(rf)

These four scores allows us to evaluate the accuracy of our model.
- An **R Squared** of the validation set in the **high-80's** tells us that our model is significantly **better** than simply **predicting the average** price.
- The large difference of the **RMSLE** of the training (**0.09**) and the validation sets (**0.25**) tells us that we're **over-fitting badly**, because the average error of the model predictions are much greater when dealing with data it hasn't been trained with.

## Speeding things up
In order to make the model development process more **interactive**, we need to make sure that the model runs in a reasonable time to be able to **make some changes and see the results fast**.
- We''ll use the paramater **`subset`** from **`proc_df`** that takes a random subset of the selected size from the dataframe.
- We'll randomly sample **32,000** from the original 401,125 rows.
- We'll use the **same validation set** as before.
- To make sure our training set doesn't overlap with the dates of the validation set we'll **split again the subset** we've taken and **discard the last 12,000** rows.

In [None]:
# randomly sample 32000 thousand rows
X_subset, y_subset, nas_dict = proc_df(df_raw, 'SalePrice', subset=32000)

# split the train subset and discard the last 12000 rows: X_train [:20000], _ [20000:].
X_train, _ = split_vals(X_subset, 20000)

# split the target subset and discard the last 12000 rows: X_train [:20000], _ [20000:].
y_train, _ = split_vals(y_subset, 20000)

Now we can **train our model with 20,000 randomly chosen samples** from the total 401,125 rows and we can be sure that we're not cheating because the validation set is a different set entirely.

In [None]:
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_scores(m)

Well, that was fast.
## Single tree
To **understand** why we were overfitting, we'll create a **random forest** so simple that we can actually **take a look inside**.
- With the parameter `n_estimators`, we'll create a forest with **only one tree**.
- With the parameter `max_depth`, we'll create a forest which trees **split only three times**.
- With the parameter `bootstrap`, we'll **avoid using randomly generated training sets**.

In [None]:
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1)
m.fit(X_train, y_train)
print_scores(m)

Now we'll **take a look inside** this random forest simplified to a **single small deterministic tree**.
- We'll use a method fromt the **fastai library**, **`draw_tree`**, to draw a representation of the random forest in IPython.
- This method uses the function **`export_graphviz`**, which generates a GraphViz representation of the decision tree.

In [None]:
# import the export_graphviz exporter 
from sklearn.tree import export_graphviz

# import draw_tree method
from fastai.structured import draw_tree

# draw a random forest
draw_tree(m.estimators_[0], X_train, precision=3)

A tree consist of a **sequence of binary decisions** or splits.
- Each step **partitions the data** into two subsets.
- Each split point is a decision to split the data on a **particular variable** (feature) and in a **particular location** (value).
- **All possible split points are evaluated** (all features and values).
- The very **best split point is chosen** each time trying to **minimize the error** (for regression).
- The error is calculated by **comparing the predicted values with the known values** of the subsets of a particular split point.

Each box in the graphic that represents a split point consists of:
- A **variable** and a **value** to split on.
- **MSE** quantifies the error, the lower the better.
- **Samples** is the size (rows) of the subset.
- The **average** of the **target** value (log of price) of the subset.

Sources: [1](https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/), [2](https://infocenter.informationbuilders.com/wf80/index.jsp?topic=%2Fpubdocs%2FRStat16%2Fsource%2Ftopic47.htm).

Looking at our little tree, we can see that we **start with 20,000 rows** (the root node),  **split only three times** (trying to minimze the MSE) and end up with **many subsets of different sizes** (leaf nodes), which are the result of the last split.

If we create the largest (**deepest**) tree possible, it'll **keep splitting until each leaf node has only one sample**. Such tree would learn all there is to learn about the training data but would be a model that would **not be able to generalize a pattern** from it. And that is exactly what overfitting is: to **learn the noise instead of the signal**. Let's see it in practice.

In [None]:
# remove the max_depth parameter (by default to None) so the nodes expand until they can't do it anymore. 
m = RandomForestRegressor(n_estimators=1, bootstrap=False, n_jobs=-1)

m.fit(X_train, y_train)

# [RMSE of X_train, RMSE of X_valid, R Squared of X_train, R Squared of X_valid]
print_scores(m)

The scores are **great in the training set**: the RMSE is 4.3511679e-17 (that is a 0 followed by a lot of zeroes) and the R Squared is 1 (the best possible score). This is because we can in fact **predict everything** from the training data.

Our interest, of course, is to do **better in the validation set** to get more generalizable results and for that we need to use **multiple trees** and some kind of **model averaging approach** .
## Bagging
### Intro to bagging
Bootstrap aggregating (**Bagging**) is a way of doing repeated statistical analyses on the 
same data and combining them to form a single result.
- The predictions that are combined are all different thanks to the use of **resampling** (bootsrap samples).
- **Bootstraping** is a way of generating random non-repeated samples from a dataset.

A **random forest** is simply a way of **bagging trees**.
- It combines the **unique predictions** of different trees.
- Each tree is trained on a **randomly generated training set**.
- It grows **deep trees** that overfit their individual non-repeated subsets.
- The prediction **error** of the trees are **random** so the **average of the predictions doesn't overfit**.

Sources: [1](https://onlinecourses.science.psu.edu/stat857/node/181/), [2](https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/).

To learn about bagging in random forests, let's look at our **base model** again.

In [None]:
# 10 trees: it's the default number of trees in the random forest
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_scores(m)

We'll grab the **predictions for each individual tree**, and look at one example.

In [None]:
# use a list comprehension to loop through the random forest and concatenates the predictions of each individual tree on a new axis
preds = np.stack([t.predict(X_valid) for t in m.estimators_])

# dimensions of the predictions (rows, columns)
preds.shape

There are 10 sets of predictions (trees) with 12,000 values (predictions), which corresponds to the size of the validation set.
- Let's see what the **first prediction of each tree** does look like and how they **compare** with the actual value **when averaged out**.

In [None]:
# print the first prediction for each of the ten trees [all tree rows, first prediction column]
print(preds[:,0])

# print the mean of the first ten predictions
print(np.mean(preds[:,0]))

# the first value of the validation set
y_valid[0]

None of the individual trees have very good predictions but **the mean of them is actually good enough**.
### Out-of-bag (OOB) score
Out-of-bag (**OOB**) **error**  is another method of measuring the **prediction performance** of models utilizing **bagging**.
- Out-of-bag samples are **samples not used** during training for **any given tree**.
- For **every tree** in the forest there is **unseen data** available to make **new predictions** on.
- For each **training sample**, calculates the **mean prediction error** using only the trees that were not trained on that sample.
- **Evaluates** the model on the training set **without** needing a separate **validation set**.

Sources: [1](https://en.wikipedia.org/wiki/Out-of-bag_error), [2](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html), [3](https://stackoverflow.com/questions/18541923/what-is-out-of-bag-error-in-random-forests).

This also has the benefit of allowing us to **assess** whether **our model generalizes** even if we **only have** a **small amount of data** so want to avoid separating some out to create a validation set.

We can add one more parameter to our model, **`oob_score`**, to use out-of-bag samples to estimate the **R^2** on unseen data.
- There is **minimum number of trees** to be used for any given data that are **necessary** to compute a reliable OOB error (or **sklearn warns** us).
- We'll also update the `print_scores` function to **print** the  **OOB error** last.

In [None]:
# appends the oob score to the list of scores if the model has the parameter
def print_scores(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

# Because we use too few trees (10) we get a UserWarning and an unreliable oob score
m = RandomForestRegressor(n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)

print('10 trees:')
# [RMSE of X_train, RMSE of X_valid, R Squared of X_train, R Squared of X_valid, OOB score]
print_scores(m)

# 30 trees
m = RandomForestRegressor(n_estimators=30, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)

print('\n30 trees:')
# [RMSE of X_train, RMSE of X_valid, R Squared of X_train, R Squared of X_valid, OOB score]
print_scores(m)

The **OOB** is **better** than the **R Squared** of the validation set because the validation set is a **different time period** entirely, while the OOB is calculated on random samples of the same time period as the training set.
- In this case the **validation set is much harder to predict** and the R Squared will consistently be lower than the OOB.

## Hyperparameter tuning
### Subsampling
Earlier we used **subsampling** to speed up the analysis, which basically consist in **limiting** the total **amount of data** that our **model can access** so it **trains faster**. There is a **better way** to do it that is actually **also** one of the easiest ways to **avoid overfitting**.
- Rather than use **one random subset** of the data for our **entire model** (and for all its trees), we'll use a **different random subset per tree**.
- **Given enough trees**, and enough random subsets, the model will **eventually** be able to **train with** much of **all the available data**.
- Each **individual tree** it'll be **just as fast** as if we were **using one subset** for all the trees, **like** we did **before**.

We'll use a method from the **fastai library** , **`set_rf_samples`**, to do just that.
- From the documentation: "**changes Scikit learn's random forests** to give each tree a random sample of n random rows".
- It is **not compatible** with **OOB score** right now, so will simply do without it.
- To turn off `set_rf_samples` to use random forest in the normal way, it's necessary to call **`reset_rf_samples`**.

First we must **return** to using our **full dataset** so the **subsampling** is done on the **entire data**.

In [None]:
X, y, nas_dict = proc_df(df_raw, 'SalePrice')
X_train, X_valid = split_vals(X, split_point)
y_train, y_valid = split_vals(y, split_point)

We'll use the same subset size as before: **20,000** random samples, but this time will be a different subset for each tree.

In [None]:
# import set_rf_samples
from fastai.structured import set_rf_samples

set_rf_samples(20000)

In [None]:
# 10 trees
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_scores(m)

Since each additional tree allows the model to see more data, this approach can make **additional trees more useful without needing additional fine-tuning**.

In [None]:
# 40 trees
m = RandomForestRegressor(n_estimators=40, n_jobs=-1)
%time m.fit(X_train, y_train)
print_scores(m)

### Number of trees
We'll create a graphic to see what happens to the **R Squared** when using **1 tree up to 100 trees**.
- We'll **store** the **predictions** of each tree in a **numpy array** like we did before for our base model.
- We'll **loop** through the **predictions** of each tree to **plot the change** in **R Squared** as we add more trees.

In [None]:
# 100 trees
m = RandomForestRegressor(n_estimators=100, n_jobs=-1)
%time m.fit(X_train, y_train)
print_scores(m)

# use a list comprehension to loop through the random forest and concatenates the predictions of each individual tree on a new axis
preds = np.stack([t.predict(X_valid) for t in m.estimators_])

# dimensions of the predictions (rows, columns)
preds.shape

In [None]:
# import matplotlib
import matplotlib.pyplot as plt

# import metrics module
from sklearn import metrics

# plot the calculated r^2 of the true values and the predicted values up to [:i] trees (looping through a range from 1 to 100). 
plt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(100)]);

The shape of this curve clearly shows that the **model improves** the **more trees** we use, but it also suggests that adding more trees is beneficial **up to a point**, because it **flattens out** the closer it gets to 100 trees.
- There is **no lineal relationship** between number of trees and our evaluation metric score.
- The **number of trees** we use can **vary wether** we are:
 - **developing the model**, we'll propably use less trees but do lots of iterations, or
 - **finishing the model**, in that case, we'll probably use as much trees as our processor can handle to fit the model with all the available data and using the best found parameters.
- **Without** our **subsampling** approach, building **lots of trees** on **lots of data** can be computationally expensive and very **time comsuming**.

### Depth of the trees
Another way to reduce over-fitting is to grow our trees **less deeply**.
- We can do this with **`min_samples_leaf`** parameter, which requires some minimum number of rows in every leaf node (by default 1).
- For each tree there will be less levels and **less decisions** being made so it will result in simpler models.
- The predictions are made by **averaging more samples** in the leaf node.
- This can make the random forest to **generalize better**.

Possible good values to try:
- 1, 3, 5, 10, 25, 100

In [None]:
# baseline to compare to

# 50 trees: 1 minimum samples per leaf (default)
m = RandomForestRegressor(n_estimators=50, n_jobs=-1)
%time m.fit(X_train, y_train)
print_scores(m)

In [None]:
# 50 trees: 2 minimum samples per leaf
m = RandomForestRegressor(n_estimators=50, min_samples_leaf=2, n_jobs=-1)
%time m.fit(X_train, y_train)
print_scores(m)

### Variance
Another way to reduce over-fitting is to increase the amount of **variation amongst the trees**.
- We can do this by specifying **`max_features`**, which is the **number of features** to consider at **each split**.
- This way not only randomly selects a **sample of rows** for each tree, but **also** randomly selects a **sample of columns** for each split.
- This can be critical for creating variance when there are features that are so much predictive than others and all the trees are very similar to each other because they are doing similar splits.

Possible good values to try:
- None: use all columns.
- 0.5: use half of the columns
- 'sqrt' (default): use the square root of total number of columns.

In [None]:
# 50 trees: consider all columns per split
m = RandomForestRegressor(n_estimators=50, min_samples_leaf=2, max_features=None, n_jobs=-1)
m.fit(X_train, y_train)
print_scores(m)

### Experimenting
We'll revert to using a full bootstrap sample in order to experiment with these hyperparameters further and to see their full impact.
- The models will **train much slower**.
- For more information, you can check this post on stackoverflow: [Practical questions on tuning Random Forests](https://stats.stackexchange.com/questions/53240/practical-questions-on-tuning-random-forests).

In [None]:
# import reset_rf_samples
from fastai.structured import reset_rf_samples

# use full bootstrap sample
reset_rf_samples()

In [None]:
# 50 trees: 3 minimum samples per leaf
m = RandomForestRegressor(n_estimators=60, min_samples_leaf=3, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_scores(m)

In [None]:
# 50 trees: half columns per split
m = RandomForestRegressor(n_estimators=60, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_scores(m)