# 1. Setup and Data Wrangling

## Initial Setup

Note that before launching Jupyter Notebooks, you need to active the fastai environment, else package imports will fail:

In [1]:
! source activate fastai

Standard Jupyter Notebook commands for formatting:

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

Now we require to import the fastai and scikit-learn packages. For this, I created a link to the fastai repo in my home directory, using:

`! ln -s <fastai repository> <target directory>`

In [3]:
from fastai.imports import * # fastai.imports imports range of different libraries e.g. pandas
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier # Random Forest Class's we'll use.
from IPython.display import display

from sklearn import metrics

## Data Acquisition

As with our first model (see Intro to RFs), we'll start by working with data from an old Kaggle competition - aiming to predict the price of bulldozers at auction.

Data path:

In [None]:
PATH = '/Users/Alex/data/bulldozers/'

Load data into memory:

In [None]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory = False, parse_dates=["saledate"])

Function Arguments:
- `low_memory = False` : this option requires pandas to read more of the file to determine dtypes
- `parse_dates` : column headers with dtype as dates

First peek at the data, utilising custom `display_all`:

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)

This custom function displays all columns.

In [None]:
display_all(df_raw.tail().transpose()) # Apply display_all function to 'tail' and transposed

For the Kaggle competition, we're focusing on log(SalePrice):

In [None]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

## Data Manipulation

To be able to apply models to our data we need to manipulate a number of the columns:

**1. Dealing with Dates:**

We use custom function to take the date column and split into a number of other potentially useful columns to feed into the model.

In [None]:
add_datepart

In [None]:
add_datepart(df_raw, 'saledate')

**2. Categories:**

We require to change any columns with string dtypes into categories for the model to be able to process.

In [None]:
train_cats

In [None]:
train_cats(df_raw)

Further, we are able to, if we so wish, set up ordinal categories:

In [None]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

It turns out this will have minimal effect for a RF model.

**3. Nulls:**

Let's visualise the NULLs within our dataset:

In [None]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

**4. Saving the Data:**

We'll save this pre-processed data with 'feather' format:

In [None]:
os.makedirs('~/tmp', exist_ok=True)
df_raw.to_feather('/Users/Alex/tmp/raw')

## RF Model

Read back in our DF:

In [None]:
df_raw = pd.read_feather('/Users/Alex/tmp/raw')

Now we need to convert categorical variables to their numerical representations, so we can push this into our model. For this, we leverage `proc_df`:

In [None]:
df, y = proc_df(df_raw, 'SalePrice')

This function actually has quite a lot going on:
- Missing values by adding indicator column with Boolean and replaced NAs with median -- this only happens for numeric, categories handled automatically by Pandas by setting category to '-1'.
- Categories replaced by numeric codes.

In [None]:
df.columns

**Fitting the actual model:**

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df,y)
m.score(df,y)

Note that this model is immediately giving a very high r-squared value, indicating that we do have some **over-fitting** present.

We diagnose this using training and validation sets:

In [None]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note here making copy

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

Model Assessment:

In [None]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean()) # Loss Function as specified by Kaggle

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

Re-fitting:

In [None]:
RF = RandomForestRegressor(n_jobs = -1) # n_jobs = -1 -> parallelize across CPUs
RF.fit(X_train, y_train)

In [None]:
print_score(RF)

In [None]:
RF.predict(X_valid)

So note r-squared on validation sets of 0.89, which is still pretty good. Also note our RMSLE on test data of 0.25 ranks pretty high on Kaggle leaderboard.

Note: Random Forests still work with numeric variables even if they make no sense in an order - e.g. Model ID

In [None]:
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1) 
m.fit(X_train, y_train)
print_score(m)