# 1. Setup and Data Wrangling

## Initial Setup

Note that before launching Jupyter Notebooks, you need to active the fastai environment, else package imports will fail:

In [1]:
! source activate fastai

Standard Jupyter Notebook commands for formatting:

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

Now we require to import the fastai and scikit-learn packages. For this, I created a link to the fastai repo in my home directory, using:

`! ln -s <fastai repository> <target directory>`

In [3]:
from fastai.imports import * # fastai.imports imports range of different libraries e.g. pandas
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier # Random Forest Class's we'll use.
from IPython.display import display

from sklearn import metrics

## Data Acquisition

As with our first model (see Intro to RFs), we'll start by working with data from an old Kaggle competition - aiming to predict the price of bulldozers at auction.

Data path:

In [6]:
PATH = '/Users/Alex/data/bulldozers/'

Load data into memory:

In [15]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory = False, parse_dates=["saledate"])

Function Arguments:
- `low_memory = False` : this option requires pandas to read more of the file to determine dtypes
- `parse_dates` : column headers with dtype as dates

First peek at the data, utilising custom `display_all`:

In [16]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)

This custom function displays all columns.

In [17]:
display_all(df_raw.tail().transpose()) # Apply display_all function to 'tail' and transposed

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1,1,1,2,2
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


For the Kaggle competition, we're focusing on log(SalePrice):

In [18]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

## Data Manipulation

To be able to apply models to our data we need to manipulate a number of the columns:

**1. Dealing with Dates:**

We use custom function to take the date column and split into a number of other potentially useful columns to feed into the model.

In [19]:
add_datepart

<function fastai.structured.add_datepart>

In [20]:
add_datepart(df_raw, 'saledate')

**2. Categories:**

We require to change any columns with string dtypes into categories for the model to be able to process.

In [21]:
train_cats

<function fastai.structured.train_cats>

In [23]:
train_cats(df_raw)

Further, we are able to, if we so wish, set up ordinal categories:

In [24]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

It turns out this will have minimal effect for a RF model.

**3. Nulls:**

Let's visualise the NULLs within our dataset:

In [13]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
S

**4. Saving the Data:**

We'll save this pre-processed data with 'feather' format:

In [27]:
os.makedirs('~/tmp', exist_ok=True)
df_raw.to_feather('~/tmp/raw')

## RF Model

Read back in our DF:

In [33]:
df_raw = pd.read_feather('~/tmp/raw')

Now we need to convert categorical variables to their numerical representations, so we can push this into our model. For this, we leverage `proc_df`:

In [34]:
df, y = proc_df(df_raw, 'SalePrice')

This function actually has quite a lot going on:
- Missing values by adding indicator column with Boolean and replaced NAs with median -- this only happens for numeric, categories handled automatically by Pandas by setting category to '-1'.
- Categories replaced by numeric codes.

In [31]:
df.columns

Index(['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID',
       'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'fiModelDesc',
       'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor',
       'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup',
       'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type',
       'Ride_Control', 'Stick', 'Transmission', 'Turbocharged',
       'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower',
       'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control',
       'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks',
       'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width',
       'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type',
       'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
       'saleI

**Fitting the actual model:**

In [20]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df,y)
m.score(df,y)

0.983112732173968

Note that this model is immediately giving a very high r-squared value, indicating that we do have some **over-fitting** present.

We diagnose this using training and validation sets:

In [35]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note here making copy

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

((389125, 66), (389125,), (12000, 66))

Model Assessment:

In [36]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean()) # Loss Function as specified by Kaggle

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

Re-fitting:

In [37]:
RF = RandomForestRegressor(n_jobs = -1) # n_jobs = -1 -> parallelize across CPUs
RF.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [24]:
print_score(RF)

[0.09051099118128908, 0.24649444815720803, 0.98288770791377233, 0.89149187524598328]


In [26]:
RF.predict(X_valid)

array([ 9.17392,  9.28877,  9.2425 , ...,  9.32105,  9.27767,  9.27767])

So note r-squared on validation sets of 0.89, which is still pretty good. Also note our RMSLE on test data of 0.25 ranks pretty high on Kaggle leaderboard.

Note: Random Forests still work with numeric variables even if they make no sense in an order - e.g. Model ID

In [27]:
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1) 
m.fit(X_train, y_train)
print_score(m)

[0.5371269683343529, 0.5674541015386864, 0.39703966649954636, 0.42494490873727775]
