# 1. Setup and Data Wrangling

## Setup

Activate fastai environment:

In [1]:
! source activate fastai

Standard Jupyter Notebook commands:

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

Importing packages:

In [4]:
from fastai.imports import * # fastai.imports imports range of different libraries e.g. pandas
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier # Random Forest Class's we'll use.
from IPython.display import display

from sklearn import metrics

ImportError: No module named fastai.imports

## Data Collection

For our initial example, we'll be working with an old Kaggle competition, predicting the price of bulldozers at auction from their dataset.

Data path:

In [4]:
PATH = '/Users/Alex/data/bulldozers/'

Load data into memory:

In [5]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory = False, parse_dates=["saledate"])

First peek at the data, utilising custom `display_all`:

In [6]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)

In [7]:
display_all(df_raw.tail().transpose()) # Apply display_all function to 'tail' and transposed

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1,1,1,2,2
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


For the Kaggle competition, we're focusing on log(SalePrice):

In [8]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

## Data Cleaning

To be able to feed our data into the model, we need to first change strings into numerical representations of categories.

First deal with the dates (putting in place a number of additional columns):

In [10]:
add_datepart(df_raw, 'saledate')

**Now turn strings into categories:**

In [11]:
train_cats(df_raw)

We can put the strings into ordinal categories if we wish, although it turns out that RFs this has minimal difference. Let's apply this for an obvious choice of category 'Usage Band'.

In [12]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

Now let's look at the NULLs we have in the data:

In [13]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
S

We'll save this pre-processed data with 'feather' format:

In [14]:
os.makedirs('~/tmp', exist_ok=True)
df_raw.to_feather('~/tmp/raw')

## RF Model

Read back in our DF:

In [15]:
df_raw = pd.read_feather('~/tmp/raw')

Now we need to convert categorical variables to their numerical representations, so we can push this into our model. For this, we leverage `proc_df`:

In [16]:
df, y = proc_df(df_raw, 'SalePrice')

This function actually has quite a lot going on:
- Missing values by adding indicator column with Boolean and replaced NAs with median -- this only happens for numeric, categories handled automatically by Pandas by setting category to '-1'.
- Categories replaced by numeric codes.

In [19]:
df.columns

Index(['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID',
       'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'fiModelDesc',
       'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor',
       'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup',
       'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type',
       'Ride_Control', 'Stick', 'Transmission', 'Turbocharged',
       'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower',
       'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control',
       'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks',
       'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width',
       'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type',
       'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
       'saleI

**Fitting the actual model:**

In [20]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df,y)
m.score(df,y)

0.983112732173968

Note that this model is immediately giving a very high r-squared value, indicating that we do have some **over-fitting** present.

We diagnose this using training and validation sets:

In [21]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note here making copy

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

((389125, 66), (389125,), (12000, 66))

Model Assessment:

In [22]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean()) # Loss Function as specified by Kaggle

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

Re-fitting:

In [23]:
RF = RandomForestRegressor(n_jobs = -1) # n_jobs = -1 -> parallelize across CPUs
RF.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [24]:
print_score(RF)

[0.09051099118128908, 0.24649444815720803, 0.98288770791377233, 0.89149187524598328]


In [26]:
RF.predict(X_valid)

array([ 9.17392,  9.28877,  9.2425 , ...,  9.32105,  9.27767,  9.27767])

So note r-squared on validation sets of 0.89, which is still pretty good. Also note our RMSLE on test data of 0.25 ranks pretty high on Kaggle leaderboard.

Note: Random Forests still work with numeric variables even if they make no sense in an order - e.g. Model ID

In [27]:
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1) 
m.fit(X_train, y_train)
print_score(m)

[0.5371269683343529, 0.5674541015386864, 0.39703966649954636, 0.42494490873727775]


# Lesson 3 - Raw Notes

## Aside: $R^2$

$R^2$ represents the amount of variance about the mean that's explained by our model.

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$

$SS_{res}$ --> MSE of fitted values of the model  
$SS_{tot}$ --> MSE around mean value

Key Point: $$ R^2 \in (- \infty,1) $$

Any negative values of $R^2$ mean 'your model is worse than predicting the mean'.

$R^2$ is not necessarily indicative of high predictive power for a model (we may over-fitting and higher variance in the bias-variance trade-off).

Instead, we must utilise validation sets.

Be particularly careful of Time Series data. In this example, because there is a time element, with our test set as a separate time set, we take the first n elements in our training data, so we know we are encapsulating the TS component into our model.

In [29]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy() # Note here making copy

n_valid = 12000  # Validation set size
n_trn = len(df)-n_valid # Train set size
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

# Training data: X_train
# Training response: y_train

# Validation set: X_valid

((389125, 66), (389125,), (12000, 66))

Key distinction:
1. Validation set: used to validate our model
2. Hold-out test set: remove to the data and only test at the final opportunity

This distinction is important so that we don't fit our model specifically for the test set.

## Speeding things up:

In [41]:
df_trn, y_trn = proc_df(df_raw, 'SalePrice')
X_train, _ = split_vals(df_trn, n_trn)
y_train, _ = split_vals(y_trn, n_trn)
set_rf_samples(20000)

In [43]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[0.2601265578630277, 0.27988411599673496, 0.86745944883692305, 0.86010426379720373]
