# 1. Setup and Data Wrangling

## Initial Setup

Note that before launching Jupyter Notebooks, you need to active the fastai environment, else package imports will fail:

In [16]:
! source activate fastai

Standard Jupyter Notebook commands for formatting:

In [17]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Now we require to import the fastai and scikit-learn packages. For this, I created a link to the fastai repo in my home directory, using:

`! ln -s <fastai repository> <target directory>`

In [18]:
from fastai.imports import * # fastai.imports imports range of different libraries e.g. pandas
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier # Random Forest Class's we'll use.
from IPython.display import display

from sklearn import metrics

A huge amount of code will make use of pre-defined functions within the fastai library, which will hopefully be highlighted at each stage.

Aside: index DF with `df.loc[<range>,[<columns>]]`

## Data Acquisition

As with our first model (see Intro to RFs), we'll start by working with data from an old Kaggle competition - aiming to predict the price of bulldozers at auction.

Data path:

In [19]:
PATH = '/Users/Alex/data/bulldozers/'

Load data into memory:

In [20]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory = False, parse_dates=["saledate"])

Function Arguments:
- `low_memory = False` : this option requires pandas to read more of the file to determine dtypes
- `parse_dates` : column headers with dtype as dates

First peek at the data, utilising custom `display_all`:

In [21]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)

This custom function displays all columns.

In [22]:
display_all(df_raw.tail().transpose()) # Apply display_all function to 'tail' and transposed

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1,1,1,2,2
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


For the Kaggle competition, we're focusing on log(SalePrice):

In [23]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

## Data Manipulation

To be able to apply models to our data we need to manipulate a number of the columns:

**1. Dealing with Dates:**

We use custom function to take the date column and split into a number of other potentially useful columns to feed into the model.

In [24]:
add_datepart

<function fastai.structured.add_datepart>

In [25]:
add_datepart(df_raw, 'saledate')

**2. Categories:**

We require to change any columns with string dtypes into categories for the model to be able to process.

In [26]:
train_cats

<function fastai.structured.train_cats>

In [27]:
train_cats(df_raw)

Further, we are able to, if we so wish, set up ordinal categories:

In [28]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

It turns out this will have minimal effect for a RF model.

**3. Nulls:**

Let's visualise the NULLs within our dataset:

In [29]:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
S

**4. Saving the Data:**

We'll save this pre-processed data with 'feather' format:

In [32]:
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/raw')