### Introduction

This notebook contains my notes from the first lecture of the `fastai` course held by `Jeremy Howard`. 

#### Initial set up

In [1]:
# to my understanding, this is used for auto-reloading of modules. E.g. you change something in the module then you don't have to restart the notebook's kernel again in order for changes to take effect.
%load_ext autoreload
%autoreload 2

# this part is necessary in order to load georggie module.
import sys
sys.path.append('../')

!pwd

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed')).History will not be written to the database.
/home/georggie/DataAnalysis/The-Mechanics-Of-Machine-Learning/lessons


In [7]:
# georggie module import (import everything (*) from the module).
from ai import *
from sklearn.ensemble import RandomForestRegressor

# path to the dataset
PATH = "../resources"

In [8]:
# load data from CSV to pandas data frame.
data = pd.read_csv(f'{PATH}/bulldozers.csv', low_memory=False, parse_dates=['saledate'])

# display as much as it can be displayed and transpose rows and columns.
display_all(data.head().transpose())

Unnamed: 0,0,1,2,3,4
SalesID,1139246,1139248,1139249,1139251,1139253
SalePrice,66000,57000,10000,38500,11000
MachineID,999089,117657,434808,1026470,1057373
ModelID,3157,77,7009,332,17311
datasource,121,121,121,121,121
auctioneerID,3,3,3,3,3
YearMade,2004,1996,2001,2001,2007
MachineHoursCurrentMeter,68,4640,2838,3486,722
UsageBand,Low,Low,High,High,Medium
saledate,2006-11-16 00:00:00,2004-03-26 00:00:00,2004-02-26 00:00:00,2011-05-19 00:00:00,2009-07-23 00:00:00


Project is evaluated on `RMSLE (Root Mean Square Log Error)`. In other words, we are going to look at the difference between the log of our prediction of price and the log of the actual price and we will square them (differences) and add them up.

In [9]:
# take log of prices since we are working with RMSLE.
data.SalePrice = np.log(data.SalePrice)

#### Preprocessing

* We should always try to extract new features from a complete datetime field. Without expanding datatime into these additional features, you can't capture any trend/cyclical behavior as a function of time at any of these granularities.

* Some features in our data set contain string variables which most of the ML algorithms will not accept (do not understand them). We need to convert them to categorical variables.

In [10]:
extract_date_features(data, 'saledate')
convert_to_categorical(data)

In [11]:
# when converting `UsageBand` feature to categorical variable we received an order of High < Low < Medium which is not natural (may also cause some inefficiencies).So we decided to put them in natural order. 
data.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

# Normally, pandas will continue displaying the text categories, while treating them as numerical data internally. Optionally, we can replace the text categories with numbers, which will make this variable non-categorical.
data.UsageBand = data.UsageBand.cat.codes

In [7]:
# we also need to be careful about missing values because they can cause problems in the model training process. Command shows the percentage of missing values in our data set (per feature).
display_all(data.isnull().sum().sort_index()/len(data) * 100)

Backhoe_Mounting            80.387161
Blade_Extension             93.712932
Blade_Type                  80.097725
Blade_Width                 93.712932
Coupler                     46.662013
Coupler_System              89.165971
Differential_Type           82.695918
Drive_System                73.982923
Enclosure                    0.081022
Enclosure_Type              93.712932
Engine_Horsepower           93.712932
Forks                       52.115425
Grouser_Tracks              89.189903
Grouser_Type                75.281271
Hydraulics                  20.082269
Hydraulics_Flow             89.189903
MachineHoursCurrentMeter    64.408850
MachineID                    0.000000
ModelID                      0.000000
Pad_Type                    80.271985
Pattern_Changer             75.265067
ProductGroup                 0.000000
ProductGroupDesc             0.000000
ProductSize                 52.545964
Pushblock                   93.712932
Ride_Control                62.952696
Ripper      

We'll replace categories (categorical variables) with their numeric codes, handle missing continuous values (replacing them with median), and split the dependent variable into a separate variable.

In [12]:
data, target, changes = preprocess_dataframe(data, 'SalePrice')

At this point we have something we can pass to the RandomForest(RF). Let's see what we got

In [13]:
model = RandomForestRegressor(n_jobs=-1)
model.fit(data, target)
model.score(data, target)

0.9879554790323682