## 1 Imports

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics

In [3]:
PATH = "../../data/competitions/bluebook-for-bulldozers/"

In [4]:
!ls {PATH}

Data%20Dictionary.xlsx		  Train.7z	     Train.zip
Machine_Appendix.csv		  TrainAndValid.7z   Valid.7z
median_benchmark.csv		  TrainAndValid.csv  Valid.csv
random_forest_benchmark_test.csv  TrainAndValid.zip  ValidSolution.csv
Test.csv			  Train.csv	     Valid.zip


## 2. Data

`low_memory=False` tells Pandas to read more of the file to decide what the types are.

`parse_dates=[...]` is used for any columns that contain dates.

In [5]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=['saledate'])

Entering a DataFrame to display it will truncate it if it's too long.
This function sets the truncation threshold to 1000 rows & cols.

In [6]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000):
        with pd.option_context("display.max_columns", 1000):
            display(df)

`df_raw.tail()` will show the last few rows of the DataFrame. By default it shows the 
cols at top and rows on side. There're a lot of cols, so using `.transpose()` 
displays the table on its side.

In [7]:
# display_all(df_raw.tail().transpose())

In [8]:
# display_all(df_raw.describe(include='all').transpose())

In [9]:
# df_raw.head()

[RMSLE](https://www.kaggle.com/c/bluebook-for-bulldozers#evaluation) is used in the Kaggle competition. So by taking the log of all sale prices, we can just use RMSE later to calculate our loss. RMSLE: $Σ\big(($log(prediction) - log(actual)$)^2\big)$ : this means ratios not absolutes.

Here we also replace a column w/ a new column:

In [10]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

### 2.2.2 Initial Processing

A Random Forest is smth of a universal machine learning technique. Could be a category or a continous variable - and can predict w/ cols of almost any kind (pixel data, zip codes, etc). RFs genrly don't overfit (prevention: easy). RFs don't genrly req validation sets, & can tell you how well they generalize - even w/ only 1 dataset. RFs have few if any statistical assumptions of your data, & req v.few pieces of feature engineering.

`model.fit(`__`Independant Variables`__`, `__`Dependent Variables`__`)`

Indep: used to predict; Dep: predicted. `pandas.DataFrame.drop(..)` returns a new DataFrame w/ a list of rows/cols removed. So we use everything but the SalePrice to predict the SalePrice.

In [11]:
model = RandomForestRegressor(n_jobs=-1) # n_jobs: number of cores to use. -1 ==> all
model.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

ValueError: could not convert string to float: 'Conventional'

This dataset contains a mix of **continuous** and __categorical__ variables. Most ML models (incl. RFs) req numbers -- so we need to convert all our cols to numbers.


`sklearn.ensemble.RandomForestRegressor`: predict __continuous__ variables

`sklearn.ensemble.RandomForestClassifier`: predict __categorical__ variables

---

One issue is `saledate` was parsed as a date $ \longrightarrow $ as a number. But if we look at it, it isn't a number, it's a `datetime64` -- which is __not__ a number. So we need to do our first bit of feature engineering.

In [20]:
df_raw.saledate[:5]

0   2006-11-16
1   2004-03-26
2   2004-02-26
3   2011-05-19
4   2009-07-23
Name: saledate, dtype: datetime64[ns]

Inside `fastai.structured` is a function called `add_datepart`, which we'll use to fix this.

__Overview of `add_datepart`:__

1. We pass in a dataframe and a field (in this case `'saledate'`) to `add_datepart(df, fldname)`. We can't do `df.fieldname` because that'd return a field called 'fieldname'. So `df[fldname]` is how we grab a column when that column name is stored in the variable `fldname`. This gives us the field itself, the `pd.Series`.

2. `add_datepart` then goes through a list of date attribute strings ('Year', 'Month', 'Dayofyear', etc) and builds new columns by looking them up in `fld`'s datetime attributes (`fld.dt`).

3. It finally drops the original `fldname` column (`'saledate'` here) because it isn't numerical.

---

***NOTE***: `'saledate'` is a date type because we told Pandas to make it such via `parse_dates=["saledate"]`. That's why it has the relevant datetime attributes.

In [22]:
add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()

0    2006
1    2004
2    2004
3    2011
4    2009
Name: saleYear, dtype: int64

Now the datatype for `'saledate'` is numerical (`int64`). If we check the columns of the DataFrame we'll see the new ones added by `add_datepart`:

In [23]:
df_raw.columns

Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries',
       'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state',
       'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure',
       'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission',
       'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type',
       'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier',
       'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System',
       'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type',
       'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer',
       'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',


This isn't enough. One more bit of feature engineering is needed: there are strings in the dataset (`'Low'`, `'High'`) etc. FastAI has function to automatically create categorical variables for all strings - by creating a column (backend) mapping integers to strings.

FastAI also has a `apply_cats` function to preserve training-set category mappings for validation & test set use.

In [25]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
0,1139246,11.09741,999089,3157,121,3.0,2004,68.0,Low,521D,...,16,3,320,False,False,False,False,False,False,1163635200
1,1139248,10.950807,117657,77,121,3.0,1996,4640.0,Low,950FII,...,26,4,86,False,False,False,False,False,False,1080259200
2,1139249,9.21034,434808,7009,121,3.0,2001,2838.0,High,226,...,26,3,57,False,False,False,False,False,False,1077753600
3,1139251,10.558414,1026470,332,121,3.0,2001,3486.0,High,PC120-6E,...,19,3,139,False,False,False,False,False,False,1305763200
4,1139253,9.305651,1057373,17311,121,3.0,2007,722.0,Medium,S175,...,23,3,204,False,False,False,False,False,False,1248307200


In [28]:
train_cats(df_raw)

Now we can access categorical variables as `.cat`attributes just as we could with `.dt` for datetime:

In [29]:
df_raw.UsageBand.cat.categories

Index(['High', 'Low', 'Medium'], dtype='object')

The ordering here is odd, and will have _some_ effect on our RF -- and a sensible ordering will give some improvement. The RF's Decision Trees splt at single points ('High' vs 'Low' and 'Medium' or etc).

`ordered=True` preserved supplied order, `inplace=True` changes the DataFrame in place instead of returning a new one.

In [30]:
df_raw.UsageBand.cat.set_categories(['High','Medium','Low'], ordered=True, inplace=True)

In [None]:
df_raw.describe()

# left off at [Lesson 1 -  59:38](https://youtu.be/CzdWqFTmn0Y?t=3578)