# Intro
Adaptation of `fastai` and `sklearn` libraries for Machine learning models using Kaggle 'buldozers' dataset.

For `fastai` implementation used documentation https://docs.fast.ai/tutorial.tabular.html and https://docs.fast.ai/tabular.core.html

# Prepare

## Imports

In [None]:
%load_ext autoreload
# %autoreload 2
%matplotlib inline
# these lines let us edit modules code and this will automatically reload modules.

In [None]:
from fastai.imports import *
from fastai.tabular import *
from fastai.tabular.core import * # not work on fastai 2.0.17
from fastai.tabular.all import * # not work on fastai 2.0.17

# from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display

from sklearn import metrics

### If got errors importing fastai - try install version 2

if `fastai.tabular.core` or `fastai.tabular.all` give errors - this means you have wrong version of `fastai` library. Try uninstall and install it again.

In [None]:
# !pip install --upgrade fastai
!pip uninstall fastai 

Uninstalling fastai-2.0.17:
  Would remove:
    /usr/local/lib/python3.6/dist-packages/fastai-2.0.17.dist-info/*
    /usr/local/lib/python3.6/dist-packages/fastai/*
Proceed (y/n)? y
  Successfully uninstalled fastai-2.0.17


In [None]:
!pip install fastai #==2.0.17

Collecting fastai
[?25l  Downloading https://files.pythonhosted.org/packages/ff/53/da994550c0dd2962351fd694694e553afe0c9516c02251586790f830430b/fastai-2.1.8-py3-none-any.whl (189kB)
[K     |█▊                              | 10kB 17.3MB/s eta 0:00:01[K     |███▌                            | 20kB 23.4MB/s eta 0:00:01[K     |█████▏                          | 30kB 14.7MB/s eta 0:00:01[K     |███████                         | 40kB 12.3MB/s eta 0:00:01[K     |████████▋                       | 51kB 9.0MB/s eta 0:00:01[K     |██████████▍                     | 61kB 9.2MB/s eta 0:00:01[K     |████████████                    | 71kB 9.9MB/s eta 0:00:01[K     |█████████████▉                  | 81kB 9.8MB/s eta 0:00:01[K     |███████████████▋                | 92kB 9.7MB/s eta 0:00:01[K     |█████████████████▎              | 102kB 10.2MB/s eta 0:00:01[K     |███████████████████             | 112kB 10.2MB/s eta 0:00:01[K     |████████████████████▊           | 122kB 10.2MB/s et

In [None]:
!ls /usr/local/lib/python3.6/dist-packages/fastai

basics.py	imports.py    losses.py     _pytorch_doc.py  torch_imports.py
callback	__init__.py   medical	    tabular	     vision
collab.py	interpret.py  metrics.py    test_utils.py
data		launch.py     _nbdev.py     text
distributed.py	layers.py     optimizer.py  torch_basics.py
fp16_utils.py	learner.py    __pycache__   torch_core.py


In [None]:
!cat /usr/local/lib/python3.6/dist-packages/fastai/version.py

__all__ = ['__version__']
__version__ = '1.0.61'


version 2 should have text with version number in file below

In [None]:
!cat /usr/local/lib/python3.6/dist-packages/fastai/__init__.py

__version__ = "2.0.17"



In [None]:
!ls /usr/local/lib/python3.6/dist-packages/fastai/tabular/
# !head 20 /usr/local/lib/python3.6/dist-packages/fastai/tabular/

all.py	core.py  data.py  __init__.py  learner.py  model.py  __pycache__


### mount google drive

In [None]:
# google drive if needed
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Load buldozer data
You can download `Train.csv` from kaggle and then upload it to google drive.

In [None]:
!ls  'drive/My Drive/Colab Notebooks/AIacademy/class-C/fast_ai/'

 data			       functions.py	        syntetic_data.ipynb
 fast-ai-random-forest.ipynb  'mnist fastai-v2.ipynb'   Test.csv
 fastai-v2-buldozers.ipynb    'Neural net.ipynb'        Train.csv


In [None]:
buldozer_path = 'drive/My Drive/Colab Notebooks/AIacademy/class-C/fast_ai/'

In [None]:
!head -2 'drive/My Drive/Colab Notebooks/AIacademy/class-C/fast_ai/Train.csv'

SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
1139246,66000,999089,3157,121,3,2004,68,Low,11/16/2006 0:00,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional


Load from csv without optimization of memory usage by defining column types and sizes. Define only dates.

In [None]:
%%time
df_raw = pd.read_csv(f'{buldozer_path}Train.csv', 
                     low_memory=False, parse_dates=["saledate","YearMade"])

CPU times: user 3.58 s, sys: 549 ms, total: 4.13 s
Wall time: 4.3 s


## First Exploration

In [None]:
df_raw.shape

(401125, 53)

In [None]:
df_raw.dtypes

SalesID                              int64
SalePrice                            int64
MachineID                            int64
ModelID                              int64
datasource                           int64
auctioneerID                       float64
YearMade                            object
MachineHoursCurrentMeter           float64
UsageBand                           object
saledate                    datetime64[ns]
fiModelDesc                         object
fiBaseModel                         object
fiSecondaryDesc                     object
fiModelSeries                       object
fiModelDescriptor                   object
ProductSize                         object
fiProductClassDesc                  object
state                               object
ProductGroup                        object
ProductGroupDesc                    object
Drive_System                        object
Enclosure                           object
Forks                               object
Pad_Type   

In [None]:
df_raw.dtypes.value_counts()

object            45
int64              5
float64            2
datetime64[ns]     1
dtype: int64

## First transformations

### Log Y values 'salesPrice'

In [None]:
df_raw.SalePrice.iloc[:3]

0    66000
1    57000
2    10000
Name: SalePrice, dtype: int64

In [None]:
df_raw['SalePrice'] = np.log(df_raw.SalePrice)
df_raw.SalePrice.iloc[:3]

0    11.097410
1    10.950807
2     9.210340
Name: SalePrice, dtype: float64

### FastAI `shrink` dataframe
minimize column data types and convert `object` columns into categories (kind of `train_cats`)

In [None]:
# fastai.tabular.core
%time df = df_shrink(df_raw, skip=['YearMade','saledate'])

CPU times: user 964 ms, sys: 399 ms, total: 1.36 s
Wall time: 1.37 s


below is check for the types of columns.

In [None]:
# df_raw.dtypes
df_raw.value_counts()

Series([], dtype: int64)

In [None]:
# df.dtypes
df.value_counts()

Series([], dtype: int64)

### Find categories 'object' columns for `fastAI` processing

In [None]:
cont_names, cat_names = cont_cat_split(df_raw)
print(cont_names)
print(cat_names)

['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'auctioneerID', 'MachineHoursCurrentMeter']
['datasource', 'YearMade', 'UsageBand', 'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls', 'Differential_Type', 'Steering_Controls']


Normalize is removed as we will do `RandomForest` not neural nets.

In [None]:
procs = [Categorify, FillMissing] #, Normalize]

# Data splits

## Manual split: dependend / independend

In [None]:
df_train, df_test = df.iloc[:300000].copy(), df.iloc[300000:].copy()

In [None]:
df_test.SalePrice.iloc[:3]

300000    10.434115
300001     9.798127
300002    10.146434
Name: SalePrice, dtype: float32

In [None]:
Y_train = df_train.SalePrice
x_train = df_train.drop('SalePrice', axis=1)

In [None]:
Y_test = df_test.SalePrice
x_test = df_test.drop('SalePrice', axis=1)

In [None]:
Y_train.head(3)

0    11.097410
1    10.950807
2     9.210340
Name: SalePrice, dtype: float32

## SKlearn split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# TODO need to test
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

In [None]:
type(df_train)

pandas.core.frame.DataFrame

In [None]:
df_train.head(3)

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,11.09741,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1,1139248,10.950807,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,950FII,950,F,II,,Medium,Wheel Loader - 150.0 to 175.0 Horsepower,North Carolina,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,23.5,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
2,1139249,9.21034,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,226,226,,,,,Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity,New York,SSL,Skid Steer Loaders,,OROPS,None or Unspecified,,,,,,,,,,Auxiliary,,,,,,None or Unspecified,None or Unspecified,None or Unspecified,Standard,,,,,,,,,,,


In [None]:
df_test.drop(['SalePrice'], axis=1, inplace=True)

In [None]:
df_test.head(3)

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
300000,2241104,285987,13776,136,,2000,0.0,,2007-05-11,D3CIIIXL,D3,C,III,XL,,"Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower",Virginia,TTT,Track Type Tractors,,OROPS,,,,,Hydrostatic,,,,,,2 Valve,,None or Unspecified,,,,,,,,,,,,,,None or Unspecified,None or Unspecified,None or Unspecified,,
300001,2241111,115917,13776,136,1.0,2001,0.0,,2010-06-30,D3CIIIXL,D3,C,III,XL,,"Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower",North Carolina,TTT,Track Type Tractors,,OROPS,,,,,Standard,,,,,,2 Valve,,None or Unspecified,,,,,,,,,,,,,,None or Unspecified,PAT,None or Unspecified,,
300002,2241112,1714094,13776,136,,1999,0.0,,2007-07-14,D3CIIIXL,D3,C,III,XL,,"Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower",New York,TTT,Track Type Tractors,,OROPS,,,,,Standard,,,,,,2 Valve,,None or Unspecified,,,,,,,,,,,,,,None or Unspecified,PAT,None or Unspecified,,


## Fast AI

In [None]:
# splits = RandomSplitter()(range_of(df_train))
splits = RandomSplitter(valid_pct=0.2)(range_of(df_raw))

NameError: ignored

In [None]:
type(splits)

tuple

In [None]:
splits[:3]

((#320900) [26011,75166,37736,142749,33174,389369,393453,167697,102512,30346...],
 (#80225) [197738,278728,196100,62413,157552,309314,164957,92725,310410,393560...])

In [None]:
df_train.dtypes.value_counts()

object            45
int64              5
float64            2
datetime64[ns]     1
dtype: int64

### TO object

There are different spliters in `fastai`. Here I choose `TrainTestSplitter` but not sure it takes latest sorted rows by date.

In [None]:
splits = TrainTestSplitter(test_size=0.3)(range_of(df))

In [None]:
splits[0]

(#280787) [150144,163729,392636,91666,258321,361691,307627,377446,237386,89548...]

In [None]:
%%time
to = TabularPandas(df, procs, cat_names, cont_names, y_names="SalePrice", splits=splits)

CPU times: user 1.59 s, sys: 90 ms, total: 1.68 s
Wall time: 1.7 s


In [None]:
to.xs.iloc[:2]

Unnamed: 0,datasource,YearMade,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,auctioneerID_na,MachineHoursCurrentMeter_na,SalesID,SalePrice,MachineID,ModelID,auctioneerID,MachineHoursCurrentMeter
150144,2,53,0,2676,557,186,0,0,64,3,14,23,4,4,0,3,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,2,0,0,0,2,19,27,3,2,1,0,0,0,0,0,1,2,1524231,10.373491,1427198,1177,1.0,0.0
163729,2,40,0,3157,1242,368,15,0,0,0,35,24,2,2,3,1,0,0,0,0,5,0,1,6,3,1,5,1,2,2,2,17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,1586430,9.350102,1527409,4777,8.0,0.0


In [None]:
to.train

        SalesID  SalePrice  ...  auctioneerID_na  MachineHoursCurrentMeter_na
150144  1524231  10.373491  ...                1                            2
163729  1586430   9.350102  ...                1                            2
392636  6257981  10.985292  ...                1                            2
91666   1401514  11.170435  ...                1                            2
258321  1790589   9.104980  ...                1                            2
...         ...        ...  ...              ...                          ...
190089  1626612  10.518673  ...                1                            2
387807  4353671  11.434964  ...                1                            1
397366  6286309  10.146434  ...                1                            2
21988   1215306  10.596635  ...                1                            1
139433  1500926   9.546813  ...                1                            2

[280787 rows x 55 columns]

In [None]:
to.ys.describe

<bound method NDFrame.describe of         SalePrice
368515  -0.851633
232517  -0.358788
355574  -1.180124
180215   0.147972
288734  -0.799185
...           ...
27327   -1.518793
364904  -0.436759
307504   1.358389
103274   0.947356
225984  -0.652403

[401125 rows x 1 columns]>

In [None]:
dls = to.dataloaders()
dls.valid.show_batch()
# type(to)

Unnamed: 0,datasource,YearMade,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,auctioneerID_na,MachineHoursCurrentMeter_na,SalesID,SalePrice,SalePrice.1,MachineID,ModelID,auctioneerID,MachineHoursCurrentMeter,SalePrice.2,SalePrice.3
0,132,2003,#na#,2008-06-19,210LE,210,LE,#na#,#na#,Compact,Wheel Loader - 60.0 to 80.0 Horsepower,Arizona,WL,Wheel Loader,#na#,OROPS,None or Unspecified,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,Conventional,False,True,1636008.0,16000.0,16000.0,1465800.0,4579.0,1.0,0.0,16000.0,16000.0
1,132,1000,#na#,1999-05-04,C14B,C14,B,#na#,#na#,Small,"Hydraulic Excavator, Track - 16.0 to 19.0 Metric Tons",Wisconsin,TEX,Track Excavators,#na#,EROPS,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,#na#,#na#,#na#,#na#,#na#,None or Unspecified,#na#,#na#,#na#,Steel,30 inch,None or Unspecified,None or Unspecified,None or Unspecified,Double,#na#,#na#,#na#,#na#,#na#,False,True,1844791.0,9000.0,9000.0,1479368.0,5847.0,6.0,0.0,9000.0,9000.0
2,132,1999,#na#,2006-08-30,653G,653,G,#na#,#na#,Large / Medium,"Hydraulic Excavator, Track - 19.0 to 21.0 Metric Tons",Alabama,TEX,Track Excavators,#na#,EROPS,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,#na#,#na#,#na#,#na#,#na#,None or Unspecified,#na#,#na#,#na#,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,#na#,#na#,#na#,#na#,#na#,False,True,1634006.0,33000.0,33000.0,1423637.0,4768.0,7.0,0.0,33000.0,33000.0
3,132,1983,#na#,1999-02-02,D4E,D4,E,#na#,#na#,#na#,"Track Type Tractor, Dozer - 75.0 to 85.0 Horsepower",Florida,TTT,Track Type Tractors,#na#,OROPS,#na#,#na#,#na#,#na#,Standard,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,None or Unspecified,#na#,#na#,False,True,1328622.0,16000.0,16000.0,1148391.0,4104.0,2.0,0.0,16000.0,16000.0
4,132,2006,#na#,2009-02-05,420D,420,D,#na#,#na#,#na#,Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth,Florida,BL,Backhoe Loaders,Four Wheel Drive,OROPS,None or Unspecified,None or Unspecified,No,Standard,Standard,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,True,1560601.0,40000.0,40000.0,1158903.0,3542.0,2.0,0.0,40000.0,40000.0
5,136,1000,#na#,2010-06-26,D4CIIILGP,D4,C,III,LGP,#na#,"Track Type Tractor, Dozer - 75.0 to 85.0 Horsepower",Alaska,TTT,Track Type Tractors,#na#,OROPS,#na#,#na#,#na#,#na#,Standard,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,None or Unspecified,PAT,None or Unspecified,#na#,#na#,False,False,2264019.0,17000.0,17000.0,632283.0,1545.0,1.0,0.0,17000.0,17000.0
6,132,2006,High,2010-12-09,750J,750,J,#na#,#na#,Medium,"Track Type Tractor, Dozer - 130.0 to 160.0 Horsepower",Georgia,TTT,Track Type Tractors,#na#,EROPS w AC,#na#,#na#,#na#,#na#,Standard,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,None or Unspecified,PAT,None or Unspecified,#na#,#na#,False,False,1588576.0,85000.0,85000.0,1143493.0,11406.0,1.0,4760.0,85000.0,85000.0
7,132,1986,#na#,2000-04-08,416,416,#na#,#na#,#na#,#na#,Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth,New York,BL,Backhoe Loaders,Four Wheel Drive,EROPS,None or Unspecified,None or Unspecified,No,Extended,Standard,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,True,1403929.0,15500.0,15500.0,1389949.0,7110.0,2.0,0.0,15500.0,15500.0
8,136,1992,#na#,2008-06-11,936F,936,F,#na#,#na#,Medium,Wheel Loader - 135.0 to 150.0 Horsepower,Texas,WL,Wheel Loader,#na#,EROPS,Yes,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,3 Valve,#na#,#na#,#na#,#na#,20.5,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,Conventional,False,False,2268888.0,31000.0,31000.0,1557424.0,3804.0,1.0,0.0,31000.0,31000.0
9,149,1985,#na#,2011-10-18,992C,992,C,#na#,#na#,Large,Wheel Loader - 500.0 to 1000.0 Horsepower,Maryland,WL,Wheel Loader,#na#,EROPS w AC,None or Unspecified,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,Conventional,False,True,6263977.0,53000.0,53000.0,1925412.0,3893.0,1.0,0.0,53000.0,53000.0


In [None]:
to.show()

Unnamed: 0,datasource,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,auctioneerID_na,MachineHoursCurrentMeter_na,SalesID,SalePrice,MachineID,ModelID,auctioneerID,YearMade,MachineHoursCurrentMeter,SalePrice.1
0,121,Low,2006-11-16,521D,521,D,#na#,#na#,#na#,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,#na#,EROPS w AC,None or Unspecified,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,Conventional,False,False,1139246,66000,999089,3157,3.0,2004,68.0,66000
1,121,Low,2004-03-26,950FII,950,F,II,#na#,Medium,Wheel Loader - 150.0 to 175.0 Horsepower,North Carolina,WL,Wheel Loader,#na#,EROPS w AC,None or Unspecified,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,#na#,#na#,#na#,23.5,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,Conventional,False,False,1139248,57000,117657,77,3.0,1996,4640.0,57000
2,121,High,2004-02-26,226,226,#na#,#na#,#na#,#na#,Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity,New York,SSL,Skid Steer Loaders,#na#,OROPS,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Auxiliary,#na#,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,None or Unspecified,Standard,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,False,1139249,10000,434808,7009,3.0,2001,2838.0,10000
3,121,High,2011-05-19,PC120-6E,PC120,#na#,-6E,#na#,Small,"Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons",Texas,TEX,Track Excavators,#na#,EROPS w AC,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,#na#,#na#,#na#,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,False,1139251,38500,1026470,332,3.0,2001,3486.0,38500
4,121,Medium,2009-07-23,S175,S175,#na#,#na#,#na#,#na#,Skid Steer Loader - 1601.0 to 1751.0 Lb Operating Capacity,New York,SSL,Skid Steer Loaders,#na#,EROPS,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Auxiliary,#na#,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,None or Unspecified,Standard,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,False,1139253,11000,1057373,17311,3.0,2007,722.0,11000
5,121,Low,2008-12-18,310G,310,G,#na#,#na#,#na#,Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth,Arizona,BL,Backhoe Loaders,Four Wheel Drive,OROPS,None or Unspecified,None or Unspecified,No,Extended,Powershuttle,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,False,1139255,26500,1001274,4605,3.0,2004,508.0,26500
6,121,High,2004-08-26,790ELC,790,E,#na#,LC,Large / Medium,"Hydraulic Excavator, Track - 21.0 to 24.0 Metric Tons",Florida,TEX,Track Excavators,#na#,EROPS,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,#na#,#na#,#na#,#na#,#na#,None or Unspecified,#na#,#na#,#na#,Steel,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,#na#,#na#,#na#,#na#,#na#,False,False,1139256,21000,772701,1937,3.0,1993,11540.0,21000
7,121,High,2005-11-17,416D,416,D,#na#,#na#,#na#,Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth,Illinois,BL,Backhoe Loaders,Four Wheel Drive,OROPS,None or Unspecified,Reversible,No,Standard,Standard,Yes,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,False,False,1139261,27000,902002,3539,3.0,2001,4883.0,27000
8,121,Low,2009-08-27,430HAG,430,HAG,#na#,#na#,Mini,"Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons",Texas,TEX,Track Excavators,#na#,EROPS,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Auxiliary,#na#,#na#,#na#,#na#,#na#,Manual,#na#,#na#,#na#,Rubber,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,#na#,#na#,#na#,#na#,#na#,False,False,1139272,21500,1036251,36003,3.0,2008,302.0,21500
9,121,Medium,2007-08-09,988B,988,B,#na#,#na#,Large,Wheel Loader - 350.0 to 500.0 Horsepower,Florida,WL,Wheel Loader,#na#,EROPS w AC,None or Unspecified,#na#,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,2 Valve,#na#,#na#,#na#,#na#,None or Unspecified,None or Unspecified,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,#na#,Standard,Conventional,False,False,1139275,65000,1016474,3883,3.0,1000,20700.0,65000


In [None]:
# decode categories
row = to.items.iloc[0]
to.decode_row(row)

In [None]:
# row, clas, probs = learn.predict(df.iloc[0])

NameError: ignored

# Train

## SKlearn `RandomForestRegressor` Model definition



### First model

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=20, max_features=0.3, random_state=0)

### Custom `fit`

In [None]:
%%time
model.fit(x_train, Y_train)

ValueError: ignored

In [None]:
predictions = model.predict(X)

### Fast AI `fit`

In [None]:
X_train_f, y_train_f = to.train.xs, to.train.ys.values.ravel()
X_test_f, y_test_f = to.valid.xs, to.valid.ys.values.ravel()

In [None]:
%%time
model.fit(X_train_f, y_train_f)

CPU times: user 28.7 s, sys: 11 ms, total: 28.8 s
Wall time: 28.8 s


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=0.3, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=20, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

# Evaluate

## SKlearn custom

In [None]:
from sklearn import metrics
from sklearn.metrics import mean_absolute_error

In [None]:
# get predicted prices on validation data
predicted = model.predict(X_test_f)
print(mean_absolute_error(y_test_f, predicted))

0.011748369689584143


In [None]:
type(predicted)

numpy.ndarray

In [None]:
rmse(predicted, y_test_f)

AttributeError: ignored

In [None]:
# print("Accuracy: %f" % lgs.score(X_test_f, y_test_f))  #checking the accuracy
# print("Precision: %f" % metrics.precision_score(y_test_f, predicted))
# print("Recall: %f" % metrics.recall_score(y_test_f, predicted))
# print("F1-Score: %f" % metrics.f1_score(y_test_f, predicted))
# print("AUC: %f" metrics.auc(fpr, tpr))

ValueError: ignored

# Pipeline

## Sklearn Imports

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

## processing

In [None]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
model = RandomForestRegressor(n_estimators=20, random_state=0)

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)