# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning 

In this notebook, we're going to go through an example machine learning project with the goal of predicting the sale price of bulldozers.

Since we're trying to predict a number, this kind of problem is known as a **regression problem**.

The data and evaluation metric we'll be using (root mean square log error or RMSLE) is from the [Kaggle Bluebook for Bulldozers competition](https://www.kaggle.com/c/bluebook-for-bulldozers/overview).


## 1. Problem Definition

For this dataset, the problem we're trying to solve, or better, the question we're trying to answer is,

> How well can we predict the future sale price of a bulldozer, given its characteristics previous examples of how much similar bulldozers have been sold for?

## 2. Data

Looking at the [dataset from Kaggle](https://www.kaggle.com/c/bluebook-for-bulldozers/data), you can you it's a time series problem. This means there's a time attribute to dataset.

In this case, it's historical sales data of bulldozers. Including things like, model type, size, sale date and more.

**Bulldozer.csv** - Historical bulldozer sales examples up to 2012 (close to 400,000 examples with 50+ different attributes, including `SalePrice` which is the **target variable**).

## 3. Evaluation

For this problem, [Kaggle has set the evaluation metric to being root mean squared log error (RMSLE)](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation). As with many regression evaluations, the goal will be to get this value as low as possible.

To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the [Kaggle leaderboard](https://www.kaggle.com/c/bluebook-for-bulldozers/leaderboard).

## 4. Features

Features are different parts of the data. During this step, you'll want to start finding out what you can about the data.

One of the most common ways to do this, is to create a **data dictionary**.

For this dataset, Kaggle provide a data dictionary which contains information about what each attribute of the dataset means. You can [download this file directly from the Kaggle competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/download/Bnl6RAHA0enbg0UfAvGA%2Fversions%2FwBG4f35Q8mAbfkzwCeZn%2Ffiles%2FData%20Dictionary.xlsx) (account required) or view it on Google Sheets.

With all of this being known, let's get started! 


### Importing the data and preparing it for modelling

In [1]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_squared_log_error, root_mean_squared_log_error

In [2]:
data = pd.read_csv("./data/Bulldozer.csv", parse_dates= ["saledate"])
data

  data = pd.read_csv("./data/Bulldozer.csv", parse_dates= ["saledate"])


Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000.0,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000.0,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000.0,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500.0,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000.0,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
412693,6333344,10000.0,1919201,21435,149,2.0,2005,,,2012-03-07,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
412694,6333345,10500.0,1882122,21436,149,2.0,2005,,,2012-01-28,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
412695,6333347,12500.0,1944213,21435,149,2.0,2005,,,2012-01-28,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
412696,6333348,10000.0,1794518,21435,149,2.0,2006,,,2012-03-07,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   SalesID                   412698 non-null  int64         
 1   SalePrice                 412698 non-null  float64       
 2   MachineID                 412698 non-null  int64         
 3   ModelID                   412698 non-null  int64         
 4   datasource                412698 non-null  int64         
 5   auctioneerID              392562 non-null  float64       
 6   YearMade                  412698 non-null  int64         
 7   MachineHoursCurrentMeter  147504 non-null  float64       
 8   UsageBand                 73670 non-null   object        
 9   saledate                  412698 non-null  datetime64[ns]
 10  fiModelDesc               412698 non-null  object        
 11  fiBaseModel               412698 non-null  object        
 12  fi

In [4]:
data.sort_values(by="saledate", inplace=True)
data.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
205615,1646770,9500.0,1126363,8434,132,18.0,1974,,,1989-01-17,...,,,,,,None or Unspecified,Straight,None or Unspecified,,
274835,1821514,14000.0,1194089,10150,132,99.0,1980,,,1989-01-31,...,,,,,,,,,Standard,Conventional
141296,1505138,50000.0,1473654,4139,132,99.0,1978,,,1989-01-31,...,,,,,,None or Unspecified,Straight,None or Unspecified,,
212552,1671174,16000.0,1327630,8591,132,99.0,1980,,,1989-01-31,...,,,,,,,,,Standard,Conventional
62755,1329056,22000.0,1336053,4089,132,99.0,1984,,,1989-01-31,...,,,,,,None or Unspecified,PAT,Lever,,


In [5]:
data["sale_year"] = data.saledate.dt.year
data["sale_month"] = data.saledate.dt.month
data["sale_day"] = data.saledate.dt.day

data["day_of_year"] = data.saledate.dt.day_of_year
data["day_of_week"] = data.saledate.dt.day_of_week

data.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,sale_year,sale_month,sale_day,day_of_year,day_of_week
205615,1646770,9500.0,1126363,8434,132,18.0,1974,,,1989-01-17,...,None or Unspecified,Straight,None or Unspecified,,,1989,1,17,17,1
274835,1821514,14000.0,1194089,10150,132,99.0,1980,,,1989-01-31,...,,,,Standard,Conventional,1989,1,31,31,1
141296,1505138,50000.0,1473654,4139,132,99.0,1978,,,1989-01-31,...,None or Unspecified,Straight,None or Unspecified,,,1989,1,31,31,1
212552,1671174,16000.0,1327630,8591,132,99.0,1980,,,1989-01-31,...,,,,Standard,Conventional,1989,1,31,31,1
62755,1329056,22000.0,1336053,4089,132,99.0,1984,,,1989-01-31,...,None or Unspecified,PAT,Lever,,,1989,1,31,31,1


In [6]:
data.drop("saledate", axis=1, inplace=True)
data.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls,sale_year,sale_month,sale_day,day_of_year,day_of_week
205615,1646770,9500.0,1126363,8434,132,18.0,1974,,,TD20,...,None or Unspecified,Straight,None or Unspecified,,,1989,1,17,17,1
274835,1821514,14000.0,1194089,10150,132,99.0,1980,,,A66,...,,,,Standard,Conventional,1989,1,31,31,1
141296,1505138,50000.0,1473654,4139,132,99.0,1978,,,D7G,...,None or Unspecified,Straight,None or Unspecified,,,1989,1,31,31,1
212552,1671174,16000.0,1327630,8591,132,99.0,1980,,,A62,...,,,,Standard,Conventional,1989,1,31,31,1
62755,1329056,22000.0,1336053,4089,132,99.0,1984,,,D3B,...,None or Unspecified,PAT,Lever,,,1989,1,31,31,1


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 412698 entries, 205615 to 409203
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 73670 non-null   object 
 9   fiModelDesc               412698 non-null  object 
 10  fiBaseModel               412698 non-null  object 
 11  fiSecondaryDesc           271971 non-null  object 
 12  fiModelSeries             58667 non-null   object 
 13  fiModelDescriptor         74816 non-null   o

# Cleaning & Organizing Data

In [8]:
pd.api.types.is_object_dtype(data.sale_day)

False

In [9]:
data.sale_day.dtype == "object"

False

In [10]:
for column in data.columns:
    if data[column].dtype == "object":
        data[column] = data[column].astype("category")

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 412698 entries, 205615 to 409203
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype   
---  ------                    --------------   -----   
 0   SalesID                   412698 non-null  int64   
 1   SalePrice                 412698 non-null  float64 
 2   MachineID                 412698 non-null  int64   
 3   ModelID                   412698 non-null  int64   
 4   datasource                412698 non-null  int64   
 5   auctioneerID              392562 non-null  float64 
 6   YearMade                  412698 non-null  int64   
 7   MachineHoursCurrentMeter  147504 non-null  float64 
 8   UsageBand                 73670 non-null   category
 9   fiModelDesc               412698 non-null  category
 10  fiBaseModel               412698 non-null  category
 11  fiSecondaryDesc           271971 non-null  category
 12  fiModelSeries             58667 non-null   category
 13  fiModelDescriptor         748

In [12]:
data.Differential_Type.cat.codes

205615   -1
274835    3
141296   -1
212552    3
62755    -1
         ..
410879   -1
412476   -1
411927   -1
407124   -1
409203    3
Length: 412698, dtype: int8

In [13]:
data.Differential_Type.cat.categories

Index(['Limited Slip', 'Locking', 'No Spin', 'Standard'], dtype='object')

In [14]:
np.unique(data.Differential_Type.cat.codes, return_counts=True)

(array([-1,  0,  1,  2,  3], dtype=int8),
 array([341134,   1181,      2,    212,  70169]))

In [15]:
data_differential_cat_codes = pd.Series(data.Differential_Type.cat.codes)
data_differential_cat_codes.value_counts()

-1    341134
 3     70169
 0      1181
 2       212
 1         2
Name: count, dtype: int64

In [16]:
for col_name, col_content in data.items():
    if isinstance(col_content.dtype, pd.CategoricalDtype):
        data[col_name] = col_content.cat.codes

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 412698 entries, 205615 to 409203
Data columns (total 57 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   SalesID                   412698 non-null  int64  
 1   SalePrice                 412698 non-null  float64
 2   MachineID                 412698 non-null  int64  
 3   ModelID                   412698 non-null  int64  
 4   datasource                412698 non-null  int64  
 5   auctioneerID              392562 non-null  float64
 6   YearMade                  412698 non-null  int64  
 7   MachineHoursCurrentMeter  147504 non-null  float64
 8   UsageBand                 412698 non-null  int8   
 9   fiModelDesc               412698 non-null  int16  
 10  fiBaseModel               412698 non-null  int16  
 11  fiSecondaryDesc           412698 non-null  int16  
 12  fiModelSeries             412698 non-null  int16  
 13  fiModelDescriptor         412698 non-null  i

In [18]:
data["auctioneerID_imputed"] = data.auctioneerID.isna()
data.loc[data.auctioneerID.isna(), "auctioneerID"] = data.auctioneerID.median()

data["MachineHoursCurrentMeter_imputed"] = data.MachineHoursCurrentMeter.isna()
data.loc[data.MachineHoursCurrentMeter.isna(), "MachineHoursCurrentMeter"] = data.MachineHoursCurrentMeter.median()

data.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,Travel_Controls,Differential_Type,Steering_Controls,sale_year,sale_month,sale_day,day_of_year,day_of_week,auctioneerID_imputed,MachineHoursCurrentMeter_imputed
205615,1646770,9500.0,1126363,8434,132,18.0,1974,0.0,-1,4592,...,5,-1,-1,1989,1,17,17,1,False,True
274835,1821514,14000.0,1194089,10150,132,99.0,1980,0.0,-1,1819,...,-1,3,1,1989,1,31,31,1,False,True
141296,1505138,50000.0,1473654,4139,132,99.0,1978,0.0,-1,2347,...,5,-1,-1,1989,1,31,31,1,False,True
212552,1671174,16000.0,1327630,8591,132,99.0,1980,0.0,-1,1818,...,-1,3,1,1989,1,31,31,1,False,True
62755,1329056,22000.0,1336053,4089,132,99.0,1984,0.0,-1,2118,...,4,-1,-1,1989,1,31,31,1,False,True


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 412698 entries, 205615 to 409203
Data columns (total 59 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   SalesID                           412698 non-null  int64  
 1   SalePrice                         412698 non-null  float64
 2   MachineID                         412698 non-null  int64  
 3   ModelID                           412698 non-null  int64  
 4   datasource                        412698 non-null  int64  
 5   auctioneerID                      412698 non-null  float64
 6   YearMade                          412698 non-null  int64  
 7   MachineHoursCurrentMeter          412698 non-null  float64
 8   UsageBand                         412698 non-null  int8   
 9   fiModelDesc                       412698 non-null  int16  
 10  fiBaseModel                       412698 non-null  int16  
 11  fiSecondaryDesc                   412698 non-null  i

## Build a Prediction Model

In [20]:
X_train = data.loc[data.sale_year < 2012 , "MachineID":]
X_test =  data.loc[data.sale_year == 2012 , "MachineID":]

y_train = data.loc[data.sale_year < 2012 , "SalePrice"]
y_test =  data.loc[data.sale_year == 2012 , "SalePrice"]

In [21]:
X_train.shape, y_train.shape

((401125, 57), (401125,))

In [22]:
regressor = RandomForestRegressor(n_jobs= -1)

In [23]:
%%time
regressor.fit(X_train, y_train)

CPU times: user 7min 58s, sys: 2.17 s, total: 8min 1s
Wall time: 1min 3s


In [24]:
def rmsle(y_test, y_pred):
    return np.sqrt(mean_squared_log_error(y_test, y_pred))

In [25]:
def CalculatePerformance(model, X_train, X_test, y_train, y_test):
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    results = {
                "Train R2:": r2_score(y_train, train_pred),
                "Test R2": r2_score(y_test, test_pred),
        
                "Train MSE:": mean_squared_error(y_train, train_pred),
                "Test MSE": mean_squared_error(y_test, test_pred),
        
                "Train MAE:": mean_absolute_error(y_train, train_pred),
                "Test MAE": mean_absolute_error(y_test, test_pred),
        
                "Train RMSLE:": root_mean_squared_log_error(y_train, train_pred),
                "Test RMSLE": root_mean_squared_log_error(y_test, test_pred)}
    return results

In [26]:
CalculatePerformance(regressor, X_train, X_test, y_train, y_test)

{'Train R2:': 0.9875995768052734,
 'Test R2': 0.8728650944039799,
 'Train MSE:': 6580871.971171227,
 'Test MSE': 87325459.50622551,
 'Train MAE:': 1572.4547776129637,
 'Test MAE': 6116.243857253953,
 'Train RMSLE:': 0.08425034756708057,
 'Test RMSLE': 0.2533911977098474}

In [27]:
regressor = RandomForestRegressor(n_jobs=-1, max_samples= 40000)

In [28]:
%%time
regressor.fit(X_train, y_train)

CPU times: user 1min 11s, sys: 467 ms, total: 1min 12s
Wall time: 9.56 s


In [29]:
CalculatePerformance(regressor, X_train, X_test, y_train, y_test)

{'Train R2:': 0.9076167332677406,
 'Test R2': 0.8645162980085255,
 'Train MSE:': 49027556.648397855,
 'Test MSE': 93060017.4400916,
 'Train MAE:': 4467.370389030851,
 'Test MAE': 6374.550803594574,
 'Train RMSLE:': 0.2138271325749094,
 'Test RMSLE': 0.26298276286149724}

## Parameter Tuning for Random Forest

In [38]:
params = {
    "n_estimators": [300, 200, 100, 50],
    "max_depth": [None, 5, 10, 20],
    "min_samples_split": [2, 10, 20, 30],
    "max_samples":[20000, 30000, 40000],
    "max_features": ["sqrt", 0.5, 0.75, 1]
}

In [39]:
search_cv= RandomizedSearchCV(RandomForestRegressor(), 
                              param_distributions= params,
                              cv=5,
                              n_iter= 2,
                              scoring = "neg_root_mean_squared_log_error",
                              n_jobs =-1,
                              verbose = True)

In [40]:
print(X_train.isnull().sum().sum())
print(np.isinf(X_train).sum().sum())

0
0


In [41]:
%%time
search_cv.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
CPU times: user 24.6 s, sys: 391 ms, total: 25 s
Wall time: 1min 3s


In [42]:
search_cv.best_estimator_

In [43]:
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_min_samples_split,param_max_samples,param_max_features,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,4.715418,0.193786,0.277729,0.026647,100,30,40000,sqrt,5,"{'n_estimators': 100, 'min_samples_split': 30,...",-0.536059,-0.506447,-0.484284,-0.509202,-0.515022,-0.510203,0.016605,2
1,34.139084,1.250492,0.828296,0.056716,200,20,40000,0.5,10,"{'n_estimators': 200, 'min_samples_split': 20,...",-0.428099,-0.352761,-0.30721,-0.336257,-0.34982,-0.354829,0.040026,1


In [36]:
# A Good Model
good_model = RandomForestRegressor(n_estimators  =100, min_samples_leaf = 7, min_samples_split = 4, 
                                  max_features= 0.5, n_jobs = -1, max_depth = None, max_samples =None)
good_model.fit(X_train,y_train)

In [37]:
CalculatePerformance(good_model, X_train, X_test, y_train, y_test)

{'Train R2:': 0.9394809625066746,
 'Test R2': 0.8809318406150366,
 'Train MSE:': 32117294.007470332,
 'Test MSE': 81784634.06337653,
 'Train MAE:': 3545.171618103602,
 'Test MAE': 5916.127777189163,
 'Train RMSLE:': 0.17227247705115656,
 'Test RMSLE': 0.24229363340478435}