Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

Goal

Build a few regression models, and compare their quality of prediction, speed of prediction, and time required to train. The best model will be used to predict the price of a car.

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import make_scorer

from catboost import CatBoostRegressor, CatBoostClassifier
from lightgbm import LGBMRegressor
random_state=12345

import warnings

warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('/datasets/car_data.csv')

In [3]:
df.shape

(354369, 16)

In [4]:
df.head

<bound method NDFrame.head of              DateCrawled  Price  VehicleType  RegistrationYear Gearbox  Power  \
0       24/03/2016 11:52    480          NaN              1993  manual      0   
1       24/03/2016 10:58  18300        coupe              2011  manual    190   
2       14/03/2016 12:52   9800          suv              2004    auto    163   
3       17/03/2016 16:54   1500        small              2001  manual     75   
4       31/03/2016 17:25   3600        small              2008  manual     69   
...                  ...    ...          ...               ...     ...    ...   
354364  21/03/2016 09:50      0          NaN              2005  manual      0   
354365  14/03/2016 17:48   2200          NaN              2005     NaN      0   
354366  05/03/2016 19:56   1199  convertible              2000    auto    101   
354367  19/03/2016 18:57   9200          bus              1996  manual    102   
354368  20/03/2016 19:41   3400        wagon              2002  manual    100  

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [6]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [7]:
df.duplicated().sum()

262

Conclusion
        Each observation describes different car models. there are 354369 cars and 16 features. The top 2 features with missing values are NotRepaired and VehicleType. There are also 262 of the records that are duplicated this was expected. 

## Data preparation

Cleaning Process

In [8]:
df=df[df['Price']!=0]#prices are not zero
df=df[df['Power']!=0]#no zero horsepower
df=df[df['RegistrationMonth']!=0]#no zero registration month
df=df[(1886 <= df['RegistrationYear']) & (df['RegistrationYear'] <= 2021)]
#registration is between and including 1886 and 2021
df=df.drop('NumberOfPictures', axis=1)#drops NumberOfPictures column
df['NotRepaired'].fillna(value='unknown', inplace=True)
#replace missing values in NotRepaired column by 'unknown' 
df.dropna(inplace=True)#drops all other missing values
df.drop_duplicates(inplace=True)#drop duplicate rows
df.info()#general info after changes

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253566 entries, 2 to 354368
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        253566 non-null  object
 1   Price              253566 non-null  int64 
 2   VehicleType        253566 non-null  object
 3   RegistrationYear   253566 non-null  int64 
 4   Gearbox            253566 non-null  object
 5   Power              253566 non-null  int64 
 6   Model              253566 non-null  object
 7   Mileage            253566 non-null  int64 
 8   RegistrationMonth  253566 non-null  int64 
 9   FuelType           253566 non-null  object
 10  Brand              253566 non-null  object
 11  NotRepaired        253566 non-null  object
 12  DateCreated        253566 non-null  object
 13  PostalCode         253566 non-null  int64 
 14  LastSeen           253566 non-null  object
dtypes: int64(6), object(9)
memory usage: 31.0+ MB


We have finished our cleaning process. Now let us prepare the data splits. We will be training Linear Regression (as a baseline), Decision Tree Regressor, Random Forest Regressor, CatBoost Regressor, and LightGBM Regressor models. The first three models will need us to encode the categorical features prior to training. The last 2 don't need prior encoding since they have built-in encoders. So we will make 2 copies of our cleaned data: 1 that isn't encoded (which will be used with CatBoost and LightGBM), and the other which will be encoded for the other models. First, the data that will remain unencoded....

In [9]:
#creating and preparing a copy of data to use with CatBoost and LightGBM which don't need prior encoding
df_gbm=df.copy()#makes a copy of our cleaned data
df_gbm.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1, inplace=True)
#drops the columns that are in time-date format 
df_gbm.reset_index(drop=True, inplace=True)
#resets the index
f_gbm=df_gbm.drop('Price', axis=1)#defines our features
t_gbm=df_gbm['Price']#defines the price column as our target
f_train_gbm, f_test_gbm, t_train_gbm, t_test_gbm=train_test_split(f_gbm, t_gbm, test_size=0.2,
                                                                 random_state=random_state)
#splits our new data into training and test sets for our features and target

#prints the shapes of our splits
print(f_train_gbm.shape)
print(t_train_gbm.shape)
print(f_test_gbm.shape)
print(t_test_gbm.shape)

(202852, 11)
(202852,)
(50714, 11)
(50714,)


In [10]:
cat_feat=['VehicleType', 'RegistrationYear', 'Gearbox', 'Model', 'RegistrationMonth',
         'FuelType', 'Brand', 'NotRepaired', 'PostalCode']

In [11]:
#creating and preparing a copy of data to use with models requiring prior encoding
df_enc=df.copy()#copy of cleaned data
encoder=OrdinalEncoder()#creates an instance of the ordinal encoder
df_enc[cat_feat]=encoder.fit_transform(df_enc[cat_feat])#encodes the columns

In [12]:
#preparation and splitting for encoded data
df_enc.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1, inplace=True)
df_enc.reset_index(drop=True, inplace=True)
f_enc=df_enc.drop('Price', axis=1)
t_enc=df_enc['Price']
f_train_enc, f_test_enc, t_train_enc, t_test_enc=train_test_split(f_enc, t_enc, test_size=0.2,
                                                                 random_state=random_state)
print(f_train_enc.shape)
print(t_train_enc.shape)
print(f_test_enc.shape)
print(t_test_enc.shape)

(202852, 11)
(202852,)
(50714, 11)
(50714,)


We successfully cleaned, prepared, encoded, and unencoded the data for model training 

## Model training

We will first create a function to calculate the RMSE and then make it our evaluation metric (or scorer) for our models. It takes as arguments the prediction and target data.

In [13]:
#create an rmse function and make it our scorer
def rmse(pred, target):#creates the rmse function that takes target and prediction values as arguments
    pred=np.array(pred)#turns the prediction into a vector
    target=np.array(target)#turns the target into a vector
    error=pred - target #calculates the vector of errors
    sq_error=error ** 2 #squares the errors
    msq_error= sq_error.mean()#gets the mean of all errors
    score = msq_error ** 0.5 #gets the square root
    return score #returns the value
scorer=make_scorer(rmse, greater_is_better=False) 
#makes our rmse function our scorer and specifies that a smaller value is better

## Linear Regression (Baseline)


We will now train a Linear Regression model as our baseline i.e the value of the evaluation metric (RMSE) that we will get from it will be the one that the other models should strive to beat. We will feed it the encoded data. Let's get a cross-validation score:

In [14]:
#Baseline model with Linear Regression
LR=LinearRegression()#creates an instance of Linear Regression model
LR_score=cross_val_score(LR, f_train_enc, t_train_enc, scoring=scorer, cv=5)
#calculates the cross-validation scores for 5 folds of the training data
print(LR_score.mean()) #prints the mean of those scores

-3341.572814146216


Let us train the model and take note of the wall time it takes to do so

In [15]:
#training LR model
LR=LinearRegression()#instance of a linear regression model
%time LR.fit(f_train_enc, t_train_enc)#trains the model with the encoded data and times the process

CPU times: user 56.7 ms, sys: 60.3 ms, total: 117 ms
Wall time: 136 ms


LinearRegression()

We now get predictions:

In [16]:
#LR model prediction
%time lr_pred=LR.predict(f_test_enc)#gets predictions and times the process

CPU times: user 6.68 ms, sys: 44.8 ms, total: 51.4 ms
Wall time: 17.8 ms


RMSE calculation:

In [17]:
#LR model RMSE
lr_rmse=rmse(t_test_enc, lr_pred)
lr_rmse

3319.945019855772

So the RMSE of 3319.94 is the value to beat

## Decision Tree Regressor


For this we will perform hyperparameter tuning (with the max_depth hyperparameter) to find its best value to set it to when actually training. We will choose the hyperparameter which gets the best cross-validation score

In [18]:
#Decision Tree Rgressor - hyperparameter tuning
for depth in range(1, 16):#loops through values of max_depth from 1 to 15
    DTR=DecisionTreeRegressor(max_depth=depth, random_state=random_state)
    #creates an instance of Decision Tree Rgressor model with the max_depth value and random state of 12345
    DTR_score=cross_val_score(DTR, f_train_enc, t_train_enc, scoring=scorer, cv=5)
    #gets cross-validation score
    print('Max_depth', depth, 'score:', DTR_score.mean())
    #prints max_depth value and the mean cross-validation score

Max_depth 1 score: -3597.5212621009646
Max_depth 2 score: -3148.7614173370557
Max_depth 3 score: -2791.439807061578
Max_depth 4 score: -2541.7368659693166
Max_depth 5 score: -2372.5274016078033
Max_depth 6 score: -2263.1348897390744
Max_depth 7 score: -2164.109806596397
Max_depth 8 score: -2081.064381108975
Max_depth 9 score: -2009.1902263310585
Max_depth 10 score: -1951.205002133464
Max_depth 11 score: -1913.3661189015081
Max_depth 12 score: -1889.1937156659264
Max_depth 13 score: -1882.4974472277943
Max_depth 14 score: -1889.0048759830315
Max_depth 15 score: -1911.4468913132223


So the best score was gotten when the max_depth hyperparameter was 13. So we will use that to train our model and get predictions, and the processes will be timed

In [19]:
#DTR model training
DTR=DecisionTreeRegressor(max_depth=13, random_state=random_state)
%time DTR.fit(f_train_enc, t_train_enc)

CPU times: user 1.08 s, sys: 3.76 ms, total: 1.09 s
Wall time: 1.1 s


DecisionTreeRegressor(max_depth=13, random_state=12345)

In [20]:
#DTR model prediction
%time DTR_pred=DTR.predict(f_test_enc)

CPU times: user 10.4 ms, sys: 3.71 ms, total: 14.1 ms
Wall time: 12.1 ms


RMSE calculation:

In [21]:
#DTR rmse
DTR_rmse=rmse(t_test_enc, DTR_pred)
DTR_rmse

1855.542228750032

The RMSE is 1855.54 which is already much better than our baseline

## Random Forest Regressor


Here we will perform hyperparameter tuning with 2 hyperparameters: max_depth and n_estimators. After doing so, the best model we came up with had n_estimators=70 and max_depth=24

In [22]:
#Random Forest Regressor - hyperparameter tuning
for depth in range(20, 26):
    RFR=RandomForestRegressor(n_estimators=70, max_depth=depth, random_state=random_state)
    RFR_score=cross_val_score(RFR, f_train_enc, t_train_enc, scoring=scorer, cv=5)
    print('Max_depth', depth, 'score:', RFR_score.mean())

Max_depth 20 score: -1551.1765036303507
Max_depth 21 score: -1549.600091494261
Max_depth 22 score: -1548.4409216471856
Max_depth 23 score: -1547.4262950780078
Max_depth 24 score: -1547.294733299153
Max_depth 25 score: -1547.4023074095205


So let us train and test our model, with the processes being timed

In [23]:
#RFR model training
RFR=RandomForestRegressor(n_estimators=70, max_depth=24, random_state=random_state)
%time RFR.fit(f_train_enc, t_train_enc)

CPU times: user 1min 22s, sys: 196 ms, total: 1min 22s
Wall time: 1min 22s


RandomForestRegressor(max_depth=24, n_estimators=70, random_state=12345)

In [24]:
#RFR model predictions
%time RFR_pred=RFR.predict(f_test_enc)

CPU times: user 1.67 s, sys: 4.01 ms, total: 1.67 s
Wall time: 1.68 s


RMSE calculation:

In [25]:
RFR_rmse=rmse(t_test_enc, RFR_pred)
RFR_rmse

1527.8930336833093

RMSE of 1527. Much better than our baseline

## CatBoost Regressor

For our CatBoost regressor (which uses gradient boosting), we don't need prior encoding. We will be tuning using different hyperparameters this time. We will find the best parameters using GridSearchCV.

In [26]:
#CatBoost - hyperparameter tuning
CBR=CatBoostRegressor()#instance of CatBoost Regressor
parameters={'depth': [6, 8, 10],
           'learning_rate': [0.5, 0.1],
           'l2_leaf_reg': [2, 4],
           'iterations': [10, 50],
           'loss_function': ['RMSE'],
           'random_seed': [random_state]}
#our dictionary of hyperparameters that will be looped through when we feed them to GridSearch
grid=GridSearchCV(estimator=CBR, param_grid=parameters, scoring=scorer, cv=3, n_jobs=-1, verbose=0)
#loops through parameters to help us get the best hyperparameters for model training
grid.fit(f_train_gbm, t_train_gbm, cat_features=cat_feat)
#fits our training unencoded data into our grid instance
best_param=grid.best_params_ #gets the best set of parameters for our model

0:	learn: 3225.8100978	total: 226ms	remaining: 2.04s
1:	learn: 2540.3887764	total: 393ms	remaining: 1.57s
2:	learn: 2227.9518568	total: 553ms	remaining: 1.29s
3:	learn: 2083.5267987	total: 702ms	remaining: 1.05s
4:	learn: 2012.2635808	total: 858ms	remaining: 858ms
5:	learn: 1978.0723763	total: 1.01s	remaining: 676ms
6:	learn: 1943.8383140	total: 1.16s	remaining: 495ms
7:	learn: 1915.5434974	total: 1.3s	remaining: 326ms
8:	learn: 1898.5680531	total: 1.45s	remaining: 161ms
9:	learn: 1873.4810880	total: 1.6s	remaining: 0us
0:	learn: 3227.9585234	total: 175ms	remaining: 1.57s
1:	learn: 2540.4534392	total: 345ms	remaining: 1.38s
2:	learn: 2243.1368262	total: 503ms	remaining: 1.17s
3:	learn: 2087.1054341	total: 657ms	remaining: 986ms
4:	learn: 2009.6934546	total: 811ms	remaining: 811ms
5:	learn: 1968.9292266	total: 961ms	remaining: 640ms
6:	learn: 1939.9819808	total: 1.11s	remaining: 475ms
7:	learn: 1909.8565025	total: 1.26s	remaining: 315ms
8:	learn: 1889.8867469	total: 1.41s	remaining: 156

We can print out the best hyperparameter settings for our model

In [27]:
print('Best score across all searched parameters', grid.best_score_)
print('Best parameters:', best_param)

Best score across all searched parameters -1608.7460454086065
Best parameters: {'depth': 10, 'iterations': 50, 'l2_leaf_reg': 2, 'learning_rate': 0.5, 'loss_function': 'RMSE', 'random_seed': 12345}


We can now train our CatBoost regressor using the best hyperparameter settings got, get predictions, all while timing the processes

In [28]:
#CBR model tarining
CBR=CatBoostRegressor(depth=best_param['depth'],
                     iterations=best_param['iterations'],
                     l2_leaf_reg=best_param['l2_leaf_reg'],
                     learning_rate=best_param['learning_rate'],
                     loss_function='RMSE', random_seed=random_state)
%time CBR.fit(f_train_gbm, t_train_gbm, cat_features=cat_feat, verbose=False, plot=False)

CPU times: user 19.6 s, sys: 80 ms, total: 19.6 s
Wall time: 19.8 s


<catboost.core.CatBoostRegressor at 0x7f8bafade8e0>

In [29]:
#CBR model predictions
%time CBR_pred=CBR.predict(f_test_gbm)

CPU times: user 157 ms, sys: 0 ns, total: 157 ms
Wall time: 158 ms


RMSE calculation:

In [30]:
CBR_rmse=rmse(t_test_gbm, CBR_pred)
CBR_rmse

1593.6185757394744

An RMSE of 1593.62 Much better than our baseline

## LightGBM Regressor

This also doesn't require prior encoding. We will still be performing hyperparameter tuning similar to the way we did for CatBoost, except we will be dealing with different hyperparameters. One thing to note about LightGBM is that our categorical features have to be of the 'category' type before feeding it to LightGBM. It will not accept 'object' types or anything else. So we will need to do so both for the training set and the test set

In [31]:
#Converting all categorical variables to 'category' type
for c in cat_feat:
    f_train_gbm[c] = f_train_gbm[c].astype('category')

#LightGBM hyperparameter tuning
model=LGBMRegressor()
parameters={'num_leaves': [10, 20, 30],
           'learning_rate': [0.5, 0.1],
           'n_estimators': [10, 20],
           'random_state': [random_state],
           'objective': ['rmse']}
grid=GridSearchCV(estimator=model, param_grid=parameters, scoring=scorer, cv=3, n_jobs=-1)
grid.fit(f_train_gbm, t_train_gbm)
best_param=grid.best_params_

Let us get our best hyperparameter settings:

In [32]:
print('Best score across all searched parameters', grid.best_score_)
print('Best parameters:', best_param)

Best score across all searched parameters -1679.90724334323
Best parameters: {'learning_rate': 0.5, 'n_estimators': 20, 'num_leaves': 30, 'objective': 'rmse', 'random_state': 12345}


We will now train and test our LightGBM model using those hparameter settings, test the model, all while timing the processes

In [33]:
#LightGBM model training
lgbm=LGBMRegressor(learning_rate=0.5,
                  n_estimators=20,
                  num_leaves=30,
                  objective='rmse', random_state=random_state)
%time lgbm.fit(f_train_gbm, t_train_gbm)

CPU times: user 2.59 s, sys: 11.5 ms, total: 2.6 s
Wall time: 2.6 s


LGBMRegressor(learning_rate=0.5, n_estimators=20, num_leaves=30,
              objective='rmse', random_state=12345)

In [34]:
#convert test features to 'category' type
for c in cat_feat:
    f_test_gbm[c] = f_test_gbm[c].astype('category')

#LightGBM model predictions
%time lgbm_pred=lgbm.predict(f_test_gbm)

CPU times: user 190 ms, sys: 0 ns, total: 190 ms
Wall time: 174 ms


RMSE calculation:

In [35]:
lgbm_rmse=rmse(t_test_gbm, lgbm_pred)
lgbm_rmse

1667.6415612054686

RMSE of 1667.64 Much better than our baseline

## Model analysis

In [39]:
index=['LR', 'DTR', 'RFR', 'CB', 'LGBM']
summary=pd.DataFrame(data={'training_time(s)': [.136, 1.1, 82, 20, 2.6],
                             'prediction_time(s)': [.018, .012, 1.68, .16, .174],
                             'RMSE': [3319, 1855, 1527, 1593, 1667]},
                    index=index)
summary

Unnamed: 0,training_time(s),prediction_time(s),RMSE
LR,0.136,0.018,3319
DTR,1.1,0.012,1855
RFR,82.0,1.68,1527
CB,20.0,0.16,1593
LGBM,2.6,0.174,1667


Takeaways:

Linear Regression had the best training time (.136s) while the worst goes to Random Forest (82s).
Decision Tree had the best prediction time (.012 s) while the worst goes to Random Forest (1.68s).
(To be fair, we did train Random Forest with n_estimators=70 and max_depth=24. But generally, Random Forests take more time.)
Random Forest had the best RMSE (1527) while the worst goes to Linear Regression (3319).

## Conclusion

We have successfully cleaned and prepared the data and used it to train models. Even though the Random Forest model has the highest RMSE (1527), the cost in terms of training and prediction time is very considerable. The CatBoost regressor takes less than half that amount of time and gets an RMSE of 1593, a difference of just 66 Euros. So we recommend the CatBoost model.

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [X]  Code is error free
- [X]  The cells with the code have been arranged in order of execution
- [X]  The data has been downloaded and prepared
- [X]  The models have been trained
- [X]  The analysis of speed and quality of the models has been performed