# Determining value of a car using historical data
Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# Table of Contents
1. [General Information](#step1)
2. [Data Preprocessing](#step2)
3. [Model Training](#step3)
  - [Linear Regression](#step3_1)
  - [Decision Tree](#step3_2)
  - [Random Forest](#step3_3)
  - [CatBoost](#step3_4)
  - [LightGBM](#step3_5)
4. [Model Analysis](#step4)
5. [Conclusion](#step5)

# General Information<a name='step1'></a>

Let us import our necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
import time
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
random_state=12345

We can now read our data

In [2]:
data=pd.read_csv('/datasets/car_data.csv')#reads the csv data as a pandas dataframe
data.head()#1st 5 rows

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


Let us get some general information about the dataframe

In [3]:
data.info()#general information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Some rows have missing values that we will have to deal with. Let us look at a description of the numerical columns

In [4]:
data.describe()#description of columns with numerical values

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


A few things we can gather:
1. There price values of zero. This doesn't make sense. We need to filter those out.
2. There are registration years that date as far back as the year 1000, and as far into the future as the year 9999. According to history, the automobile was invented in 1886; and last time we checked, we are in the year 2021. Filtering is needed again
3. There are cars with zero power. This doesn't make sense. Filtering, again, is needed
4. Months go from 1 to 12. But we have values of 0 in the RegistrationMonth column. More filtering.
5. The NumberOfPictures column is zero all through. We can drop the whole column since that isn't helpful.

# Data Preprocessing<a name='step2'></a>

Let us check for duplicate rows. If we have any, we also need to drop those

In [5]:
data.duplicated().sum()#gives us the number of duplicate rows

262

We can now start our cleaning process:

In [6]:
data=data[data['Price']!=0]#prices are not zero
data=data[data['Power']!=0]#no zero horsepower
data=data[data['RegistrationMonth']!=0]#no zero registration month
data=data[(1886 <= data['RegistrationYear']) & (data['RegistrationYear'] <= 2021)]
#registration is between and including 1886 and 2021
data=data.drop('NumberOfPictures', axis=1)#drops NumberOfPictures column
data['NotRepaired'].fillna(value='unknown', inplace=True)
#replace missing values in NotRepaired column by 'unknown' 
data.dropna(inplace=True)#drops all other missing values
data.drop_duplicates(inplace=True)#drop duplicate rows
data.info()#general info after changes

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253566 entries, 2 to 354368
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        253566 non-null  object
 1   Price              253566 non-null  int64 
 2   VehicleType        253566 non-null  object
 3   RegistrationYear   253566 non-null  int64 
 4   Gearbox            253566 non-null  object
 5   Power              253566 non-null  int64 
 6   Model              253566 non-null  object
 7   Mileage            253566 non-null  int64 
 8   RegistrationMonth  253566 non-null  int64 
 9   FuelType           253566 non-null  object
 10  Brand              253566 non-null  object
 11  NotRepaired        253566 non-null  object
 12  DateCreated        253566 non-null  object
 13  PostalCode         253566 non-null  int64 
 14  LastSeen           253566 non-null  object
dtypes: int64(6), object(9)
memory usage: 31.0+ MB


We have finished our cleaning process. Now let us prepare the data splits. We will be training Linear Regression (as a baseline), Decision Tree Regressor, Random Forest Regressor, CatBoost Regressor, and LightGBM Regressor models. The first three models will need us to encode the categorical features prior to training. The last 2 don't need prior encoding since they have built-in encoders. So we will make 2 copies of our cleaned data: 1 that isn't encoded (which will be used with CatBoost and LightGBM), and the other which will be encoded for the other models. First, the data that will remain unencoded.... 

In [7]:
#creating and preparing a copy of data to use with CatBoost and LightGBM which don't need prior encoding
data_gbm=data.copy()#makes a copy of our cleaned data
data_gbm.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1, inplace=True)
#drops the columns that are in time-date format 
data_gbm.reset_index(drop=True, inplace=True)
#resets the index
f_gbm=data_gbm.drop('Price', axis=1)#defines our features
t_gbm=data_gbm['Price']#defines the price column as our target
f_train_gbm, f_test_gbm, t_train_gbm, t_test_gbm=train_test_split(f_gbm, t_gbm, test_size=0.2,
                                                                 random_state=random_state)
#splits our new data into training and test sets for our features and target

#prints the shapes of our splits
print(f_train_gbm.shape)
print(t_train_gbm.shape)
print(f_test_gbm.shape)
print(t_test_gbm.shape)

(202852, 11)
(202852,)
(50714, 11)
(50714,)


Now for the encoded copy, we will first create a list with our columns we want encoded:

In [8]:
cat_feat=['VehicleType', 'RegistrationYear', 'Gearbox', 'Model', 'RegistrationMonth',
         'FuelType', 'Brand', 'NotRepaired', 'PostalCode']

In [9]:
#creating and preparing a copy of data to use with models requiring prior encoding
data_enc=data.copy()#copy of cleaned data
encoder=OrdinalEncoder()#creates an instance of the ordinal encoder
data_enc[cat_feat]=encoder.fit_transform(data_enc[cat_feat])#encodes the columns

In [10]:
#preparation and splitting for encoded data
data_enc.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1, inplace=True)
data_enc.reset_index(drop=True, inplace=True)
f_enc=data_enc.drop('Price', axis=1)
t_enc=data_enc['Price']
f_train_enc, f_test_enc, t_train_enc, t_test_enc=train_test_split(f_enc, t_enc, test_size=0.2,
                                                                 random_state=random_state)
print(f_train_enc.shape)
print(t_train_enc.shape)
print(f_test_enc.shape)
print(t_test_enc.shape)

(202852, 11)
(202852,)
(50714, 11)
(50714,)


We have successfully cleaned and prepared encoded and unencoded data for model training

# Model training<a name='step3'></a>

We will first create a function to calculate the RMSE and then make it our evaluation metric (or scorer) for our models. It takes as arguments the prediction and target data.

In [11]:
#create an rmse function and make it our scorer
def rmse(pred, target):#creates the rmse function that takes target and prediction values as arguments
    pred=np.array(pred)#turns the prediction into a vector
    target=np.array(target)#turns the target into a vector
    error=pred - target #calculates the vector of errors
    sq_error=error ** 2 #squares the errors
    msq_error= sq_error.mean()#gets the mean of all errors
    score = msq_error ** 0.5 #gets the square root
    return score #returns the value
scorer=make_scorer(rmse, greater_is_better=False) 
#makes our rmse function our scorer and specifies that a smaller value is better

## Linear Regression (Baseline)<a name='step3_1'></a>

We will now train a Linear Regression model as our baseline i.e the value of the evaluation metric (RMSE) that we will get from it will be the one that the other models should strive to beat. We will feed it the encoded data. Let's get a cross-validation score:

In [12]:
#Baseline model with Linear Regression
LR=LinearRegression()#creates an instance of Linear Regression model
LR_score=cross_val_score(LR, f_train_enc, t_train_enc, scoring=scorer, cv=5)
#calculates the cross-validation scores for 5 folds of the training data
print(LR_score.mean()) #prints the mean of those scores

-3341.572814146216


Let us train the model and take note of the wall time it takes to do so

In [13]:
#training LR model
LR=LinearRegression()#instance of a linear regression model
%time LR.fit(f_train_enc, t_train_enc)#trains the model with the encoded data and times the process

CPU times: user 97.4 ms, sys: 59.9 ms, total: 157 ms
Wall time: 130 ms


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

We now get predictions:

In [14]:
#LR model prediction
%time lr_pred=LR.predict(f_test_enc)#gets predictions and times the process

CPU times: user 8.49 ms, sys: 19.5 ms, total: 28 ms
Wall time: 66.7 ms


RMSE calculation:

In [15]:
#LR model RMSE
lr_rmse=rmse(t_test_enc, lr_pred)
lr_rmse

3319.945019855772

So the RMSE of 3319.94 is the value to beat

## Decision Tree Regressor<a name='step3_2'></a>

For this we will perform hyperparameter tuning (with the max_depth hyperparameter) to find its best value to set it to when actually training. We will choose the hyperparameter which gets the best cross-validation score

In [16]:
#Decision Tree Rgressor - hyperparameter tuning
for depth in range(1, 16):#loops through values of max_depth from 1 to 15
    DTR=DecisionTreeRegressor(max_depth=depth, random_state=random_state)
    #creates an instance of Decision Tree Rgressor model with the max_depth value and random state of 12345
    DTR_score=cross_val_score(DTR, f_train_enc, t_train_enc, scoring=scorer, cv=5)
    #gets cross-validation score
    print('Max_depth', depth, 'score:', DTR_score.mean())
    #prints max_depth value and the mean cross-validation score

Max_depth 1 score: -3597.5212621009646
Max_depth 2 score: -3148.7614173370557
Max_depth 3 score: -2791.439807061578
Max_depth 4 score: -2541.7368659693166
Max_depth 5 score: -2372.5274016078033
Max_depth 6 score: -2263.1348897390744
Max_depth 7 score: -2164.109806596397
Max_depth 8 score: -2081.064381108975
Max_depth 9 score: -2009.1902263310585
Max_depth 10 score: -1951.205002133464
Max_depth 11 score: -1913.3661189015081
Max_depth 12 score: -1889.1937156659264
Max_depth 13 score: -1882.4974472277943
Max_depth 14 score: -1889.0048759830315
Max_depth 15 score: -1911.4468913132223


So the best score was gotten when the max_depth hyperparameter was 13. So we will use that to train our model and get predictions, and the processes will be timed

In [17]:
#DTR model training
DTR=DecisionTreeRegressor(max_depth=13, random_state=random_state)
%time DTR.fit(f_train_enc, t_train_enc)

CPU times: user 1.07 s, sys: 28 µs, total: 1.07 s
Wall time: 1.09 s


DecisionTreeRegressor(criterion='mse', max_depth=13, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=12345, splitter='best')

In [18]:
#DTR model prediction
%time DTR_pred=DTR.predict(f_test_enc)

CPU times: user 13.7 ms, sys: 0 ns, total: 13.7 ms
Wall time: 12.5 ms


RMSE calculation:

In [19]:
#DTR rmse
DTR_rmse=rmse(t_test_enc, DTR_pred)
DTR_rmse

1855.542228750032

The RMSE is 1855.54 which is already much better than our baseline

## Random Forest Regressor<a name='step3_3'></a>

Here we will perform hyperparameter tuning with 2 hyperparameters: max_depth and n_estimators. After doing so, the best model we came up with had n_estimators=70 and max_depth=24

In [20]:
#Random Forest Regressor - hyperparameter tuning
for depth in range(20, 26):
    RFR=RandomForestRegressor(n_estimators=70, max_depth=depth, random_state=random_state)
    RFR_score=cross_val_score(RFR, f_train_enc, t_train_enc, scoring=scorer, cv=5)
    print('Max_depth', depth, 'score:', RFR_score.mean())

Max_depth 20 score: -1551.1765036303507
Max_depth 21 score: -1549.600091494261
Max_depth 22 score: -1548.4409216471856
Max_depth 23 score: -1547.4262950780078
Max_depth 24 score: -1547.294733299153
Max_depth 25 score: -1547.4023074095205


So let us train and test our model, with the processes being timed

In [21]:
#RFR model training
RFR=RandomForestRegressor(n_estimators=70, max_depth=24, random_state=random_state)
%time RFR.fit(f_train_enc, t_train_enc)

CPU times: user 1min 11s, sys: 524 ms, total: 1min 12s
Wall time: 1min 13s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=24,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=70,
                      n_jobs=None, oob_score=False, random_state=12345,
                      verbose=0, warm_start=False)

In [22]:
#RFR model predictions
%time RFR_pred=RFR.predict(f_test_enc)

CPU times: user 1.86 s, sys: 26 µs, total: 1.86 s
Wall time: 1.98 s


RMSE calculation:

In [23]:
RFR_rmse=rmse(t_test_enc, RFR_pred)
RFR_rmse

1527.8930336833093

RMSE of 1527. Much better than our baseline

## CatBoost Regressor<a name='step3_4'></a>

For our CatBoost regressor (which uses gradient boosting), we don't need prior encoding. We will be tuning using different hyperparameters this time. We will find the best parameters using GridSearchCV.

In [24]:
#CatBoost - hyperparameter tuning
CBR=CatBoostRegressor()#instance of CatBoost Regressor
parameters={'depth': [6, 8, 10],
           'learning_rate': [0.5, 0.1],
           'l2_leaf_reg': [2, 4],
           'iterations': [10, 50],
           'loss_function': ['RMSE'],
           'random_seed': [random_state]}
#our dictionary of hyperparameters that will be looped through when we feed them to GridSearch
grid=GridSearchCV(estimator=CBR, param_grid=parameters, scoring=scorer, cv=3, n_jobs=-1, verbose=0)
#loops through parameters to help us get the best hyperparameters for model training
grid.fit(f_train_gbm, t_train_gbm, cat_features=cat_feat)
#fits our training unencoded data into our grid instance
best_param=grid.best_params_ #gets the best set of parameters for our model

0:	learn: 3243.2208729	total: 346ms	remaining: 3.12s
1:	learn: 2550.6756245	total: 646ms	remaining: 2.58s
2:	learn: 2245.5139470	total: 942ms	remaining: 2.2s
3:	learn: 2093.2119517	total: 1.24s	remaining: 1.85s
4:	learn: 1999.0158196	total: 1.44s	remaining: 1.44s
5:	learn: 1952.1274484	total: 1.74s	remaining: 1.16s
6:	learn: 1923.2631983	total: 2.03s	remaining: 871ms
7:	learn: 1900.1347922	total: 2.24s	remaining: 560ms
8:	learn: 1878.0977223	total: 2.53s	remaining: 281ms
9:	learn: 1866.6032357	total: 2.74s	remaining: 0us
0:	learn: 3198.9719744	total: 272ms	remaining: 2.44s
1:	learn: 2545.9847100	total: 572ms	remaining: 2.29s
2:	learn: 2235.5128702	total: 872ms	remaining: 2.03s
3:	learn: 2087.9789974	total: 1.17s	remaining: 1.75s
4:	learn: 2016.7188869	total: 1.38s	remaining: 1.38s
5:	learn: 1963.7703900	total: 1.67s	remaining: 1.11s
6:	learn: 1930.9555137	total: 1.87s	remaining: 803ms
7:	learn: 1907.7143636	total: 2.17s	remaining: 542ms
8:	learn: 1883.7370442	total: 2.37s	remaining: 26

We can print out the best hyperparameter settings for our model

In [25]:
print('Best score across all searched parameters', grid.best_score_)
print('Best parameters:', best_param)

Best score across all searched parameters -1609.1884105589986
Best parameters: {'depth': 10, 'iterations': 50, 'l2_leaf_reg': 4, 'learning_rate': 0.5, 'loss_function': 'RMSE', 'random_seed': 12345}


We can now train our CatBoost regressor using the best hyperparameter settings got, get predictions, all while timing the processes

In [26]:
#CBR model tarining
CBR=CatBoostRegressor(depth=best_param['depth'],
                     iterations=best_param['iterations'],
                     l2_leaf_reg=best_param['l2_leaf_reg'],
                     learning_rate=best_param['learning_rate'],
                     loss_function='RMSE', random_seed=random_state)
%time CBR.fit(f_train_gbm, t_train_gbm, cat_features=cat_feat, verbose=False, plot=False)

CPU times: user 26.8 s, sys: 1.64 s, total: 28.5 s
Wall time: 30 s


<catboost.core.CatBoostRegressor at 0x7f9142611310>

In [27]:
#CBR model predictions
%time CBR_pred=CBR.predict(f_test_gbm)

CPU times: user 280 ms, sys: 8.6 ms, total: 289 ms
Wall time: 307 ms


RMSE calculation:

In [28]:
CBR_rmse=rmse(t_test_gbm, CBR_pred)
CBR_rmse

1590.1999494650702

An RMSE of 1590. Much better than our baseline

## LightGBM Regressor<a name='step3_5'></a>

This also doesn't require prior encoding. We will still be performing hyperparameter tuning similar to the way we did for CatBoost, except we will be dealing with different hyperparameters. One thing to note about LightGBM is that our categorical features have to be of the 'category' type before feeding it to LightGBM. It will not accept 'object' types or anything else. So we will need to do so both for the training set and the test set

In [29]:
#Converting all categorical variables to 'category' type
for c in cat_feat:
    f_train_gbm[c] = f_train_gbm[c].astype('category')

#LightGBM hyperparameter tuning
model=LGBMRegressor()
parameters={'num_leaves': [10, 20, 30],
           'learning_rate': [0.5, 0.1],
           'n_estimators': [10, 20],
           'random_state': [random_state],
           'objective': ['rmse']}
grid=GridSearchCV(estimator=model, param_grid=parameters, scoring=scorer, cv=3, n_jobs=-1)
grid.fit(f_train_gbm, t_train_gbm)
best_param=grid.best_params_

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Let us get our best hyperparameter settings:

In [30]:
print('Best score across all searched parameters', grid.best_score_)
print('Best parameters:', best_param)

Best score across all searched parameters -1681.0490934534325
Best parameters: {'learning_rate': 0.5, 'n_estimators': 20, 'num_leaves': 30, 'objective': 'rmse', 'random_state': 12345}


We will now train and test our LightGBM model using those hparameter settings, test the model, all while timing the processes

In [31]:
#LightGBM model training
lgbm=LGBMRegressor(learning_rate=0.5,
                  n_estimators=20,
                  num_leaves=30,
                  objective='rmse', random_state=random_state)
%time lgbm.fit(f_train_gbm, t_train_gbm)

CPU times: user 4.36 s, sys: 14.3 ms, total: 4.37 s
Wall time: 4.46 s


LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.5, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=20, n_jobs=-1, num_leaves=30, objective='rmse',
              random_state=12345, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [32]:
#convert test features to 'category' type
for c in cat_feat:
    f_test_gbm[c] = f_test_gbm[c].astype('category')

#LightGBM model predictions
%time lgbm_pred=lgbm.predict(f_test_gbm)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


CPU times: user 225 ms, sys: 0 ns, total: 225 ms
Wall time: 240 ms


RMSE calculation:

In [33]:
lgbm_rmse=rmse(t_test_gbm, lgbm_pred)
lgbm_rmse

1664.1203426677855

RMSE of 1664.12. Much better than our baseline

# Model analysis<a name='step4'></a>

We can prepare a table showing the different models and their training and prediction times (in milliseconds) and RMSEs

In [34]:
index=['LR', 'DTR', 'RFR', 'CB', 'LGBM']
summary=pd.DataFrame(data={'training_time(ms)': [102, 1020, 76000, 30600, 20400],
                             'prediction_time(ms)': [6.73, 12.9, 1790, 265, 181],
                             'RMSE': [3319, 1855, 1527, 1590, 1664]},
                    index=index)
summary

Unnamed: 0,training_time(ms),prediction_time(ms),RMSE
LR,102,6.73,3319
DTR,1020,12.9,1855
RFR,76000,1790.0,1527
CB,30600,265.0,1590
LGBM,20400,181.0,1664


Takeaways:
1. Linear Regression had the best training time (102 ms) while the worst goes to Random Forest (76000 ms)
2. Linear Regression had the best prediction time (6.73 ms) while the worst goes to Random Forest (1790 ms)

(To be fair, we did train Random Forest with n_estimators=70 and max_depth=24. But generally, Random Forests take more time)

3. Random Forest had the best RMSE (1527) while the worst goes to Linear Regression (3319)

# Conclusion<a name='step6'></a>

We have successfully cleaned and prepared the data and used it to train models. Even though the Random Forest model has the highest RMSE (1527), the cost in terms of training and prediction time is very considerable. The CatBoost regressor takes less than half that amount of time and gets an RMSE of 1590, a difference of just 63 Euros. So we recommend the CatBoost model