# ML Model to determine used car value

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training



**Executive Summary**:

**CatBoost regressor** has achieved best quality, out of a few models that were trained and evaluated. It's performance on the unseen test set: 
- the quality of the prediction: RMSE 1578.5 Euro  (a bit better than LightGBM)
- the speed of the prediction: 87.8 ms
- the time required for training: 24.7 s (a bit slower than LightGBM)

## Data preparation

### Loading and looking into the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing

from sklearn.model_selection import train_test_split

from IPython.display import display

from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load the data into a df
df = pd.read_csv('/datasets/car_data.csv')

In [3]:
# looking at a sample of the data
df.sample(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
9403,13/03/2016 13:42,4750,,2017,manual,65,micra,80000,8,petrol,nissan,,13/03/2016 00:00,0,48167,19/03/2016 12:18
215915,31/03/2016 14:56,475,small,1991,manual,45,polo,150000,11,petrol,volkswagen,,31/03/2016 00:00,0,49090,02/04/2016 10:45
332115,02/04/2016 00:57,6900,,2017,manual,0,sharan,150000,4,gasoline,volkswagen,,01/04/2016 00:00,0,42289,06/04/2016 06:15
186623,20/03/2016 13:54,4999,sedan,2007,auto,105,astra,90000,7,petrol,opel,,20/03/2016 00:00,0,58511,06/04/2016 15:45
184042,04/04/2016 10:47,3680,wagon,2004,auto,115,touran,150000,3,,volkswagen,,04/04/2016 00:00,0,33729,06/04/2016 11:46
284220,21/03/2016 15:58,2200,small,2002,manual,37,lupo,150000,2,petrol,volkswagen,no,21/03/2016 00:00,0,23896,21/03/2016 16:43
332930,27/03/2016 23:37,650,sedan,1998,manual,125,vectra,150000,6,,opel,yes,27/03/2016 00:00,0,4626,06/04/2016 06:17
209162,29/03/2016 19:53,2500,sedan,2008,,0,polo,150000,11,gasoline,volkswagen,no,29/03/2016 00:00,0,26131,29/03/2016 19:53
81981,11/03/2016 21:54,2800,wagon,2005,manual,109,3_reihe,150000,9,gasoline,peugeot,no,11/03/2016 00:00,0,52538,04/04/2016 07:49
306918,02/04/2016 11:39,2990,wagon,2004,manual,120,3_reihe,150000,7,gasoline,peugeot,no,02/04/2016 00:00,0,25436,02/04/2016 11:39


In [4]:
# General info of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [5]:
# Missing values
df.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [6]:
# General statistics of the data
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


After an initial view of the data, these are the steps we will take in the preparation stage:
1. Adding the vehicle's age on the time of publishing 
2. Deleting columns that seem not relevant for impact on price
3. Dealing with missing values
4. Dealing with errors in data (like price=zero, or RegostrationYear=1000, or power=0, etc.) 
5. Encoding categorical features and standardizing all features

### Adding vehicle's age 

In [7]:
# Adding the vehicle's age when published
df['DateCreated'] = pd.to_datetime(df['DateCreated'], format='%d/%m/%Y %H:%M')
df['year_published'] = pd.DatetimeIndex(df['DateCreated']).year
df['AgeAtPublish'] = df['year_published'] - df['RegistrationYear']
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen,year_published,AgeAtPublish
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,2016-03-24,0,70435,07/04/2016 03:16,2016,23
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24,0,66954,07/04/2016 01:46,2016,5
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14,0,90480,05/04/2016 12:47,2016,12
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17,0,91074,17/03/2016 17:40,2016,15
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31,0,60437,06/04/2016 10:17,2016,8


### Deleting less-relevant columns 

In [8]:
# Deleting columns that seem not relevant for impact on price:
df = df.drop(columns=['DateCrawled', 'RegistrationMonth', 'DateCreated', 'NumberOfPictures',
                      'PostalCode', 'LastSeen', 'year_published'])
df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,AgeAtPublish
0,480,,1993,manual,0,golf,150000,petrol,volkswagen,,23
1,18300,coupe,2011,manual,190,,125000,gasoline,audi,yes,5
2,9800,suv,2004,auto,163,grand,125000,gasoline,jeep,,12
3,1500,small,2001,manual,75,golf,150000,petrol,volkswagen,no,15
4,3600,small,2008,manual,69,fabia,90000,gasoline,skoda,no,8


### Dealing with missing values

In [9]:
# Check the share of data left if we drop data with missing values, and it's size:
print(len(df.dropna()) / len(df))
print(len(df.dropna()))

0.6936667710776055
245814


Since we remain with ~70% of the data, and almost 250K entries with full relevant data, we will move to dropping the data with missing values.

In [10]:
# Dropping the data with missing values
df = df.dropna()
len(df)

245814

### Dealing with errors in data

In [11]:
# Checking entries with Price = zero
df[df['Price'] == 0] 

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,AgeAtPublish
7,0,sedan,1980,manual,50,other,40000,petrol,volkswagen,no,36
152,0,bus,2004,manual,101,meriva,150000,lpg,opel,yes,12
579,0,sedan,1996,manual,170,5er,150000,petrol,bmw,no,20
615,0,sedan,1998,manual,75,polo,150000,petrol,volkswagen,yes,18
859,0,wagon,2009,manual,170,a6,150000,gasoline,audi,yes,7
...,...,...,...,...,...,...,...,...,...,...,...
353882,0,small,1996,manual,45,polo,150000,petrol,volkswagen,yes,20
353943,0,sedan,1999,manual,150,golf,125000,petrol,volkswagen,no,17
353995,0,sedan,1991,manual,133,100,150000,petrol,audi,no,25
354124,0,small,2004,manual,200,golf,150000,petrol,volkswagen,no,12


In [12]:
# Leaving only vehicles with price < 0
df = df[df['Price'] > 0]
len(df)

242428

In [13]:
# Checking for errors left in registration year
print(len(df[df['RegistrationYear'] < 1900]))
print(len(df[df['RegistrationYear'] > 2018]))

0
0


In [14]:
# Checking for errors left in Power
df[df['Power'] == 0]

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,AgeAtPublish
55,550,wagon,1999,manual,0,astra,150000,gasoline,opel,yes,17
70,800,small,1993,manual,0,polo,150000,petrol,volkswagen,no,23
98,4290,bus,2008,manual,0,combo,150000,gasoline,opel,no,8
158,800,sedan,1993,manual,0,golf,10000,petrol,volkswagen,yes,23
166,300,wagon,1998,manual,0,v40,150000,petrol,volvo,no,18
...,...,...,...,...,...,...,...,...,...,...,...
354220,2200,convertible,1988,manual,0,golf,150000,petrol,volkswagen,no,28
354263,1800,coupe,2000,manual,0,clk,150000,petrol,mercedes_benz,no,16
354332,7900,bus,2007,manual,0,b_klasse,125000,petrol,mercedes_benz,no,9
354335,390,small,1997,auto,0,corsa,100000,petrol,opel,yes,19


In [15]:
# Leaving only vehicles with Power > 0
df = df[df['Power'] > 0].reset_index(drop=True)
len(df)

233275

We have now 233,275 entries (each with 10 relevant features and one target), that are free of obvious errors and have no missing values. We can move to encoding categorical features.  

### Encoding categorical features and standardizing all features

In [16]:
# Turning the two binari columns from object type to 0 and 1:
df['Gearbox'] = df['Gearbox'].replace(['auto', 'manual'], [1, 0])
df['NotRepaired'] = df['NotRepaired'].replace(['yes', 'no'], [1, 0])
df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,AgeAtPublish
0,1500,small,2001,0,75,golf,150000,petrol,volkswagen,0,15
1,3600,small,2008,0,69,fabia,90000,gasoline,skoda,0,8
2,650,sedan,1995,0,102,3er,150000,petrol,bmw,1,21
3,2200,convertible,2004,0,109,2_reihe,150000,petrol,peugeot,0,12
4,2000,sedan,2004,0,105,3_reihe,150000,petrol,mazda,0,12


'FuelType' petrol and gasoline are the same - we will fix this before encoding:

In [17]:
# Renaming 'petrol' as 'gasoline':
df['FuelType'] = df['FuelType'].replace(['petrol'], 'gasoline')
df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,AgeAtPublish
0,1500,small,2001,0,75,golf,150000,gasoline,volkswagen,0,15
1,3600,small,2008,0,69,fabia,90000,gasoline,skoda,0,8
2,650,sedan,1995,0,102,3er,150000,gasoline,bmw,1,21
3,2200,convertible,2004,0,109,2_reihe,150000,gasoline,peugeot,0,12
4,2000,sedan,2004,0,105,3_reihe,150000,gasoline,mazda,0,12


In [18]:
# Checking the values of the categorical features
print(df['VehicleType'].value_counts())
print()
print(df['Model'].value_counts())
print()
print(df['FuelType'].value_counts())
print()
print(df['Brand'].value_counts())

sedan          68961
small          55004
wagon          48783
bus            22563
convertible    15681
coupe          11585
suv             9216
other           1482
Name: VehicleType, dtype: int64

golf                  19382
other                 17225
3er                   14263
polo                   8251
corsa                  7675
                      ...  
elefantino                4
serie_3                   3
samara                    2
range_rover_evoque        2
rangerover                1
Name: Model, Length: 249, dtype: int64

gasoline    228894
lpg           3687
cng            424
hybrid         167
other           54
electric        49
Name: FuelType, dtype: int64

volkswagen       49484
bmw              26224
opel             24514
mercedes_benz    22673
audi             20779
ford             16035
renault          10665
peugeot           7440
fiat              5925
seat              4696
skoda             4262
mazda             3771
toyota            3461
citroen 

Since we have a large amount of values in some of the features (especially model and brand), that might have impact on price, and yet have no clear-cut order, and in the same time we are interested in speed of prediction and training - we will avoid OHE or ordinal encoding, and choose CatBoostEncoder (regardless of the different prediction models we will train later). Following the encoding, we will standardize all features (that will be all nmeric, after the encoding). 

We will also start checking the time of cell execution. 

In [19]:
%%time
# Encoding the categorical features and standardizing all features, after splitting into train-valid-test sets
!pip install category_encoders
from category_encoders import CatBoostEncoder
from sklearn.preprocessing import StandardScaler

encoder = CatBoostEncoder()
X = df.drop('Price', axis=1)
y = df['Price']

# In the first step of splitting we will split the data into "seen" (to be split later to train and validation) and  test set
X_seen, X_test, y_seen, y_test = train_test_split(X,y, train_size=0.8, random_state=12345)

# Now since we want to split the "seen" set into train (60% of all data, 75% of the seen data) 
    # and validation set (20% of all data, 25% of the seen data)
X_train, X_valid, y_train, y_valid = train_test_split(X_seen, y_seen, test_size=0.25, random_state=12345)

X_train = encoder.fit_transform(X_train, y_train)
X_valid = encoder.transform(X_valid)
X_test = encoder.transform(X_test)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

CPU times: user 734 ms, sys: 75.6 ms, total: 809 ms
Wall time: 2.92 s


We are ready now for model training.

## Model training

In [20]:
# Function to calculate rmse (for models that it is not their default)
def rmse(y, pred_y):
    return mean_squared_error(y, pred_y)**0.5

In [21]:
# Function to train, predict, and evaluate a model:
def model_TPE (X_train, y_train, X_eval, y_eval, model):
    model.fit(X_train, y_train)
    pred_eval = model.predict(X_eval)
    print(model, 'RMSE', rmse(y_eval, pred_eval))

### Dummy model as a baseline

In [22]:
# Dummy model (as a baseline) rmse on the test set
pred_mean = np.ones(y_test.shape) * y_test.mean()
print(rmse(y_test, pred_mean))

4737.931311869403


### Linear regression (for a sanity check)

In [23]:
%%time
from sklearn.linear_model import LinearRegression
model = LinearRegression()
print('On validation set:')
model_TPE(X_train, y_train, X_valid, y_valid, model)

On validation set:
LinearRegression() RMSE 2939.3387479349512
CPU times: user 70.9 ms, sys: 82.9 ms, total: 154 ms
Wall time: 137 ms


### Random forest 

In [24]:
%%time
from sklearn.ensemble import RandomForestRegressor
print('On validation set:')
n_estimators_list = [10, 40, 70]
for n in n_estimators_list:
    model = RandomForestRegressor(n_estimators=n)
    model_TPE(X_train, y_train, X_valid, y_valid, model)
    print()

On validation set:
RandomForestRegressor(n_estimators=10) RMSE 1684.1482836348393

RandomForestRegressor(n_estimators=40) RMSE 1625.0498524087536

RandomForestRegressor(n_estimators=70) RMSE 1609.8192526243854

CPU times: user 1min 52s, sys: 1.09 s, total: 1min 53s
Wall time: 1min 53s


Our best quality here is with 70 trees. On validation set - RFR results are much better than both dummy model and linear regression. 

### LightGBM

In [25]:
%%time
import lightgbm as lgbm
print('On validation set:')
num_leaves_list = [30, 50, 100]
max_depth_list = [0, 5, 8]
n_estimators_list = [10, 50, 80]
for n in num_leaves_list:
    for m in max_depth_list:
        for e in n_estimators_list:
            model = lgbm.LGBMRegressor(num_leaves=n, max_depth=m, n_estimators=e)
            model_TPE(X_train, y_train, X_valid, y_valid, model)
            print()

On validation set:
LGBMRegressor(max_depth=0, n_estimators=10, num_leaves=30) RMSE 2599.028628882116

LGBMRegressor(max_depth=0, n_estimators=50, num_leaves=30) RMSE 1727.8445822412777

LGBMRegressor(max_depth=0, n_estimators=80, num_leaves=30) RMSE 1683.8205428652525

LGBMRegressor(max_depth=5, n_estimators=10, num_leaves=30) RMSE 2621.1969644649757

LGBMRegressor(max_depth=5, n_estimators=50, num_leaves=30) RMSE 1771.0211670958586

LGBMRegressor(max_depth=5, n_estimators=80, num_leaves=30) RMSE 1725.6784043832026

LGBMRegressor(max_depth=8, n_estimators=10, num_leaves=30) RMSE 2599.028628882116

LGBMRegressor(max_depth=8, n_estimators=50, num_leaves=30) RMSE 1737.9102080845805

LGBMRegressor(max_depth=8, n_estimators=80, num_leaves=30) RMSE 1692.321718403323

LGBMRegressor(max_depth=0, n_estimators=10, num_leaves=50) RMSE 2531.5464988105423

LGBMRegressor(max_depth=0, n_estimators=50, num_leaves=50) RMSE 1687.478146224637

LGBMRegressor(max_depth=0, n_estimators=80, num_leaves=50) RM

Our best (lowest) RMSE here on the validation set is 1598, a little better than the above RFR model. We will check it later with the test set. The parameters are: LGBMRegressor(max_depth=0, n_estimators=80, num_leaves=100). Since total time of all options was about 3 minutes and a half - this one option does not seem to have a significant time problem (for now).  

### CatBoost

RMSE is the default score of CatBoostRegressor, so we will not use our function from above, but the CBR way of evaluating on the validation set. First we will see the iterations in steps of 50 (on the next step we will tune hyperparameters).   

In [26]:
%%time

from catboost import CatBoostRegressor

model = CatBoostRegressor(iterations=500,
                          metric_period=50, random_state=12)
model.fit(X_train, y_train, 
          eval_set=(X_valid, y_valid))

Learning rate set to 0.16883
0:	learn: 4186.5372282	test: 4186.6472411	best: 4186.6472411 (0)	total: 74.7ms	remaining: 37.3s
50:	learn: 1793.6431062	test: 1786.4421273	best: 1786.4421273 (50)	total: 1.01s	remaining: 8.93s
100:	learn: 1722.8088529	test: 1718.8226401	best: 1718.8226401 (100)	total: 2.03s	remaining: 8.04s
150:	learn: 1682.3614232	test: 1679.6664926	best: 1679.6664926 (150)	total: 3.04s	remaining: 7.03s
200:	learn: 1652.0595726	test: 1656.4813009	best: 1656.4813009 (200)	total: 4s	remaining: 5.96s
250:	learn: 1629.4528608	test: 1643.1101196	best: 1643.1101196 (250)	total: 4.94s	remaining: 4.9s
300:	learn: 1608.4677937	test: 1630.5671552	best: 1630.5671552 (300)	total: 5.88s	remaining: 3.89s
350:	learn: 1592.7342180	test: 1624.0694705	best: 1624.0694705 (350)	total: 6.83s	remaining: 2.9s
400:	learn: 1577.4243718	test: 1616.0387432	best: 1616.0387432 (400)	total: 7.82s	remaining: 1.93s
450:	learn: 1565.2674585	test: 1610.7638409	best: 1610.7638409 (450)	total: 8.76s	remainin

<catboost.core.CatBoostRegressor at 0x7fd9ae6d16a0>

We will now tune few hyperparameters: 

In [27]:
%%time
depth_list = [6, 8, 10]
iterations_list = [10, 100, 500]
for d in depth_list:
    for i in iterations_list:
        print(i, 'iterations,', d, 'tree depth')
        model = CatBoostRegressor(depth=d, iterations=i,
                          metric_period=int(i/10), random_state=12)
        model.fit(X_train, y_train, 
          eval_set=(X_valid, y_valid))

10 iterations, 6 tree depth
Learning rate set to 0.5
0:	learn: 3263.2806709	test: 3262.5034161	best: 3262.5034161 (0)	total: 25.2ms	remaining: 227ms
1:	learn: 2569.8559597	test: 2558.3316693	best: 2558.3316693 (1)	total: 45.9ms	remaining: 184ms
2:	learn: 2267.1193352	test: 2255.5418832	best: 2255.5418832 (2)	total: 65.2ms	remaining: 152ms
3:	learn: 2123.5136523	test: 2113.5759289	best: 2113.5759289 (3)	total: 83.5ms	remaining: 125ms
4:	learn: 2051.6821146	test: 2040.4105399	best: 2040.4105399 (4)	total: 102ms	remaining: 102ms
5:	learn: 2005.2492269	test: 1995.8160999	best: 1995.8160999 (5)	total: 120ms	remaining: 80ms
6:	learn: 1975.0541635	test: 1963.0063505	best: 1963.0063505 (6)	total: 139ms	remaining: 59.4ms
7:	learn: 1947.3790973	test: 1931.8267980	best: 1931.8267980 (7)	total: 157ms	remaining: 39.1ms
8:	learn: 1932.4319453	test: 1916.4701758	best: 1916.4701758 (8)	total: 174ms	remaining: 19.4ms
9:	learn: 1912.6529520	test: 1896.9260001	best: 1896.9260001 (9)	total: 194ms	remainin

Our best RMSE on the validation set is 1554, a bit better than with lightGBM. The chosen parameters are: 500 iterations, 10 tree depth. Training time for this option was less than a minute. 

## Model analysis

We will now evaluate the best model from each kind on the test set - quality (RMSE) and speed. First we will train each chosen model (with tuned parameters) on the whole "seen" dataset - both train and valid sets, by this adding another allowed 33% to our initial training data, and allowing us to measure training time as well prediction time (and first we will 'fit_transform' for this whole data on the scaler and the encoder). 

In order to enable time measurements - each model last training and prediction will be done on a separate code cell. 

In [28]:
%%time
# final training on train data
model = LinearRegression()
model.fit(X_train, y_train)

CPU times: user 54.7 ms, sys: 55 µs, total: 54.7 ms
Wall time: 33.7 ms


LinearRegression()

In [29]:
%%time
# Prediction (and evaluation) on the unseen test set
y_pred = model.predict(X_test)
rmse(y_test, y_pred)

CPU times: user 12.5 ms, sys: 84 µs, total: 12.6 ms
Wall time: 2.38 ms


2913.800990448321

**Linear regression** performance on the unseen test set: 
- the quality of the prediction: RMSE 2913.8 Euro (much better than dummy model, see above 2.1)
- the speed of the prediction: 2.38 ms
- the time required for training: 33.7 ms

### Random forest 

In [30]:
%%time
# final training on train data
model = RandomForestRegressor(n_estimators=70)
model.fit(X_train, y_train)

CPU times: user 1min 4s, sys: 684 ms, total: 1min 5s
Wall time: 1min 5s


RandomForestRegressor(n_estimators=70)

In [31]:
%%time
# Prediction (and evaluation) on the unseen test set
y_pred = model.predict(X_test)
rmse(y_test, y_pred)

CPU times: user 1.39 s, sys: 7.83 ms, total: 1.4 s
Wall time: 1.39 s


1628.4172446783853

**Random forest regressor** performance on the unseen test set: 
- the quality of the prediction: RMSE 1628.4 Euro (much better than linear regression) 
- the speed of the prediction: 1.39 s 
- the time required for training: 1min 5s (much slower than linear regression)

### LightGBM 

In [32]:
%%time
# final training on train data
model = lgbm.LGBMRegressor(max_depth=0, n_estimators=80, num_leaves=100)
model.fit(X_train, y_train)

CPU times: user 5.3 s, sys: 20.3 ms, total: 5.32 s
Wall time: 5.34 s


LGBMRegressor(max_depth=0, n_estimators=80, num_leaves=100)

In [33]:
%%time
# Prediction (and evaluation) on the unseen test set
y_pred = model.predict(X_test)
rmse(y_test, y_pred)

CPU times: user 384 ms, sys: 7.62 ms, total: 392 ms
Wall time: 361 ms


1617.811593940994

**Light GBM regressor** performance on the unseen test set: 
- the quality of the prediction: RMSE 1617.8 Euro  (a bit better than Random Forest)
- the speed of the prediction: 361 ms
- the time required for training: 5.34 s (much faster than Random Forest)

### CatBoost 

In [34]:
%%time
# final training on train data
model = CatBoostRegressor(depth=10, iterations=500,
                          metric_period=50, random_state=12)
model.fit(X_train, y_train)

Learning rate set to 0.157032
0:	learn: 4182.5025972	total: 50.2ms	remaining: 25s
50:	learn: 1685.7652000	total: 2.45s	remaining: 21.6s
100:	learn: 1596.5844314	total: 4.76s	remaining: 18.8s
150:	learn: 1532.4975332	total: 7.12s	remaining: 16.5s
200:	learn: 1492.7976608	total: 9.55s	remaining: 14.2s
250:	learn: 1452.2930468	total: 12.2s	remaining: 12.1s
300:	learn: 1422.1037104	total: 14.5s	remaining: 9.59s
350:	learn: 1395.0213813	total: 16.9s	remaining: 7.17s
400:	learn: 1369.2508786	total: 19.3s	remaining: 4.78s
450:	learn: 1347.0320761	total: 21.8s	remaining: 2.37s
499:	learn: 1327.3413622	total: 24.1s	remaining: 0us
CPU times: user 24 s, sys: 138 ms, total: 24.2 s
Wall time: 24.7 s


<catboost.core.CatBoostRegressor at 0x7fd9ae75fdc0>

In [35]:
%%time
# Prediction (and evaluation) on the unseen test set
y_pred = model.predict(X_test)
rmse(y_test, y_pred)

CPU times: user 85.6 ms, sys: 4.08 ms, total: 89.6 ms
Wall time: 87.8 ms


1578.5209569917765

**CatBoost regressor** performance on the unseen test set: 
- the quality of the prediction: RMSE 1578.5 Euro  (a bit better than LightGBM)
- the speed of the prediction: 87.8 ms
- the time required for training: 24.7 s (a bit slower than LightGBM)

## Conclusion

Our best quality (RMSE 1578.5 Euro) was achieved with a **CatBoost regressor**, with fast prediction (less than a second), and a bit less than half a minute for training.

**LightGBM regressor** achieved a bit lower quality, but with faster training time (~5 seconds). 

**Random Forest regressor** performed almost the same quality, but with much slower training time (more than a minute). 