Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# 1. Data preparation

The dataset is stored in file /datasets/car_data.csv

Features:

DateCrawled — date profile was downloaded from the database

VehicleType — vehicle body type

RegistrationYear — vehicle registration year

Gearbox — gearbox type

Power — power (hp)

Model — vehicle model

Mileage — mileage (measured in km due to dataset's regional specifics)

RegistrationMonth — vehicle registration month

FuelType — fuel type

Brand — vehicle brand

NotRepaired — vehicle repaired or not

DateCreated — date of profile creation

NumberOfPictures — number of vehicle pictures

PostalCode — postal code of profile owner (user)

LastSeen — date of the last activity of the user

Target:

Price

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import lightgbm
import catboost
import joblib
import time
import timeit

In [2]:
!pip install scikit-learn --upgrade

Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: scikit-learn in /home/jovyan/.local/lib/python3.7/site-packages (0.24.0)


In [3]:
df = pd.read_csv('/datasets/car_data.csv')
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
DateCrawled          354369 non-null object
Price                354369 non-null int64
VehicleType          316879 non-null object
RegistrationYear     354369 non-null int64
Gearbox              334536 non-null object
Power                354369 non-null int64
Model                334664 non-null object
Mileage              354369 non-null int64
RegistrationMonth    354369 non-null int64
FuelType             321474 non-null object
Brand                354369 non-null object
NotRepaired          283215 non-null object
DateCreated          354369 non-null object
NumberOfPictures     354369 non-null int64
PostalCode           354369 non-null int64
LastSeen             354369 non-null object
dtypes: int64(7), object(9)
memory usage: 43.3+ MB


In [5]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


RegistrationYear year having 9999 max value seems incorrect value

Price is having 0 value, so there might be some free bikes in our data

RegistrationMonth can't have 0 value, it seems incorrect

NumberOfPictures seems like having only zero values, if so, it will not going to be a significant feature we can safely remove it.

In [6]:
df.VehicleType.value_counts()

sedan          91457
small          79831
wagon          65166
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: VehicleType, dtype: int64

In [7]:
df.Gearbox.value_counts()

manual    268251
auto       66285
Name: Gearbox, dtype: int64

In [8]:
df.Mileage.value_counts()

150000    238209
125000     36454
100000     14882
90000      11567
80000      10047
70000       8593
60000       7444
5000        6397
50000       6232
40000       4911
30000       4436
20000       3975
10000       1222
Name: Mileage, dtype: int64

In [9]:
df.Model.value_counts()

golf                  29232
other                 24421
3er                   19761
polo                  13066
corsa                 12570
                      ...  
i3                        8
serie_3                   4
rangerover                4
serie_1                   2
range_rover_evoque        2
Name: Model, Length: 250, dtype: int64

In [10]:
df.RegistrationMonth.value_counts()

0     37352
3     34373
6     31508
4     29270
5     29153
7     27213
10    26099
12    24289
11    24186
9     23813
1     23219
8     22627
2     21267
Name: RegistrationMonth, dtype: int64

In [11]:
df.FuelType.value_counts()

petrol      216352
gasoline     98720
lpg           5310
cng            565
hybrid         233
other          204
electric        90
Name: FuelType, dtype: int64

In [12]:
print(len(df.Brand.value_counts()))
df.Brand.value_counts()

40


volkswagen        77013
opel              39931
bmw               36914
mercedes_benz     32046
audi              29456
ford              25179
renault           17927
peugeot           10998
fiat               9643
seat               6907
mazda              5615
skoda              5500
smart              5246
citroen            5148
nissan             4941
toyota             4606
hyundai            3587
sonstige_autos     3374
volvo              3210
mini               3202
mitsubishi         3022
honda              2817
kia                2465
suzuki             2323
alfa_romeo         2314
chevrolet          1754
chrysler           1439
dacia               900
daihatsu            806
subaru              762
porsche             758
jeep                677
trabant             589
land_rover          545
daewoo              542
saab                526
jaguar              505
rover               486
lancia              471
lada                225
Name: Brand, dtype: int64

In [13]:
df.NotRepaired.value_counts()

no     247161
yes     36054
Name: NotRepaired, dtype: int64

In [14]:
df.NumberOfPictures.value_counts()

0    354369
Name: NumberOfPictures, dtype: int64

In [15]:
df[df.duplicated()]

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
14266,21/03/2016 19:06,5999,small,2009,manual,80,polo,125000,5,petrol,volkswagen,no,21/03/2016 00:00,0,65529,05/04/2016 20:47
27568,23/03/2016 10:38,12200,bus,2011,manual,125,zafira,40000,10,gasoline,opel,no,23/03/2016 00:00,0,26629,05/04/2016 07:44
31599,03/04/2016 20:41,4950,wagon,2003,auto,170,e_klasse,150000,4,gasoline,mercedes_benz,no,03/04/2016 00:00,0,48432,05/04/2016 21:17
33138,07/03/2016 20:45,10900,convertible,2005,auto,163,clk,125000,5,petrol,mercedes_benz,no,07/03/2016 00:00,0,61200,21/03/2016 03:45
43656,13/03/2016 20:48,4200,sedan,2003,manual,105,golf,150000,10,gasoline,volkswagen,no,13/03/2016 00:00,0,14482,13/03/2016 20:48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349709,03/04/2016 20:52,700,small,1999,manual,60,ibiza,150000,12,petrol,seat,yes,03/04/2016 00:00,0,6268,05/04/2016 21:47
351555,26/03/2016 16:54,3150,bus,2003,manual,86,transit,150000,11,gasoline,ford,no,26/03/2016 00:00,0,96148,02/04/2016 07:47
352384,15/03/2016 21:54,5900,wagon,2006,manual,129,3er,150000,12,petrol,bmw,no,15/03/2016 00:00,0,92526,20/03/2016 21:17
353057,05/03/2016 14:16,9500,small,2013,manual,105,ibiza,40000,5,petrol,seat,no,04/03/2016 00:00,0,61381,05/04/2016 19:18


In [16]:
# lets count missing values in columns
sum(df.isna().any(axis=1)*1)/len(df)

0.30633322892239445

Conclusion:

VehicleType, Gearbox, Model, FuelType, NotRepaired columns has Missing values. Since it covers 30% of our data we can not drop all null values even if we have huge data. We will replace it as representative value and consider it as sperate column

DateCrawled, DateCreated and LastSeen can be converted in to datetime format. After that we can simply get total number of days till now and normalize it. Since our model could not process date

RegistrationYear year having 9999 max value seems incorrect value

Price is having 0 value, so there might be some free bikes in our data

RegistrationMonth can't have 0 value, it seems incorrect. It might be possible that this values 0 is given as representative value for incorrect or null value in this column

NumberOfPictures seems like having only zero values, if so, it will not going to be a significant feature we can safely remove it.

Model column has 250 different values. We can not directly use this feature after one hot encoding. Since it will create 250 different columns

Brand column has 40 unique value if we do one hot encoding it will create atleast 39 different columns

we have 262 duplicate records we can drop it

VehicleType, Gearbox, Model, FuelType, Brand, NotRepaired which needs to be one hot encode

<h3>DATA PROCESSING:</h3>

In [17]:
# drop duplicate records
df = df.drop_duplicates()

In [18]:
pd.to_datetime(df['DateCrawled'], format='%d/%m/%Y %H:%M')

0        2016-03-24 11:52:00
1        2016-03-24 10:58:00
2        2016-03-14 12:52:00
3        2016-03-17 16:54:00
4        2016-03-31 17:25:00
                 ...        
354364   2016-03-21 09:50:00
354365   2016-03-14 17:48:00
354366   2016-03-05 19:56:00
354367   2016-03-19 18:57:00
354368   2016-03-20 19:41:00
Name: DateCrawled, Length: 354107, dtype: datetime64[ns]

In [19]:
df['DateCrawled'] = pd.to_datetime(df['DateCrawled'], format='%d/%m/%Y %H:%M')
df['DateCreated'] = pd.to_datetime(df['DateCreated'], format='%d/%m/%Y %H:%M')
df['LastSeen'] = pd.to_datetime(df['LastSeen'], format='%d/%m/%Y %H:%M')
print(df['DateCrawled'].dt.year.value_counts())
print()
print(df['DateCreated'].dt.year.value_counts())
print()
print(df['LastSeen'].dt.year.value_counts())

2016    354107
Name: DateCrawled, dtype: int64

2016    354081
2015        25
2014         1
Name: DateCreated, dtype: int64

2016    354107
Name: LastSeen, dtype: int64


In [20]:
latest_date = max([df['DateCrawled'].max(), df['DateCreated'].max(), df['LastSeen'].max()])

In [21]:
latest_dates = pd.Series(latest_date, index=df.index)

In [22]:
df['DayCrawled']=(latest_dates-df['DateCrawled']).dt.days
df['DayCreated']=(latest_dates-df['DateCreated']).dt.days
df['DayLastSeen']=(latest_dates-df['LastSeen']).dt.days

In [23]:
# i. Drop Columns
p_df = df.drop(columns=['NumberOfPictures', 'Model', 'DateCrawled', 'DateCreated', 'LastSeen'])

In [24]:
p_df.query('RegistrationYear>2019')['RegistrationYear'].value_counts()

9999    26
5000    17
3000     7
6000     5
7000     4
2500     4
4000     3
9000     3
2800     2
4500     2
5555     2
5911     2
8000     2
2222     2
2066     1
9996     1
5900     1
2200     1
4800     1
8455     1
4100     1
7800     1
5300     1
3500     1
9229     1
8888     1
8500     1
7100     1
8200     1
7500     1
2900     1
3800     1
5600     1
6500     1
9450     1
2290     1
3700     1
3200     1
Name: RegistrationYear, dtype: int64

In [25]:
# filter incorrect registration years
p_df = p_df.query('RegistrationYear<=2019')

In [26]:
p_df.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,PostalCode,DayCrawled,DayCreated,DayLastSeen
0,480,,1993,manual,0,150000,0,petrol,volkswagen,,70435,14,14,0
1,18300,coupe,2011,manual,190,125000,5,gasoline,audi,yes,66954,14,14,0
2,9800,suv,2004,auto,163,125000,8,gasoline,jeep,,90480,24,24,2
3,1500,small,2001,manual,75,150000,6,petrol,volkswagen,no,91074,20,21,20
4,3600,small,2008,manual,69,90000,7,gasoline,skoda,no,60437,6,7,1


In [27]:
del df

In [28]:
p_df = p_df.fillna('NaN')

In [29]:
# Number of columns before OHE
len(p_df.columns)

14

In [30]:
OHE_df = pd.get_dummies(p_df, dummy_na=True, drop_first=True)

In [31]:
OHE_df.head()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,PostalCode,DayCrawled,DayCreated,DayLastSeen,VehicleType_bus,...,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,Brand_nan,NotRepaired_no,NotRepaired_yes,NotRepaired_nan
0,480,1993,0,150000,0,70435,14,14,0,0,...,0,0,0,0,1,0,0,0,0,0
1,18300,2011,190,125000,5,66954,14,14,0,0,...,0,0,0,0,0,0,0,0,1,0
2,9800,2004,163,125000,8,90480,24,24,2,0,...,0,0,0,0,0,0,0,0,0,0
3,1500,2001,75,150000,6,91074,20,21,20,0,...,0,0,0,0,1,0,0,1,0,0
4,3600,2008,69,90000,7,60437,6,7,1,0,...,0,0,0,0,0,0,0,1,0,0


In [32]:
#Convert Object type data to Categorical Columns
for col in p_df.dtypes[p_df.dtypes == object].index:
    p_df[col] = p_df[col].astype('category')

In [33]:
p_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354002 entries, 0 to 354368
Data columns (total 14 columns):
Price                354002 non-null int64
VehicleType          354002 non-null category
RegistrationYear     354002 non-null int64
Gearbox              354002 non-null category
Power                354002 non-null int64
Mileage              354002 non-null int64
RegistrationMonth    354002 non-null int64
FuelType             354002 non-null category
Brand                354002 non-null category
NotRepaired          354002 non-null category
PostalCode           354002 non-null int64
DayCrawled           354002 non-null int64
DayCreated           354002 non-null int64
DayLastSeen          354002 non-null int64
dtypes: category(5), int64(9)
memory usage: 28.7 MB


# 2. Model training

<h3>Step1. Create Train, Test and Valid set</h3>

Since our model RandomForestRegressor or any other model will take a lot of time to train we will not consider entire dataset for hyperparameter training part we will only consider 10% of the data.

In [34]:
OHE_features, OHE_target = OHE_df.drop(columns=['Price']), OHE_df['Price']
features, target = p_df.drop(columns=['Price']), p_df['Price']

In [35]:
from sklearn.model_selection import train_test_split

OHE_X_temp, OHE_X_test, OHE_y_temp, OHE_y_test = train_test_split(OHE_features, OHE_target, test_size=0.10, random_state=12345)
OHE_X_train, OHE_X_valid, OHE_y_train, OHE_y_valid = train_test_split(OHE_X_temp, OHE_y_temp, test_size=0.10, random_state=12345)
print("OHE Train length:",len(OHE_y_train),", Valid length:", len(OHE_y_valid), ", Test Length", len(OHE_y_test))

X_temp, X_test, y_temp, y_test = train_test_split(features, target, test_size=0.10, random_state=12345)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.10, random_state=12345)
print("OHE Train length:",len(OHE_y_train),", Valid length:", len(OHE_y_valid), ", Test Length", len(OHE_y_test))

OHE Train length: 286740 , Valid length: 31861 , Test Length 35401
OHE Train length: 286740 , Valid length: 31861 , Test Length 35401


<h3>Step2. Experiment with Model</h3>

In [36]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

In [37]:
def algorithm_pipeline(X_train_data, X_valid_data, y_train_data, y_valid_data, 
                       model, param_grid, cv=5, scoring_fit='neg_root_mean_squared_error',
                       do_probabilities = False, cat_features=None):
    
    gs = GridSearchCV(
        estimator=model,
        param_grid=param_grid, 
        cv=cv, 
        n_jobs=-1, 
        scoring=scoring_fit,
        verbose=2
    )
    
    fitted_model = gs.fit(X_train_data, y_train_data)
    
    if do_probabilities:
        pred = fitted_model.predict_proba(X_valid_data)
    else:
        pred = fitted_model.predict(X_valid_data)
    
    return fitted_model, pred

<h4>Creating Baseline Model</h4>

In [38]:
linear_model = LinearRegression()

In [39]:
%%time
linear_model.fit(OHE_X_temp, OHE_y_temp)

CPU times: user 4.41 s, sys: 1.78 s, total: 6.19 s
Wall time: 6.33 s


LinearRegression()

In [40]:
%%time
test_pred = linear_model.predict(OHE_X_test)

CPU times: user 36.8 ms, sys: 15.6 ms, total: 52.5 ms
Wall time: 110 ms


In [41]:
mean_squared_error(OHE_y_test, test_pred, squared=False)

3235.516464765666

mean_squared_error value for Linear Regression is 3235.516464765666

<h3>Random Forest</h3>

In [42]:
random_forest = RandomForestRegressor()

In [43]:
param_grid = {
    'n_estimators': [20, 50],
    'max_depth': [20,25],
    'max_leaf_nodes': [50, 70]
}

estimator, pred = algorithm_pipeline(OHE_X_train, OHE_X_valid, OHE_y_train, OHE_y_valid, random_forest, 
                                 param_grid, cv=5)

best_random_forest = estimator.best_estimator_
print(estimator.best_score_)
print(estimator.best_params_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=20; total time=  22.0s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=20; total time=  22.1s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=20; total time=  21.9s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=20; total time=  23.0s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=20; total time=  22.2s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=50; total time=  57.0s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=50; total time=  57.2s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=50; total time=  55.4s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=50; total time=  55.1s
[CV] END ...max_depth=20, max_leaf_nodes=50, n_estimators=50; total time=  55.1s
[CV] END ...max_depth=20, max_leaf_nodes=70, n_estimators=20; total time=  25.5s
[CV] END ...max_depth=20, max_leaf_nodes=70, n_es

In [44]:
best_random_forest = RandomForestRegressor(max_depth=20, max_leaf_nodes=70, n_estimators=50)

In [45]:
%%time
best_random_forest.fit(OHE_X_temp, OHE_y_temp)

CPU times: user 1min 33s, sys: 83.8 ms, total: 1min 33s
Wall time: 1min 34s


RandomForestRegressor(max_depth=20, max_leaf_nodes=70, n_estimators=50)

This is the output here 
##### CPU times: user 1min 29s, sys: 132 ms, total: 1min 29s
##### Wall time: 1min 30s
##### RandomForestRegressor(max_depth=20, max_leaf_nodes=70, n_estimators=50)

In [46]:
RandomForestRegressor(max_depth=20, max_leaf_nodes=70, n_estimators=50)

RandomForestRegressor(max_depth=20, max_leaf_nodes=70, n_estimators=50)

In [47]:
%%time
test_pred = best_random_forest.predict(OHE_X_test)

CPU times: user 133 ms, sys: 4 ms, total: 137 ms
Wall time: 164 ms


#### CPU times: user 138 ms, sys: 7 µs, total: 138 ms
#### Wall time: 145 ms

In [48]:
mean_squared_error(OHE_y_test, test_pred, squared=False)

2325.4533113123207

#### Output:2324.9809730663346

<h3>LightGBM</h3>

In [49]:
lgbm_regressor = lightgbm.LGBMRegressor(num_leaves=31)

param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
}

estimator, pred = algorithm_pipeline(X_train, X_valid, y_train, y_valid, lgbm_regressor, 
                                 param_grid, cv=5)

best_lgbm_regressor = estimator.best_estimator_
print(estimator.best_score_)
print(estimator.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ................learning_rate=0.01, n_estimators=20; total time=   8.5s
[CV] END ................learning_rate=0.01, n_estimators=20; total time=  10.7s
[CV] END ................learning_rate=0.01, n_estimators=20; total time=  10.1s
[CV] END ................learning_rate=0.01, n_estimators=20; total time=  10.9s
[CV] END ................learning_rate=0.01, n_estimators=20; total time=  11.2s
[CV] END ................learning_rate=0.01, n_estimators=40; total time=  16.6s
[CV] END ................learning_rate=0.01, n_estimators=40; total time=  18.0s
[CV] END ................learning_rate=0.01, n_estimators=40; total time=  34.6s
[CV] END ................learning_rate=0.01, n_estimators=40; total time=  14.1s
[CV] END ................learning_rate=0.01, n_estimators=40; total time=  18.5s
[CV] END .................learning_rate=0.1, n_estimators=20; total time=   9.1s
[CV] END .................learning_rate=0.1, n_es

#### Fitting 5 folds for each of 6 candidates, totalling 30 fits
#### [CV] END ................learning_rate=0.01, n_estimators=20; total time=   2.9s
#### [CV] END ................learning_rate=0.01, n_estimators=20; total time=   3.1s
#### [CV] END ................learning_rate=0.01, n_estimators=20; total time=   3.0s
#### [CV] END ................learning_rate=0.01, n_estimators=20; total time=   3.0s
#### [CV] END ................learning_rate=0.01, n_estimators=20; total time=   2.9s
#### [CV] END ................learning_rate=0.01, n_estimators=40; total time=   5.1s
#### [CV] END ................learning_rate=0.01, n_estimators=40; total time=   5.0s
#### [CV] END ................learning_rate=0.01, n_estimators=40; total time=   5.2s
#### [CV] END ................learning_rate=0.01, n_estimators=40; total time=   5.0s
#### [CV] END ................learning_rate=0.01, n_estimators=40; total time=   5.5s
#### [CV] END .................learning_rate=0.1, n_estimators=20; total time=   2.8s
#### [CV] END .................learning_rate=0.1, n_estimators=20; total time=   3.1s
#### [CV] END .................learning_rate=0.1, n_estimators=20; total time=   3.1s
#### [CV] END .................learning_rate=0.1, n_estimators=20; total time=   3.1s
#### [CV] END .................learning_rate=0.1, n_estimators=20; total time=   3.0s
#### [CV] END .................learning_rate=0.1, n_estimators=40; total time=   5.4s
#### [CV] END .................learning_rate=0.1, n_estimators=40; total time=   6.3s
#### [CV] END .................learning_rate=0.1, n_estimators=40; total time=   5.0s
#### [CV] END .................learning_rate=0.1, n_estimators=40; total time=   5.2s
#### [CV] END .................learning_rate=0.1, n_estimators=40; total time=   5.1s
#### [CV] END ...................learning_rate=1, n_estimators=20; total time=   2.9s
#### [CV] END ...................learning_rate=1, n_estimators=20; total time=   3.1s
#### [CV] END ...................learning_rate=1, n_estimators=20; total time=   2.8s
#### [CV] END ...................learning_rate=1, n_estimators=20; total time=   2.6s
#### [CV] END ...................learning_rate=1, n_estimators=20; total time=   2.6s
#### [CV] END ...................learning_rate=1, n_estimators=40; total time=   4.1s
#### [CV] END ...................learning_rate=1, n_estimators=40; total time=   3.8s
#### [CV] END ...................learning_rate=1, n_estimators=40; total time=   3.9s
#### [CV] END ...................learning_rate=1, n_estimators=40; total time=   3.9s
#### [CV] END ...................learning_rate=1, n_estimators=40; total time=   3.9s
#### -1912.3528350092918
#### {'learning_rate': 0.1, 'n_estimators': 40}

In [50]:
best_lgbm_regressor = lightgbm.LGBMRegressor(num_leaves=31,learning_rate=0.1, n_estimators=40, verbose=100)

In [51]:
%%time
best_lgbm_regressor.fit(X_temp, y_temp)

CPU times: user 6.04 s, sys: 71.5 ms, total: 6.12 s
Wall time: 6.18 s


LGBMRegressor(n_estimators=40, verbose=100)

#### CPU times: user 6.1 s, sys: 61.5 ms, total: 6.17 s
#### Wall time: 6.19 s
#### LGBMRegressor(n_estimators=40, verbose=100)

In [52]:
%time
test_pred = best_lgbm_regressor.predict(X_test)

CPU times: user 0 ns, sys: 4 µs, total: 4 µs
Wall time: 7.15 µs


#### CPU times: user 4 µs, sys: 0 ns, total: 4 µs
#### Wall time: 8.34 µs

In [53]:
# by specifing squared=False we are getting rmse
mean_squared_error(y_test, test_pred, squared=False)

1936.182611920237

#### 1936.182611920237

<h3>CatBoost</h3>

In [54]:
cat_features = [i for i, col in enumerate(X_train.columns) if X_train[col].dtypes.name == 'category']
cat_features

[0, 2, 6, 7, 8]

#### [0, 2, 6, 7, 8]

In [55]:
from catboost import CatBoostRegressor

cat_boost_regressor = CatBoostRegressor(cat_features=cat_features, loss_function='RMSE', verbose=100)

In [56]:
%%time
cat_boost_regressor.fit(X=X_temp, y=y_temp)

0:	learn: 4427.3872631	total: 1.19s	remaining: 19m 44s
100:	learn: 2136.4992082	total: 1m 32s	remaining: 13m 40s
200:	learn: 1978.8250346	total: 3m 3s	remaining: 12m 9s
300:	learn: 1923.1123132	total: 4m 38s	remaining: 10m 46s
400:	learn: 1888.9735092	total: 6m 13s	remaining: 9m 18s
500:	learn: 1862.4235837	total: 7m 49s	remaining: 7m 47s
600:	learn: 1842.6935998	total: 9m 25s	remaining: 6m 15s
700:	learn: 1827.8935062	total: 10m 57s	remaining: 4m 40s
800:	learn: 1816.6014181	total: 12m 30s	remaining: 3m 6s
900:	learn: 1804.8100701	total: 14m 5s	remaining: 1m 32s
999:	learn: 1794.0648401	total: 15m 39s	remaining: 0us
CPU times: user 14min 14s, sys: 1min 22s, total: 15min 36s
Wall time: 15min 44s


<catboost.core.CatBoostRegressor at 0x7ffaacad4850>

####  0:	learn: 4427.3872631	total: 1.08s	remaining: 17m 57s
#### 100:	learn: 2136.4992082	total: 1m 31s	remaining: 13m 37s
#### 200:	learn: 1978.8250346	total: 3m 2s	remaining: 12m 5s
#### 300:	learn: 1923.1123132	total: 4m 35s	remaining: 10m 40s
#### 400:	learn: 1888.9735092	total: 6m 10s	remaining: 9m 14s
#### 500:	learn: 1862.4235837	total: 7m 46s	remaining: 7m 44s
#### 600:	learn: 1842.6935998	total: 9m 23s	remaining: 6m 13s
#### 700:	learn: 1827.8935062	total: 10m 55s	remaining: 4m 39s
#### 800:	learn: 1816.6014181	total: 12m 28s	remaining: 3m 6s
#### 900:	learn: 1804.8100701	total: 14m 3s	remaining: 1m 32s
#### 999:	learn: 1794.0648401	total: 15m 37s	remaining: 0us
#### CPU times: user 14min 36s, sys: 1min 1s, total: 15min 38s
#### Wall time: 15min 41s
#### <catboost.core.CatBoostRegressor at 0x7f06350be250>

In [57]:
%%time
test_pred = cat_boost_regressor.predict(X_test)

CPU times: user 207 ms, sys: 16.8 ms, total: 224 ms
Wall time: 224 ms


#### CPU times: user 218 ms, sys: 362 µs, total: 219 ms
#### Wall time: 194 ms

In [58]:
# by specifing squared=False we are getting rmse
mean_squared_error(y_test, test_pred, squared=False)

1845.0331368865575

#### 1845.0331368865575

# 3. Model analysis

<table><tr><th>Model Name</th>	<th>RMSE</th>	<th>Training Time</th>	<th>Prediction Time</th></tr>
    <tr><td>Linear Regression</td>	<td>3235.516</td>	<td>6.04 s</td>	<td>107 ms</td></tr>
    <tr><td>Random Forest</td>	<td>2325.356</td>	<td>1min 31s</td>	<td>158 ms</td></tr>
    <tr><td>LightGBM</td>	<td>1936.18</td>	<td>6min 48s</td>	<td>9.06 µs</td></tr>
    <tr><td>Cat Boost Regressor</td>	<td>1845.0331368865575</td>	<td>15min 41s</td>	<td>194 ms</td></tr></table>

From above observations we can say that,

LightGBM gives least prediction time,
While Cat Boost Regressor gives best RMSE score for but requires longer training time and infererence time
All other models gives better score compare to our baseline model LinearRegressor

## Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed