<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Thanks for taking the time to improve the project! It is accepted now. Good luck on the next sprint!

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, although there are some problems that need to be fixed before the project is accepted. Let me know if you have any questions!

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

### Initialization

In [1]:
# Load essential libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

In [2]:
# Create dataframe
df = pd.read_csv('/datasets/car_data.csv')

In [3]:
# Visualize df
df

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354364,21/03/2016 09:50,0,,2005,manual,0,colt,150000,7,petrol,mitsubishi,yes,21/03/2016 00:00,0,2694,21/03/2016 10:42
354365,14/03/2016 17:48,2200,,2005,,0,,20000,1,,sonstige_autos,,14/03/2016 00:00,0,39576,06/04/2016 00:46
354366,05/03/2016 19:56,1199,convertible,2000,auto,101,fortwo,125000,3,petrol,smart,no,05/03/2016 00:00,0,26135,11/03/2016 18:17
354367,19/03/2016 18:57,9200,bus,1996,manual,102,transporter,150000,3,gasoline,volkswagen,no,19/03/2016 00:00,0,87439,07/04/2016 07:15


In [4]:
# Visualize df info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, the data was loaded and inspected

</div>

### Data Preprocessing

#### Column names

In [5]:
# Standardize column names
df.columns = df.columns.str.lower()
df = df.rename(
    columns={
    'datecrawled': 'date_crawled', 
    'vehicletype':'vehicle_type', 
    'registrationyear':'registration_year',
    'gearbox':'gear_box',
    'registrationmonth' : 'registration_month',
    'fueltype': 'fuel_type',
    'notrepaired': 'not_repaired',
    'datecreated' : 'date_created',
    'numberofpictures' : 'number_of_pictures',
    'postalcode': 'postal_code',
    'lastseen':'last_seen'
})

#### Missing values

In [6]:
# Vehicle_type column
df.fillna('N/A', inplace=True)

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, missing values were replaced by a placeholder

</div>

#### Data types

In [7]:
# Convert data types
df['date_crawled'] = pd.to_datetime(df['date_crawled'])
df['last_seen'] = pd.to_datetime(df['last_seen'])
df['date_created'] = pd.to_datetime(df['date_created'])

#### Result of preprocessing

In [8]:
# Check df info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   date_crawled        354369 non-null  datetime64[ns]
 1   price               354369 non-null  int64         
 2   vehicle_type        354369 non-null  object        
 3   registration_year   354369 non-null  int64         
 4   gear_box            354369 non-null  object        
 5   power               354369 non-null  int64         
 6   model               354369 non-null  object        
 7   mileage             354369 non-null  int64         
 8   registration_month  354369 non-null  int64         
 9   fuel_type           354369 non-null  object        
 10  brand               354369 non-null  object        
 11  not_repaired        354369 non-null  object        
 12  date_created        354369 non-null  datetime64[ns]
 13  number_of_pictures  354369 no

In [9]:
# copy data frame for ordinal encoding
df_2 = df.copy()

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Would be nice to see some EDA, you missed some problems with the data. 

</div>

#### Ordinal Encoding

In [10]:
# Ordinal Encoding on categorical colums of df2
categorical =['vehicle_type', 'gear_box', 'model', 'fuel_type', 'brand', 'not_repaired']
encoder = OrdinalEncoder()
df_2[categorical] = encoder.fit_transform(df_2[categorical])

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, categorical features were encoded using ordinal encoder

</div>

#### OHE

### Split df

In [11]:
# Split into train and test df
train, test = train_test_split(df, test_size = 0.25, random_state=12345)

train_2, test_2 = train_test_split(df_2, test_size = 0.25, random_state=12345)

In [12]:
# Features and targets
train_features = train.drop(['price', 'date_crawled', 'last_seen', 'date_created', 'number_of_pictures', 'postal_code'], axis=1)
train_target = train['price']

tf2 = train_2.drop(['price', 'date_crawled', 'last_seen', 'date_created', 'number_of_pictures', 'postal_code'], axis=1)
tt2 = train_2['price']

test_features = test.drop(['price', 'date_crawled', 'last_seen', 'date_created', 'number_of_pictures', 'postal_code'], axis=1)
test_target = test['price']

tef2 = test_2.drop(['price', 'date_crawled', 'last_seen', 'date_created', 'number_of_pictures', 'postal_code'], axis=1)
tet2 = test_2['price']

<div class="alert alert-success">
<b>Reviewer's comment</b>

The list of columns to drop makes sense. The data was split into train and test

</div>

#### Feature scaling

In [13]:
# Scale numerical columns for df 1
numeric = ['registration_year', 'power', 'mileage', 'registration_month']
scaler = StandardScaler()
scaler.fit(train_features[numeric])
train_features[numeric] = scaler.transform(train_features[numeric])
test_features[numeric] = scaler.transform(test_features[numeric])

In [14]:
# Scale numerical columns for df 2
numeric = ['registration_year', 'power', 'mileage', 'registration_month']
scaler = StandardScaler()
scaler.fit(tf2[numeric])
tf2[numeric] = scaler.transform(tf2[numeric])
tef2[numeric] = scaler.transform(tef2[numeric])

<div class="alert alert-success">
<b>Reviewer's comment</b>

Scaling is applied correctly

</div>

In [15]:
train_features

Unnamed: 0,vehicle_type,registration_year,gear_box,power,model,mileage,registration_month,fuel_type,brand,not_repaired
25945,sedan,-0.189626,manual,0.111709,escort,0.574755,-0.459017,petrol,ford,no
57619,sedan,-0.036692,,-0.564868,c_klasse,0.574755,0.077608,petrol,mercedes_benz,
14516,small,0.010365,manual,-0.180449,twingo,0.574755,0.345920,petrol,renault,no
283907,sedan,0.069186,manual,0.203970,qashqai,-1.536705,1.150857,gasoline,nissan,no
161548,wagon,0.104478,manual,-0.001054,astra,-1.800637,0.614232,gasoline,opel,no
...,...,...,...,...,...,...,...,...,...,...
47873,sedan,-0.083748,manual,-0.103565,6_reihe,0.574755,1.419169,petrol,mazda,no
86398,sedan,-0.095512,manual,-0.047184,other,-0.085076,1.419169,petrol,volkswagen,no
347556,sedan,0.045658,manual,0.060453,1er,-1.800637,-0.190705,petrol,bmw,no
77285,sedan,0.033893,manual,0.050202,cooper,-0.744907,1.150857,petrol,mini,no


In [16]:
tf2

Unnamed: 0,vehicle_type,registration_year,gear_box,power,model,mileage,registration_month,fuel_type,brand,not_repaired
25945,5.0,-0.189626,2.0,0.111709,99.0,0.574755,-0.459017,7.0,10.0,1.0
57619,5.0,-0.036692,0.0,-0.564868,60.0,0.574755,0.077608,7.0,20.0,0.0
14516,6.0,0.010365,2.0,-0.180449,228.0,0.574755,0.345920,7.0,27.0,1.0
283907,5.0,0.069186,2.0,0.203970,181.0,-1.536705,1.150857,3.0,23.0,1.0
161548,8.0,0.104478,2.0,-0.001054,43.0,-1.800637,0.614232,3.0,24.0,1.0
...,...,...,...,...,...,...,...,...,...,...
47873,5.0,-0.083748,2.0,-0.103565,17.0,0.574755,1.419169,7.0,19.0,1.0
86398,5.0,-0.095512,2.0,-0.047184,167.0,-0.085076,1.419169,7.0,38.0,1.0
347556,5.0,0.045658,2.0,0.060453,6.0,-1.800637,-0.190705,7.0,2.0,1.0
77285,5.0,0.033893,2.0,0.050202,81.0,-0.744907,1.150857,7.0,21.0,1.0


The data has been cleaned and prepared for analysis. The missing values were all in categorical columns and were filled with placeholders ('N/A') as these values would have substantially affected the dataset if dropped.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright!

</div>

#### OHE 

In [17]:
# OHE on categorical columns
lr_df = df.copy()
lr_df = lr_df.drop(['date_crawled', 'last_seen', 'date_created', 'number_of_pictures', 'postal_code'], axis=1)
lr_df = pd.get_dummies(lr_df, columns= categorical)

In [18]:
# Prepare columns for lr model
lr_train, lr_test = train_test_split(lr_df, test_size = 0.25, random_state=12345) #split df


# Create features and targets
lr_train_features = lr_train.drop('price', axis=1)
lr_train_target = lr_train['price']

lr_test_features = lr_test.drop('price', axis=1)
lr_test_target = lr_test['price']


#Scale numerical columns
scaler = StandardScaler()
scaler.fit(lr_train_features[numeric])
lr_train_features[numeric] = scaler.transform(lr_train_features[numeric])
lr_test_features[numeric] = scaler.transform(lr_test_features[numeric])

In [19]:
lr_train_features

Unnamed: 0,registration_year,power,mileage,registration_month,vehicle_type_N/A,vehicle_type_bus,vehicle_type_convertible,vehicle_type_coupe,vehicle_type_other,vehicle_type_sedan,...,brand_sonstige_autos,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,not_repaired_N/A,not_repaired_no,not_repaired_yes
25945,-0.189626,0.111709,0.574755,-0.459017,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
57619,-0.036692,-0.564868,0.574755,0.077608,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
14516,0.010365,-0.180449,0.574755,0.345920,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
283907,0.069186,0.203970,-1.536705,1.150857,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
161548,0.104478,-0.001054,-1.800637,0.614232,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47873,-0.083748,-0.103565,0.574755,1.419169,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
86398,-0.095512,-0.047184,-0.085076,1.419169,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,1,0
347556,0.045658,0.060453,-1.800637,-0.190705,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
77285,0.033893,0.050202,-0.744907,1.150857,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0


In [20]:
lr_test_features

Unnamed: 0,registration_year,power,mileage,registration_month,vehicle_type_N/A,vehicle_type_bus,vehicle_type_convertible,vehicle_type_coupe,vehicle_type_other,vehicle_type_sedan,...,brand_sonstige_autos,brand_subaru,brand_suzuki,brand_toyota,brand_trabant,brand_volkswagen,brand_volvo,not_repaired_N/A,not_repaired_no,not_repaired_yes
18734,0.069186,0.203970,0.574755,-0.459017,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
141787,0.080950,0.168091,-2.328502,-0.190705,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
37523,-0.001399,0.075830,0.574755,1.687482,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
194192,0.033893,0.583263,0.574755,0.882545,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
110210,-0.119041,-0.216328,0.574755,0.882545,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20704,-0.013163,0.552510,0.574755,0.077608,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
234662,0.022129,0.270602,0.574755,-0.190705,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
81975,0.033893,0.060453,0.574755,0.614232,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
175123,-0.095512,-0.282961,-0.085076,1.150857,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## Model training

### Create Scorer

In [21]:
def rmse(target, pred, **kwargs):
    mse = mean_squared_error(target, pred)
    result = mse**0.5
    return result
scorer = make_scorer(rmse, greater_is_better = False)

### Linear regression model

In [22]:
%%time
# Create and train model
lr_model = LinearRegression()
lr_model.fit(lr_train_features, lr_train_target)
lr_pred = lr_model.predict(lr_test_features)
lr_res = sqrt(mean_squared_error(lr_test_target, lr_pred))

CPU times: user 21.8 s, sys: 12.6 s, total: 34.4 s
Wall time: 34.4 s


In [23]:
%%time
# Prediction time
lr_pred = lr_model.predict(lr_test_features)

CPU times: user 114 ms, sys: 88.5 ms, total: 203 ms
Wall time: 198 ms


<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

For linear models like linear regression (in contrast to tree-based models) ordinal encoding of categorical features is not appropriate unless there is some natural order on those categories.

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Ok, fixed!

</div>

### Random Forest Model

In [24]:
%%time
# Create and train model
rf_model = None
estimators = 0
max_depth = 0
RMSE = float('-inf')
for est in range(10,50,10):
    for depth in range(1,10):
        model = RandomForestRegressor(n_estimators=est, max_depth=depth, random_state=12345)
        result = cross_val_score(model, tf2, tt2, cv=5, scoring=scorer).mean()
        if result > RMSE:
            rf_model = model
            RMSE = result
            estimators = est
            max_depth = depth

CPU times: user 12min 29s, sys: 1.19 s, total: 12min 30s
Wall time: 12min 30s


In [25]:
%%time
# Train model with chosen parameters
rf =  RandomForestRegressor(n_estimators=estimators, max_depth=max_depth, random_state=12345)
rf.fit(tf2,tt2)

CPU times: user 15.2 s, sys: 15.7 ms, total: 15.2 s
Wall time: 15.3 s


RandomForestRegressor(max_depth=9, n_estimators=40, random_state=12345)

### Decision Tree Model

In [26]:
%%time
# Create and train model
dt_model = None
dt_depth = 0
RMSE_1 = float('-inf')
for depth in range(1,10):
    model_1 = DecisionTreeRegressor(max_depth=depth, random_state=12345)
    result_1 = cross_val_score(model_1, tf2, tt2, cv=5, scoring=scorer).mean()
    if result_1 > RMSE_1:
        dt_model = model_1
        RMSE_1 = result_1
        dt_depth = depth

CPU times: user 11.8 s, sys: 11.8 ms, total: 11.8 s
Wall time: 11.8 s


In [27]:
%%time
# Train model with chosen parameters
dt = DecisionTreeRegressor(max_depth=dt_depth, random_state=12345)
dt.fit(tf2, tt2)

CPU times: user 508 ms, sys: 14.4 ms, total: 523 ms
Wall time: 529 ms


DecisionTreeRegressor(max_depth=9, random_state=12345)

### Catboost Model

In [28]:
%%time
# Create and train model
cb_model = None
iterations = 0
RMSE_2 = float('-inf')
for l in range(100, 200, 50):
    model_2 = CatBoostRegressor(random_seed=12345, iterations=l, verbose=0, cat_features = categorical)
    result_2 = cross_val_score(model_2, train_features, train_target, cv=3, scoring=scorer).mean()
    if result_2 > RMSE_2:
        cb_model = model_2
        RMSE_2 = result_2
        iterations = l

CPU times: user 49.7 s, sys: 328 ms, total: 50 s
Wall time: 52.8 s


In [29]:
%%time
# Fit model with chosen parameters
cb = CatBoostRegressor(random_seed=12345, iterations=iterations, verbose=0, cat_features = categorical)
cb.fit(train_features, train_target)

CPU times: user 14.4 s, sys: 92.3 ms, total: 14.5 s
Wall time: 14.9 s


<catboost.core.CatBoostRegressor at 0x7f810c255790>

### Light GBM 

In [30]:
%%time
# Create and train model
lg_model = None
lg_depth = 0
RMSE_3 = float('-inf')
for depth in range(10,20,5):
    model_3 = LGBMRegressor(random_seed=12345, max_depth=depth, verbose=-1)
    result_3 = cross_val_score(model_3, tf2, tt2, cv=3, scoring = scorer).mean()
    if result_3 > RMSE_3:
        lg_model = model_3
        RMSE_3 = result_3
        lg_depth = depth

CPU times: user 15min 45s, sys: 2.77 s, total: 15min 47s
Wall time: 15min 55s


In [31]:
%%time
# Create model with chosen parameters
lg = LGBMRegressor(random_seed=12345, max_depth=lg_depth, verbose=-1)
lg.fit(tf2, tt2)

CPU times: user 5.16 s, sys: 28.4 ms, total: 5.19 s
Wall time: 5.2 s


LGBMRegressor(max_depth=15, random_seed=12345, verbose=-1)

<div class="alert alert-warning">
<b>Reviewer's comment</b>

For lightgbm and catboost it would be better to use their own encoding of categorical features ([lightgbm](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support) and [catboost](https://catboost.ai/en/docs/concepts/python-reference_catboostregressor#cat_features))

</div>

### XGBoost

In [32]:
%%time
# Create and train model
xg_model = None
xg_depth = 0
RMSE_4 = float('-inf')
for depth in range(10,20,5):
    model_4 = XGBRegressor(random_seed=12345, max_depth=depth, verbosity=0)
    result_4 = cross_val_score(model_4, tf2, tt2, cv=3, scoring=scorer).mean()
    if result_4 > RMSE_4:
        xg_model = model_4
        RMSE_4 = result_4
        xg_depth = depth

CPU times: user 9min 34s, sys: 1.68 s, total: 9min 36s
Wall time: 9min 42s


In [33]:
%%time
# Create model with chosen parameters
xg = XGBRegressor(random_seed=12345, max_depth=xg_depth, verbosity=0)
xg.fit(tf2,tt2)

CPU times: user 1min 36s, sys: 327 ms, total: 1min 36s
Wall time: 1min 37s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_seed=12345,
             random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
             subsample=1, tree_method='exact', validate_parameters=1,
             verbosity=0)

## Model analysis

### Linear regression model

In [34]:
# Check RMSE
lr_res

3172.6499446092876

In [35]:
%%time
# Check prediction time
lr_pred = lr_model.predict(lr_test_features)

CPU times: user 90.3 ms, sys: 0 ns, total: 90.3 ms
Wall time: 56.1 ms


### Random Forest Model

In [36]:
# Check RMSE
RMSE

-2102.3786543604942

In [37]:
%%time
# Check prediction time
rf_pred = rf.predict(tef2)

CPU times: user 246 ms, sys: 48 µs, total: 246 ms
Wall time: 251 ms


### Decision tree model

In [38]:
# Check RMSE
RMSE_1

-2193.696900362912

In [39]:
%%time
# Check prediction time
dt_pred = dt.predict(tef2)

CPU times: user 9.56 ms, sys: 14 µs, total: 9.58 ms
Wall time: 9.09 ms


### Catboost Model

In [40]:
#Check RMSE
RMSE_2

-1815.3862822182266

In [41]:
%%time
# Check prediction time
cb_pred = cb.predict(test_features)

CPU times: user 107 ms, sys: 8 µs, total: 107 ms
Wall time: 116 ms


### Light GBM Model

In [42]:
# Check RMSE
RMSE_3

-1856.8987810320084

In [43]:
%%time
# Chek prediction time
lg_pred = lg.predict(tef2)

CPU times: user 644 ms, sys: 0 ns, total: 644 ms
Wall time: 575 ms


### XG Boost Model

In [44]:
# Check RMSE
RMSE_4

-1747.6402813310408

In [45]:
%%time
# Chek prediction time
xg_pred = xg.predict(tef2)

CPU times: user 542 ms, sys: 0 ns, total: 542 ms
Wall time: 492 ms


### Best model analysis

In [46]:
# Check RMSE of best model
print(rmse(cb_pred, test_target))

1801.4250092931532


From the above analysis, we can see that the catboost model is the fastest gradient boosting model among the three and has the highest accuracy.
When we weigh speed and accuracy, the catboost model is the best choice.
The decision tree model is the fastest of all the tuned models but its accuracy is not as good as that of the gradient boosting models.

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Great, you tried a few different models and tuned their hyperparameters. There is a problem however that you used the test set for this. The test set should only be used to evaluate exactly one model: the best model according to cross-validation/validation set performance. Similarly either cross-validation or a separate validation set should be used for hyperparameter tuning.

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Very good!

</div>

<div class="alert alert-warning">
<b>Reviewer's comment</b>

It would also be great to measure training and prediction time for the tuned model of each type separately instead of just measuring the time it takes to tune hyperparameters (which is much more dependent on how many combinations of hyperparameters you're willing to try)

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

It's nice that you measured prediction time!

</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed