# Project Description

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Project instructions
1.	Download and look at the data.
2.	Train different models with various hyperparameters (You should make at least two different models, but more is better. Remember, various implementations of gradient boosting don't count as different models.) The main point of this step is to compare gradient boosting methods with random forest, decision tree, and linear regression.
3.	Analyze the speed and quality of the models.

**Notes:**
 - Use the RMSE metric to evaluate the models.
 - Linear regression is not very good for hyperparameter tuning, but it is perfect for doing a sanity check of other methods. If gradient boosting performs worse than linear regression, something definitely went wrong.
 - On your own, work with the LightGBM library and use its tools to build gradient boosting models.
 - Ideally, your project should include linear regression for a sanity check, a tree-based algorithm with hyperparameter tuning (preferably, random forrest), LightGBM with hyperparameter tuning (try a couple of sets), and CatBoost and XGBoost with hyperparameter tuning (optional).
 - Take note of the encoding of categorical features for simple algorithms. LightGBM and CatBoost have their implementation, but XGBoost requires OHE.
 - You can use a special command to find the cell code runtime in Jupyter Notebook. Find that command.
 - Since the training of a gradient boosting model can take a long time, change only a few model parameters.
 - If Jupyter Notebook stops working, delete the excessive variables by using the del operator.


## Data description
The dataset is stored in file /datasets/car_data.csv. download dataset.

**Features**
 - DateCrawled — date profile was downloaded from the database
 - VehicleType — vehicle body type
 - RegistrationYear — vehicle registration year
 - Gearbox — gearbox type
 - Power — power (hp)
 - Model — vehicle model
 - Mileage — mileage (measured in km due to dataset's regional specifics)
 - RegistrationMonth — vehicle registration month
 - FuelType — fuel type
 - Brand — vehicle brand
 - NotRepaired — vehicle repaired or not
 - DateCreated — date of profile creation
 - NumberOfPictures — number of vehicle pictures
 - PostalCode — postal code of profile owner (user)
 - LastSeen — date of the last activity of the user

**Target**

Price — price (Euro)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from catboost import CatBoostRegressor, Pool
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
import timeit
from functools import lru_cache

## Data preparation

In [2]:
df = pd.read_csv('/datasets/car_data.csv')
display(df.head())

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
display(df.shape)

(354369, 16)

In [4]:
# renaming columns to snakecase

df = df.rename(columns={'DateCrawled': 'date_crawled', 'Price': 'price', 
                        'VehicleType': 'vehicle_type', 'RegistrationYear': 'registration_year', 
                        'Gearbox': 'gearbox', 'Power': 'power', 'Model': 'model', 'Mileage': 'mileage', 
                        'RegistrationMonth': 'registration_month', 'FuelType': 'fuel_type', 
                        'Brand': 'brand','NotRepaired': 'not_repaired','DateCreated': 'date_created', 
                        'NumberOfPictures': 'num_of_pics', 'PostalCode': 'postal_code', 'LastSeen': 'last_seen'})

In [5]:
display(df.sample(10))

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,num_of_pics,postal_code,last_seen
327786,09/03/2016 16:53,2650,small,2005,manual,101,fabia,150000,8,gasoline,skoda,,09/03/2016 00:00,0,47918,23/03/2016 08:16
293132,10/03/2016 20:58,2750,bus,1984,manual,72,other,150000,1,gasoline,mercedes_benz,no,10/03/2016 00:00,0,20359,15/03/2016 18:46
148406,15/03/2016 09:37,9999,sedan,2004,auto,334,a6,150000,4,petrol,audi,no,15/03/2016 00:00,0,12623,05/04/2016 19:18
216871,11/03/2016 18:43,3100,sedan,2002,manual,75,golf,150000,5,petrol,volkswagen,,11/03/2016 00:00,0,55743,12/03/2016 08:45
136986,22/03/2016 20:54,9900,wagon,2005,manual,224,a6,150000,8,gasoline,audi,no,22/03/2016 00:00,0,54516,06/04/2016 15:17
121168,31/03/2016 09:50,6299,wagon,2007,manual,109,touran,125000,7,cng,volkswagen,no,31/03/2016 00:00,0,4316,06/04/2016 02:45
325415,27/03/2016 12:38,3250,bus,2004,auto,125,zafira,125000,1,petrol,opel,no,27/03/2016 00:00,0,42109,30/03/2016 17:15
287409,04/04/2016 22:36,7299,wagon,2008,auto,140,passat,150000,0,,volkswagen,no,04/04/2016 00:00,0,47229,07/04/2016 03:17
285555,05/04/2016 07:51,15550,convertible,2001,auto,306,sl,5000,11,petrol,mercedes_benz,no,05/04/2016 00:00,0,56072,05/04/2016 11:48
49532,12/03/2016 18:55,1450,small,2004,manual,101,fiesta,150000,5,petrol,ford,,12/03/2016 00:00,0,50827,14/03/2016 11:44


In [6]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  num_of_pics         354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

None

In [7]:
# checking for missing values

display(df.isna().sum())
display(df.isnull().mean().sort_values(ascending=False))

date_crawled              0
price                     0
vehicle_type          37490
registration_year         0
gearbox               19833
power                     0
model                 19705
mileage                   0
registration_month        0
fuel_type             32895
brand                     0
not_repaired          71154
date_created              0
num_of_pics               0
postal_code               0
last_seen                 0
dtype: int64

not_repaired          0.200791
vehicle_type          0.105794
fuel_type             0.092827
gearbox               0.055967
model                 0.055606
date_crawled          0.000000
price                 0.000000
registration_year     0.000000
power                 0.000000
mileage               0.000000
registration_month    0.000000
brand                 0.000000
date_created          0.000000
num_of_pics           0.000000
postal_code           0.000000
last_seen             0.000000
dtype: float64

In [8]:
# checking for duplicates

display(df.duplicated().sum())
display(df[df.duplicated()].head())

262

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,num_of_pics,postal_code,last_seen
14266,21/03/2016 19:06,5999,small,2009,manual,80,polo,125000,5,petrol,volkswagen,no,21/03/2016 00:00,0,65529,05/04/2016 20:47
27568,23/03/2016 10:38,12200,bus,2011,manual,125,zafira,40000,10,gasoline,opel,no,23/03/2016 00:00,0,26629,05/04/2016 07:44
31599,03/04/2016 20:41,4950,wagon,2003,auto,170,e_klasse,150000,4,gasoline,mercedes_benz,no,03/04/2016 00:00,0,48432,05/04/2016 21:17
33138,07/03/2016 20:45,10900,convertible,2005,auto,163,clk,125000,5,petrol,mercedes_benz,no,07/03/2016 00:00,0,61200,21/03/2016 03:45
43656,13/03/2016 20:48,4200,sedan,2003,manual,105,golf,150000,10,gasoline,volkswagen,no,13/03/2016 00:00,0,14482,13/03/2016 20:48


In [9]:
display(df.describe())

Unnamed: 0,price,registration_year,power,mileage,registration_month,num_of_pics,postal_code
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [10]:
target = 'price'
features = list(set(df.columns)-set(target))

**Data Preparation: First Look Summary**

 - Each observation describes a model of a car; there are 354,369 cars and 16 features.
 - The top two features with missing values are: `'not_repaired'` (20%) and `'vehicle_type'` (11%).
 - There are 262 duplicated records. This is about 0.07% of the data.
 - The numerical data shows outliers for the `'price'`, `'registration_year'` and `'power'` columns. It also shows that we have no data in the `'num_of_pics'` column.
 - The `'date_crawled'`, `'date_created'`, and `'last_seen'` columns are formatted with the wrong datatype.

### Fixing the Data

In [11]:
# getting rid of the duplicates

df.drop_duplicates(inplace=True)

In [12]:
# correcting the datatypes

df['date_crawled'] = pd.to_datetime(df['date_crawled'], format='%d/%m/%Y %H:%M')
df['date_created'] = pd.to_datetime(df['date_created'], format='%d/%m/%Y %H:%M')
df['last_seen'] = pd.to_datetime(df['last_seen'], format='%d/%m/%Y %H:%M')

In [13]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   date_crawled        354107 non-null  datetime64[ns]
 1   price               354107 non-null  int64         
 2   vehicle_type        316623 non-null  object        
 3   registration_year   354107 non-null  int64         
 4   gearbox             334277 non-null  object        
 5   power               354107 non-null  int64         
 6   model               334406 non-null  object        
 7   mileage             354107 non-null  int64         
 8   registration_month  354107 non-null  int64         
 9   fuel_type           321218 non-null  object        
 10  brand               354107 non-null  object        
 11  not_repaired        282962 non-null  object        
 12  date_created        354107 non-null  datetime64[ns]
 13  num_of_pics         354107 no

None

In [14]:
display(df['date_crawled'].min(), df['date_crawled'].max())

Timestamp('2016-03-05 14:06:00')

Timestamp('2016-04-07 14:36:00')

In [15]:
display(df['date_created'].min(), df['date_created'].max())

Timestamp('2014-03-10 00:00:00')

Timestamp('2016-04-07 00:00:00')

In [16]:
display(df['last_seen'].min(), df['last_seen'].max())

Timestamp('2016-03-05 14:15:00')

Timestamp('2016-04-07 14:58:00')

**Date  Fields Summary**

 - The data has been crawled for a period of one month.
 - The `'date_created'` ranges for a period of two years.

### Creating a New Field and Date Features

In [17]:
df['days_on_site'] = (df['last_seen'] - df['date_created']).dt.days
display(np.any(df['days_on_site'] < 0))

False

In [18]:
df['create_year'] = df['date_created'].dt.year
df['create_month'] = df['date_created'].dt.month
df['create_day'] = df['date_created'].dt.day

### Fixing Missing Values

*Thinking about how to impute `'not_repaired'`. It seems that NaN may be a defalut field that someone forgot to change. It is hard to know what the real meaning of this input is. I believe that replace NaN with 'yes' is a more probable meaning of the defualt value.*

In [19]:
df['not_repaired'] = df['not_repaired'].fillna('yes')
df['not_repaired'] = (df['not_repaired'] == 'yes').astype('int')

In [20]:
df['gearbox'] = (df['gearbox'] == 'auto').astype('int')

In [21]:
display(df.isna().sum())

date_crawled              0
price                     0
vehicle_type          37484
registration_year         0
gearbox                   0
power                     0
model                 19701
mileage                   0
registration_month        0
fuel_type             32889
brand                     0
not_repaired              0
date_created              0
num_of_pics               0
postal_code               0
last_seen                 0
days_on_site              0
create_year               0
create_month              0
create_day                0
dtype: int64

In [22]:
def impute_value(in_df, features, target):
  encoders=dict()
  df = in_df.copy()
  for col in df[features].select_dtypes('object').columns:
    df.loc[df[col].isna(), col] = 'None'
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    encoders[col] = le  
  for col in df[features].select_dtypes('datetime64').columns:
    df[f"{col}_hour"] = df[col].dt.hour
    df[f"{col}_month"] = df[col].dt.month
    df[f"{col}_day"] = df[col].dt.day  
    del df[col]
  features = list(set(df.columns)-set([target]))
  train_df = df[~df[target].isna()]
  test_df = df[df[target].isna()]
  let = LabelEncoder()
  y_train = let.fit_transform(train_df[target])
  y_train = train_df[target].values
  X_train, X_test = train_df[features].values, test_df[features].values
  if len(X_test)==0:
    return in_df
  model = DecisionTreeClassifier().fit(X_train, y_train)
  y_pred = model.predict(X_test)
  df.loc[df[target].isna(), target] = y_pred
  in_df[target] = df[target]
  return in_df

In [23]:
for col in ['fuel_type', 'vehicle_type', 'model']:
  df = impute_value(df, features=list(set(df.columns)-set([col])-set([target])), target=col)

In [24]:
display(df.isna().sum())

date_crawled          0
price                 0
vehicle_type          0
registration_year     0
gearbox               0
power                 0
model                 0
mileage               0
registration_month    0
fuel_type             0
brand                 0
not_repaired          0
date_created          0
num_of_pics           0
postal_code           0
last_seen             0
days_on_site          0
create_year           0
create_month          0
create_day            0
dtype: int64

### Dropping Duplicates

In [25]:
display(df.duplicated().sum())

12

In [26]:
df.drop_duplicates(inplace = True)

In [27]:
display(df.duplicated().sum())

0

### Encoding the Categorical Features

In [28]:
encode_cols = ['postal_code']
encode_cols += df.select_dtypes(include=['object']).columns.tolist()
oe = OrdinalEncoder()
df[encode_cols] = oe.fit_transform(df[encode_cols])

In [29]:
display(df.duplicated().sum())

0

### Dropping Features Not Used During Modeling

In [30]:
drop_cols = ['date_crawled', 'date_created', 'num_of_pics', 'last_seen']
df.drop(drop_cols, axis=1, inplace=True)

In [31]:
display(df.shape)

(354095, 16)

In [32]:
display(df.head())

Unnamed: 0,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,postal_code,days_on_site,create_year,create_month,create_day
0,480,5.0,1993,0,0,116.0,150000,0,6.0,38.0,1,4898.0,14,2016,3,24
1,18300,2.0,2011,0,190,30.0,125000,5,2.0,1.0,1,4615.0,14,2016,3,24
2,9800,6.0,2004,1,163,117.0,125000,8,2.0,14.0,1,6992.0,22,2016,3,14
3,1500,5.0,2001,0,75,116.0,150000,6,6.0,38.0,0,7032.0,0,2016,3,17
4,3600,5.0,2008,0,69,101.0,90000,7,2.0,31.0,0,4212.0,6,2016,3,31


In [33]:
def convert_dtype(df, field, dtype=np.uint8):
    """
    Convert the dtype
    """
    try:
        df[field] = df[field].astype(dtype)
    except:
        print(f'Failed to change dtype for {field}')

In [34]:
dtype_cols = {
    'vehicle_type': np.uint8,
    'registration_year': np.uint16,
    'gearbox': np.uint8,
    'power': np.uint8,
    'model': np.uint8,
    'registration_month': np.uint8,
    'fuel_type': np.uint8,
    'brand': np.uint8,
    'not_repaired': np.uint8,
    'postal_code': np.uint32,
    'days_on_site': np.uint16,
    'create_year': np.uint16,
    'create_month': np.uint8,
    'create_day': np.uint8
}

In [35]:
for col in dtype_cols:
    print(f'Conversion for {col}')
    convert_dtype(df, col, dtype_cols[col])

Conversion for vehicle_type
Conversion for registration_year
Conversion for gearbox
Conversion for power
Conversion for model
Conversion for registration_month
Conversion for fuel_type
Conversion for brand
Conversion for not_repaired
Conversion for postal_code
Conversion for days_on_site
Conversion for create_year
Conversion for create_month
Conversion for create_day


In [36]:
# checking dtypes

display(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354095 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   price               354095 non-null  int64 
 1   vehicle_type        354095 non-null  uint8 
 2   registration_year   354095 non-null  uint16
 3   gearbox             354095 non-null  uint8 
 4   power               354095 non-null  uint8 
 5   model               354095 non-null  uint8 
 6   mileage             354095 non-null  int64 
 7   registration_month  354095 non-null  uint8 
 8   fuel_type           354095 non-null  uint8 
 9   brand               354095 non-null  uint8 
 10  not_repaired        354095 non-null  uint8 
 11  postal_code         354095 non-null  uint32
 12  days_on_site        354095 non-null  uint16
 13  create_year         354095 non-null  uint16
 14  create_month        354095 non-null  uint8 
 15  create_day          354095 non-null  uint8 
dtypes:

None

## Model training

In [37]:
train_X, test_X, train_y, test_y = train_test_split(df.drop(['price'], axis=1),
                                                      df['price'], test_size=0.2, random_state=1357)
valid_X, test_X, valid_y, test_y = train_test_split(test_X,
                                                      test_y, test_size=0.1, random_state=1357)

In [38]:
def rmse(y_true, y_pred):
    """
    Make scorer to compute rmse
    """
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [47]:
def build_model(model, train_X, train_y, valid_X, valid_y, 
                hyperparameters={}, 
                scoring=None, cv=5, model_str=None, cat_features=[], verbose=False):
    """
    Build model
    """
    np.random.seed(1357)
    best_rmse = 0
    start_time = time.time()
    
    gs = GridSearchCV(model, param_grid=hyperparameters, cv=cv, scoring=scoring)
    gs.fit(train_X, train_y)
    if model_str == 'catboost':
        gs.best_estimator_.fit(train_X, train_y, cat_features=cat_features, verbose=verbose)
    else:
        gs.best_estimator_.fit(train_X, train_y)
    preds = gs.best_estimator_.predict(valid_X)
    best_rmse = rmse(valid_y, preds)
    end_time = time.time() - start_time
    return gs.best_estimator_, best_rmse, np.round(end_time, 2)

In [48]:
lr = LinearRegression()
best_lr_model, best_lr_rmse, best_lr_time = build_model(lr, train_X, train_y, 
                                                        valid_X, valid_y,
                                                        hyperparameters={}, 
                scoring=make_scorer(rmse, greater_is_better=False), cv=5, model_str='lr')

In [50]:
rf = RandomForestRegressor(random_state=1357)
best_rf_model, best_rf_rmse, best_rf_time = build_model(rf, train_X, train_y, 
                                                        valid_X, valid_y,
                                                        hyperparameters={'n_estimators': [20, 30, 40, 50]}, 
                scoring=make_scorer(rmse, greater_is_better=False), cv=5)

In [49]:
cat_features = ['vehicle_type',  'gearbox', 'brand',
               'model', 'fuel_type', 'not_repaired', 'create_year', 'create_month', 
               'create_day', 'postal_code']
cat = CatBoostRegressor(learning_rate=0.9, loss_function='RMSE',
                        random_seed=1357)
best_cb_model, best_cb_rmse, best_cb_time = build_model(cat, train_X, train_y, 
                                                        valid_X, valid_y,
                                                        hyperparameters={'iterations': [20, 30, 40, 50]}, 
                scoring=make_scorer(rmse, greater_is_better=False), cv=5, model_str='catboost',
           cat_features=cat_features)

0:	learn: 2935.1716783	total: 102ms	remaining: 1.94s
1:	learn: 2622.0196792	total: 167ms	remaining: 1.5s
2:	learn: 2508.0531054	total: 214ms	remaining: 1.21s
3:	learn: 2414.9667515	total: 285ms	remaining: 1.14s
4:	learn: 2353.3062541	total: 341ms	remaining: 1.02s
5:	learn: 2299.3419647	total: 403ms	remaining: 941ms
6:	learn: 2252.0715898	total: 466ms	remaining: 866ms
7:	learn: 2229.0937277	total: 534ms	remaining: 800ms
8:	learn: 2208.5556937	total: 593ms	remaining: 725ms
9:	learn: 2166.7577044	total: 659ms	remaining: 659ms
10:	learn: 2138.8194440	total: 717ms	remaining: 587ms
11:	learn: 2119.6934609	total: 774ms	remaining: 516ms
12:	learn: 2110.0292754	total: 825ms	remaining: 444ms
13:	learn: 2091.6352953	total: 890ms	remaining: 381ms
14:	learn: 2075.7257912	total: 950ms	remaining: 317ms
15:	learn: 2059.9497759	total: 1s	remaining: 251ms
16:	learn: 2044.2559499	total: 1.07s	remaining: 190ms
17:	learn: 2034.9545492	total: 1.14s	remaining: 126ms
18:	learn: 2025.2927607	total: 1.19s	remai

In [51]:
model_results_df = pd.DataFrame({'model': ['LinearRegression', 'RandomForestRegressor', 'CatBoostRegressor'],
              'best_rmse': [best_lr_rmse, best_rf_rmse, best_cb_rmse],
              'train_time': [best_lr_time, best_rf_time, best_cb_time]})

In [52]:
model_results_df

Unnamed: 0,model,best_rmse,train_time
0,LinearRegression,3409.077419,2.22
1,RandomForestRegressor,1780.134404,1544.81
2,CatBoostRegressor,1913.10894,69.75


In [53]:
# best features from rf model

pd.DataFrame(best_rf_model.feature_importances_, index=df.columns[1:].tolist()).sort_values(0, ascending=False)

Unnamed: 0,0
registration_year,0.525681
power,0.180397
brand,0.049744
model,0.036818
mileage,0.033201
vehicle_type,0.032936
postal_code,0.032415
days_on_site,0.027056
registration_month,0.020496
create_day,0.020078


In [54]:
# best features from cb model

pd.DataFrame(best_cb_model.feature_importances_, index=df.columns[1:].tolist()).sort_values(0, ascending=False)

Unnamed: 0,0
registration_year,42.727394
power,22.90697
brand,12.844049
vehicle_type,5.558732
mileage,4.799621
model,3.610021
not_repaired,2.898646
days_on_site,1.511695
gearbox,1.451955
fuel_type,0.934423


 - The RandomForestRegressor and CatBoostRegressor were trained using GridSearch with a CrossValidation of 5 folds. For the sake of time, only 1 hyperparameter was tuned - n_estimators/iterations
 - The RandomForestRegressor has the worst performance with respect to training time; it has taken 1423 seconds. Whereas, LinearRegression took the least amount of time; 2.29 seconds. On the validation set, however, RandomForestRegressor has the best RMSE of EUR 1815
 - The top 3 important features from RandomForest and CatBoost are: `'registration_year'`, `'power'`, and `'brand'`.

## Model analysis

*Let's check our results.*

In [55]:
def test_prediction(model, test_X, test_y):
    """
    Predict RMSE on the test set
    """
    start_time = time.time()
    preds = model.predict(test_X)
    return np.round(time.time() - start_time, 2), rmse(test_y, preds)

In [56]:
models = [best_lr_model, best_rf_model, best_cb_model]
for model in models:
    pred_time, best_rmse = test_prediction(model, test_X, test_y)
    print(f'The RMSE on the test set for {model} is {best_rmse}, and time to predict {pred_time} seconds')

The RMSE on the test set for LinearRegression() is 3319.9717363810455, and time to predict 0.0 seconds
The RMSE on the test set for RandomForestRegressor(n_estimators=50, random_state=1357) is 1764.6702620752321, and time to predict 0.36 seconds
The RMSE on the test set for <catboost.core.CatBoostRegressor object at 0x7fecdc323040> is 1908.7739149219926, and time to predict 0.02 seconds


 - The results of testing the best models on the test set reveal that RandomForestRegressor has the best RMSE; However, the predictions took 0.27 seconds.

**Summary**

 - RustyBargain will have to make a compromise between a 24x increase in training time versus a +/- 1.1x increase in RMSE should it decide between the top 2 models - RandomForestRegression and CatBoostRegression.

 - However, CatBoostRegressor offers several hyperparameters to play with, and increasing the number of iterations, and adding regularization may help in achieving a better RMSE. Therefore, in my view, they should pick CatBoostRegression simply because it is really fast, and it offers several options to improve the evaluation metric.

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed