## Project Introduction
The mission is to build a model to determine the value of one's car. Rusty Bargain used car sales service is developing an app to attract new customers. In that app, we aim to quickly find out the market value of one's car. In the data is access to historical data: technical specifications, trim versions, and prices.  

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from joblib import dump
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.utils import resample
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [2]:
try:
    data = pd.read_csv('/datasets/car_data.csv')
    print("data on site")
except:
    data = pd.read_csv('car_data.csv')
    print("data from local")

data on site


In [3]:
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [5]:
# Make data type corrections
data['DateCrawled'] = pd.to_datetime(data['DateCrawled'])
data['LastSeen'] = pd.to_datetime(data['LastSeen'])
data['DateCreated'] = pd.to_datetime(data['DateCreated'])
data['PostalCode'] = data['PostalCode'].astype('object')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   DateCrawled        354369 non-null  datetime64[ns]
 1   Price              354369 non-null  int64         
 2   VehicleType        316879 non-null  object        
 3   RegistrationYear   354369 non-null  int64         
 4   Gearbox            334536 non-null  object        
 5   Power              354369 non-null  int64         
 6   Model              334664 non-null  object        
 7   Mileage            354369 non-null  int64         
 8   RegistrationMonth  354369 non-null  int64         
 9   FuelType           321474 non-null  object        
 10  Brand              354369 non-null  object        
 11  NotRepaired        283215 non-null  object        
 12  DateCreated        354369 non-null  datetime64[ns]
 13  NumberOfPictures   354369 non-null  int64   

In [6]:
# Checking missing information
data.isna().sum()

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64

In [7]:
#Renaming columns
data = data.rename(columns={'DateCrawled': 'date_crawled', 'Price': 'price', 'VehicleType':'vehicle_type', \
                     'RegistrationYear': 'registration_year', 'Gearbox':'gearbox', 'Power':'power', 'Model': 'model', \
                     'Mileage':'mileage', 'RegistrationMonth':'registration_month','FuelType':'fuel_type', 'Brand':'brand', \
                     'NotRepaired':'not_repaired','DateCreated': 'date_created','NumberOfPictures':'number_of_pictures',\
                     'PostalCode':'postal_code','LastSeen':'last_seen'})
data.columns

Index(['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [8]:
data.describe()

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0
min,0.0,1000.0,0.0,5000.0,0.0,0.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0


In [9]:
# Converting NaNs in vehicle_type & model columns to 'missing' object.
data['vehicle_type'] = data['vehicle_type'].replace(np.nan, 'missing')
data['model'] = data['model'].replace (np.nan, 'missing')

# Verify
display(data['vehicle_type'].value_counts(dropna=False))
display(data['model'].value_counts(dropna=False))

sedan          91457
small          79831
wagon          65166
missing        37490
bus            28775
convertible    20203
coupe          16163
suv            11996
other           3288
Name: vehicle_type, dtype: int64

golf                  29232
other                 24421
3er                   19761
missing               19705
polo                  13066
                      ...  
serie_2                   8
rangerover                4
serie_3                   4
serie_1                   2
range_rover_evoque        2
Name: model, Length: 251, dtype: int64

In [10]:
data['gearbox'].value_counts(dropna=False)

manual    268251
auto       66285
NaN        19833
Name: gearbox, dtype: int64

In [11]:
data['gearbox'] = data['gearbox'].replace(np.nan, 'missing')
data['gearbox'].value_counts(dropna=False)

manual     268251
auto        66285
missing     19833
Name: gearbox, dtype: int64

In [12]:
# Registration year details
data['registration_year'].unique()

# Remove unrealistic car years
data = data.query('registration_year > 1886 and registration_year < 2023 ')

In [13]:
data['power'].unique()

array([    0,   190,   163,    75,    69,   102,   109,    50,   125,
         101,   105,   140,   115,   131,    60,   136,   160,   231,
          90,   118,   193,    99,   113,   218,   122,   129,    70,
         306,    95,    61,   177,    80,   170,    55,   143,    64,
         286,   232,   150,   156,    82,   155,    54,   185,    87,
         180,    86,    84,   224,   235,   200,   178,   265,    77,
         110,   144,   120,   116,   184,   126,   204,    88,   194,
         305,   197,   179,   250,    45,   313,    41,   165,    98,
         130,   114,   211,    56,   201,   213,    58,   107,    83,
         174,   100,   220,    85,    73,   192,    68,    66,   299,
          74,    52,   147,    65,   310,    71,    97,   239,   203,
           5,   300,   103,   258,   320,    63,    81,   148,    44,
         145,   230,   280,   260,   104,   188,   333,   186,   117,
         141,    59,   132,   234,   158,    39,    92,    51,   135,
          53,   209,

In [14]:
# Power col, where a cars power is 0 group it by fuel type
data.query('power == 0')
data['power'] = np.where((data['power'] == 0), data.groupby('fuel_type')['power'].transform('mean').round(0), data['power'])
data.power.value_counts(dropna=False)

75.0      24020
106.0     20393
60.0      15897
150.0     14590
101.0     13298
          ...  
409.0         1
2005.0        1
332.0         1
645.0         1
9010.0        1
Name: power, Length: 711, dtype: int64

In [15]:
# Changing abnormal horsepower numbers based on brand & changing data type
data['power'] = np.where((data['power'] > 700), data.groupby('brand')['power'].transform('mean').round(0), data['power'])
data['power'] = np.where((data['power'] < 50), data.groupby('brand')['power'].transform('mean').round(0), data['power'])
#data['power'] = data['power'].astype('int')
data['power'].unique()

array([106., 190., 163.,  75.,  69., 102., 109.,  50., 125., 101., 105.,
       140., 115., 131.,  60., 136., 160., 231.,  90., 118., 193.,  99.,
       113., 218., 122., 128., 129.,  70., 306.,  95.,  61., 177.,  80.,
       170.,  55., 143.,  nan,  64., 286., 232., 150., 156.,  82., 155.,
        54., 185.,  87., 180.,  86.,  84., 224., 235., 200., 178., 265.,
        77., 110., 144., 120., 116., 184., 126., 204.,  88., 194., 305.,
       197., 179., 250., 104., 313.,  67., 165.,  98., 130., 114., 211.,
        56., 201., 213.,  58., 107.,  83., 174., 100., 220.,  85.,  73.,
       192.,  68.,  66., 299.,  74.,  52., 147.,  65.,  51., 310.,  71.,
        97., 239., 203., 138., 300., 103., 258., 320.,  63.,  81., 148.,
       145., 230., 280., 260., 188., 333., 186., 117., 141.,  59., 132.,
       234., 158.,  92., 135.,  53., 209.,  93., 146., 166., 276., 344.,
        72., 249., 237., 245., 111., 326., 279., 175.,  96., 226.,  43.,
       301., 334., 133., 124., 219., 241., 167.,  9

In [16]:
data.fuel_type.unique()

array(['petrol', 'gasoline', nan, 'lpg', 'other', 'hybrid', 'cng',
       'electric'], dtype=object)

In [17]:
# Filling in fuel_type col from NaN to max value based on 'power' column.
data.fuel_type = data.fuel_type.fillna('X')
data['fuel_type'] = np.where((data['fuel_type'] == 'X'), data.groupby('power')['fuel_type'].transform('max'), data['fuel_type'])
data.fuel_type = data.fuel_type.replace('X', np.nan)
data.dropna(subset=['fuel_type'], how='any', inplace=True)
print(data['fuel_type'].unique())
print(data.shape)
print(data['fuel_type'].value_counts())
data['fuel_type'].shape


['petrol' 'gasoline' 'lpg' 'other' 'hybrid' 'cng' 'electric']
(342633, 16)
petrol      237513
gasoline     98717
lpg           5313
cng            564
hybrid         233
other          203
electric        90
Name: fuel_type, dtype: int64


(342633,)

In [18]:
data['price'].describe()
print(data[data['price'] == 0].value_counts().sum())

5176


In [19]:
#Dropping abnormal values in price & changing data type
data['price'] = np.where((data['price'] < 100), np.nan, data['price'])
data.dropna(subset=['price'], how='any', inplace=True)
data['price'] = data['price'].astype('int32')
print(data.price.value_counts(dropna=False))
print()
print(data['price'].isna().sum())
print()
print(data[data['price'] == 0].value_counts().sum())

500      5355
1500     5204
1200     4416
1000     4375
2500     4289
         ... 
5920        1
1687        1
9875        1
19870       1
8188        1
Name: price, Length: 3651, dtype: int64

0

0


In [20]:
# Replacing NaN in not_repaired col
data['not_repaired'].unique()
data['not_repaired'] = data['not_repaired'].replace(np.nan, 'unknown')
print(data['not_repaired'].unique())
print(data['not_repaired'].shape)

['unknown' 'yes' 'no']
(332216,)


In [21]:
# No more missing info in dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 332216 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   date_crawled        332216 non-null  datetime64[ns]
 1   price               332216 non-null  int32         
 2   vehicle_type        332216 non-null  object        
 3   registration_year   332216 non-null  int64         
 4   gearbox             332216 non-null  object        
 5   power               332216 non-null  float64       
 6   model               332216 non-null  object        
 7   mileage             332216 non-null  int64         
 8   registration_month  332216 non-null  int64         
 9   fuel_type           332216 non-null  object        
 10  brand               332216 non-null  object        
 11  not_repaired        332216 non-null  object        
 12  date_created        332216 non-null  datetime64[ns]
 13  number_of_pictures  332216 no

In [22]:
data.corr()

Unnamed: 0,price,registration_year,power,mileage,registration_month,number_of_pictures
price,1.0,0.394548,0.498142,-0.374701,0.080839,
registration_year,0.394548,1.0,0.087289,-0.219629,0.029776,
power,0.498142,0.087289,1.0,0.094712,0.044743,
mileage,-0.374701,-0.219629,0.094712,1.0,-0.012495,
registration_month,0.080839,0.029776,0.044743,-0.012495,1.0,
number_of_pictures,,,,,,


In [23]:
#Resetting indexes in the original dataset.
data= data.reset_index(drop=True)
display(data)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:00,480,missing,1993,manual,106.0,golf,150000,0,petrol,volkswagen,unknown,2016-03-24,0,70435,2016-07-04 03:16:00
1,2016-03-24 10:58:00,18300,coupe,2011,manual,190.0,missing,125000,5,gasoline,audi,yes,2016-03-24,0,66954,2016-07-04 01:46:00
2,2016-03-14 12:52:00,9800,suv,2004,auto,163.0,grand,125000,8,gasoline,jeep,unknown,2016-03-14,0,90480,2016-05-04 12:47:00
3,2016-03-17 16:54:00,1500,small,2001,manual,75.0,golf,150000,6,petrol,volkswagen,no,2016-03-17,0,91074,2016-03-17 17:40:00
4,2016-03-31 17:25:00,3600,small,2008,manual,69.0,fabia,90000,7,gasoline,skoda,no,2016-03-31,0,60437,2016-06-04 10:17:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332211,2016-03-19 19:53:00,3200,sedan,2004,manual,225.0,leon,150000,5,petrol,seat,yes,2016-03-19,0,96465,2016-03-19 20:44:00
332212,2016-03-27 20:36:00,1150,bus,2000,manual,106.0,zafira,150000,3,petrol,opel,no,2016-03-27,0,26624,2016-03-29 10:17:00
332213,2016-05-03 19:56:00,1199,convertible,2000,auto,101.0,fortwo,125000,3,petrol,smart,no,2016-05-03,0,26135,2016-11-03 18:17:00
332214,2016-03-19 18:57:00,9200,bus,1996,manual,102.0,transporter,150000,3,gasoline,volkswagen,no,2016-03-19,0,87439,2016-07-04 07:15:00


In [24]:
# Categorical Feature prep -- for label encoding dataset

encoder = LabelEncoder()
data_ordinal = pd.DataFrame(data, columns=data.columns)
data_ordinal['vehicle_type'] = encoder.fit_transform(data_ordinal['vehicle_type'])
data_ordinal['gearbox'] = encoder.fit_transform(data_ordinal['gearbox'])
data_ordinal['model'] = encoder.fit_transform(data_ordinal['model'])
data_ordinal['fuel_type'] = encoder.fit_transform(data_ordinal['fuel_type'])
data_ordinal['brand'] = encoder.fit_transform(data_ordinal['brand'])
data_ordinal['not_repaired'] = encoder.fit_transform(data_ordinal['not_repaired'])
#data_ordinal['postal_code'] = encoder.fit_transform(data_ordinal['postal_code'])
display(data_ordinal)

Unnamed: 0,date_crawled,price,vehicle_type,registration_year,gearbox,power,model,mileage,registration_month,fuel_type,brand,not_repaired,date_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:00,480,3,1993,1,106.0,116,150000,0,6,38,1,2016-03-24,0,70435,2016-07-04 03:16:00
1,2016-03-24 10:58:00,18300,2,2011,1,190.0,153,125000,5,2,1,2,2016-03-24,0,66954,2016-07-04 01:46:00
2,2016-03-14 12:52:00,9800,7,2004,0,163.0,117,125000,8,2,14,1,2016-03-14,0,90480,2016-05-04 12:47:00
3,2016-03-17 16:54:00,1500,6,2001,1,75.0,116,150000,6,6,38,0,2016-03-17,0,91074,2016-03-17 17:40:00
4,2016-03-31 17:25:00,3600,6,2008,1,69.0,101,90000,7,2,31,0,2016-03-31,0,60437,2016-06-04 10:17:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332211,2016-03-19 19:53:00,3200,5,2004,1,225.0,140,150000,5,6,30,2,2016-03-19,0,96465,2016-03-19 20:44:00
332212,2016-03-27 20:36:00,1150,0,2000,1,106.0,250,150000,3,6,24,0,2016-03-27,0,26624,2016-03-29 10:17:00
332213,2016-05-03 19:56:00,1199,1,2000,0,101.0,106,125000,3,6,32,0,2016-05-03,0,26135,2016-11-03 18:17:00
332214,2016-03-19 18:57:00,9200,0,1996,1,102.0,225,150000,3,2,38,0,2016-03-19,0,87439,2016-07-04 07:15:00


## Model training

In [25]:
#Assigning features and target for the model construction.
features = data_ordinal.drop(['price', 'date_crawled', 'registration_month', 'date_created', \
                              'number_of_pictures', \
                              'postal_code', 'last_seen'], axis=1)
target = data_ordinal['price']

#Train test split.
features_train, features_test, target_train, target_test = train_test_split(features, target, \
                                                                            test_size=0.25, random_state=12345)

In [26]:
#Creating pipelines

pipe_dec = Pipeline([('scaler0', StandardScaler()),
                    ('DecisionTreeRegressor', DecisionTreeRegressor())])

pipe_randomfor = Pipeline([('scaler1', StandardScaler()),
                    ('RandomForestRegressor', RandomForestRegressor(n_estimators=100))])

pipe_linear = Pipeline([('scaler2', StandardScaler()),
                       ('LinearRegression(Dummy)', LinearRegression())])

pipe_cat_boost = Pipeline([('scaler3', StandardScaler()),
                       ('CatBoostRegressor', CatBoostRegressor(verbose=500))])

pipe_lgbm =  Pipeline([('scaler4', StandardScaler()),
                       ('LGBMRegressor', LGBMRegressor())])

pipe_xgb = Pipeline([('scaler5', StandardScaler()),
                       ('XGBRegressor', XGBRegressor())])

In [27]:
# Creating list of pipelines

pipelines = [pipe_dec, pipe_randomfor, pipe_linear, pipe_cat_boost, pipe_lgbm, pipe_xgb]

# Creating dictionary for pipelines

pipe_dict = {pipe_dec: 'DecisionTreeRegressor', pipe_randomfor: 'RandomForestRegressor',
            pipe_linear: 'LinearRegression', pipe_cat_boost: 'CatBoostRegressor', 
            pipe_lgbm: 'LGBMRegressor', pipe_xgb: 'XGBRegressor'}

In [28]:

# Define a function to calculate RMSE
def calculate_rmse(target, predictions): 
    score = mean_squared_error(target, predictions)
    score = score ** 0.5
    return score



In [29]:
# Loop through pipelines to obtain cross-validation scores
# Create a scoring function for RMSE
rmse_scorer = make_scorer(calculate_rmse, greater_is_better=False)
for pipe in pipelines:
    print(pipe_dict[pipe])
    scores = cross_val_score(pipe, features_train, target_train, scoring=rmse_scorer, cv=5)
    print("Cross-Validation RMSE Scores:", scores)
    print("Mean RMSE:", scores.mean())


DecisionTreeRegressor
Cross-Validation RMSE Scores: [-2016.21546012 -2049.55670748 -2042.78023742 -2050.83239782
 -2030.5062729 ]
Mean RMSE: -2037.9782151492648
RandomForestRegressor
Cross-Validation RMSE Scores: [-1629.02336615 -1647.09318463 -1648.13259129 -1646.98744706
 -1654.11560217]
Mean RMSE: -1645.0704382594279
LinearRegression
Cross-Validation RMSE Scores: [-2972.20310236 -2973.52493574 -2996.00816945 -2988.41493048
 -3017.61000811]
Mean RMSE: -2989.5522292276864
CatBoostRegressor
Learning rate set to 0.094517
0:	learn: 4262.2806753	total: 92.6ms	remaining: 1m 32s
500:	learn: 1648.7170707	total: 20.7s	remaining: 20.6s
999:	learn: 1568.4062773	total: 41.1s	remaining: 0us
Learning rate set to 0.094517
0:	learn: 4256.0324210	total: 39.7ms	remaining: 39.7s
500:	learn: 1646.4870660	total: 20.5s	remaining: 20.5s
999:	learn: 1564.5337096	total: 41s	remaining: 0us
Learning rate set to 0.094517
0:	learn: 4254.0951474	total: 41.6ms	remaining: 41.5s
500:	learn: 1648.8199457	total: 19.3s

In [None]:
#Creating a parameters dictionary for RandomForestRegressor possible hyper tuning values

parameters = {'n_estimators': (20,50,75,110,150,200,230,250),
              'max_depth': (10,15,20),
              } 

#Creating a grid model.
randomf_grid = GridSearchCV(RandomForestRegressor(random_state=12345, criterion='mse'), param_grid=parameters, cv=5)
randomf_grid_model = randomf_grid.fit(features_train, target_train)
print(randomf_grid_model.best_estimator_)

In [None]:
randomf_grid_model.best_score_

In [None]:
"""
# Model training
features_train, features_valid, target_train, target_valid = train_test_split(
    data.drop('price', axis=1), data.price, test_size=0.25, random_state=12345
)
"""
cat_features = [
    'vehicle_type',
    'gearbox',
    'model',
    'not_repaired',
    'brand',
    'postal_code',
    'fuel_type'
]
"""
model = CatBoostRegressor(loss_function="RMSE", iterations=150, random_seed=12345)

model.fit(features_train, target_train, cat_features=cat_features, verbose=20)

# Predict on the validation set
predictions_valid = model.predict(features_valid)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(target_valid, predictions_valid))
print("RMSE:", rmse)
"""

In [None]:
# Creating parameter for hypertuning for learning rate for CatBoost

parameters = {'learning_rate': (0.15,0.5,0.35,0.8),
              'depth': (10,15,16)} 

#Creating a grid model.
CatB_grid = GridSearchCV(CatBoostRegressor(random_seed=12345), param_grid=parameters, cv=5)
CatB_grid_model = CatB_grid.fit(features_train, target_train, cat_features = cat_features)
print(CatB_grid_model.best_estimator_)

<div class="alert alert-info">
 Used Label endcoding to use for models.  Created Pipelines to be able to run a for loop to run the models.
    Created features and target using the training set.  Set the metric for RMSE.  Created a model for LinearRegression, DecisionTreeRegressor, RandomForestRegressor; as well as CatBoostRegressor, LGBMRegressor and XGBRegressor.  The results were: XGBRegressor = Mean RMSE: -1653.8067111183507; LGBMRegressor = Mean RMSE: -1733.9356894426917; CatBoostRegressor = Mean RMSE: -1641.2391696314303; LinearRegression (Dummy) = Mean RMSE: -2989.5522292276864; RandomForestRegressor = Mean RMSE: -1644.9400720393837; DecisionTreeRegressor = -2035.8740918772455. The best performing models without being hyper tuned were CatBoostRegressor -1641.2391 and RandomForestRegressor with avg RMSE of -1644.94.  LinearRegression does not lend itself to hyper parameter tuning so it could not be tuned.  I tune RandomForest and Catboost.
</div>

## Model analysis

In [None]:
from time import time
run_time = {}
training_time = {}
predictions_time = {}

In [None]:
#Creating pipelines with the right hyperparameters setted.

pipe_linear = Pipeline([('scaler0', StandardScaler()),
                       ('LinearRegression', LinearRegression())])

pipe_rfr_quality = Pipeline([('scaler3', StandardScaler()),
                    ('RandomForestRegressor_quality', RandomForestRegressor(max_depth=19, n_estimators=250, random_state=12345))])

pipe_cat_boost = Pipeline([('scaler5', StandardScaler())
                                  'CatBoostRegressor', CatBoostRegressor(verbose=500)])

pipe_cat_boost_quality = Pipeline([('scaler4', StandardScaler()),
                       ('CatBoostRegressor', CatBoostRegressor(depth=16, random_seed=12345, learning_rate=0.35, verbose=100))])
 
    


In [None]:
#Saving LinearRegression results on testing dataset.
start = time()
pipe_linear.fit(features_train, target_train)
training_time['LinearRegression'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_linear.predict(features_test)
predictions_time['LinearRegression'] = np.round(time()-start_predictions, 3)

rmse_linear = mean_squared_error(target_test, predictions)**0.5

run_time['LinearRegression'] = np.round(time()-start, 3)

print('Rmse:', rmse_linear,'\nTraining Time:', {training_time['LinearRegression']},'s',\
      '\nPredictions Time:', {predictions_time['LinearRegression']},'s','\nRun Time:', {run_time['LinearRegression']},'s')

In [None]:
#Saving RandomForestRegressor_quality results on testing dataset.
start = time()
pipe_rfr_quality.fit(features_train, target_train)
training_time['RandomForestRegressor_quality'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_rfr_quality.predict(features_test)
rmse_rfr = mean_squared_error(target_test, predictions)**0.5
predictions_time['RandomForestRegressor_quality'] = np.round(time()-start_predictions, 3)

run_time['RandomForestRegressor_quality'] = np.round(time()-start, 3)

print('Rmse:', rmse_rfr,'\nTraining Time:', {training_time['RandomForestRegressor_quality']},'s',\
      '\nPredictions Time:', {predictions_time['RandomForestRegressor_quality']},'s','\nRun Time:', {run_time['RandomForestRegressor_quality']},'s')

In [None]:
%%time
#Saving CatBoostRegressor results on testing dataset.
start = time()
pipe_cat_boost_quality.fit(features_train, target_train)
training_time['CatBoostRegressor'] = np.round(time()-start, 3)

start_predictions = time()
predictions = pipe_cat_boost_quality.predict(features_test)
rmse_cat_boost = mean_squared_error(target_test, predictions)**0.5
predictions_time['CatBoostRegressor'] = np.round(time()-start_predictions, 3)

run_time['CatBoostRegressor'] = np.round(time()-start, 3)

print('Rmse:', rmse_cat_boost,'\nTraining Time:', {training_time['CatBoostRegressor']},'s',\
      '\nPredictions Time:', {predictions_time['CatBoostRegressor']},'s','\nRun Time:', {run_time['CatBoostRegressor']},'s')

In [None]:
rmse_dict = {'LinearRegression':rmse_linear, 'DecisionTreeRegressor': rmse_dtr,\
             'RandomForestRegressor':rmse_rfr, 'LGBMRegressor':rmse_lgbm, \
             'XGBRegressor':rmse_xgb, 'CatBoostRegressor':rmse_cat_boost}

In [None]:
print(training_time)
print()
print(predictions_time)
print()
print(run_time)
print()
print(rmse_dict)

<div class="alert alert-info">
  <b> Conclusion </b><br>
    The best Quality model was obtained by RandomForestRegressor and CatBoostRegressor.  All outperformed our dummy model LinearRegression. The models that trained faster were LinearRegression and Way faster then the RFR model.
    # Brief summarize. Point 3

Model analysis on test set showed these results

-   LinearRegression: \
    Rmse: 3029.9790047276947 \
    Training Time: {0.128} s \
    Predictions Time: {0.03} s \
    Run Time: {0.16} s
    
    

-   RandomForestRegressor_quality:\
    Rmse: 1653.884919717754 \
    Training Time: {175.973} s \
    Predictions Time: {7.706} s \
    Run Time: {183.679} s



    
</div>