# Used cars value model

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

First of all, will load the data and the libraries that we will use in this project.

In [1]:
# Loading all required libraries
import pandas as pd
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
import time

In [2]:
# Loading the data files into DataFrame
df=pd.read_csv('/datasets/car_data.csv')

We will display general data info and a sample of the data.

In [3]:
# printing the general/summary information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
# printing a sample of data
df.head(10)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17
5,04/04/2016 17:36,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,04/04/2016 00:00,0,33775,06/04/2016 19:17
6,01/04/2016 20:48,2200,convertible,2004,manual,109,2_reihe,150000,8,petrol,peugeot,no,01/04/2016 00:00,0,67112,05/04/2016 18:18
7,21/03/2016 18:54,0,sedan,1980,manual,50,other,40000,7,petrol,volkswagen,no,21/03/2016 00:00,0,19348,25/03/2016 16:47
8,04/04/2016 23:42,14500,bus,2014,manual,125,c_max,30000,8,petrol,ford,,04/04/2016 00:00,0,94505,04/04/2016 23:42
9,17/03/2016 10:53,999,small,1998,manual,101,golf,150000,0,,volkswagen,,17/03/2016 00:00,0,27472,31/03/2016 17:17


As can be seen, there are many missing values. Let's calculate the percentage of them.

In [5]:
#calculating percent of missing values
((df.isna().sum() / df.shape[0]))*100

DateCrawled           0.000000
Price                 0.000000
VehicleType          10.579368
RegistrationYear      0.000000
Gearbox               5.596709
Power                 0.000000
Model                 5.560588
Mileage               0.000000
RegistrationMonth     0.000000
FuelType              9.282697
Brand                 0.000000
NotRepaired          20.079070
DateCreated           0.000000
NumberOfPictures      0.000000
PostalCode            0.000000
LastSeen              0.000000
dtype: float64

VehicleType, Gearbox,Model and FuelType columns miss 5 to 10 percent of values. The data set is big enough to drop missing values in these fields. We will only fill missing values for NotRepaired as it has 20% of missing values. Let’s check which values it has.

In [6]:
#Displaying unique values in NotRepaired column
df['NotRepaired'].value_counts()

no     247161
yes     36054
Name: NotRepaired, dtype: int64

There are only yes and no values. We cannot determine what are the missing values, therefore with will replace them with the `unknown` category.

In [7]:
#filling missing values in NotRepaired column with unknown category
df['NotRepaired']=df['NotRepaired'].fillna('unknown')

In [8]:
#confirming that the values were filled successfully
df['NotRepaired'].value_counts()

no         247161
unknown     71154
yes         36054
Name: NotRepaired, dtype: int64

The values were filled. Next, we will drop null values in all other columns.

In [9]:
#dropping null values 
df=df.dropna()

In [10]:
#Confirming that no missing values left
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 284126 entries, 2 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        284126 non-null  object
 1   Price              284126 non-null  int64 
 2   VehicleType        284126 non-null  object
 3   RegistrationYear   284126 non-null  int64 
 4   Gearbox            284126 non-null  object
 5   Power              284126 non-null  int64 
 6   Model              284126 non-null  object
 7   Mileage            284126 non-null  int64 
 8   RegistrationMonth  284126 non-null  int64 
 9   FuelType           284126 non-null  object
 10  Brand              284126 non-null  object
 11  NotRepaired        284126 non-null  object
 12  DateCreated        284126 non-null  object
 13  NumberOfPictures   284126 non-null  int64 
 14  PostalCode         284126 non-null  int64 
 15  LastSeen           284126 non-null  object
dtypes: int64(7), object(

After dropping missing values, the dataset has 284126 rows which should be more than enough for our task.

Also, based on the dataset description, 'DateCrawled', 'DateCreated', and 'LastSeen' are related to user profile and not related to the price of a car therefore we will drop these columns.

In [11]:
#dropping columns
df=df.drop(['DateCrawled', 'DateCreated', 'LastSeen'], axis=1)
#resetting index after the drop 
df=df.reset_index(drop=True)

Next, we will prepare training and test features and target sets. Since LightGBM and CatBoost do not require encoding, we will create separate training and test features and target sets.

In [12]:
#prepare training and test features and target sets for LightGBM and CatBoost
gb_features=df.drop('Price', axis=1)
gb_target=df['Price']
gb_features_train, gb_features_test, gb_target_train, gb_target_test=train_test_split(gb_features, gb_target, test_size=0.25,
                                                                 random_state=12345)

#defining fetures for encoding for other models
features_to_encode=['VehicleType', 'RegistrationYear', 'Gearbox', 'Model', 'RegistrationMonth',
         'FuelType', 'Brand', 'NotRepaired', 'PostalCode']

#encoding features using ordinal encoder
encoder=OrdinalEncoder() 
df[features_to_encode]=encoder.fit_transform(df[features_to_encode])

#prepare training and test features and target sets for other models
features=df.drop('Price', axis=1)
target=df['Price']
features_train, features_test, target_train, target_test=train_test_split(features, target, test_size=0.25,
                                                                 random_state=12345)

Now we are ready to build our models.

## Model training

At this step, we will train different models with various hyperparameters. We will train Linear regression, Random Forest, CatBoost, and LightGBM models. We will use Linear regression as a base model to perform a sanity check of other methods.

## Linear Regression

We will start with Linear Regression. We will train it on the training set, test it on the test set and calculate RMSE.

In [13]:
#training LinearRegression models and printing time that requires to train the model
start_time=time.time()
model=LinearRegression()
model.fit(features_train, target_train)
lr_tr_time=time.time() - start_time
lr_tr_time

0.08089590072631836

In [14]:
#making predictions with LinearRegression model and printing time that requires to do this
start_time=time.time()
lr_predictions=model.predict(features_test)
lr_pred_time=time.time() - start_time
lr_pred_time

0.006867170333862305

In [15]:
#calculating RMSE for LinearRegression model
lr_rmse=(mean_squared_error(target_test, lr_predictions))**0.5
lr_rmse

3344.1480561328963

## Random Forest

Next, we will build and tune the parameters of the Random Forest model. Since it takes too much time to train Random Forest on a dataset of such size, we will tune only the n_estimators parameter. Then we will test it on the test set with the best parameters and calculate RMSE.

In [16]:
#training RandomForest models and tuning parametrs
best_n_estimators=0
best_score=-10000
def rmse(target, predictions):
    return (mean_squared_error(target, predictions))**0.5    
scorer=make_scorer(rmse, greater_is_better=False) 
#training models with different numbers of estimators in range from 3 to 30.
for i in range(3,30,5):
    model = RandomForestRegressor(random_state=12345, n_estimators=i) 
    score=cross_val_score(model, features_train, target_train, scoring=scorer, cv=3)
    print(f'score {score.mean()} n_estimators={i}')
#if score of the model is greater than previus best score, updating best n_estimators and best score variables
    if score.mean()>best_score:
        best_n_estimators=i
        best_score=score.mean()
print(f'Best score {best_score} is achived with n_estimators={best_n_estimators}')

score -1944.7342575859823 n_estimators=3
score -1776.008127416184 n_estimators=8
score -1735.3234235906639 n_estimators=13
score -1718.8204149734947 n_estimators=18
score -1707.8453024888531 n_estimators=23
score -1701.462508402991 n_estimators=28
Best score -1701.462508402991 is achived with n_estimators=28


In [17]:
#training RandomForest model with best paramenters and printing time that requires to train the model
model=RandomForestRegressor(random_state=12345, n_estimators=28) 
start_time=time.time()
model.fit(features_train, target_train)
rf_tr_time=time.time() - start_time
rf_tr_time

24.184314727783203

In [18]:
#making predictions with RandomForest model and printing time that requires to do this
start_time=time.time()
rf_predictions=model.predict(features_test)
rf_pred_time=time.time() - start_time
rf_pred_time

0.9611546993255615

In [19]:
#calculating RMSE for RandomForest model
rf_rmse=(mean_squared_error(target_test, rf_predictions))**0.5
rf_rmse

1642.698529268388

## CatBoost

Next, we will build and tune the parameters of the CatBoost model. Since the training of a gradient boosting model can take a long time, we will tune only a few model parameters: depth and learning_rate. We will use the same n_estimators as we used for our best Random Forest model. To find the best parameters, we will use GridSearchCV. Then we will test it on the test set with the best parameters and calculate RMSE.

In [20]:
#training CatBoost models and tuning parametrs
model=CatBoostRegressor()

#defining parameters for grid search
param={ 'n_estimators': [28],
        'depth' : [6,8,10],
        'learning_rate': [0.01, 0.05, 0.1, 0.5],    
        'loss_function': ['RMSE'],
        'random_seed': [12345]}

#finding the best parameters
grid=GridSearchCV(estimator=model, param_grid=param, scoring=scorer, cv=2, n_jobs=-1, verbose=0)
grid.fit(gb_features_train, gb_target_train, cat_features=features_to_encode)
best_param=grid.best_params_

0:	learn: 4612.2239555	total: 113ms	remaining: 3.04s
1:	learn: 4580.4500115	total: 165ms	remaining: 2.15s
2:	learn: 4548.9853788	total: 206ms	remaining: 1.72s
3:	learn: 4518.3257534	total: 247ms	remaining: 1.48s
4:	learn: 4488.0821517	total: 291ms	remaining: 1.34s
5:	learn: 4458.0479338	total: 334ms	remaining: 1.23s
6:	learn: 4428.3355268	total: 377ms	remaining: 1.13s
7:	learn: 4398.7215303	total: 425ms	remaining: 1.06s
8:	learn: 4369.7438259	total: 465ms	remaining: 983ms
9:	learn: 4341.0804548	total: 507ms	remaining: 913ms
10:	learn: 4312.6198325	total: 551ms	remaining: 851ms
11:	learn: 4284.6043109	total: 591ms	remaining: 789ms
12:	learn: 4256.7842946	total: 636ms	remaining: 734ms
13:	learn: 4229.2615580	total: 678ms	remaining: 678ms
14:	learn: 4202.2140859	total: 718ms	remaining: 623ms
15:	learn: 4175.2519411	total: 759ms	remaining: 569ms
16:	learn: 4148.5977603	total: 800ms	remaining: 517ms
17:	learn: 4122.7139524	total: 840ms	remaining: 467ms
18:	learn: 4096.6900800	total: 881ms	r

In [21]:
#printing the best parameters
best_param

{'depth': 10,
 'learning_rate': 0.5,
 'loss_function': 'RMSE',
 'n_estimators': 28,
 'random_seed': 12345}

In [22]:
#training CatBoost model with best paramenters and printing time that requires to train the model
model=CatBoostRegressor(n_estimators=28, depth=10, learning_rate=0.5, loss_function='RMSE', random_seed=12345) 
start_time=time.time()
model.fit(gb_features_train, gb_target_train, cat_features=features_to_encode)
cb_tr_time=time.time() - start_time
cb_tr_time

0:	learn: 3043.0632080	total: 316ms	remaining: 8.52s
1:	learn: 2348.5357451	total: 576ms	remaining: 7.49s
2:	learn: 2069.1347489	total: 845ms	remaining: 7.04s
3:	learn: 1955.4771179	total: 1.08s	remaining: 6.5s
4:	learn: 1893.5567168	total: 1.33s	remaining: 6.11s
5:	learn: 1860.4480761	total: 1.57s	remaining: 5.76s
6:	learn: 1828.4468718	total: 1.82s	remaining: 5.46s
7:	learn: 1799.1940589	total: 2.06s	remaining: 5.14s
8:	learn: 1783.5426369	total: 2.29s	remaining: 4.84s
9:	learn: 1766.9914039	total: 2.54s	remaining: 4.57s
10:	learn: 1757.4561050	total: 2.77s	remaining: 4.28s
11:	learn: 1747.5615399	total: 3.01s	remaining: 4.01s
12:	learn: 1738.7233674	total: 3.25s	remaining: 3.75s
13:	learn: 1729.3024459	total: 3.49s	remaining: 3.49s
14:	learn: 1719.3958468	total: 3.73s	remaining: 3.23s
15:	learn: 1711.4208680	total: 3.97s	remaining: 2.98s
16:	learn: 1703.5361663	total: 4.21s	remaining: 2.72s
17:	learn: 1698.2538322	total: 4.45s	remaining: 2.47s
18:	learn: 1694.3588085	total: 4.69s	re

8.023786783218384

In [23]:
#making predictions with CatBoost model and printing time that requires to do this
start_time=time.time()
cb_predictions=model.predict(gb_features_test)
cb_pred_time=time.time() - start_time
cb_pred_time

0.1301732063293457

In [24]:
#calculating RMSE for CatBoost model
cb_rmse=(mean_squared_error(gb_target_test, cb_predictions))**0.5
cb_rmse

1728.196429621239

## LightGBM

Finally, we will build and tune the parameters of the LightGBM model. Similarly to CatBoost model, we will tune only a few model parameters: max_depth and learning_rate. We will use the same n_estimators as we used for our best Random Forest model. To find the best parameters, we will use GridSearchCV. Also, for this model to use categorical features directly, we will change the data type of the features to category. Then we will test it on the test set with the best parameters and calculate RMSE.

In [25]:
#changing the data type of the features to category
for cat in features_to_encode:
    gb_features_train[cat] = gb_features_train[cat].astype('category')

#training CatBoost models and tuning parametrs    
model=LGBMRegressor()

#defining parameters for grid search
param={ 'n_estimators': [28],
        'max_depth' : [6,8,10],
        'learning_rate': [0.01, 0.05, 0.1, 0.5],    
        'objective': ['RMSE'],
        'random_seed': [12345]}

#finding the best parameters
grid=GridSearchCV(estimator=model, param_grid=param, scoring=scorer, cv=2, n_jobs=-1)
grid.fit(gb_features_train, gb_target_train)
best_param=grid.best_params_

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gb_features_train[cat] = gb_features_train[cat].astype('category')


In [26]:
#printing the best parameters
best_param

{'learning_rate': 0.5,
 'max_depth': 10,
 'n_estimators': 28,
 'objective': 'RMSE',
 'random_seed': 12345}

In [27]:
#training LightGBM model with best paramenters and printing time that requires to train the model
model=LGBMRegressor(n_estimators=28, max_depth=10, learning_rate=0.5, objective='RMSE', random_seed=12345) 
start_time=time.time()
model.fit(gb_features_train, gb_target_train)
lg_tr_time=time.time() - start_time
lg_tr_time

4.1870410442352295

In [28]:
#changing the data type of the features in test set to category
for cat in features_to_encode:
    gb_features_test[cat] = gb_features_test[cat].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gb_features_test[cat] = gb_features_test[cat].astype('category')


In [29]:
#making predictions with LightGBM model and printing time that requires to do this
start_time = time.time()
lg_predictions=model.predict(gb_features_test)
lg_pred_time=time.time() - start_time
lg_pred_time

0.2529733180999756

In [30]:
#calculating RMSE for LightGBM model
lg_rmse=(mean_squared_error(gb_target_test, lg_predictions))**0.5
lg_rmse

1745.3624810146844

All models are built and tested. On the next step, we will analyze them.

## Model analysis

In this final step, we will compare the speed and quality of the models. To make comparison easy, we will combine models’ speed and quality to a data frame.

In [31]:
#putting models speed and quality date to a data frame
index=['LR', 'RF', 'CB', 'LG']
columns=['training_time(sec)', 'prediction_time(sec)', 'RMSE']
data=[[lr_tr_time, lr_pred_time, lr_rmse],
     [rf_tr_time, rf_pred_time, rf_rmse],
     [cb_tr_time, cb_pred_time, cb_rmse],
     [lg_tr_time, lg_pred_time, lg_rmse]]
result=pd.DataFrame(data=data, columns=columns, index=index)

#printing the resulting data frame
result

Unnamed: 0,training_time(sec),prediction_time(sec),RMSE
LR,0.080896,0.006867,3344.148056
RF,24.184315,0.961155,1642.698529
CB,8.023787,0.130173,1728.19643
LG,4.187041,0.252973,1745.362481


We used Linear regression as a base model to perform a sanity check of other methods, and as can be seen, the quality of all other models is much better than the quality of Linear regression. So, we can conclude that the result that we received looks correct. 

As can be seen, the quality of the Random Forest model is a bit higher than the quality of CatBoost and LightGBM models. This could be due to the fact that we tuned only a few model parameters for gradient boosting models.

At the same time, while the quality of Random Forest model is the highest, the time to train it and prediction time for it are several times higher than the time to train CatBoost and LightGBM models. It is also worth noting that although training time for CatBoost is higher than for LightGBM model, CatBoost makes predictions faster.

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed