# Determining the cost of cars

The service for selling used cars “Not a Bit, Not a Paint” is developing an application to attract new customers. Here you can quickly find out the market value of your car. Historical data is at your disposal: technical characteristics, configurations and prices of cars. You need to build a model to determine the cost.

The following are important to the customer:

- quality of prediction;
- prediction speed;
- studying time.

## Data preparation

In [1]:
#!pip install scikit-learn==1.1.3

In [2]:
#!pip install lightgbm

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV,KFold
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import make_column_transformer
from lightgbm import LGBMRegressor

In [4]:
data = pd.read_csv('autos.csv')
data.info()

data.head()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Kilometer          354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  Repaired           283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


In [5]:
#remove obvious duplicates if there are any
data=data.drop_duplicates()

There are anomalies in some data important for analysis, let’s get rid of them:

In [6]:
data=data[data['Price']>500]
data=data[data['RegistrationYear']>1950]
data=data[data['RegistrationYear']<2023]
data=data[data['Power']>50]
data=data[data['Power']<1000]
data.describe()

Unnamed: 0,Price,RegistrationYear,Power,Kilometer,RegistrationMonth,NumberOfPictures,PostalCode
count,277311.0,277311.0,277311.0,277311.0,277311.0,277311.0,277311.0
mean,5238.976543,2003.823134,125.335284,127860.380584,6.058847,0.0,51487.97236
std,4595.348538,6.56561,53.710944,36815.130822,3.559625,0.0,25704.989864
min,501.0,1951.0,51.0,5000.0,0.0,0.0,1067.0
25%,1675.0,2000.0,86.0,125000.0,3.0,0.0,31171.0
50%,3600.0,2004.0,116.0,150000.0,6.0,0.0,50735.0
75%,7500.0,2008.0,150.0,150000.0,9.0,0.0,72213.0
max,20000.0,2019.0,999.0,150000.0,12.0,0.0,99998.0


In [7]:
#compare the year of registration with the year the application was downloaded, leaving only the data,
#in which the year of registration is less than or equal to the download date
data['DateCrawled']=pd.to_datetime(data['DateCrawled'],format='%Y-%m-%d %H:%M:%S')
data=data[pd.DatetimeIndex(data['DateCrawled']).year>=data['RegistrationYear']]

In [8]:
data.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,RegistrationMonth,FuelType,Brand,Repaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
1,2016-03-24 10:58:45,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
5,2016-04-04 17:36:23,650,sedan,1995,manual,102,3er,150000,10,petrol,bmw,yes,2016-04-04 00:00:00,0,33775,2016-04-06 19:17:07


In [9]:
#let's highlight categorical features that are important for learning
categorial = ['VehicleType', 'Gearbox', 'Model', 'FuelType',
           'Brand', 'Repaired']
#there are gaps in the categorical features, fill them with NaN
data[categorial] = data[categorial].fillna('Nan')

In [10]:
#let's create a training dataset from which we will remove features that are not essential for prediction
data_train=data.drop(['DateCrawled','RegistrationMonth','DateCreated','NumberOfPictures','PostalCode','LastSeen'],axis=1)

In [11]:
#split the dataset into training and test samples
target = data_train['Price']
features = data_train.drop('Price', axis=1)
features_train, features_test, target_train, target_test = train_test_split(
features, target, test_size=0.25, random_state=12345)

Let's make two training sets - one with direct encoding, the other with ordinal encoding.

In [12]:
encoder_ohe = OneHotEncoder(drop='first',handle_unknown='ignore')
encoder_ohe.fit(features_train[categorial])
encoder_df = pd.DataFrame(encoder_ohe.transform(features_train[categorial]). toarray (),index=features_train.index)
features_train_ohe = features_train.join(encoder_df)
features_train_ohe = features_train_ohe.drop(categorial, axis=1)
features_train_ohe.head()

Unnamed: 0,RegistrationYear,Power,Kilometer,0,1,2,3,4,5,6,...,296,297,298,299,300,301,302,303,304,305
353084,1997,60,150000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
219527,2013,150,80000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
318748,2008,177,150000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
92274,2003,75,150000,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
211824,2012,140,90000,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# apply ordinal encoding
encoder_ordinal=OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
encoder_ordinal.fit(features_train[categorial])
features_train[categorial] = encoder_ordinal.transform(features_train[categorial])

In [14]:
features_train.head()

Unnamed: 0,VehicleType,RegistrationYear,Gearbox,Power,Model,Kilometer,FuelType,Brand,Repaired
353084,6.0,1997,2.0,60,173.0,150000,7.0,38.0,1.0
219527,8.0,2013,2.0,150,28.0,80000,3.0,1.0,1.0
318748,3.0,2008,2.0,177,6.0,150000,3.0,2.0,1.0
92274,6.0,2003,2.0,75,83.0,150000,7.0,24.0,1.0
211824,1.0,2012,2.0,140,248.0,90000,7.0,24.0,0.0


In [15]:
#scaling numerical features in a dataset with direct encoding
numeric = ['RegistrationYear', 'Power', 'Kilometer']
scaler = StandardScaler()
scaler.fit(features_train_ohe[numeric])
features_train_ohe[numeric] = scaler.transform(features_train_ohe[numeric])
features_train_ohe=features_train_ohe.fillna(0)
features_train_ohe.head()

Unnamed: 0,RegistrationYear,Power,Kilometer,0,1,2,3,4,5,6,...,296,297,298,299,300,301,302,303,304,305
353084,-1.029479,-1.221948,0.605774,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
219527,1.564115,0.453616,-1.289476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
318748,0.753617,0.956286,0.605774,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
92274,-0.056881,-0.942688,0.605774,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
211824,1.402015,0.267443,-1.018726,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
#scale numerical features in a dataset with ordinal encoding
numeric = ['RegistrationYear', 'Power', 'Kilometer']
scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])

## Model training

**1. Random forest model**

In [17]:
%%time
# to select hyperparameters we will use RandomizedSearchCV
# because there are quite a lot of combinations of hyperparameters when iterating over
model_rf = RandomForestRegressor(random_state=12345)

param_grid_rf = {
    'n_estimators': range(50, 251, 50),
    'max_depth': range(2, 15),
    'min_samples_split': (2, 3, 4),
    'min_samples_leaf': (1, 2, 3, 4)
}


gs_rf = RandomizedSearchCV(
    model_rf,
    param_distributions=param_grid_rf,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    random_state=12345,
    verbose=True
)
#let's select parameters for the random forest model on data with ordinal encoding
gs_rf.fit(features_train, target_train)

gs_rf_best_score = gs_rf.best_score_ * -1
gs_rf_best_params = gs_rf.best_params_
print(f'best_score: {gs_rf_best_score}')
print(f'best_params: {gs_rf_best_params}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
best_score: 1706.9201384360458
best_params: {'n_estimators': 150, 'min_samples_split': 4, 'min_samples_leaf': 3, 'max_depth': 13}
CPU times: user 31.9 s, sys: 558 ms, total: 32.4 s
Wall time: 2min 45s


In [18]:
%%time
#let's select parameters for the random forest model on data with direct encoding
gs_rf.fit(features_train_ohe, target_train)

gs_rf_best_score = gs_rf.best_score_ * -1
gs_rf_best_params = gs_rf.best_params_
print(f'best_score: {gs_rf_best_score}')
print(f'best_params: {gs_rf_best_params}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits










best_score: 1744.3338770326336
best_params: {'n_estimators': 150, 'min_samples_split': 4, 'min_samples_leaf': 3, 'max_depth': 13}
CPU times: user 3min 8s, sys: 2.35 s, total: 3min 11s
Wall time: 15min 23s


Conclusion: in the random forest model, the best results (lowest RMSE) are shown by the model trained on data transformed using ordinal encoding

**2. LightGBM Gradient Boost Model**

In [19]:
model_gbm=LGBMRegressor(random_state=12345)
cv = KFold(n_splits=3, shuffle=True, random_state=12345)
params = {
    'learning_rate': [0.01,0.1,1],
    'n_estimators': [40, 60],
    'num_leaves': [21, 31, 41],
}
# we will select the parameters using GridSearchCV и RandomizedSearchCV
grid_gbm = GridSearchCV(model_gbm,
                        params,
                        cv=cv,
                        scoring='neg_root_mean_squared_error')
gs_gbm = RandomizedSearchCV(
    model_gbm,
    param_distributions=params,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    random_state=12345,
    verbose=True
)



In [20]:
%%time
#let's select parameters for the model using directly encoded data

gs_gbm.fit(features_train_ohe, target_train)

gs_gbm_best_score = gs_gbm.best_score_ * -1
gs_gbm_best_params = gs_gbm.best_params_
print(f'best_score: {gs_gbm_best_score}')
print(f'best_params: {gs_gbm_best_params}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
best_score: 1685.3218479319883
best_params: {'num_leaves': 21, 'n_estimators': 60, 'learning_rate': 1}
CPU times: user 1.84 s, sys: 1.12 s, total: 2.96 s
Wall time: 28.8 s


In [21]:
%%time
#let's select parameters for the model using ordinal encoded data
gs_gbm.fit(features_train, target_train)

gs_gbm_best_score = gs_gbm.best_score_ * -1
gs_gbm_best_params = gs_gbm.best_params_
print(f'best_score: {gs_gbm_best_score}')
print(f'best_params: {gs_gbm_best_params}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits




best_score: 1701.0517971534318
best_params: {'num_leaves': 21, 'n_estimators': 60, 'learning_rate': 1}
CPU times: user 1.62 s, sys: 1.05 s, total: 2.67 s
Wall time: 4.38 s


In [22]:
%%time
#Let's compare the result with the efficiency of selecting parameters using GridSearchCV

grid_gbm.fit(features_train, target_train)

grid_gbm_best_score = grid_gbm.best_score_ * -1
grid_gbm_best_params = grid_gbm.best_params_
print(f'best_score: {grid_gbm_best_score}')
print(f'best_params: {grid_gbm_best_params}')

best_score: 1715.4826393322335
best_params: {'learning_rate': 1, 'n_estimators': 60, 'num_leaves': 41}
CPU times: user 1min 11s, sys: 30 s, total: 1min 40s
Wall time: 10.9 s


In [23]:
%%time
#selection of LightGBM parameters on directly encoded data using GridSearchCV

grid_gbm.fit(features_train_ohe, target_train)

grid_gbm_best_score = grid_gbm.best_score_ * -1
grid_gbm_best_params = grid_gbm.best_params_
print(f'best_score: {grid_gbm_best_score}')
print(f'best_params: {grid_gbm_best_params}')

best_score: 1681.0559903149526
best_params: {'learning_rate': 1, 'n_estimators': 60, 'num_leaves': 41}
CPU times: user 2min 33s, sys: 1min 39s, total: 4min 13s
Wall time: 1min 7s


*Conclusion:* in the LightGBM model, the best results (lowest RMSE) are shown by the model trained on data transformed using direct coding, however, with a slight difference, the training time on direct coding data is several times higher. Selection of parameters using random enumeration of combinations of three hyperparameters (RandomizedSearchCV) shows a result that is not much worse than exhaustive search (GridSearchCV) with a significantly greater time expenditure for exhaustive search

## Model analysis

**Model comparison**
When using ordinal-transformed data, both models show similar prediction performance on the training data (RMSE 1706 for Random Forest and 1701 for LightGBM), but LightGBM trains faster, making it more suitable for the task at hand. Let's check the accuracy of the prediction on a test sample transformed using ordinal coding using selected hyperparameters.

In [24]:
#convert the test set data using OrdinalEncoder and StandardScaler trained on the training set
features_test[categorial] = encoder_ordinal.transform(features_test[categorial])
features_test[numeric] = scaler.transform(features_test[numeric])
model_gbm=LGBMRegressor(num_leaves=41, n_estimators=40, learning_rate=1,random_state=12345)
model_gbm.fit(features_train, target_train)
prediction=model_gbm.predict(features_test)
score=(mean_squared_error(target_test, prediction))**0.5
print('RMSE on test dataset:', score)

RMSE on test dataset: 1699.168889843651


**Conclusion**
On the test sample, the model shows similar prediction accuracy. Combined with a high learning rate, this allows it to be used for the task at hand.