# Rusty Bargain

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction
- the speed of the prediction
- the time required for training

## Preparation

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from math import sqrt

from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.metrics import mean_squared_error

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import lightgbm

In [2]:
df = pd.read_csv('C:/Users/Kayo/Downloads/car_data.csv')

In [3]:
# Checking general info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


For our purposes, DateCrawled, RegistrationMonth, DateCreated, PostalCode, and LastSeen aren't needed. While RegistrationMonth alongside RegistrationYear will help us predict a car's value, RegistrationMonth is a much less important weight that may affect the output, even with scaling, more than it should. 

In [5]:
df.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,RegistrationMonth,NumberOfPictures,PostalCode
count,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0,354369.0
mean,4416.656776,2004.234448,110.094337,128211.172535,5.714645,0.0,50508.689087
std,4514.158514,90.227958,189.850405,37905.34153,3.726421,0.0,25783.096248
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1050.0,1999.0,69.0,125000.0,3.0,0.0,30165.0
50%,2700.0,2003.0,105.0,150000.0,6.0,0.0,49413.0
75%,6400.0,2008.0,143.0,150000.0,9.0,0.0,71083.0
max,20000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


After reviewing our minimum, we can see some outliers. Price of the listed vehicle should of course never cost \\$0, nor have 0 horsepower, among other issues that are readily apparent. While $0 and 0 horsepower could indicate the vehicle is free or that the vehicle doesn't have a motor, our purpose here is to make a model to predict a car's value, so these are still worth removing as they don't help accurately predict a new customer's listing.

In [6]:
# Cut off outliers
df = df[(df['Price'] > 0) & (df['Power'] > 0) & (df['RegistrationYear'] < 2022) & (df['RegistrationYear'] > 1900) & (
    df['Power'] < 2000)].reset_index(drop=True)

It's not clear what year this dataset was made, so we'll just cut off all values past this year for RegistrationYear, 2022, and before 1900.

In [7]:
# Drop unnecesary columns
df = df.drop(['DateCrawled', 'RegistrationMonth', 'DateCreated', 'LastSeen', 'PostalCode'], axis=1)

# NumberOfPictures seems to have only 0's, checking it here
print('Unique values in NumberOfPictures:', df['NumberOfPictures'].unique())
# then Dropping NumberOfPictures
df = df.drop(['NumberOfPictures'], axis=1)

Unique values in NumberOfPictures: [0]


While NumberOfPictures could be useful, none of the listings in this dataset have any, so there's no point in keeping it as a feature for the model.

In [8]:
# Identify if there's duplicates
print('Total duplicate rows:', len(df) - len(df.drop_duplicates()))
# then Dropping duplicates
df = df.drop_duplicates()
print('Total duplicate rows after dropping:', len(df) - len(df.drop_duplicates()))

Total duplicate rows: 39780
Total duplicate rows after dropping: 0


In [9]:
# Imputing away NaNs
numeric_feature_names = ['Price', 'RegistrationYear', 'Power', 'Mileage'] 
df[numeric_feature_names] = KNNImputer().fit_transform(df[numeric_feature_names])

In [10]:
# Changing string columns into numbers for our Linear Regression model
df_lr = df.copy()
df_lr = pd.get_dummies(df_lr)

# Changing string columns into numbers for our other models
# label encoder will also replace string NaNs with 0
df_ml = df.copy()

for column in df_ml[['FuelType', 'VehicleType', 'Brand', 'Model', 'Gearbox', 'NotRepaired']]:
    df_ml[column] = LabelEncoder().fit_transform(df_ml[column])

In [11]:
df_ml.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired
0,18300.0,2,2011.0,1,190.0,249,125000.0,2,1,1
1,9800.0,6,2004.0,0,163.0,117,125000.0,2,14,2
2,1500.0,5,2001.0,1,75.0,116,150000.0,6,38,0
3,3600.0,5,2008.0,1,69.0,101,90000.0,2,31,0
4,650.0,4,1995.0,1,102.0,11,150000.0,6,2,1


In [12]:
df_lr.head()

Unnamed: 0,Price,RegistrationYear,Power,Mileage,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_smart,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_no,NotRepaired_yes
0,18300.0,2011.0,190.0,125000.0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,9800.0,2004.0,163.0,125000.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1500.0,2001.0,75.0,150000.0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,1,0
3,3600.0,2008.0,69.0,90000.0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
4,650.0,1995.0,102.0,150000.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


After cleaning up our dataset by dropping duplicates, unnecessary columns, and outliers after scaling and replacing our string columns with numbers we can move onto the next stage.

## Model Training

In [13]:
# Creating variables for testing and training for LR
target_train_lr, target_test_lr, features_train_lr, features_test_lr = train_test_split(
    df_lr['Price'], df_lr.drop(['Price'], axis=1), test_size=0.1, random_state=123)

# Creating variables for testing and training for ML
target_train_ml, target_test_ml, features_train_ml, features_test_ml = train_test_split(
    df_ml['Price'], df_ml.drop(['Price'], axis=1), test_size=0.1, random_state=123)

# The following prints are only for the ML variables, however since I only copy/pasted for the ML variables and 
# made small edits to turn change it to LR, so since it's closely the same code it would be redundant to test both
print('Target Train Shape:', target_train_ml.shape)
print('Features Train Shape:', features_train_ml.shape)
print()
print('Target Test Shape:', target_test_ml.shape)
print('Features Test Shape:', features_test_ml.shape)

Target Train Shape: (240573,)
Features Train Shape: (240573, 9)

Target Test Shape: (26731,)
Features Test Shape: (26731, 9)


In [14]:
# Scaling our LR variables
transformer_mas = MaxAbsScaler().fit(features_train_lr)
features_train_lr = transformer_mas.transform(features_train_lr)
features_test_lr = transformer_mas.transform(features_test_lr)

# Scaling our ML variables
transformer_mas = MaxAbsScaler().fit(features_train_ml)
features_train_ml = transformer_mas.transform(features_train_ml)
features_test_ml = transformer_mas.transform(features_test_ml)

### Linear Regression

In [15]:
%%time
# Testing Linear Regression with GridSearchCV for cross-validation
param_grid = {}

lr_model = GridSearchCV(estimator=LinearRegression(), param_grid=param_grid, cv=5)
lr_model.fit(features_train_lr, target_train_lr)

Wall time: 12.1 s


GridSearchCV(cv=5, estimator=LinearRegression(), param_grid={})

In [16]:
%%time
lr_predict = lr_model.predict(features_test_lr)

Wall time: 15 ms


### Decision Tree Regressor

In [17]:
%%time
# Tuning Decision Tree Regression parameters using GridSearchCV
param = {
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [int(x) for x in np.linspace(3, 15, num = 5)],
}

dtr_model = GridSearchCV(estimator=DecisionTreeRegressor(random_state=123), param_grid=param, cv=5)
dtr_model.fit(features_train_ml, target_train_ml)

Wall time: 10.3 s


GridSearchCV(cv=5, estimator=DecisionTreeRegressor(random_state=123),
             param_grid={'max_depth': [3, 6, 9, 12, 15],
                         'max_features': ['auto', 'sqrt', 'log2']})

In [18]:
%%time
dtr_predict = dtr_model.predict(features_test_ml)

Wall time: 3.99 ms


### Random Forest Regressor

In [19]:
%%time
# Tuning Random Forest Regressors parameters using GridSearchCV
# limited paramater tuning due to training time
param = {
    'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 200, num = 2)],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [int(x) for x in np.linspace(3, 15, num = 3)],
}

rfr_model = GridSearchCV(estimator=RandomForestRegressor(random_state=123), param_grid=param, cv=3)
rfr_model.fit(features_train_ml, target_train_ml)

Wall time: 8min 33s


GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=123),
             param_grid={'max_depth': [3, 9, 15],
                         'max_features': ['auto', 'sqrt'],
                         'n_estimators': [100, 200]})

In [20]:
%%time
rfr_predict = rfr_model.predict(features_test_ml)

Wall time: 732 ms


### Light GBM

#### Light GBM 1

In [21]:
%%time
# Train using LightGBM
param = {}

lgb_model_1 = GridSearchCV(estimator=lightgbm.LGBMRegressor(random_state=123), param_grid=param, cv=5)
lgb_model_1.fit(features_train_ml, target_train_ml, eval_metric='logloss')

Wall time: 2.51 s


GridSearchCV(cv=5, estimator=LGBMRegressor(random_state=123), param_grid={})

In [22]:
%%time
lgb_predict_1 = lgb_model_1.predict(features_test_ml)

Wall time: 28.9 ms


#### Light GBM 2

In [23]:
%%time
# Train using LightGBM with weight
param = {
    'n_estimators' : [int(x) for x in np.linspace(10, 30, 5)]
}

lgb_model_2 = GridSearchCV(estimator=lightgbm.LGBMRegressor(random_state=123), param_grid=param, cv=5)
lgb_model_2.fit(features_train_ml, target_train_ml, eval_metric='logloss')

Wall time: 4.6 s


GridSearchCV(cv=5, estimator=LGBMRegressor(random_state=123),
             param_grid={'n_estimators': [10, 15, 20, 25, 30]})

In [24]:
%%time
lgb_predict_2 = lgb_model_2.predict(features_test_ml)

Wall time: 12 ms


#### Light GBM 3

In [25]:
%%time
# Train using LightGBM with weights and leaves
param = {
    'n_estimators' : [int(x) for x in np.linspace(10, 30, 5)],
    'num_leaves' : [int(x) for x in np.linspace(3, 12, 4)]
}

lgb_model_3 = GridSearchCV(estimator=lightgbm.LGBMRegressor(random_state=123), param_grid=param, cv=5)
lgb_model_3.fit(features_train_ml, target_train_ml, eval_metric='logloss')

Wall time: 13.9 s


GridSearchCV(cv=5, estimator=LGBMRegressor(random_state=123),
             param_grid={'n_estimators': [10, 15, 20, 25, 30],
                         'num_leaves': [3, 6, 9, 12]})

In [26]:
%%time
lgb_predict_3 = lgb_model_3.predict(features_test_ml)

Wall time: 8.48 ms


The base LightGBM took quite a long time, and changing around the parameters caused it to be much much faster. RFR took quite a while as well, whereas the remaining models quite quick.

## Model Analysis

In [27]:
# RMSE for LinearRegression
print('RMSE for Linear Regression, considered a sanity check:', sqrt(mean_squared_error(target_test_lr, lr_predict)))
print()

# RMSE for DecisionTreeRegressor
print('RMSE for Decision Tree Regression:', sqrt(mean_squared_error(target_test_ml, dtr_predict)))
print()

# RMSE for RandomForestRegressor
print('RMSE for Random Forest Regression:', sqrt(mean_squared_error(target_test_ml, rfr_predict)))
print()

# RMSE for LightGBM
print('RMSE for LightGBM:', sqrt(mean_squared_error(target_test_ml, lgb_predict_1)))
print()

# RMSE for LightGBM with weights
print('RMSE for LightGBM with weights:', sqrt(mean_squared_error(target_test_ml, lgb_predict_2)))
print()

# RMSE for LightGBM with weights and leaves
print('RMSE for LightGBM with weights and leaves:', sqrt(mean_squared_error(target_test_ml, lgb_predict_3)))
print()

RMSE for Linear Regression, considered a sanity check: 2736.073812677318

RMSE for Decision Tree Regression: 1914.796661771209

RMSE for Random Forest Regression: 1663.5666798183497

RMSE for LightGBM: 1725.0605109375585

RMSE for LightGBM with weights: 1945.7446062028644

RMSE for LightGBM with weights and leaves: 2118.8878036259757



LightGBM has the second lowest RMSE score after RFR, however both took much longer than other models comparatively. That leaves the LightGBM model with modified weights and Decision Tree left to choose beteween, and the DTR was quicker and has lower RMSE.