## Car Auction Price Prediction

Given *data about cars for sale in auction*, let's try to predict the **price** of a given car.

We will use various regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/doaaalsenani/usa-cers-dataset

### Importing Libraries

In [56]:
import numpy as np
import pandas as pd
import re

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

In [2]:
data = pd.read_csv('archive/USA_cars_datasets.csv')
data

Unnamed: 0.1,Unnamed: 0,price,brand,model,year,title_status,mileage,color,vin,lot,state,country,condition
0,0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,jtezu11f88k007763,159348797,new jersey,usa,10 days left
1,1,2899,ford,se,2011,clean vehicle,190552.0,silver,2fmdk3gc4bbb02217,166951262,tennessee,usa,6 days left
2,2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,3c4pdcgg5jt346413,167655728,georgia,usa,2 days left
3,3,25000,ford,door,2014,clean vehicle,64146.0,blue,1ftfw1et4efc23745,167753855,virginia,usa,22 hours left
4,4,27700,chevrolet,1500,2018,clean vehicle,6654.0,red,3gcpcrec2jg473991,167763266,florida,usa,22 hours left
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,2494,7800,nissan,versa,2019,clean vehicle,23609.0,red,3n1cn7ap9kl880319,167722715,california,usa,1 days left
2495,2495,9200,nissan,versa,2018,clean vehicle,34553.0,silver,3n1cn7ap5jl884088,167762225,florida,usa,21 hours left
2496,2496,9200,nissan,versa,2018,clean vehicle,31594.0,silver,3n1cn7ap9jl884191,167762226,florida,usa,21 hours left
2497,2497,9200,nissan,versa,2018,clean vehicle,32557.0,black,3n1cn7ap3jl883263,167762227,florida,usa,2 days left


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2499 entries, 0 to 2498
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    2499 non-null   int64  
 1   price         2499 non-null   int64  
 2   brand         2499 non-null   object 
 3   model         2499 non-null   object 
 4   year          2499 non-null   int64  
 5   title_status  2499 non-null   object 
 6   mileage       2499 non-null   float64
 7   color         2499 non-null   object 
 8   vin           2499 non-null   object 
 9   lot           2499 non-null   int64  
 10  state         2499 non-null   object 
 11  country       2499 non-null   object 
 12  condition     2499 non-null   object 
dtypes: float64(1), int64(4), object(8)
memory usage: 253.9+ KB


### Preprocessing

In [26]:
df = data.copy()

In [27]:
df = df.drop('Unnamed: 0', axis=1)

In [28]:
def binary_encode(df, columns_with_positive_values):
    df = df.copy()
    for column, positive_value in columns_with_positive_values:
        df[column] = df[column].apply(lambda x: 1 if x == positive_value else 0)
    return df    

def onehot_encode(df, columns_with_prefixes):
    df = df.copy()
    for column, prefix in columns_with_prefixes:
        dummies = pd.get_dummies(df[column], prefix=prefix)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
    return df

In [29]:
{column: len(df[column].unique()) for column in df.columns}

{'price': 790,
 'brand': 28,
 'model': 127,
 'year': 30,
 'title_status': 2,
 'mileage': 2439,
 'color': 49,
 'vin': 2495,
 'lot': 2495,
 'state': 44,
 'country': 2,
 'condition': 47}

In [30]:
# Drop unnecessary columns
df = df.drop(['vin', 'lot'], axis=1)
df

Unnamed: 0,price,brand,model,year,title_status,mileage,color,state,country,condition
0,6300,toyota,cruiser,2008,clean vehicle,274117.0,black,new jersey,usa,10 days left
1,2899,ford,se,2011,clean vehicle,190552.0,silver,tennessee,usa,6 days left
2,5350,dodge,mpv,2018,clean vehicle,39590.0,silver,georgia,usa,2 days left
3,25000,ford,door,2014,clean vehicle,64146.0,blue,virginia,usa,22 hours left
4,27700,chevrolet,1500,2018,clean vehicle,6654.0,red,florida,usa,22 hours left
...,...,...,...,...,...,...,...,...,...,...
2494,7800,nissan,versa,2019,clean vehicle,23609.0,red,california,usa,1 days left
2495,9200,nissan,versa,2018,clean vehicle,34553.0,silver,florida,usa,21 hours left
2496,9200,nissan,versa,2018,clean vehicle,31594.0,silver,florida,usa,21 hours left
2497,9200,nissan,versa,2018,clean vehicle,32557.0,black,florida,usa,2 days left


In [31]:
data['title_status'].value_counts().keys()

Index(['clean vehicle', 'salvage insurance'], dtype='object', name='title_status')

In [32]:
# Binary encode the title_status and country columns
df = binary_encode(df, 
                  columns_with_positive_values = [
                      ('title_status', 'salvage insurance'),
                      ('country', ' canada')
                  ])

In [33]:
df

Unnamed: 0,price,brand,model,year,title_status,mileage,color,state,country,condition
0,6300,toyota,cruiser,2008,0,274117.0,black,new jersey,0,10 days left
1,2899,ford,se,2011,0,190552.0,silver,tennessee,0,6 days left
2,5350,dodge,mpv,2018,0,39590.0,silver,georgia,0,2 days left
3,25000,ford,door,2014,0,64146.0,blue,virginia,0,22 hours left
4,27700,chevrolet,1500,2018,0,6654.0,red,florida,0,22 hours left
...,...,...,...,...,...,...,...,...,...,...
2494,7800,nissan,versa,2019,0,23609.0,red,california,0,1 days left
2495,9200,nissan,versa,2018,0,34553.0,silver,florida,0,21 hours left
2496,9200,nissan,versa,2018,0,31594.0,silver,florida,0,21 hours left
2497,9200,nissan,versa,2018,0,32557.0,black,florida,0,2 days left


In [36]:
df['country'].value_counts()

country
0    2492
1       7
Name: count, dtype: int64

In [37]:
df['title_status'].value_counts()

title_status
0    2336
1     163
Name: count, dtype: int64

In [38]:
# One-hot encode the brand, model, color, state and condition columns
df = onehot_encode(df, 
                  columns_with_prefixes=[
                      ('brand', 'br'),
                      ('model', 'md'),
                      ('color', 'cl'),
                      ('state', 'st'),
                      ('condition', 'cd')
                  ])

In [39]:
df

Unnamed: 0,price,year,title_status,mileage,country,br_acura,br_audi,br_bmw,br_buick,br_cadillac,...,cd_5 hours left,cd_53 minutes,cd_6 days left,cd_6 hours left,cd_7 days left,cd_7 hours left,cd_8 days left,cd_9 days left,cd_9 minutes,cd_Listing Expired
0,6300,2008,0,274117.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2899,2011,0,190552.0,0,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
2,5350,2018,0,39590.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,25000,2014,0,64146.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,27700,2018,0,6654.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,7800,2019,0,23609.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2495,9200,2018,0,34553.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2496,9200,2018,0,31594.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2497,9200,2018,0,32557.0,0,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [57]:
# To mitigate LightGBM error
df = df.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

In [58]:
# Split df into X and y
y = df['price'].copy()
X = df.drop('price', axis=1).copy()

In [59]:
X

Unnamed: 0,year,title_status,mileage,country,br_acura,br_audi,br_bmw,br_buick,br_cadillac,br_chevrolet,...,cd_5hoursleft,cd_53minutes,cd_6daysleft,cd_6hoursleft,cd_7daysleft,cd_7hoursleft,cd_8daysleft,cd_9daysleft,cd_9minutes,cd_ListingExpired
0,2008,0,274117.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2011,0,190552.0,0,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
2,2018,0,39590.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2014,0,64146.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2018,0,6654.0,0,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,2019,0,23609.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2495,2018,0,34553.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2496,2018,0,31594.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2497,2018,0,32557.0,0,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [60]:
y

0        6300
1        2899
2        5350
3       25000
4       27700
        ...  
2494     7800
2495     9200
2496     9200
2497     9200
2498     9200
Name: price, Length: 2499, dtype: int64

In [61]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123)

In [62]:
X_train.shape, X_test.shape

((1749, 299), (750, 299))

In [63]:
# Scale X with a standard scaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns, index=X_test.index)

In [64]:
X_train.mean()

year                 1.161284e-14
title_status         6.093849e-17
mileage              3.656309e-17
country              0.000000e+00
br_acura             5.078207e-18
                         ...     
cd_7hoursleft        8.125131e-18
cd_8daysleft        -2.031283e-17
cd_9daysleft         3.453181e-17
cd_9minutes          1.117206e-17
cd_ListingExpired   -5.687592e-17
Length: 299, dtype: float64

In [65]:
X_train.var()

year                 1.000572
title_status         1.000572
mileage              1.000572
country              1.000572
br_acura             1.000572
                       ...   
cd_7hoursleft        1.000572
cd_8daysleft         1.000572
cd_9daysleft         1.000572
cd_9minutes          1.000572
cd_ListingExpired    1.000572
Length: 299, dtype: float64

In [66]:
X_train

Unnamed: 0,year,title_status,mileage,country,br_acura,br_audi,br_bmw,br_buick,br_cadillac,br_chevrolet,...,cd_5hoursleft,cd_53minutes,cd_6daysleft,cd_6hoursleft,cd_7daysleft,cd_7hoursleft,cd_8daysleft,cd_9daysleft,cd_9minutes,cd_ListingExpired
1351,-1.288806,-0.260318,1.512784,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
903,0.372505,-0.260318,-0.522124,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
2049,0.649390,-0.260318,-0.394639,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
798,-1.011921,-0.260318,0.592934,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,2.744396,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
1360,0.649390,-0.260318,-0.625620,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1147,0.649390,-0.260318,-0.285273,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
2154,0.649390,-0.260318,-0.218108,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
1766,-0.458151,-0.260318,0.097645,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829
1122,0.649390,-0.260318,-0.392006,-0.053544,-0.041451,-0.023918,-0.075832,-0.083117,-0.067787,-0.364379,...,-0.086536,-0.023918,-0.142899,-0.058671,-0.136518,-0.041451,-0.191707,-0.142899,-0.041451,-0.089829


### Training

In [67]:
models = {
    "                     Linear Regression": LinearRegression(),
    "                   K-Nearest Neighbors": KNeighborsRegressor(),
    "                        Neural Network": MLPRegressor(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                         Decision Tree": DecisionTreeRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "                     Gradient Boosting": GradientBoostingRegressor(),
    "                               XGBoost": XGBRegressor(),
    "                              LightGBM": LGBMRegressor(),
    "                              CatBoost": CatBoostRegressor()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                     Linear Regression trained.
                   K-Nearest Neighbors trained.




                        Neural Network trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                         Decision Tree trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001051 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 495
[LightGBM] [Info] Number of data points in the train set: 1749, number of used features: 74
[LightGBM] [Info] Start training from score 18689.759291
                              LightGBM trained.
Learning rate set to 0.044724
0:	learn: 11773.9471864	total: 48.2ms	remaining: 48.2s
1:	learn: 11644.0791228	total: 49.5ms	remaining: 24.7s
2:	learn: 11479.2418290	total: 50.8ms	remaining: 16.9s
3:	learn: 113

### Results

In [68]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

                     Linear Regression R^2 Score: 0.60394
                   K-Nearest Neighbors R^2 Score: 0.45878
                        Neural Network R^2 Score: -1.16986
Support Vector Machine (Linear Kernel) R^2 Score: -1.89813
   Support Vector Machine (RBF Kernel) R^2 Score: -0.02643
                         Decision Tree R^2 Score: 0.47375
                         Random Forest R^2 Score: 0.63509
                     Gradient Boosting R^2 Score: 0.57529
                               XGBoost R^2 Score: 0.65481
                              LightGBM R^2 Score: 0.59239
                              CatBoost R^2 Score: 0.67971
