# Introduction

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [71]:
# Load libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
import lightgbm as lgb
import catboost as cb
import xgboost as xgb
import numpy as np
import time


In [72]:
#load data
cars = pd.read_csv("car_data.csv")

In [73]:
# Examine few rows
cars.sample(5)

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
107034,20/03/2016 19:57,3999,small,2001,auto,125,a_klasse,150000,12,petrol,mercedes_benz,no,20/03/2016 00:00,0,70374,07/04/2016 04:46
280905,08/03/2016 17:25,650,small,1998,manual,54,corsa,150000,3,petrol,opel,yes,08/03/2016 00:00,0,56479,07/04/2016 01:17
129099,17/03/2016 13:39,1699,sedan,1999,manual,116,beetle,150000,9,petrol,volkswagen,no,17/03/2016 00:00,0,67067,25/03/2016 14:16
128446,03/04/2016 22:58,9200,sedan,2006,auto,200,golf,150000,5,petrol,volkswagen,no,03/04/2016 00:00,0,92237,06/04/2016 02:44
230433,08/03/2016 16:38,1500,bus,2007,manual,75,other,125000,6,lpg,chevrolet,no,08/03/2016 00:00,0,52146,02/04/2016 23:15


In [74]:
# check duplicates before I drop some columns
print(cars.duplicated().sum())

262


In [75]:
# remove duplicates
cars = cars.drop_duplicates()

In [76]:
# confirm changes
print(cars.duplicated().sum())

0


I think there are many columns which are not necessary in order to create a model . They are DataCrawled, RegistrationMonth, DateCreated, NumberOfPictures, PostalCode and LastSeen. I will drop these columns.

In [77]:
# Drop irrrelevant columns
columns_to_drop = ['DateCrawled', 'RegistrationMonth', 'DateCreated','NumberOfPictures', 'PostalCode', 'LastSeen' ]

cars = cars.drop(columns = columns_to_drop)

In [78]:
# observe changes
cars.head()

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired
0,480,,1993,manual,0,golf,150000,petrol,volkswagen,
1,18300,coupe,2011,manual,190,,125000,gasoline,audi,yes
2,9800,suv,2004,auto,163,grand,125000,gasoline,jeep,
3,1500,small,2001,manual,75,golf,150000,petrol,volkswagen,no
4,3600,small,2008,manual,69,fabia,90000,gasoline,skoda,no


In [79]:
# Inspect data types
cars.info()

<class 'pandas.core.frame.DataFrame'>
Index: 354107 entries, 0 to 354368
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             354107 non-null  int64 
 1   VehicleType       316623 non-null  object
 2   RegistrationYear  354107 non-null  int64 
 3   Gearbox           334277 non-null  object
 4   Power             354107 non-null  int64 
 5   Model             334406 non-null  object
 6   Mileage           354107 non-null  int64 
 7   FuelType          321218 non-null  object
 8   Brand             354107 non-null  object
 9   NotRepaired       282962 non-null  object
dtypes: int64(4), object(6)
memory usage: 29.7+ MB


In [80]:
# missing values
cars.isnull().sum()

Price                   0
VehicleType         37484
RegistrationYear        0
Gearbox             19830
Power                   0
Model               19701
Mileage                 0
FuelType            32889
Brand                   0
NotRepaired         71145
dtype: int64

In [81]:
# Find unique values from the columns with null values

columns_with_null = ['VehicleType', 'Gearbox', 'Model', 'FuelType','NotRepaired' ]

missing = {one_column: cars[one_column].unique().tolist()
    for one_column in columns_with_null}

for key, value in missing.items():
    print(f"\n\nColumn: {key}")
    print(f"Number of Unique Values: {len(value)}")
    print(f"Unique Values: {value}")
    



Column: VehicleType
Number of Unique Values: 9
Unique Values: [nan, 'coupe', 'suv', 'small', 'sedan', 'convertible', 'bus', 'wagon', 'other']


Column: Gearbox
Number of Unique Values: 3
Unique Values: ['manual', 'auto', nan]


Column: Model
Number of Unique Values: 251
Unique Values: ['golf', nan, 'grand', 'fabia', '3er', '2_reihe', 'other', 'c_max', '3_reihe', 'passat', 'navara', 'ka', 'polo', 'twingo', 'a_klasse', 'scirocco', '5er', 'meriva', 'arosa', 'c4', 'civic', 'transporter', 'punto', 'e_klasse', 'clio', 'kadett', 'kangoo', 'corsa', 'one', 'fortwo', '1er', 'b_klasse', 'signum', 'astra', 'a8', 'jetta', 'fiesta', 'c_klasse', 'micra', 'vito', 'sprinter', '156', 'escort', 'forester', 'xc_reihe', 'scenic', 'a4', 'a1', 'insignia', 'combo', 'focus', 'tt', 'a6', 'jazz', 'omega', 'slk', '7er', '80', '147', '100', 'z_reihe', 'sportage', 'sorento', 'v40', 'ibiza', 'mustang', 'eos', 'touran', 'getz', 'a3', 'almera', 'megane', 'lupo', 'r19', 'zafira', 'caddy', 'mondeo', 'cordoba', 'colt',

In the column `FuelType`, I see petrol as well as gasoline. They are synonymous. This dataset seems to be from Europe because the target Price is in Euro. So I will replace gasoline with petrol.

In [82]:
# Replace gasoline with petrol
cars['FuelType'] = cars['FuelType'].replace('gasoline', 'petrol')

In [83]:
# confirm changes
cars['FuelType'].unique()

array(['petrol', nan, 'lpg', 'other', 'hybrid', 'cng', 'electric'],
      dtype=object)

In [84]:
# Fill missing values

for one_col in columns_with_null :
    cars[one_col] = cars[one_col].fillna('unknown')

In [85]:
# Check missing values

cars.isnull().sum()

Price               0
VehicleType         0
RegistrationYear    0
Gearbox             0
Power               0
Model               0
Mileage             0
FuelType            0
Brand               0
NotRepaired         0
dtype: int64

In [86]:
# Get summary statistics for the numerical columns
cars.describe()

Unnamed: 0,Price,RegistrationYear,Power,Mileage
count,354107.0,354107.0,354107.0,354107.0
mean,4416.433287,2004.235355,110.089651,128211.811684
std,4514.338584,90.261168,189.914972,37906.590101
min,0.0,1000.0,0.0,5000.0
25%,1050.0,1999.0,69.0,125000.0
50%,2700.0,2003.0,105.0,150000.0
75%,6400.0,2008.0,143.0,150000.0
max,20000.0,9999.0,20000.0,150000.0


In [87]:
# Create a min max dataframe to get intuition from the data

numerical_cols = ['RegistrationYear', 'Power', 'Mileage']


# Create min_max_df DataFrame

min_max_df = pd.DataFrame({
    'Column': numerical_cols,
    'Min': [cars[one_col].min() for one_col in numerical_cols],
    'Max': [cars[one_col].max() for one_col in numerical_cols]
})

min_max_df

Unnamed: 0,Column,Min,Max
0,RegistrationYear,1000,9999
1,Power,0,20000
2,Mileage,5000,150000


I will use domain knowledge to select the values. If we look at the Mileage column, the minimum is 5,000 and the maximum is 150,000. This range is realistic. However, if we examine the RegistrationYear column, the minimum year is 1000 and the maximum year is 9999, which are clearly unrealistic. I will filter the data to include only cars registered from 1960 to 2016, since the dataset was generated in 2016. Finally, if we consider the Power column, the minimum value is 0 and the maximum is 20,000, both of which are implausible. I will select cars with Power values between 50 and 1000.

In [88]:
# Filter dataset based on domain knowledge
cars = cars[
    (cars['RegistrationYear'] >= 1960) & (cars['RegistrationYear'] <= 2016) &
    (cars['Power'] >= 50) & (cars['Power'] <= 1000)
]

In [89]:
# Examine few rows
cars.sample(5)

Unnamed: 0,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired
10701,4500,convertible,1989,manual,90,golf,150000,petrol,volkswagen,unknown
242741,400,coupe,1997,manual,115,escort,150000,petrol,ford,yes
85706,14800,convertible,2011,manual,126,mx_reihe,40000,petrol,mazda,no
86654,6800,wagon,2010,manual,109,focus,125000,petrol,ford,no
125778,6500,unknown,2016,manual,230,signum,150000,petrol,opel,no


In [90]:
# Get general information
cars.info()

<class 'pandas.core.frame.DataFrame'>
Index: 296990 entries, 1 to 354368
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Price             296990 non-null  int64 
 1   VehicleType       296990 non-null  object
 2   RegistrationYear  296990 non-null  int64 
 3   Gearbox           296990 non-null  object
 4   Power             296990 non-null  int64 
 5   Model             296990 non-null  object
 6   Mileage           296990 non-null  int64 
 7   FuelType          296990 non-null  object
 8   Brand             296990 non-null  object
 9   NotRepaired       296990 non-null  object
dtypes: int64(4), object(6)
memory usage: 24.9+ MB


I feel like it is better to create a new column Age rather than RegistrationYear. I think Age is more intuitive.

In [91]:
#Create column Age and drop RegistrationYear

cars['Age'] = 2016 - cars['RegistrationYear']
cars = cars.drop(columns=['RegistrationYear'])

In [92]:
# Examine changes
cars.head()

Unnamed: 0,Price,VehicleType,Gearbox,Power,Model,Mileage,FuelType,Brand,NotRepaired,Age
1,18300,coupe,manual,190,unknown,125000,petrol,audi,yes,5
2,9800,suv,auto,163,grand,125000,petrol,jeep,unknown,12
3,1500,small,manual,75,golf,150000,petrol,volkswagen,no,15
4,3600,small,manual,69,fabia,90000,petrol,skoda,no,8
5,650,sedan,manual,102,3er,150000,petrol,bmw,yes,21


We need to convert categorical data into numerical ones. We can do this with One Hot Encoding (OHE).

In [93]:
# Apply One Hot Encoding (OHE)

columns_for_ohe = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
cars_ohe = pd.get_dummies(cars, columns=columns_for_ohe, drop_first=False, dtype=int)

In [94]:
# Examine changes

cars_ohe.head()

Unnamed: 0,Price,Power,Mileage,Age,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_no,NotRepaired_unknown,NotRepaired_yes
1,18300,190,125000,5,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,163,125000,12,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1500,75,150000,15,0,0,0,0,0,1,...,0,0,0,0,0,1,0,1,0,0
4,3600,69,90000,8,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
5,650,102,150000,21,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


As we can see more columns are added.

Now we need to split data for the models. For Linear Regression, Ransom Forest and XGBoost, we need to use One Hot Encoded data. But LightGBM and CatBoost have their implementation, so we need to use the non encoded data.

In [95]:
# For the LinearRegression, Random Forest, XGBoost)

X_ohe = cars_ohe.drop('Price', axis=1) # Features. We are using ohe encoded data
y_ohe = cars_ohe['Price'] # Target



# Split into training (60%) and remaining (40%) 
X_train_ohe, X_remain_ohe, y_train_ohe, y_remain_ohe = train_test_split(X_ohe, y_ohe,test_size=0.4,random_state=100)

# Split remaining into validation (20%) and test
X_valid_ohe, X_test_ohe, y_valid_ohe, y_test_ohe = train_test_split(X_remain_ohe, y_remain_ohe,test_size=0.5, random_state=100)
  


# For LightGBM, CatBoost)
X_non_ohe = cars.drop('Price', axis=1)  # Features. We are using non encoded data 
y_non_ohe = cars['Price']


#Split into training (60%) and remaining (40%) 
X_train_non_ohe, X_remain_non_ohe, y_train_non_ohe, y_remain_non_ohe = train_test_split(X_non_ohe, y_non_ohe, test_size=0.4, random_state=100)

# Split remaining into validation (20%) and test (20%)
X_valid_non_ohe, X_test_non_ohe, y_valid_non_ohe, y_test_non_ohe = train_test_split(X_remain_non_ohe, y_remain_non_ohe,test_size=0.5, random_state=100)


# Display shapes of each dataset

print("OHE Training Set Shape (X_train_ohe, y_train_ohe):", X_train_ohe.shape, y_train_ohe.shape)
print("OHE Validation Set Shape (X_valid_ohe, y_valid_ohe):", X_valid_ohe.shape, y_valid_ohe.shape)
print("OHE Test Set Shape (X_test_ohe, y_test_ohe):", X_test_ohe.shape, y_test_ohe.shape)
print('')
print('-'* 80)
print('')
print("Non-OHE Training Set Shape (X_train_non_ohe, y_train_non_ohe):", X_train_non_ohe.shape, y_train_non_ohe.shape)
print("Non-OHE Validation Set Shape (X_valid_non_ohe, y_valid_non_ohe):", X_valid_non_ohe.shape, y_valid_non_ohe.shape)
print("Non-OHE Test Set Shape (X_test_non_ohe, y_test_non_ohe):", X_test_non_ohe.shape, y_test_non_ohe.shape)


OHE Training Set Shape (X_train_ohe, y_train_ohe): (178194, 315) (178194,)
OHE Validation Set Shape (X_valid_ohe, y_valid_ohe): (59398, 315) (59398,)
OHE Test Set Shape (X_test_ohe, y_test_ohe): (59398, 315) (59398,)

--------------------------------------------------------------------------------

Non-OHE Training Set Shape (X_train_non_ohe, y_train_non_ohe): (178194, 9) (178194,)
Non-OHE Validation Set Shape (X_valid_non_ohe, y_valid_non_ohe): (59398, 9) (59398,)
Non-OHE Test Set Shape (X_test_non_ohe, y_test_non_ohe): (59398, 9) (59398,)


The numerical columns are in different scales. So we need to standardize that. For that we need to scale the numerical columns.

In [96]:
# Define numerical columns to scale
numeric = ['Age', 'Power', 'Mileage']

# Scale OHE Dataset
scaler_ohe = StandardScaler()
scaler_ohe.fit(X_train_ohe[numeric])

# Convert to float before scaling to avoid warnings
X_train_ohe[numeric] = X_train_ohe[numeric].astype(float)
X_valid_ohe[numeric] = X_valid_ohe[numeric].astype(float)
X_test_ohe[numeric] = X_test_ohe[numeric].astype(float)

# Now scale
X_train_ohe[numeric] = scaler_ohe.transform(X_train_ohe[numeric])
X_valid_ohe[numeric] = scaler_ohe.transform(X_valid_ohe[numeric])
X_test_ohe[numeric] = scaler_ohe.transform(X_test_ohe[numeric])

print("\nOHE Dataset - Numerical Columns After Scaling (Training Set):")
print(X_train_ohe[numeric].describe())


OHE Dataset - Numerical Columns After Scaling (Training Set):
                Age         Power       Mileage
count  1.781940e+05  1.781940e+05  1.781940e+05
mean   9.928793e-17  1.435488e-18 -2.468242e-17
std    1.000003e+00  1.000003e+00  1.000003e+00
min   -2.077507e+00 -1.332595e+00 -3.379294e+00
25%   -6.469903e-01 -7.788460e-01 -9.923793e-02
50%   -1.120511e-02 -1.328053e-01  5.841071e-01
75%    6.245801e-01  5.132353e-01  5.841071e-01
max    6.823486e+00  1.620279e+01  5.841071e-01


## Model training

### Linear Regression

I will use Linear regression as a sanity check. Linear regression is not very good for hyperparameter tuning.

In [97]:
# LinearRegression Sanity Check (OHE encoded, Scaled)
# Create linear regression model

lr = LinearRegression()

### Random Forest

In [98]:
# Random Forest with Tuning (OHE, Scaled) 

# Define hyperparameter grid for tuning
rf_param = {
    'n_estimators': [100],
    'max_depth': [5, 10]
}
# Create Random Forest model
rf = RandomForestRegressor(random_state=100)

# Hyperparameter tuning
rf_grid = GridSearchCV(rf, rf_param, scoring='neg_mean_squared_error', cv=2, n_jobs=-1)

# Fit on training data
rf_grid.fit(X_train_ohe, y_train_ohe)

# Get best model 
best_rf = rf_grid.best_estimator_

# Print best hyperparameters
print("Best Parameters:", rf_grid.best_params_)


Best Parameters: {'max_depth': 10, 'n_estimators': 100}


### LightGBM

In [99]:
# Categorical columns for LightGBM/CatBoost
categorical_cols = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

# Convert each categorical column to category dtype
for one_col in categorical_cols:
    X_train_non_ohe[one_col] = X_train_non_ohe[one_col].astype('category')
    X_valid_non_ohe[one_col] = X_valid_non_ohe[one_col].astype('category')
    X_test_non_ohe[one_col] = X_test_non_ohe[one_col].astype('category')

# LightGBM with Tuning
lgb_param = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1]
}

# Create LightGBM regressor model - CHANGE VARIABLE NAME
lgb_model = lgb.LGBMRegressor(random_state=100)  # Use lgb_model instead of lgb

# GridSearchCV for hyperparameter tuning
lgb_grid = GridSearchCV(lgb_model, lgb_param, scoring='neg_mean_squared_error', cv=3, n_jobs=-1)

# Fit on training data
lgb_grid.fit(X_train_non_ohe, y_train_non_ohe, categorical_feature=categorical_cols)

# Get best model 
best_lgb = lgb_grid.best_estimator_

# Print best hyperparameters
print("Best Parameters:", lgb_grid.best_params_)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.037183 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 619
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.038624 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 619
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.038525 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 624
[LightGBM] [Info] Number of data points in the train set: 118796, number of used features: 9
[LightGBM] [Info] Number of data points in the train set: 118796, number of used features: 9
[LightGBM] [Info] Number of data poin

### CatBoost

In [100]:
# CatBoost with Tuning (Non-OHE (non encoded), Scaled Numerical)

# Hyperparameter grid for tuning
cb_param = {
    'iterations': [100, 200],
    'learning_rate': [0.05, 0.1]
}

# Create CatBoost regressor model
cb = cb.CatBoostRegressor(random_state=100, verbose=0)

# Set up GridSearchCV for hyperparameter tuning
cb_grid = GridSearchCV(cb, cb_param, scoring='neg_mean_squared_error', cv=2, n_jobs=-1)

# Fit on training data
cb_grid.fit(X_train_non_ohe, y_train_non_ohe, cat_features=categorical_cols)

# Get best model from GridSearchCV
best_cb = cb_grid.best_estimator_

# Print best hyperparameters
print("Best Parameters:", cb_grid.best_params_)


Best Parameters: {'iterations': 200, 'learning_rate': 0.1}


### XGBoost

In [101]:
# XGBoost with Default Settings (OHE, Scaled) 

# Create XGBoost regressor model with default hyperparameters

xgb = xgb.XGBRegressor(random_state=100)

# Train model on training data
xgb.fit(X_train_ohe, y_train_ohe)

# Assign trained model as best model
best_xgb = xgb

## Model analysis

### Linear Regression

In [102]:
# training time 
lr_start_train = time.time()
lr.fit(X_train_ohe, y_train_ohe)
lr_train_time = time.time() - lr_start_train

# prediction time 
lr_start_pred = time.time()
y_pred_lr = lr.predict(X_test_ohe)
lr_pred_time = time.time() - lr_start_pred

# RMSE 
rmse_lr = np.sqrt(mean_squared_error(y_test_ohe, y_pred_lr))


# Print training time
print(f"Training Time: {lr_train_time:.2f} seconds")

# Print prediction time
print(f"Prediction Time: {lr_pred_time:.2f} seconds")

# Print RMSE 
print(f"RMSE on Test Set: {rmse_lr:.0f}")

Training Time: 4.58 seconds
Prediction Time: 0.20 seconds
RMSE on Test Set: 2657


### Random Forest

In [103]:
# training time 
rb_start_train = time.time()
best_rf.fit(X_train_ohe, y_train_ohe)
rb_train_time = time.time() - rb_start_train

# prediction time
rb_start_pred = time.time()
y_pred_rf = best_rf.predict(X_test_ohe)
rb_pred_time = time.time() - rb_start_pred

# RMSE
rmse_rf = np.sqrt(mean_squared_error(y_test_ohe, y_pred_rf))

# Print training time
print(f"Training Time: {rb_train_time:.2f} seconds")

# Print prediction time
print(f"Prediction Time: {rb_pred_time:.2f} seconds")

# Print RMSE
print(f"RMSE on Test Set: {rmse_rf:.0f}")

Training Time: 188.32 seconds
Prediction Time: 0.64 seconds
RMSE on Test Set: 1959


### LightGBM

In [104]:
# training time
lgb_start_train = time.time()
best_lgb.fit(X_train_non_ohe, y_train_non_ohe, categorical_feature=categorical_cols)
lgb_train_time = time.time() - lgb_start_train

# prediction time
lgb_start_pred = time.time()
y_pred_lgb = best_lgb.predict(X_test_non_ohe)
lgb_pred_time = time.time() - lgb_start_pred

# RMSE 
rmse_lgb = np.sqrt(mean_squared_error(y_test_non_ohe, y_pred_lgb))

# Print training time
print(f"Training Time: {lgb_train_time:.2f} seconds")

# Print prediction time
print(f"Prediction Time: {lgb_pred_time:.2f} seconds")

# Print RMSE
print(f"RMSE on Test Set: {rmse_lgb:.0f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005828 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 627
[LightGBM] [Info] Number of data points in the train set: 178194, number of used features: 9
[LightGBM] [Info] Start training from score 4804.534833
Training Time: 0.65 seconds
Prediction Time: 0.16 seconds
RMSE on Test Set: 1675


### CatBoost

In [105]:
# training time 
cb_start_train = time.time()
best_cb.fit(X_train_non_ohe, y_train_non_ohe, cat_features=categorical_cols)
cb_train_time = time.time() - cb_start_train

# prediction time 
cb_start_pred = time.time()
y_pred_cb = best_cb.predict(X_test_non_ohe)
cb_pred_time = time.time() - cb_start_pred

# RMSE 
rmse_cb = np.sqrt(mean_squared_error(y_test_non_ohe, y_pred_cb))

# Print training 
print(f"Training Time: {cb_train_time:.2f} seconds")

# Print prediction time
print(f"Prediction Time: {cb_pred_time:.2f} seconds")

# Print RMSE
print(f"RMSE on Test Set: {rmse_cb:.0f}")

Training Time: 7.86 seconds
Prediction Time: 0.03 seconds
RMSE on Test Set: 1781


### XGBoost

In [106]:
# training time
xb_start_train = time.time()
xgb.fit(X_train_ohe, y_train_ohe)
xb_train_time = time.time() - xb_start_train

# trained model as best model
best_xgb = xgb

# prediction time
xb_start_pred = time.time()
y_pred_xgb = best_xgb.predict(X_test_ohe)
xb_pred_time = time.time() - xb_start_pred

# RMSE 
rmse_xgb = np.sqrt(mean_squared_error(y_test_ohe, y_pred_xgb))

# Print training time
print(f"Training Time: {xb_train_time:.2f} seconds")

# Print prediction time
print(f"Prediction Time: {xb_pred_time:.2f} seconds")

# Print RMSE 
print(f"RMSE on Test Set: {rmse_xgb:.0f}")

Training Time: 2.55 seconds
Prediction Time: 0.17 seconds
RMSE on Test Set: 1727


In [114]:
# Create a summary table

summary_table = pd.DataFrame({
    'Model': ['LinearRegression', 'RandomForest', 'LightGBM', 'CatBoost', 'XGBoost'],
    'RMSE (Euros)': [2657, 1959, 1675, 1781, 1727],
    'Training Time (s)': [4.58, 188.32, 0.65, 7.86, 2.55],
    'Prediction Time (s)': [0.2, 0.64, 0.16, 0.03, 0.17]
})


# Convert DataFrame to Markdown table
markdown_table = summary_table.to_markdown(index=False)

# Print the Markdown table
print(markdown_table)

| Model            |   RMSE (Euros) |   Training Time (s) |   Prediction Time (s) |
|:-----------------|---------------:|--------------------:|----------------------:|
| LinearRegression |           2657 |                4.58 |                  0.2  |
| RandomForest     |           1959 |              188.32 |                  0.64 |
| LightGBM         |           1675 |                0.65 |                  0.16 |
| CatBoost         |           1781 |                7.86 |                  0.03 |
| XGBoost          |           1727 |                2.55 |                  0.17 |


## Conclusion

After performing exploratory data analysis, I successfully trained and evaluated multiple models to predict car prices for Rusty Bargain’s app. I used Linear Regression for the sanity check. Although it was relatively fast at around 4.58 seconds for training, it gave the worst RMSE of 2657 euros, confirming its baseline role. Random Forest improved performance with an RMSE of 1959 euros but took around 188.32 seconds to train. LightGBM turned out to be the best model with an RMSE of 1675 euros with training time of 0.65 seconds. CatBoost also did well with an RMSE of 1781 euros and a training time of 7.86 seconds. Similarly, XGBoost performed great with an RMSE of 1728 euros using default parameters, and a training time of 2.55 seconds.The fastest prediction time was from CatBoost with a time of 0.03 seconds, and the slowest prediction time was from RandomForest with a time of 0.64 seconds.

Overall, LightGBM stands out as the optimal choice for Rusty Bargain’s app due to its superior prediction quality and speed.