Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

#  Vehicle Price Prediction Using Machine Learning

## Step 1: Import Necessary Libraries and Load Data

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

# Load data
data = pd.read_csv('/datasets/car_data.csv')

# Overview of data
print(data.info())
data.head()


## Step 2: Exploratory Data Analysis and Data Preprocessing

###  Data Inspection

In [None]:
# Check for missing values and duplicates
print(data.isnull().sum())
print(f"Number of duplicates: {data.duplicated().sum()}")

# Drop duplicates
data = data.drop_duplicates()

# Inspect target variable
plt.figure(figsize=(10, 5))
sns.histplot(data['Price'], bins=50, kde=True)
plt.title('Distribution of Price')
plt.xlabel('Price')
plt.show()

# Remove extreme outliers in the Price column
data = data[data['Price'].between(500, 100000)]


### Preprocessing

In [None]:
# Remove unnecessary columns
data = data.drop(['DateCrawled', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)

# Fill missing values
data['VehicleType'].fillna('unknown', inplace=True)
data['Gearbox'].fillna('unknown', inplace=True)
data['FuelType'].fillna('unknown', inplace=True)
data['NotRepaired'].fillna('unknown', inplace=True)

# One-hot encode categorical features
categorical_features = ['VehicleType', 'Gearbox', 'FuelType', 'Brand', 'NotRepaired', 'Model']
numerical_features = ['RegistrationYear', 'Mileage', 'Power', 'RegistrationMonth']

X = data.drop('Price', axis=1)
y = data['Price']

X = pd.get_dummies(X, columns=categorical_features, drop_first=True)


## Step 3: Train-Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")


## Step 4: Model Training and Evaluation

### Linear Regression

In [None]:
# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Evaluate
rmse_lr = mean_squared_error(y_test, y_pred_lr, squared=False)
print(f"Linear Regression RMSE: {rmse_lr}")


###  Random Forest

In [None]:
# Random Forest
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
print(f"Random Forest RMSE: {rmse_rf}")


### Decision Tree

In [None]:
# Decision Tree
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Evaluate
rmse_dt = mean_squared_error(y_test, y_pred_dt, squared=False)
print(f"Decision Tree RMSE: {rmse_dt}")


###  LightGBM

In [None]:
# LightGBM
lgb_model = lgb.LGBMRegressor(random_state=42)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)

# Evaluate
rmse_lgb = mean_squared_error(y_test, y_pred_lgb, squared=False)
print(f"LightGBM RMSE: {rmse_lgb}")


### XGBoost

In [None]:
# XGBoost
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate
rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)
print(f"XGBoost RMSE: {rmse_xgb}")


### CatBoost

In [None]:
# CatBoost
cat_model = CatBoostRegressor(random_state=42, verbose=0)
cat_model.fit(X_train, y_train)
y_pred_cat = cat_model.predict(X_test)

# Evaluate
rmse_cat = mean_squared_error(y_test, y_pred_cat, squared=False)
print(f"CatBoost RMSE: {rmse_cat}")


## Model analysis

1. Linear Regression


Performance: RMSE = 3138.86
Linear Regression provided the highest RMSE, indicating that it struggles to model the complex relationships in the data. This result is expected, as linear regression assumes linear relationships, which may not align with the nonlinear dependencies inherent in this dataset.

Sanity Check: Despite its high error, Linear Regression is valuable as a baseline to compare the performance of more sophisticated models.


2. Decision Tree


Performance: RMSE = 2079.99
Decision Tree showed improvement over Linear Regression, capturing some nonlinear relationships. However, it overfits easily due to its greedy nature, which limits its generalization on unseen data.

Speed: Fast training time, but the simplicity of its splits leads to limited accuracy compared to ensemble methods.


3. Random Forest


Performance: RMSE = 1630.14
Random Forest achieved a significant reduction in RMSE compared to Decision Tree. As an ensemble method, it combines multiple trees to improve accuracy and reduce overfitting.


Strengths:
Handles missing data and categorical features (via encoding) efficiently.
Shows robust performance without much fine-tuning.


Weaknesses: Longer training time compared to a single Decision Tree.



4. LightGBM


Performance: RMSE = 1750.12
LightGBM is a gradient boosting model known for its efficiency. While its RMSE is slightly higher than Random Forest, it compensates with a much faster training speed and lower memory usage.


Strengths:
Built-in handling of categorical features, reducing preprocessing effort.
Efficient for large datasets due to its leaf-wise tree growth.


Weaknesses: Sensitive to hyperparameters; slightly underperformed compared to Random Forest.


5. XGBoost



Performance: RMSE = 1696.93
XGBoost delivered better accuracy than LightGBM and marginally underperformed compared to CatBoost. It is highly customizable and supports regularization, which helps in handling overfitting.


Strengths:
Versatile and reliable with strong support for missing data and hyperparameter tuning.
Often a go-to model for tabular data.


Weaknesses:
Requires One-Hot Encoding for categorical features, which increases data size and computational cost.
Training time is higher compared to LightGBM.


6. CatBoost


Performance: RMSE = 1650.31
CatBoost achieved the best performance in terms of RMSE. Its categorical feature encoding and robustness to overfitting make it a strong choice for structured data like this.


Strengths:
Handles categorical data natively without preprocessing, which simplifies pipeline development.
Outperformed other models with minimal hyperparameter tuning.
Fast training compared to XGBoost.


Weaknesses:
Slightly more resource-intensive than LightGBM.



Conclusion

Best Model: CatBoost was the top performer, achieving the lowest RMSE of 1650.31. Its ability to handle categorical features and generalize well makes it the best choice for predicting vehicle prices.


Trade-offs:

For a balance of speed and accuracy, Random Forest or LightGBM can be good alternatives.
XGBoost offers strong performance but requires more preprocessing and computational cost.


Insights:

Ensemble methods (Random Forest, Gradient Boosting models) significantly outperform simpler models (Linear Regression, Decision Tree) by capturing complex patterns.

Hyperparameter tuning, especially for gradient boosting models, can further improve results.
Categorical feature encoding is critical; models like CatBoost and LightGBM simplify this process compared to XGBoost.