Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd

#load the data
df = pd.read_csv('/datasets/car_data.csv')
#check the data
df.info()
#checking the missing values
df.isna().sum()

# Drop irrelevant features
df = df.drop(columns=['DateCrawled', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'])
# Drop duplicates values
df = df.drop_duplicates()

# Filter out unrealistic values
df = df[(df['Price'] > 100) & (df['Price'] < 200000)]
df = df[(df['RegistrationYear'] >= 1970) & (df['RegistrationYear'] <= 2025)]
df = df[(df['Power'] > 10) & (df['Power'] < 500)]

# Fill missing values
df['VehicleType'] = df['VehicleType'].fillna('unknown')
df['Gearbox'] = df['Gearbox'].fillna('unknown')
df['FuelType'] = df['FuelType'].fillna('unknown')
df['NotRepaired'] = df['NotRepaired'].fillna('unknown')
df['Model'] = df['Model'].fillna('unknown')

# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

## Model training

In [2]:
from sklearn.model_selection import train_test_split
#Split features/target and train/valid/test sets
features = df.drop('Price', axis=1)
target = df['Price']
features_left, features_test, target_left, target_test = train_test_split(features, target, test_size=0.2, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_left, target_left, test_size=0.25, random_state=12345)

## Model analysis

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import numpy as np
import time

#1/Linear Regression
lr = LinearRegression()

#2/Decision Tree Regressor
dt = DecisionTreeRegressor(max_depth=10, random_state=12345)

#3/Random Forest 
rf = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=12345)

#LightGBM
lgbm = lgb.LGBMRegressor(num_leaves=31, learning_rate=0.1, n_estimators=100)


models = {
    'Linear Regression': lr,
    'Decision Tree': dt,
    'Random Forest': rf,
    'LightGBM': lgbm,
    
}

best_rmse = float('inf')
best_model = None
best_model_name = ""

for name, model in models.items():
    
# Measure training and prediction time on the models
    #training
    start_train = time.time()
    model.fit(features_train, target_train)
    train_time = time.time() - start_train

    #Prediction
    start_predict = time.time()
    preds = model.predict(features_valid)
    predict_time = time.time() - start_predict
    
    rmse = mean_squared_error(target_valid, preds, squared=False)
    print(f"{name} RMSE: {rmse:.2f}")
    print(f" Train Time: {train_time:.2f}s AND Predict Time: {predict_time:.2f}s")


    if rmse < best_rmse:
        best_rmse = rmse
        best_model = model
        best_model_name = name

print(f"\n Best model is: {best_model_name} with RMSE = {best_rmse:.2f}")

Linear Regression RMSE: 2501.76
 Train Time: 9.92s AND Predict Time: 0.12s
Decision Tree RMSE: 1969.77
 Train Time: 2.62s AND Predict Time: 0.08s
Random Forest RMSE: 1870.06
 Train Time: 151.97s AND Predict Time: 0.47s
LightGBM RMSE: 1660.86
 Train Time: 14.85s AND Predict Time: 0.29s

 Best model is: LightGBM with RMSE = 1660.86


In [4]:
# Measure prediction time on test set

from sklearn.model_selection import RandomizedSearchCV

# Define parameter grid
param_grid = {
    'num_leaves': [15, 31, 63],
    'max_depth': [-1, 5, 10, 20],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
}

search = RandomizedSearchCV(
    best_model,
    param_distributions=param_grid,
    n_iter=10,
    scoring='neg_root_mean_squared_error',
    cv=3,
    random_state=12345,
    verbose=1,
    n_jobs=-1
)

search.fit(features_train, target_train)

print(f" Best params: {search.best_params_}")


print(f" Best CV RMSE: {-search.best_score_:.2f}")

best_model2 = search.best_estimator_

start = time.time()
preds_test = best_model2.predict(features_test)
test_predict_time = time.time() - start

rmse_test = mean_squared_error(target_test, preds_test, squared=False)

print(f" Final Test RMSE: {rmse_test:.2f}")
print(f" Predict Time: {test_predict_time:.2f}s")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
 Best params: {'num_leaves': 31, 'n_estimators': 50, 'max_depth': 10, 'learning_rate': 0.2}
 Best CV RMSE: 1690.43
 Final Test RMSE: 1687.23
 Predict Time: 0.20s


# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed

### Conclusion :

| Model             | Validation RMSE | Training Time (s) | Prediction Time (s) |
| ----------------- | --------------- | ----------------- | ------------------- |
| **LightGBM**          | **1660.86**     | **2.19**              | **0.30**                |
| Random Forest     | 1870.06         | 129.05            | 0.42                |
| Decision Tree     | 1969.77         | 2.3               | 0.07                |
| Linear Regression | 2501.76         | 6.6               | 0.15                |


In this project, we developed and evaluated several regression models to predict the market prices of used cars based on their technical specifications and historical data. The models tested included Linear Regression, Decision Tree, Random Forest, and LightGBM.

To ensure reliable performance evaluation, the data was properly prepared, split into training, validation, and test sets, and preprocessed to handle missing values and categorical features. Model performance was assessed using the Root Mean Squared Error (RMSE) metric, which aligns with the business goal of minimizing large prediction errors.

After tuning and comparing the models, we selected the best-performing one based on both prediction quality (lowest RMSE) and execution speed. This is critical for deployment in a real-world environment where both accuracy and efficiency are important. A summary table of model performance including RMSE, training time, and prediction time is included to support the selection and provide Rusty Bargain’s team with a clear basis for comparison. Final evaluation was conducted only on the test set using the best model to prevent data leakage and reflect true generalization performance.

The chosen model achieves a balance between high prediction accuracy and reasonable runtime, making it suitable for integration into Rusty Bargain’s pricing system. This solution will help the company optimize pricing strategies and improve decision-making.