# Car Price Prediction Project (Part V - Final Model Selection & Preparation for Deployment)

**Building on Previous Modeling Efforts (Part IV):**

In the "Model Implementation & Evaluation" phase, we explored various regression algorithms in order to better understand the very effects of car features on the price.

**Notebook Objective (Part V):**

This notebook marks the **final stage of our modeling process**, focusing on refining our best-performing model (XGBoost) for practical deployment in a Streamlit application. The primary goals are to:
1.  Implement **strategic feature selection and simplification** to create a more compact, robust, and user-friendly model without significant loss of predictive power.
2.  Train the final XGBoost regressor using this optimized set of features.
3.  Ensure the model is well-tuned, particularly addressing any potential for overfitting.
4.  **Save the finalized model**, making it ready for integration into a Streamlit web application for real-world car price prediction.

**Key Activities in this Final Phase:**

*   **Strategic Feature Selection & Simplification:**
    *   Based on previous analyses, we reduced the overall feature set from 25 down to a more focused selection of 14 key predictors (excluding the target `price` column).
    *   The `make_model` feature was specifically downsized to represent only the top 20 most frequent car models. This simplification is crucial for creating a more manageable and user-friendly input interface in the final Streamlit application.
*   **Final XGBoost Model Training:** The XGBoost regressor was trained using this refined and more compact dataset.
*   **Hyperparameter Tuning & Overfitting Management:** While XGBoost yielded superior results, final tuning was conducted to ensure the model generalizes well and to mitigate any overfitting tendencies.
*   **Model Persistence:** The trained and optimized XGBoost model has been saved, ready to be loaded and used in our Streamlit application.

The outcome of this phase is a lean, powerful, and deployment-ready XGBoost model, optimized for predicting car prices effectively within a user-friendly application.

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings 
warnings.filterwarnings('ignore')

sns.set_style("darkgrid")
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 1000)

In [93]:
df = pd.read_csv('car_project_final.csv')

In [95]:
df.head()

Unnamed: 0,price,make_model,body_type,type,doors,mileage,gearbox,fuel_type,paint,seller,seats,gears,co2_emissions,drivetrain,cylinders,empty_weight,upholstery,previous_owner,horsepower,engine_size,fuel_consumption,comfort_package,media_package,safety_package,extras_package,age
0,24400.0,Mercedes-Benz A 180,Compact,Used,5.0,27150.0,Manual,Diesel,Metallic,Dealer,5.0,6.0,120.0,Front,4.0,1330.0,Leather,1.0,116.0,1.5,4.5,Premium,Premium,Deluxe,Premium,2.0
1,29800.0,Mercedes-Benz A 180,Compact,Used,5.0,21734.0,Automatic,Diesel,Metallic,Dealer,5.0,7.0,120.0,Front,4.0,1445.0,Leather,1.0,116.0,1.5,3.9,Standard,Standard,Standard,Premium,2.0
2,21000.0,Mercedes-Benz A 180,Compact,Used,5.0,172700.0,Automatic,Diesel,Uni/basic,Dealer,5.0,7.0,102.5,Front,4.0,1425.0,Leather,2.0,109.0,1.5,3.7,Standard,Standard,Standard,Standard,4.0
3,26800.0,Mercedes-Benz A 180,Compact,Used,5.0,18989.0,Automatic,Diesel,Metallic,Dealer,5.0,7.0,120.0,Front,4.0,1455.0,Leather,1.0,116.0,1.5,3.9,Standard,Standard,Standard,Premium,2.0
4,32900.0,Mercedes-Benz A 180,Compact,Pre-registered,5.0,25.0,Manual,Benzine,Uni/basic,Dealer,5.0,6.0,124.0,Front,4.0,1350.0,Leather,0.0,136.0,1.3,5.3,Standard,Standard,Standard,Standard,1.0


In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14647 entries, 0 to 14646
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             14647 non-null  float64
 1   make_model        14647 non-null  object 
 2   body_type         14647 non-null  object 
 3   type              14647 non-null  object 
 4   doors             14647 non-null  float64
 5   mileage           14647 non-null  float64
 6   gearbox           14647 non-null  object 
 7   fuel_type         14647 non-null  object 
 8   paint             14647 non-null  object 
 9   seller            14647 non-null  object 
 10  seats             14647 non-null  float64
 11  gears             14647 non-null  float64
 12  co2_emissions     14647 non-null  float64
 13  drivetrain        14647 non-null  object 
 14  cylinders         14647 non-null  float64
 15  empty_weight      14647 non-null  float64
 16  upholstery        14647 non-null  object

In [99]:
df.nunique()

price               3213
make_model            69
body_type              6
type                   4
doors                  4
mileage             9331
gearbox                3
fuel_type              6
paint                  2
seller                 2
seats                  5
gears                  8
co2_emissions        391
drivetrain             3
cylinders              7
empty_weight        1084
upholstery             4
previous_owner        10
horsepower           234
engine_size           37
fuel_consumption     169
comfort_package        3
media_package          3
safety_package         3
extras_package         3
age                   31
dtype: int64

In [101]:
df0 = df.copy()

### Preprocessing

In [103]:
df.columns

Index(['price', 'make_model', 'body_type', 'type', 'doors', 'mileage',
       'gearbox', 'fuel_type', 'paint', 'seller', 'seats', 'gears',
       'co2_emissions', 'drivetrain', 'cylinders', 'empty_weight',
       'upholstery', 'previous_owner', 'horsepower', 'engine_size',
       'fuel_consumption', 'comfort_package', 'media_package',
       'safety_package', 'extras_package', 'age'],
      dtype='object')

In [105]:
cols = ['make_model', 'body_type', 'type', 'mileage', 'gearbox', 'fuel_type', 
        'paint', 'drivetrain', 'empty_weight', 'upholstery', 'horsepower', 'engine_size',
       'fuel_consumption', 'age', 'price']

In [107]:
df = df[cols]

In [109]:
df.head(2)

Unnamed: 0,make_model,body_type,type,mileage,gearbox,fuel_type,paint,drivetrain,empty_weight,upholstery,horsepower,engine_size,fuel_consumption,age,price
0,Mercedes-Benz A 180,Compact,Used,27150.0,Manual,Diesel,Metallic,Front,1330.0,Leather,116.0,1.5,4.5,2.0,24400.0
1,Mercedes-Benz A 180,Compact,Used,21734.0,Automatic,Diesel,Metallic,Front,1445.0,Leather,116.0,1.5,3.9,2.0,29800.0


In [111]:
df.shape

(14647, 15)

In [151]:
## To make things easier with the application, we reduce the number of car models to 20. It'll affect the size of our dataset.

my_cars = df.make_model.value_counts()[:20].index

df = df[df.make_model.isin(my_cars)]

In [153]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

In [155]:
X = df.drop(["price"], axis = 1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [157]:
cat = X_train.select_dtypes("object").columns

ord_enc = OrdinalEncoder()
column_trans = make_column_transformer((ord_enc, cat), remainder='passthrough', 
                                       verbose_feature_names_out=False).set_output(transform="pandas")

X_train = column_trans.fit_transform(X_train)
X_test = column_trans.transform(X_test)

In [159]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error

def train_val(model, X_train, y_train, X_test, y_test):

    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)

    scores = {"train": {"R2" : r2_score(y_train, y_train_pred),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),
    "rmse" : root_mean_squared_error(y_train, y_train_pred),
    "mape" :mean_absolute_percentage_error(y_train, y_train_pred)},

    "test": {"R2" : r2_score(y_test, y_pred),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "rmse" : root_mean_squared_error(y_test, y_pred),
    "mape" :mean_absolute_percentage_error(y_test, y_pred)}}

    return pd.DataFrame(scores)

In [161]:
from sklearn.preprocessing import StandardScaler

In [163]:
scaler = StandardScaler().set_output(transform='pandas')
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Linear Regression

In [165]:
from sklearn.linear_model import LinearRegression

In [167]:
vanilla_model = LinearRegression()
vanilla_model.fit(X_train, y_train)

In [169]:
train_val(vanilla_model, X_train, y_train, X_test, y_test).map('{:.2f}'.format)

Unnamed: 0,train,test
R2,0.85,0.86
mae,2645.93,2608.56
mse,14947472.7,15296126.78
rmse,3866.2,3911.03
mape,0.14,0.14


Although **Linear Regression** delivers promising results, for the application project we will continue with **XGBoost** to get better results.

In [171]:
from xgboost import XGBRegressor

In [173]:
xgb_vanilla = XGBRegressor()
xgb_vanilla.fit(X_train, y_train)

In [175]:
train_val(xgb_vanilla, X_train, y_train, X_test, y_test).map('{:.2f}'.format)

Unnamed: 0,train,test
R2,0.98,0.92
mae,920.61,1793.19
mse,1543949.23,8127874.33
rmse,1242.56,2850.94
mape,0.05,0.09


In [177]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth' : [3, 4, 5, 6],
    'learning_rate': [0.01, 0.03, 0.05],
    'colsample_bytree': [0.6, 0.8, 1],
}

grid = GridSearchCV(estimator = XGBRegressor(random_state=42),
                    param_grid = param_grid,
                    scoring = 'neg_mean_squared_error',
                    cv = 5,
                    n_jobs = -1)

grid.fit(X_train, y_train)

In [179]:
grid.best_params_

{'colsample_bytree': 0.6, 'learning_rate': 0.05, 'max_depth': 6}

In [181]:
train_val(grid, X_train, y_train, X_test, y_test).map('{:.2f}'.format)

Unnamed: 0,train,test
R2,0.95,0.93
mae,1547.74,1849.4
mse,4569411.0,7614816.56
rmse,2137.62,2759.5
mape,0.08,0.1


Since **grid model** generalizes more robustly (train and test scores are closer than the vanilla model), I will continue with **grid model**.

In [183]:
import pickle

filename = 'xgb_final_model'
pickle.dump(grid, open(filename, 'wb'))

In [185]:
filename = 'scaler'
pickle.dump(scaler, open(filename, 'wb'))

filename = 'encoder'
pickle.dump(column_trans, open(filename, 'wb'))

In [187]:
df.to_csv('final_df.csv', index=False)