---
title: "Used Car Price Prediction – End-to-End Machine Learning Project"
date: "2025-10-1"
categories: [Regression, Portofolio, ML Deployments]
image: "images/01car.jpg"
image-alt: "image of used cars"
description: "Predicting the price of used cars based on various attributes and evaluating with RMSE"
format:
  html:
    code-copy: true
---


::: {.callout-important}
The Project 100% completed and fully functional but the detailed case study and walkthrough on this Quarto site are being finalized — coming in the next 1–2 days.
:::

## Project Overview
This is a complete end-to-end machine learning project that predicts used car prices based on a finished Kaggle competition dataset (Used Car Price Dataset ~188k rows).
The goal of this repository is to showcase a production-ready ML workflow from raw data to a deployed web app:

- Exploratory Data Analysis (EDA)
- Preprocessing & feature engineering
- Model selection & hyperparameter tuning
- Final model training with proper train/validation split
- Simple deployment using Streamlit

#### Dataset
Source: [Kaggle competition](https://www.kaggle.com/competitions/playground-series-s4e9) (competition has ended)
Original name: "Used Car Price Prediction" 
~188,000 rows of used car listings with features such as brand, model, year, mileage, fuel type, transmission, etc.

Link to original competition/dataset (for reference):
https://www.kaggle.com/datasets (you can put the exact link here)

### Project Structure
Github Repo: [link]()


```{bash}
├───data
│   ├───processed
│   │   └───
│   └───raw
│       └───test.csv
│       └───sample_submission.csv
│       └───train.csv
│       └───X_train.pkl
│       └───X_valid.pkl
│       └───y_train.pkl
│       └───y_valid.pkl
├───models
│   └───imputer.pkl
│   └───lgb_rand.pkl
│   └───OHencoder.pkl
│   └───scaler.pkl
│   
├───notebooks
│       └───data_preparation.ipynb
│       └───EDA.ipynb
│       └───data_preprocessing.ipynb
│       └───modelling.ipynb
│  
├───src
│   └───modelling.py
│   └───preprocessing.pkl    
│   └───utils.pkl 
│  
└───requirements.txt
```


### Key Findings from EDA

- The variables `accident`, `clean_title`, and `fuel_type` have missing values that must be handled.
- The variables `model_year`, `mileage`, and `price` have outliers.
- There are 1897 car `models`.
- The variable `fuel_type` has missing values stored as `-`, `not supported`, and `nan`, and the category `hydrogen fuel` needs to be added.
- The `engine` variable can be extracted into several columns to be more descriptive, such as:
  - `horsepower`     : float64     # example: 369.0, 214.0, 482.0
  - `engine_size`    : float64     # example: 3.0, 2.5, 4.4
  - `cylinder`       : int64       # example: 4, 6, 8, 10, 12
  - `is_electric`    : int64       # 0 = no, 1 = yes
  - `is_turbo`       : int64       # 0 = no, 1 = yes
  - `fuel_system`    : category    # example: MPFI, GDI, PDI, DOHC, SOHC, etc.
-  Fuel type can be extracted from `engine` if `fuel_type` is nan, -, ‘ ’, or `not supported`.
- The `transmission` variable can be simplified into 4 categories, namely `A/T`, `M/T`, and `Unknown`.
- The `ext_col` and `int_col` variables can be simplified into basic color categories such as blue, red, black, silver, white, gold, orange, purple, and beige, while other colors can be simplified into `other` or `unknown`.
- The variable `accident` have only 2 type of category with 2452 rows missing.

### Preprocessing & Feature Engineering

- Proper train/validation split before any scaling/encoding (no data leakage)
- Log transformation of target (price → log1p(price))
- Feature engineering: car_age, odometer_per_year, brand-level price statistics
- Robust pipeline with ColumnTransformer:
- Numerical: log transform + StandardScaler
- Categorical: frequency encoding + target encoding (with smoothing)
- High-cardinality: brand/model grouping & target encoding

### Future Improvements

- Add SHAP explanations in the Streamlit app
- brand-tier / luxury flag based on average price
- Add mileage-per-year and non-linear mileage features
- Handle outliers more aggressively


### Modeling
Several models were evaluated using 5-fold CV and a hold-out validation set:

|Model|RMSE(Validation)|
|---|---|
|LGBM_rand|68090|
|CatBoost_rand|68257|
|LGBM	|68344|
|XGBoost|68352|
|CatBoost|68356|

LightGBM achieved the lowest RMSE of 68090 on the validation set and was therefore selected as the final model.


### Deployment

::: {.iframe-container}
<iframe src="https://usedcarpredapp-bhwgx6h2tvu5cf48ue4pu3.streamlit.app//?embed=true" 
        width="100%" 
        height="700" 
        frameborder="0">
</iframe>

:::
