<a href="https://colab.research.google.com/github/aniilkeles/CarPricePrediction/blob/main/SavedCarPricePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ðŸš— Used Car Price Prediction â€“ Project Overview

This project focuses on building a machine learning model to accurately predict the selling price of used cars based on vehicle specifications and condition-related features. The objective is to simulate a real-world pricing system where users can input vehicle characteristics and receive an estimated market value instantly.

The dataset contains detailed information about used vehicles, including brand, production year, mileage, engine horsepower, transmission type, fuel type, accident history, title status, and interior/exterior color. The target variable of the project is the vehicle's selling price.

Several preprocessing and feature engineering steps were applied to improve model performance and realism. The price column was cleaned and converted to numeric format, mileage and engine power were extracted and standardized, and model_year was transformed into a more meaningful feature called car_age. Categorical variables such as transmission type and fuel type were simplified into standardized categories to reduce noise. Accident history and title status were converted into binary indicators. Exterior and interior colors were grouped into simplified color categories to avoid high-cardinality issues.

To reduce extreme variance and improve stability, vehicles priced above $300,000 were removed as outliers. Since vehicle prices are positively skewed, a logarithmic transformation (log1p) was applied to the target variable to stabilize variance and improve regression performance. Predictions were later converted back to the original scale using exponential transformation.

Multiple regression algorithms were tested, including Linear Regression, Ridge Regression, KNN, Decision Tree, Random Forest, XGBoost, and LightGBM. After comparison and tuning, XGBoost was selected as the final model due to its superior performance and stability.

The final model achieved a 5-fold cross-validation RÂ² score of approximately 0.85, which was used as the primary indicator of generalization performance. This ensures that the model performs consistently on unseen data and avoids overfitting bias from a single train-test split.

The trained model was encapsulated within a Scikit-learn Pipeline, exported using pickle, and integrated into a FastAPI backend. A web interface was developed to allow users to input vehicle details and receive real-time price predictions, completing the full machine learning lifecycle from data preprocessing to deployment.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/used_cars.csv')

In [None]:
df.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,Ford,Utility Police Interceptor Base,2013,"51,000 mi.",E85 Flex Fuel,300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...,6-Speed A/T,Black,Black,At least 1 accident or damage reported,Yes,"$10,300"
1,Hyundai,Palisade SEL,2021,"34,742 mi.",Gasoline,3.8L V6 24V GDI DOHC,8-Speed Automatic,Moonlight Cloud,Gray,At least 1 accident or damage reported,Yes,"$38,005"
2,Lexus,RX 350 RX 350,2022,"22,372 mi.",Gasoline,3.5 Liter DOHC,Automatic,Blue,Black,None reported,,"$54,598"
3,INFINITI,Q50 Hybrid Sport,2015,"88,900 mi.",Hybrid,354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...,7-Speed A/T,Black,Black,None reported,Yes,"$15,500"
4,Audi,Q3 45 S line Premium Plus,2021,"9,835 mi.",Gasoline,2.0L I4 16V GDI DOHC Turbo,8-Speed Automatic,Glacier White Metallic,Black,None reported,,"$34,999"


In [None]:
from datetime import datetime
current_year = datetime.now().year
df["car_age"] = current_year - df["model_year"]

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime


df["price"] = df["price"].str.replace("$", "", regex=False)
df["price"] = df["price"].str.replace(",", "", regex=False).astype(float)

df = df[df["price"] < 300000]


df["milage"] = (
    df["milage"]
    .str.replace(" mi.", "", regex=False)
    .str.replace(",", "", regex=False)
    .astype(float)
)


df["engine"] = (
    df["engine"]
    .str.extract(r"(\d+\.?\d*)HP", expand=False)
    .astype(float)
)


df["has_accident"] = df["accident"].map({
    "None reported": 0,
    "At least 1 accident or damage reported": 1
})
df.drop("accident", axis=1, inplace=True)


df["clean_title"] = df["clean_title"].map({"Yes": 1})
df["clean_title"] = df["clean_title"].fillna(-1)

df["transmission"] = df["transmission"].str.lower().str.strip()

def simplify_transmission(x):
    if pd.isna(x):
        return "other"
    if "manual" in x or "m/t" in x or "mt" in x:
        return "manual"
    if "automatic" in x or "a/t" in x or "cvt" in x or "dual" in x or "tronic" in x:
        return "automatic"
    return "other"

df["transmission_type"] = df["transmission"].apply(simplify_transmission)
df.drop("transmission", axis=1, inplace=True)


df["fuel_type"] = df["fuel_type"].replace(["â€“", "not supported"], None)

df["fuel_missing"] = df["fuel_type"].isnull().astype(int)
df["is_diesel"] = df["fuel_type"].str.contains("diesel", case=False, na=False).astype(int)
df["is_hybrid"] = df["fuel_type"].str.contains("hybrid", case=False, na=False).astype(int)
df["is_flexfuel"] = df["fuel_type"].str.contains("flex", case=False, na=False).astype(int)

df.drop("fuel_type", axis=1, inplace=True)


df.drop("model", axis=1, inplace=True)


def simplify_color(x):
    if pd.isna(x) or x == "â€“":
        return "unknown"
    x = x.lower().strip()
    if "black" in x:
        return "black"
    elif "white" in x:
        return "white"
    elif "gray" in x or "grey" in x or "graphite" in x:
        return "gray"
    elif "silver" in x:
        return "silver"
    elif "blue" in x:
        return "blue"
    elif "red" in x:
        return "red"
    elif "brown" in x:
        return "brown"
    elif "beige" in x:
        return "beige"
    elif "green" in x:
        return "green"
    else:
        return "other"

df["ext_color_simple"] = df["ext_col"].apply(simplify_color)
df["int_color_simple"] = df["int_col"].apply(simplify_color)

df.drop(["ext_col", "int_col"], axis=1, inplace=True)


In [None]:
df["mileage_per_year"] = df["milage"] / (df["car_age"] + 1)
df["engine_per_year"] = df["engine"] / (df["car_age"] + 1)
df["engine_per_mileage"] = df["engine"] / (df["milage"] + 1)

In [None]:
X = df.drop("price", axis=1)
y = np.log1p(df["price"])

In [None]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=15)

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from xgboost import XGBRegressor

numeric_features = [
    "milage",
    "engine",
    "car_age",
    "mileage_per_year",
    "engine_per_year",
    "engine_per_mileage"
]

categorical_features = [
    "brand",
    "transmission_type",
    "ext_color_simple",
    "int_color_simple"
]

binary_features = [
    "clean_title",
    "has_accident",
    "fuel_missing",
    "is_diesel",
    "is_hybrid",
    "is_flexfuel"
]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

binary_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
    ("bin", binary_transformer, binary_features)
])

In [None]:
xgb_model = Pipeline([
    ("preprocessing", preprocessor),
    ("regressor", XGBRegressor(
        random_state=15,
        n_estimators=700,
        learning_rate=0.03,
        max_depth=8,
        subsample=1.0,
        colsample_bytree=0.7,
        gamma=0,
        min_child_weight=1,
        objective="reg:squarederror",
        n_jobs=-1
    ))
])


xgb_model.fit(X, y)

In [None]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
y_pred_log = xgb_model.predict(X_test)


print("XGB R2 (log space):", r2_score(y_test, y_pred_log))


y_pred = np.expm1(y_pred_log)
y_true = np.expm1(y_test)

print("XGB R2:", r2_score(y_true, y_pred))
print("MAE:", mean_absolute_error(y_true, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_true, y_pred)))

XGB R2 (log space): 0.9747009738638469
XGB R2: 0.9591366505943522
MAE: 3573.9260884934715
RMSE: 7217.427419252567


In [None]:

y_train_pred_log = xgb_model.predict(X_train)

print("Train R2 (log):", r2_score(y_train, y_train_pred_log))

y_test_pred_log = xgb_model.predict(X_test)

print("Test R2 (log):", r2_score(y_test, y_test_pred_log))


Train R2 (log): 0.9749159083282699
Test R2 (log): 0.9747009738638469


In [None]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    xgb_model,
    X,
    y,
    cv=5,
    scoring="r2",
    n_jobs=-1
)

print("CV scores:", cv_scores)
print("CV mean:", cv_scores.mean())

CV scores: [0.83743165 0.85029407 0.85158696 0.87020576 0.87679398]
CV mean: 0.857262485032033


In [None]:
xgb_model.fit(X, y)

In [None]:
import pickle

with open("/content/drive/MyDrive/car_price_xgb_pipeline.pkl", "wb") as f:
    pickle.dump(xgb_model, f)

In [None]:
pd.DataFrame(X_test).to_csv(
    "/content/drive/MyDrive/car_price_test_raw.csv",
    index=False
)

ðŸŽ¯ Conclusion

This project demonstrates an end-to-end machine learning workflow, including data cleaning, feature engineering, outlier handling, target transformation, model experimentation, cross-validation, and API deployment. Rather than focusing solely on model accuracy, emphasis was placed on generalization performance and real-world usability.

With a cross-validated RÂ² score of approximately 0.85 and a fully functional deployment pipeline, the system provides reliable and interpretable price predictions for used vehicles. The project reflects practical machine learning engineering skills, combining statistical modeling, data preprocessing strategy, and production-oriented thinking.