# Used Car Price Prediction

[Kaggle Resource](https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset)

After countless total hours... probably even total days of learning, studying, preparation, examination of best practices, and relearning... it's finally time to get down to business.

I've also recently decided that you can only do so much preparation for a complicated field like ML, so I'm going to likely have to fill in some gaps along the way here. Thankfully, Kaggle has a LOT of publicly posted code to assist me here. I will abstain from referencing it as long as I can though, and will make sure to give credit where it's due.

Anyway, I believe this data marks a perfect balance between a challenging and approachable project. Let's begin.

# Load and Examine Data

In [1328]:
import pandas as pd
import re
import numpy as np
import sklearn as skl
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer

In [1329]:
cars = pd.read_csv("data/used_cars.csv")
print(len(cars))

4009


Let's just get a quick overview of some basic precleaning stuff.

In [1330]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4009 entries, 0 to 4008
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         4009 non-null   object
 1   model         4009 non-null   object
 2   model_year    4009 non-null   int64 
 3   milage        4009 non-null   object
 4   fuel_type     3839 non-null   object
 5   engine        4009 non-null   object
 6   transmission  4009 non-null   object
 7   ext_col       4009 non-null   object
 8   int_col       4009 non-null   object
 9   accident      3896 non-null   object
 10  clean_title   3413 non-null   object
 11  price         4009 non-null   object
dtypes: int64(1), object(11)
memory usage: 376.0+ KB


In [1331]:
print(f"Missing Values:\n{cars.isna().sum()}\n\nTotal Duplicates:\n{cars.duplicated().sum()}")

Missing Values:
brand             0
model             0
model_year        0
milage            0
fuel_type       170
engine            0
transmission      0
ext_col           0
int_col           0
accident        113
clean_title     596
price             0
dtype: int64

Total Duplicates:
0


Now that that's out of the way, it's time to actually take a peak at the data.

In [1332]:
cars.head(10)

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,Ford,Utility Police Interceptor Base,2013,"51,000 mi.",E85 Flex Fuel,300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...,6-Speed A/T,Black,Black,At least 1 accident or damage reported,Yes,"$10,300"
1,Hyundai,Palisade SEL,2021,"34,742 mi.",Gasoline,3.8L V6 24V GDI DOHC,8-Speed Automatic,Moonlight Cloud,Gray,At least 1 accident or damage reported,Yes,"$38,005"
2,Lexus,RX 350 RX 350,2022,"22,372 mi.",Gasoline,3.5 Liter DOHC,Automatic,Blue,Black,None reported,,"$54,598"
3,INFINITI,Q50 Hybrid Sport,2015,"88,900 mi.",Hybrid,354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...,7-Speed A/T,Black,Black,None reported,Yes,"$15,500"
4,Audi,Q3 45 S line Premium Plus,2021,"9,835 mi.",Gasoline,2.0L I4 16V GDI DOHC Turbo,8-Speed Automatic,Glacier White Metallic,Black,None reported,,"$34,999"
5,Acura,ILX 2.4L,2016,"136,397 mi.",Gasoline,2.4 Liter,F,Silver,Ebony.,None reported,,"$14,798"
6,Audi,S3 2.0T Premium Plus,2017,"84,000 mi.",Gasoline,292.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,Blue,Black,None reported,Yes,"$31,000"
7,BMW,740 iL,2001,"242,000 mi.",Gasoline,282.0HP 4.4L 8 Cylinder Engine Gasoline Fuel,A/T,Green,Green,None reported,Yes,"$7,300"
8,Lexus,RC 350 F Sport,2021,"23,436 mi.",Gasoline,311.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,6-Speed A/T,Black,Black,None reported,Yes,"$41,927"
9,Tesla,Model X Long Range Plus,2020,"34,000 mi.",,534.0HP Electric Motor Electric Fuel System,A/T,Black,Black,None reported,Yes,"$69,950"


# Clean the Data

Now that the preliminary stuff is out of the way, let's start cleaning the data up.

It'd be best to start simple here and worry about the basic numeric conversion first. We'll start by removing unnecessary string values from columns that will be designated as strictly numeric. I also think it'd be interesting to see if there are any differences in the model outputs when using `float32` instead of `float64`. We'll get to that much later though.

For now, let's start with simple data cleaning. Then we can chip away at the more tricky regex stuff and increase the complexity as we get a feel for the data.

**Note: We can definitely work on making this SIGNIFICANTLY cleaner by aggregating a lot of this into less lines of code. We want BEST PRACTICES AND THIS IS NOT THAT.**

In [1333]:
# Bugatti cars serve as a major outlier. Don't include
# Also get rid of smart
cars = cars[~cars["brand"].isin(["Bugatti", "smart"])]

# Remove non integer string values
cars[["milage", "price"]] = cars[["milage", "price"]].replace(r'[^\d]', '', regex=True)    

# `engine` column
cars["horsepower"] = cars["engine"].str.extract(r'([\d.]+)HP', expand=False)
cars["liters"] = cars["engine"].str.extract(r'([\d.]+)L', expand=False)

# `fuel_type` column
cars["gas_diesel_other"] = (
    cars["fuel_type"].str.contains("gasoline", case=False, na=False) |
    cars["fuel_type"].str.contains("diesel", case=False, na=False) |
    cars["fuel_type"].str.contains("fuel", case=False, na=False)
)
cars["hybrid"] = cars["fuel_type"].str.contains("hybrid", case=False, na=False)
cars["electric"] = cars["fuel_type"].isna()
print(f"Conflicting fuel type values? {(cars[['gas_diesel_other', 'hybrid', 'electric']].sum(axis=1) > 1).any()}")

# `transmission` column
cars["automatic_transmission"] = (
    cars["transmission"].str.contains("A/T", case=False) |
    cars["transmission"].str.contains("AT", case=False) |
    cars["transmission"].str.contains("auto", case=False) |
    cars["transmission"].str.contains("automatic", case=False) |
    cars["transmission"].str.contains("cvt", case=False)
)
cars['dual_shift_transmission'] = cars['transmission'].str.contains("dual", case=False)
cars["manual_transmission"] = (
    cars['transmission'].str.contains("manual", case=False) |
    cars['transmission'].str.contains("m/t", case=False) |
    cars['transmission'].str.contains("mt", case=False)
)
cars.loc[cars['dual_shift_transmission'], ['automatic_transmission', 'manual_transmission']] = False
cars.loc[cars['manual_transmission'], ['automatic_transmission', 'dual_shift_transmission']] = False
mask = ~(cars[['automatic_transmission', 'dual_shift_transmission', 'manual_transmission']] == False).all(axis=1) # Remove rows with `False` in all three newly created transmission columns
cars = cars[mask]
print(f"Conflicting transmission values? {(cars[['automatic_transmission', 'manual_transmission', "dual_shift_transmission"]].sum(axis=1) > 1).any()}")

# `accident` and `clean_title` columns
cars["accident_bool"] = cars["accident"] != "None reported"
cars.dropna(subset=['accident'], inplace=True)

# Deletions
cars.drop(["fuel_type", "engine", "transmission", "accident", "clean_title", "int_col", "ext_col"], axis=1, inplace=True)
cars

Conflicting fuel type values? False
Conflicting transmission values? False


Unnamed: 0,brand,model,model_year,milage,price,horsepower,liters,gas_diesel_other,hybrid,electric,automatic_transmission,dual_shift_transmission,manual_transmission,accident_bool
0,Ford,Utility Police Interceptor Base,2013,51000,10300,300.0,3.7,True,False,False,True,False,False,True
1,Hyundai,Palisade SEL,2021,34742,38005,,3.8,True,False,False,True,False,False,True
2,Lexus,RX 350 RX 350,2022,22372,54598,,,True,False,False,True,False,False,False
3,INFINITI,Q50 Hybrid Sport,2015,88900,15500,354.0,3.5,False,True,False,True,False,False,False
4,Audi,Q3 45 S line Premium Plus,2021,9835,34999,,2.0,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4004,Bentley,Continental GT Speed,2023,714,349950,,6.0,True,False,False,True,False,False,False
4005,Audi,S4 3.0T Premium Plus,2022,10900,53900,349.0,3.0,True,False,False,False,True,False,False
4006,Porsche,Taycan,2022,2116,90998,,,False,False,True,True,False,False,False
4007,Ford,F-150 Raptor,2020,33000,62999,450.0,3.5,True,False,False,True,False,False,False


 I feel like I need to come up with some better data validation practices. That's for another time though. We'll see if I really messed up when we take a look at the MSE values.

# One-Hot Encode Categorical Data

I'm putting this as it's own section cause I've done one-hot encoding using `pandas`, but not `sklearn`. So let's see if I can get it done (probably gonna be pretty easy). The only categorical column I've kept is `brand`, since the color categories probably don't provide much insight, while there are too many different models in the `model` column to provide useful comparisons (at least I think). However, I think I might have to put this theory to the test.

How about this? We'll add on one more requirement to this project: predict `price` using high and low dimensional data and compare the results. This will be in addition to the requirement of comparing the various missing value imputation methods we are about to explore as well. We'll explore high dimensional data methods in another notebook much later.

In [1334]:
ohe = OneHotEncoder(handle_unknown="ignore", 
                    sparse_output=False).set_output(transform="pandas")
cars = pd.concat([cars,
                  ohe.fit_transform(cars[["brand"]])],
                  axis=1).drop(columns=["brand", "model"])
cars.shape

(3870, 67)

Alright, looks like the dimensionality of the data is still good with just `brand` encoded. Let's move on.

# Missing Value Imputation

I'm in uncharted territory with this kind of stuff, but I have a feeling it might not be too challenging. Just gonna take a lot of background studying. Let's go over some of the options we have.

Here's a [general overview](https://scikit-learn.org/stable/modules/impute.html) for Sklearn native missing value imputation. We'll do a deep dive into each Sklearn option in the `sklearn_missing_val_imputation_methods.ipynb` notebook. After that, we'll do the same with [LightGBM](https://lightgbm.readthedocs.io/en/stable/) in a separate notebook, then I can finally create my own attempt using personally selected features for a custom linear regression model. Each imputation method will be assigned to its own data frame.

Before any of that, let's do a quick double check to see if any additional columns have `NaN` values:

In [1335]:
cols_with_nans = [x for x in cars if cars[x].isnull().sum() > 0]
cars[cols_with_nans].isnull().sum()

horsepower    760
liters        360
dtype: int64

Beyond the methods I'm about to explore, I can also think of a few other ways we could fill in this data even more reliably, but we're here for practice, not the pursuit of the perfect model. That is for another time. Just leaving this note to keep that train of thought in mind. Anyway, let's do the simple stuff now.

Here's a dictionary that will hold the results of each method for the future comparisons:

In [1336]:
imp_datasets = {}

## Scikit: Mean, Median, and Most Frequent Value Imputation

Much to my destain, I think it's important to compare the performance of these more basic missing value imputations to the more advanced methods. This will act as a sort of bonus for familiarizing myself with `scikit`'s syntax and various options.

In [1337]:
# simple_imp_target_cols = ['horsepower', 'liters']

# imp_mean = SimpleImputer(strategy='mean')
# cars_mean = cars.copy()
# cars_mean[simple_imp_target_cols] = imp_mean.fit_transform(cars[simple_imp_target_cols])

# imp_median = SimpleImputer(strategy='median')
# cars_median = cars.copy()
# cars_median[simple_imp_target_cols] = imp_median.fit_transform(cars[simple_imp_target_cols])

# imp_most_freq = SimpleImputer(strategy='most_frequent')
# cars_freq = cars.copy()
# cars_freq[simple_imp_target_cols] = imp_most_freq.fit_transform(cars[simple_imp_target_cols])

# imp_datasets["scikit_mean"] = cars_mean
# imp_datasets["scikit_median"] = cars_median
# imp_datasets["scikit_most_frequent"] = cars_freq


I think that could've been written much more neatly. Let's try again

In [1338]:
def simple_impute(data: pd.DataFrame, strategy: str, target_cols: list[str]):
    data = data.copy()
    imp_simple = SimpleImputer(strategy=strategy)
    data[target_cols] = imp_simple.fit_transform(data[target_cols])
    return data

simple_imp_target_cols = ['horsepower', 'liters']
imp_datasets["scikit_mean"] = simple_impute(cars, "mean", simple_imp_target_cols).round(2) # No need for all those decimals 
imp_datasets["scikit_median"] = simple_impute(cars, "median", simple_imp_target_cols)
imp_datasets["scikit_most_frequent"] = simple_impute(cars, "most_frequent", simple_imp_target_cols)

imp_datasets["scikit_median"]

Unnamed: 0,model_year,milage,price,horsepower,liters,gas_diesel_other,hybrid,electric,automatic_transmission,dual_shift_transmission,...,brand_Rolls-Royce,brand_Saab,brand_Saturn,brand_Scion,brand_Subaru,brand_Suzuki,brand_Tesla,brand_Toyota,brand_Volkswagen,brand_Volvo
0,2013,51000,10300,300.0,3.7,True,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2021,34742,38005,310.0,3.8,True,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2022,22372,54598,310.0,3.5,True,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2015,88900,15500,354.0,3.5,False,True,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2021,9835,34999,310.0,2.0,True,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4004,2023,714,349950,310.0,6.0,True,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4005,2022,10900,53900,349.0,3.0,True,False,False,False,True,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4006,2022,2116,90998,310.0,3.5,False,False,True,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4007,2020,33000,62999,450.0,3.5,True,False,False,True,False,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Beautiful. Makes me so happy.

## Scikit: Iterative Imputer

I'll be referring to [this video](https://www.youtube.com/watch?v=m_qKhnaYZlc) for some help as the Sklearn documentation doesn't give me much confidence.

I'm also noticing I'm using like 3 different names for Sklearn interchangeably. I don't know how that happened.

In [1339]:
imp_it = IterativeImputer()
imp_datasets["scikit_it"] = pd.DataFrame(imp_it.fit_transform(cars), columns=cars.columns, index=cars.index)

## Scikit: KNN Imputer

In [1340]:
imp_knn = KNNImputer(n_neighbors=2)
imp_datasets["scikit_knn"] = pd.DataFrame(imp_knn.fit_transform(cars), columns=cars.columns, index=cars.index)
imp_datasets["scikit_knn"]

Unnamed: 0,model_year,milage,price,horsepower,liters,gas_diesel_other,hybrid,electric,automatic_transmission,dual_shift_transmission,...,brand_Rolls-Royce,brand_Saab,brand_Saturn,brand_Scion,brand_Subaru,brand_Suzuki,brand_Tesla,brand_Toyota,brand_Volkswagen,brand_Volvo
0,2013.0,51000.0,10300.0,300.0,3.70,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2021.0,34742.0,38005.0,350.5,3.80,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2022.0,22372.0,54598.0,390.0,4.35,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2015.0,88900.0,15500.0,354.0,3.50,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2021.0,9835.0,34999.0,265.0,2.00,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4004,2023.0,714.0,349950.0,636.5,6.00,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4005,2022.0,10900.0,53900.0,349.0,3.00,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4006,2022.0,2116.0,90998.0,835.0,4.10,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4007,2020.0,33000.0,62999.0,450.0,3.50,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
