# Used Car Price Prediction

[Kaggle Resource](https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset)

**IT'S FINALLY TIME!**

After countless total hours... probably even total days of learning, studying, preparation, examination of best practices, and relearning... it's finally time for me to apply my studies.

I've also recently decided that you can only do so much preparation for a complicated field like ML, so I'm going to likely have to fill in some gaps along the way here. Thankfully, Kaggle has a LOT of publicly posted code to assist me here. I will abstain from referencing it as long as I can though, and will make sure to give credit where it's due.

Anyway, I believe this data marks a perfect balance between being a challenge while also being an approachable task. Let's begin!

(I promise my documentation will be more formal as these examples increase in complexity).

# Load and Examine Data

In [59]:
import pandas as pd
import numpy as np
import sklearn as skl
import matplotlib.pyplot as plt

In [60]:
cars = pd.read_csv("data/used_cars.csv")

Let's just get a quick overview of some basic precleaning stuff.

In [61]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4009 entries, 0 to 4008
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         4009 non-null   object
 1   model         4009 non-null   object
 2   model_year    4009 non-null   int64 
 3   milage        4009 non-null   object
 4   fuel_type     3839 non-null   object
 5   engine        4009 non-null   object
 6   transmission  4009 non-null   object
 7   ext_col       4009 non-null   object
 8   int_col       4009 non-null   object
 9   accident      3896 non-null   object
 10  clean_title   3413 non-null   object
 11  price         4009 non-null   object
dtypes: int64(1), object(11)
memory usage: 376.0+ KB


In [62]:
print(f"Missing Values:\n{cars.isna().sum()}\n")
print(f"Total Duplicates:\n{cars.duplicated().sum()}")

Missing Values:
brand             0
model             0
model_year        0
milage            0
fuel_type       170
engine            0
transmission      0
ext_col           0
int_col           0
accident        113
clean_title     596
price             0
dtype: int64

Total Duplicates:
0


For simplicity, we're gonna just remove the missing data. We can look into accounting for missing values when my initial models are finished.

In [63]:
cars.dropna(inplace=True)

Now that that's out of the way, let's actually take a peak at the data now.

In [64]:
cars.head(10)

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,Ford,Utility Police Interceptor Base,2013,"51,000 mi.",E85 Flex Fuel,300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...,6-Speed A/T,Black,Black,At least 1 accident or damage reported,Yes,"$10,300"
1,Hyundai,Palisade SEL,2021,"34,742 mi.",Gasoline,3.8L V6 24V GDI DOHC,8-Speed Automatic,Moonlight Cloud,Gray,At least 1 accident or damage reported,Yes,"$38,005"
3,INFINITI,Q50 Hybrid Sport,2015,"88,900 mi.",Hybrid,354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...,7-Speed A/T,Black,Black,None reported,Yes,"$15,500"
6,Audi,S3 2.0T Premium Plus,2017,"84,000 mi.",Gasoline,292.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,Blue,Black,None reported,Yes,"$31,000"
7,BMW,740 iL,2001,"242,000 mi.",Gasoline,282.0HP 4.4L 8 Cylinder Engine Gasoline Fuel,A/T,Green,Green,None reported,Yes,"$7,300"
8,Lexus,RC 350 F Sport,2021,"23,436 mi.",Gasoline,311.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,6-Speed A/T,Black,Black,None reported,Yes,"$41,927"
11,Aston,Martin DBS Superleggera,2019,"22,770 mi.",Gasoline,715.0HP 5.2L 12 Cylinder Engine Gasoline Fuel,8-Speed A/T,Silver,Black,None reported,Yes,"$184,606"
12,Toyota,Supra 3.0 Premium,2021,"12,500 mi.",Gasoline,382.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,A/T,Yellow,Black,None reported,Yes,"$53,500"
13,Lincoln,Aviator Reserve AWD,2022,"18,196 mi.",Gasoline,400.0HP 3.0L V6 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Brown,None reported,Yes,"$62,000"
15,Land,Rover LR4 HSE,2013,"79,800 mi.",Gasoline,375.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,A/T,White,Black,None reported,Yes,"$29,990"


# Clean Data

Now that the preliminary stuff is out of the way, let's start cleaning up.

It'd be best to start simple here and worry about the more complex concerns when that's out of the way. We'll start by removing unnecessary string values from the simpler columns that will be designated as float values.