# Predicting Used Car Prices

Suppose we have the following problem:

> Tom wants to sell his car, but doesn't know how much he should sell it for. He wants to sell it for as much as possible, but also have it be reasonably priced so someone would want to purchase it. How can we help Tom determine the best price for the car?

In short: `Can we estimate the price of a used car based on its characteristics?`.

### Data Source

The data can be found at: https://archive.ics.uci.edu/dataset/10/automobile
- We can view the data contents (excludes headers) at: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

### Table of Contents:

- [0. Prerequisites](#0.-Prerequisites)
- [1. Reading the Raw Data](#1.-Reading-the-Raw-Data)
- [2. Cleaning the Data](#2.-Cleaning-the-Data)
  - [2.1 Handling Missing Values](#2.1-Handling-Missing-Values)
  - [2.2 Fixing Incorrect Types](#2.2-Fixing-Incorrect-Types)
  - [2.3 Standardizing Data](#2.3-Standardizing-Data)
  - [2.4 Data Normalization](#2.4-Data-Normalization)
  - [2.5 Binning](#2.5-Binning)
  - [2.6 Indicator Variable (Dummy Variable)](#2.6-Indicator-Variable-Dummy-Variable)
- [3. Saving the Cleaned Data](#3.-Saving-the-Cleaned-Data)


<hr />

# 0. Prerequisites

Before you run this notebook, complete the following steps:
- Install Libraries/Packages
- Import Required Modules


### Install Libraries/Packages

This will install all of the libraries/packages used in all of the notebooks for this project.

In [None]:
! pip install pandas numpy scipy matplotlib seaborn scikit-learn

### Import Required Modules

Import and configure the required modules.

In [1]:
import pandas as pd
import numpy as np

# 1. Reading the Raw Data

We start off by reading the raw dataset, displaying the first 5 rows, and then taking a look at the inferred columns and column types.

In [2]:
# Define file location.
# You can alternatively load the file locally by downloading it.
DATA_PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

# Use pandas to read the data.
# Since no header is provided in the data, we need to specify `header=None`.
raw_data = pd.read_csv(DATA_PATH, header=None)

# Populate the missing headers.
headers = [
  "symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style",
  "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight",
  "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio",
  "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"
]
raw_data.columns = headers

# Display the first 5 rows.
raw_data.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [3]:
raw_data.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

# 2. Cleaning the Data

From looking at a preview of the data along with the inferred data type, we can see a lot of things are missing and are incorrect. Some things we need to do include:
- Handle missing values.
- Fix incorrect types.
- Standardizing data.

### 2.1 Handling Missing Values

We notice that some fields are populated with a `?`. To make things consistent, we should replace it with `NaN`.

In [4]:
raw_data.replace("?", np.nan, inplace=True)

After replacing the `?` with `NaN`, we notice that we're missing values from the following variables/columns:
- `normalized-losses`: 41 missing values
- `num-of-doors`: 2 missing values
- `bore`: 4 missing values
- `stroke`: 4 missing values
- `horsepower`: 2 missing values
- `peak-rpm`: 2 missing values
- `price`: 4 missing values

This was discovered from checking the summary of the DataFrame with: `raw_data.info()`.

Some techniques for dealing with missing values are more appropriate compared to others. This can be influenced by the type of data returned by the variable.
- `normalized-losses`: Populate with **average** value.
- `num-of-doors`: Populate with most **frequent** value.
- `bore`: Populate with **average** value.
- `stroke`: Populate with **average** value.
- `horsepower`: Populate with **average** value.
- `peak-rpm`: Populate with **average** value.
- `price`: **Drop the entry** as this is the value we want to predict.

In [5]:
# Drop rows without a `price` value.
raw_data.dropna(subset=["price"], inplace=True)
# Replace the other rows with some other value.
inc_variables = {
  "normalized-losses": np.nan, "num-of-doors": np.nan, "bore": np.nan, "stroke": np.nan,
  "horsepower": np.nan, "peak-rpm": np.nan,
}
replacement_values = {
  "normalized-losses": raw_data["normalized-losses"].astype("float").mean(),
  "num-of-doors": raw_data["num-of-doors"].value_counts().idxmax(),
  "bore": raw_data["bore"].astype("float").mean(),
  "stroke": raw_data["stroke"].astype("float").mean(),
  "horsepower": raw_data["horsepower"].astype("float").mean(),
  "peak-rpm": raw_data["peak-rpm"].astype("float").mean(),
}

raw_data.replace(inc_variables, replacement_values, inplace=True)

### 2.2 Fixing Incorrect Types

There are some fields that are incorrectly represented that we need to fix. This can be done by comparing the attribute values noted at https://archive.ics.uci.edu/dataset/10/automobile and comparing it with the results from `raw_data.dtypes`.

In [6]:
int_variables = ["normalized-losses"]
float_variables = ["bore", "stroke", "horsepower", "peak-rpm", "price"]

raw_data[int_variables] = raw_data[int_variables].astype("int")
raw_data[float_variables] = raw_data[float_variables].astype("float")

# Verify our changes.
raw_data.dtypes

symboling              int64
normalized-losses      int64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

### 2.3 Standardizing Data

Sometimes, it would be best to transform data into a common format; such as converting `mpg` (miles per gallon) to `L/100km`. We can create a new column to preserve the original data or update the column in place.

```python
# Create a new `city-L/100km` variable/column.
raw_data["city-L/100km"] = 235 / raw_data["city-mpg"]
# Update the variable/column in place.
raw_data["highway-mpg"] = 235 / raw_data["highway-mpg"]
raw_data.rename(columns={ "highway-mpg": "highway-L/100km" }, inplace=True)
```

### 2.4 Data Normalization

Normalization is the process of transforming values of several variables into a similar range (typically between 0 and 1) so no variable has a bigger impact on the result.

We'll use **Simple Feature Scaling** to transform the `length`, `width`, and `height` variables to values ranging from 0 to 1. This is calculated by replacing the original value with that value divided by the max.

In [7]:
raw_data["length"] = raw_data["length"] / raw_data["length"].max()
raw_data["width"] = raw_data["width"] / raw_data["width"].max()
raw_data["height"] = raw_data["height"] / raw_data["height"].max()

### 2.5 Binning

Binning is the process of transforming continuous numeric variables into discrete categorical "bins" for grouped analysis. For example, if we only want to care about 3 ranges of `horsepower`, we could create a new "binned" variable as follows:

```python
# Get bin ranges.
horsepower_bins = np.linspace(min(raw_data["horsepower"]), max(raw_data["horsepower"]), 4)
bin_names = ["Low", "Medium", "High"]
# Create the new variable.
raw_data["horsepower-binned"] = pd.cut(raw_data["horsepower"], bins, labels=bin_names, include_lowest=True)
# View distribution.
print(raw_data["horsepower-binned"].values_count())
```

### 2.6 Indicator Variable (Dummy Variable)

Indicator variables is a numerical variable (0 or 1) used to label categorical variables. This is helpful for doing regression analysis on a categorical variable. For example, we might want to convert the `fuel-type` categorical variable into indicator variables as follows:

```python
# Create indicator variable & rename columns.
dummy_fuelType = pd.get_dummies(raw_data["fuel-type"])
dummy_fuelType.rename(columns={ "gas": "fuel-type-gas", "diesel": "fuel-type-diesel" }, inplace=True)
# Merge data and drop original variable.
raw_data = pd.concat([raw_data, dummy_fuelType], axis=1)
raw_data.drop("fuel-type", axis=1, inplace=True)
```

# 3. Saving the Cleaned Data

Finally, we save the cleaned dataset for reuse later on.

In [8]:
raw_data.to_csv("./automobile_cleaned.csv", index=False)