# Predicting Car Prices using KNN

In this project, we will use the K-nearest neighbors algorithm to predict car's market price using its attributes. 
In this particular notebook., we will prepare the dataset for a later usage of the KNN algorithm. The data set we will be working with contains information on various cars, and its documentation can be found [here](https://archive.ics.uci.edu/ml/datasets/automobile).

### Introduction
#### Importing libraries

In [1]:
import pandas as pd
import numpy as np

#### Reading and previewing the dataset

In [2]:
cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 
        'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 
        'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 
        'city-mpg', 'highway-mpg', 'price']
cars = pd.read_csv('imports-85.data', names=cols)

cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


### Data Preparation

#### Trimming dataset to only numeric columns
First of all, we are going to make a trimmed copy of the dataset containing only the columns with continuous numeric values, that can later be used as features of our model.

In [3]:
continuous_values_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 
                          'engine-size', 'bore', 'stroke', 'compression-rate', 'horsepower', 'peak-rpm', 
                          'city-mpg', 'highway-mpg', 'price']

numeric_cars = cars[continuous_values_cols]

numeric_cars.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,?,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27,13495
1,?,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27,16500
2,?,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154,5000,19,26,16500
3,164,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102,5500,24,30,13950
4,164,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115,5500,18,22,17450


#### Replacing  '?' for NaN values

We usually can't have any missing values if we want to use them for predictive modeling. Based on the data set preview from the last step, we can tell that the `normalized-losses` column contains missing values represented using "?", so we will replace them for NaN values.

In [4]:
numeric_cars = numeric_cars.replace('?', np.nan)
numeric_cars = numeric_cars.astype(float)
numeric_cars.head()

Unnamed: 0,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-rate,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,,88.6,168.8,64.1,48.8,2548.0,130.0,3.47,2.68,9.0,111.0,5000.0,21.0,27.0,13495.0
1,,88.6,168.8,64.1,48.8,2548.0,130.0,3.47,2.68,9.0,111.0,5000.0,21.0,27.0,16500.0
2,,94.5,171.2,65.5,52.4,2823.0,152.0,2.68,3.47,9.0,154.0,5000.0,19.0,26.0,16500.0
3,164.0,99.8,176.6,66.2,54.3,2337.0,109.0,3.19,3.4,10.0,102.0,5500.0,24.0,30.0,13950.0
4,164.0,99.4,176.6,66.4,54.3,2824.0,136.0,3.19,3.4,8.0,115.0,5500.0,18.0,22.0,17450.0


#### Dealing with NaN values

In [5]:
numeric_cars.isnull().sum()

normalized-losses    41
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-size           0
bore                  4
stroke                4
compression-rate      0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

Because there are 41 NaN values at the `normalized-losses` column, which is close to 20% of the entire dataset, we will drop this column entirely. We will also drop every row that contains NaN price values, since this is the column we want to predict.

In [6]:
numeric_cars = numeric_cars.drop('normalized-losses', axis=1)
numeric_cars = numeric_cars.dropna(subset=['price'])
numeric_cars.isnull().sum()

wheel-base          0
length              0
width               0
height              0
curb-weight         0
engine-size         0
bore                4
stroke              4
compression-rate    0
horsepower          2
peak-rpm            2
city-mpg            0
highway-mpg         0
price               0
dtype: int64

In [7]:
numeric_cars.isna().sum(axis=1).sort_values(ascending=False).head(10)

55     2
56     2
57     2
58     2
130    2
131    2
204    0
68     0
74     0
73     0
dtype: int64

For the other columns, we can see that the maximum number of NaN values is 4, less than 2% of the total data. In adition to this, the biggest number of NaN values in a single row in 2, this happening in 6 different rows. Because of these observations, we will replace all NaN values for the mean of the column, as we would be losing relevant data by dropping the column or the row with the NaN value.

In [8]:
numeric_cars = numeric_cars.fillna(numeric_cars.mean())
numeric_cars.isnull().sum()

wheel-base          0
length              0
width               0
height              0
curb-weight         0
engine-size         0
bore                0
stroke              0
compression-rate    0
horsepower          0
peak-rpm            0
city-mpg            0
highway-mpg         0
price               0
dtype: int64

#### Data Randomization and Normalization

Randomizing the dataset and then normalizing all columns, except the target column, to the standard normal distribution (mean of 0, standard deviation of 1), to prevent any single column from having too much of an impact on the distance.

In [9]:
# Randomization
np.random.seed(1)
shuffled_index = np.random.permutation(numeric_cars.index)
numeric_cars = numeric_cars.reindex(shuffled_index)

# Normalization and Creation of price_col and train_cols
price_col = numeric_cars['price']
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col
    
#train_cols = numeric_cars.drop('price', axis=1).copy()

#### Exporting csv file

Exporting the csv file to be used on Machine Learning algorithms.

In [11]:
numeric_cars.to_csv("numeric_cars.csv", index=False)