# Predicting Car Prices

## Using K-Nearest Neighbors Algorithm(KNN) To Predict Car Prices

__In this project, a thorough machine learning workflow is used to predict a car's market price using its attributes. The data set used contains information on various cars. For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, how fast the car accelerates, and more. You can read more about the data set <https://archive.ics.uci.edu/ml/datasets/automobile>__

In [1]:
#import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold, cross_val_score

In [2]:
#load and preview the dataset

cars = pd.read_csv('imports-85.data')
pd.options.display.max_columns = 99
cars.head()

FileNotFoundError: File b'imports-85.data' does not exist



As seen above, the column names are not the same as seen in the link above <https://archive.ics.uci.edu/ml/datasets/automobile> this is corrected manually below



### Data Cleaning

In [None]:
col_names = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 
'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 
'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 
'stroke', 'compression-ratio', 'horsepower', 'peak-rpm',  'city-mpg', 'highway-mpg', 'price']

cars.columns = col_names
cars.columns

In [None]:
cars.head()

__From the first five rows above, we see some entries have '?' (normalized-losses column), for the machine learning algorithm to work, wewe need to deal effectively with this__

In [None]:
#Replace all '?' with nan values in the dataset

cars.replace('?',np.nan,inplace=True)
cars.head()

In [None]:
#View the summary of the dataset for some insights
cars.info()

__Since we see some columns have nan values, we will attempt to have a more detailed count of which columns have Nan values and how many Nan values each column has__

In [None]:
cars.isna().sum()

In [None]:
#Since it the price we are trying to predict, we will drop rows with na values from the price column
cars.dropna(subset=['price'], inplace=True)
cars['price'].isna().sum()

__In order to fill the remaining columns with their mean, we need to convert some columns that are numeric but stored as object type. These columns need to be converted to float__

In [None]:
#Convert specific columns to float type
cars[['bore','stroke','horsepower','peak-rpm','normalized-losses','price']] = cars[['bore','stroke',
                                                        'horsepower','peak-rpm','normalized-losses','price']].astype('float64')
cars.info()

In [None]:
#Fill each column that have Nan values with the mean of the column
cars.fillna(np.mean(cars), inplace=True)
cars.isna().sum()

In [None]:
#Create a dataset of numeric values only for the prediction
numeric_cars = cars.copy()
continuous_values_cols = ['normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'bore', 'stroke', 
                          'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
numeric_cars = numeric_cars[continuous_values_cols]
numeric_cars.isna().sum()

In [None]:
#Preview the numeric dataset after cleaning
numeric_cars.head(7)

## Normalization of Data

All columns apart from the price colum which we want to predict, should be normalized so that each feature contributes approximately proportionately to the final result.

In [None]:
#to normalize, we use standardization: x-(mean)/std()
#Create a normalized dataframe

normalized_cars = (numeric_cars - numeric_cars.mean()) / numeric_cars.std()
normalized_cars['price'] = numeric_cars['price']
normalized_cars.head()

##  K-Nearest Neighbors Models

In [None]:
#univariate k-nearest neighbors model

def knn_train_test(train_col, target_col, df):
    
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(shuffled_index)
    
    #split the dataset into two equal parts, i.e Holdout Validation
    v = int((df.shape[0])/2)
    
    train_df = df[:v]
    test_df = df[v:]
    knn = KNeighborsRegressor()
    knn.fit(train_df[[train_col]], train_df[target_col])
    price = knn.predict(test_df[[train_col]])
    
    #Calculate and return the root mean square value
    rmse = np.sqrt(mean_squared_error(test_df[target_col],price))
    return rmse

In [None]:
#Predict the price using each column as a feature and see which one performs best
feature = ['normalized-losses', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg']

rmse_vals = {}
for i in feature:
    tr_col = i
    tar_col = 'price'
    result = knn_train_test(tr_col, tar_col, normalized_cars)
    rmse_vals[i] = result
    
rmse_vals
rmse_vals_series = pd.Series(rmse_vals)
rmse_vals_series.sort_values()

In [None]:
#Visualize the result
%matplotlib inline
style.use('fivethirtyeight')
rmse_vals_series.plot(kind='bar',
                      label='Price',
                      figsize=(12,5),
                      rot=45,
                      colormap=plt.cm.BuGn_r)
plt.title('Price Prediction Using Individual Features')
plt.grid()
plt.xlabel('Individual features')
plt.show()

### Based on the visualization above, the highway miles per gallon gave the least root mean square error in predicting price for the default k value.

In [None]:
#Predicting price by varying the k value

def knn_train_test(k_value, train_col, target_col, df):
    
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(shuffled_index)
    
    #split the dataset into two equal parts, i.e Holdout Validation
    v = int((df.shape[0])/2)
    
    train_df = df[:v]
    test_df = df[v:]
    knn = KNeighborsRegressor(n_neighbors=k_value)
    knn.fit(train_df[[train_col]], train_df[target_col])
    price = knn.predict(test_df[[train_col]])
    
    #Calculate and return the root mean square value
    rmse = np.sqrt(mean_squared_error(test_df[target_col],price))
    return rmse

In [None]:
#Predict price per feature for each k value
feature = ['normalized-losses', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg']

k = [1,3,5,7,9]

rmse_values = []
per_feature = []

for n in k:
    for i in feature:
        p = n
        tr_col = i
        tar_col = 'price'
        result = knn_train_test(p, tr_col, tar_col, normalized_cars)
        rmse_vals[i] = result
        per_feature.append(result)

per_feature[:13]

In [None]:
#Create a dataframe of rmse values and feature

#Create a dictionary of values with each key assigned the correct values
rmse_vals = {'k_value_1':per_feature[:13],'k_value_3':per_feature[13:26],
             'k_value_5':per_feature[26:39],'k_value_7':per_feature[39:52],
            'k_value_9':per_feature[52:65]}

#Create a datframe using the dictionary
rmse_vals_df = pd.DataFrame(rmse_vals)

#Create a column that will used as the index
rmse_vals_df['features'] = ['normalized-losses', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg']

#Set the index
rmse_vals_df.set_index('features',inplace=True)

rmse_vals_df.sort_values('k_value_5')

In [None]:
#Visualize the result
rmse_vals_df.plot(figsize=(12,6))
plt.grid()
plt.xlabel('Individual Features')
plt.title('Varying K-Value Used in Predicting Car Prices')

__From the visualization above, we see that among the k values used, 9 seems to be the one that gives the least root mean square error__

## Selecting the Best Features for Predicting

In [None]:
rmse_vals_df.sort_values('k_value_5')

__From the result above, using the default k value, we see the top 5 features that give the least root mean square error. We will look at these features in batches to see the best__

### Best Two Features

In [None]:
#Bi-variate k-nearest neigbors model

#Predicting price by varying the k value

def knn_train_test(k_value, train_col, target_col, df):
    
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(shuffled_index)
    
    #split the dataset into two equal parts, i.e Holdout Validation
    v = int((df.shape[0])/2)
    
    train_df = df[:v]
    test_df = df[v:]
    knn = KNeighborsRegressor(n_neighbors=k_value)
    knn.fit(train_df[train_col], train_df[target_col])
    price = knn.predict(test_df[train_col])
    
    #Calculate and return the root mean square value
    rmse = np.sqrt(mean_squared_error(test_df[target_col],price))
    return rmse

In [None]:
#Using the two best features, horse power and highway-mpg
features = ['highway-mpg','horsepower']
k = [1,3,5,7,9]

rmse_two = {}

for i in k:
    p = i
    tr_col = features
    tar_col = 'price'
    result = knn_train_test(p, tr_col, tar_col, normalized_cars)
    rmse_two[i] = result

rmse_two

In [None]:
#Visualize the result
rmse_ser = pd.Series(rmse_two)
rmse_ser.plot(label='Price',figsize=(12,5))
plt.xlabel('K Values')
plt.xticks(np.arange(1, 10, step=2))
plt.title('Predicting Price by Varying K-Value Based on Horse Power and Highway Mpg')
plt.axhline(rmse_ser[5], color='Black', label = 'Default k-value(5)')
plt.legend()
plt.grid()
plt.show()

### Best Three Features

In [None]:
#Using the same function but selecting the best three features

#Using the three best features, horse power and city mpg
features = ['highway-mpg','horsepower','city-mpg']
k = [1,3,5,7,9]

rmse_three = {}
per_feature = []

for i in k:
    p = i
    tr_col = features
    tar_col = 'price'
    result = knn_train_test(p, tr_col, tar_col, normalized_cars)
    rmse_three[i] = result

rmse_three

In [None]:
#Visualize the result

rmse_ser = pd.Series(rmse_three)
rmse_ser.plot(label='Price',figsize=(12,5))
plt.xlabel('K Values')
plt.xticks(np.arange(1, 10, step=2))
plt.title('Predicting Price by Varying K-Value Based on The best three features')
plt.axhline(rmse_ser[5], color='orange', label = 'Default k-value(5)')
plt.legend()
plt.grid()
plt.show()

### Best Four Features

In [None]:
#Using the four best features, horse power, city mpg and Curb weight
features = ['highway-mpg','horsepower','city-mpg','curb-weight']
k = [1,3,5,7,9]

rmse_four = {}
per_feature = []

for i in k:
    p = i
    tr_col = features
    tar_col = 'price'
    result = knn_train_test(p, tr_col, tar_col, normalized_cars)
    rmse_four[i] = result

rmse_four

In [None]:
#Visualize the result

rmse_ser = pd.Series(rmse_four)
rmse_ser.plot(label='Price',figsize=(12,5))
plt.xlabel('K Values')
plt.xticks(np.arange(1, 10, step=2))
plt.title('Predicting Price by Varying K-Value Based on The best four features')
plt.axhline(rmse_ser[5], color='Purple', label = 'Default k-value(5)')
plt.legend()
plt.grid()
plt.show()

### Best Five Features

In [None]:
#Using the best five features: horse power, city mpg and Curb weight
features = ['highway-mpg','horsepower','city-mpg','curb-weight','width']
k = [1,3,5,7,9]

rmse_five = {}
per_feature = []

for i in k:
    p = i
    tr_col = features
    tar_col = 'price'
    result = knn_train_test(p, tr_col, tar_col, normalized_cars)
    rmse_five[i] = result

rmse_five

In [None]:
#Visualize the result

rmse_ser = pd.Series(rmse_five)
rmse_ser.plot(label='Price',figsize=(12,5))
plt.xlabel('K Values')
plt.xticks(np.arange(1, 10, step=2))
plt.title('Predicting Price by Varying K-Value Based on The best five features')
plt.axhline(rmse_ser[5], color='Green', label = 'Default k-value(5)')
plt.legend()
plt.grid()
plt.show()

In [None]:
#Create a dataframe summarizing the result
by_rank = [rmse_two, rmse_three, rmse_four, rmse_five]
by_rank_df = pd.DataFrame(by_rank)

#transpose the data
by_rank_df = by_rank_df.transpose()

#Change the column names
col_names = ['best_two', 'best_three','best_four','best_five']
by_rank_df.columns = col_names

by_rank_df

In [None]:
#Visualize the result

by_rank_df.plot(figsize=(12,6))
plt.xticks(np.arange(1,10,2))
plt.xlabel('Varying K Values')
#plt.grid()
plt.title('Visualizing the Best Set of Features')
plt.axvline(5, color = 'Black', label = 'Default k value')
plt.show()

## Selecting The Best K Value

__From the visualization above, the best two features produced the least root mean square error for all k values above, we will analyse which is the best k value to use__

In [None]:
features = ['highway-mpg','horsepower']

rmse_two = {}

for i in range(1,25):
    p = i
    tr_col = features
    tar_col = 'price'
    result = knn_train_test(p, tr_col, tar_col, normalized_cars)
    rmse_two[i] = result

rmse_two

In [None]:
#Visualize the result

#rmse_ser = pd.Series(rmse_five)
pd.Series(rmse_two).plot(label='Price',figsize=(12,5))
plt.xlabel('K Values')
plt.xticks(np.arange(1, 26, step=1))
plt.title('Varying K-Values Based on Best Two Features')
#plt.axhline(rmse_ser[5], color='Green', label = 'Default k-value(5)')
plt.legend()
plt.grid()
plt.show()

__From the visualization above, it seems the k value of 10 gives the best result__

## K-Fold and Cross Validation Score

__The process will now be automated using sklearn inbuilt function__

In [None]:
for fold in range(2,25):
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    
    #Select the best two features
    mses = cross_val_score(model, normalized_cars[["horsepower","highway-mpg"]], normalized_cars["price"],
                           scoring="neg_mean_squared_error", cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    std_rmse = np.std(rmses)
    
    #Print the average rmse and standard rmse
    print(str(fold), "folds: ", "avg RMSE: ", str(avg_rmse), ",", "std RMSE: ", str(std_rmse))

So far, we've been working under the assumption that a lower RMSE always means that a model is more accurate. This isn't the complete picture, unfortunately. A model has two sources of error, bias and variance.

Bias describes error that results in bad assumptions about the learning algorithm. For example, assuming that only one feature, like a car's weight, relates to a car's fuel efficiency will lead you to fit a simple, univariate regression model that will result in high bias. The error rate will be high since a car's fuel efficiency is affected by many other factors besides just its weight.

Variance describes error that occurs because of the variability of a model's predicted values. If we were given a dataset with 1000 features on each car and used every single feature to train an incredibly complicated multivariate regression model, we will have low bias but high variance. In an ideal world, we want low bias and low variance but in reality, there's always a tradeoff.

The standard deviation of the RMSE values can be a proxy for a model's variance while the average RMSE is a proxy for a model's bias. Bias and variance are the 2 observable sources of error in a model that we can indirectly control.

# Thank you for Reading