# **Car Prices Prediction**


  ## **Introduction**

  The purpose of this project is to train and compare multiple regression models with the goal of predicting the price of used cars by knowing the following characteristics:

  * **Manufacturer**;
  * **Model**;
  * **Year**;
  * **Transmission**;
  * **Mileage**;
  * **Type of Fuel**;
  * **Miles per Gallon**;
  * **Engine Size**.

  The dataset considered contains about 40000 examples of used cars sold in the UK, provided through the CVS file 'car_prices.csv'.

  **N.B.:** The dataset is uploaded from Google Drive, so if you want to use the notebook you have to upload the dataset in advance.



## **Libraries**

Main libraries used are:
* **tensorflow**: enables the development and training of machine learning models focused on DNN;
* **keras**: used for semplify the training of the model;
* **sklearn**: supports supervised and unsupervised learning, provides various tools for preprocessing, model selection, model evaluation and many other utilities;
* **pandas**: Python library that provide data structures and functions to efficiently maanupulate data;
* **numpy**: Python library that provide data structures and function to efficiently work on large N-dimensional numerical array;
* **seaborn** and **matplotlib**: libraries used to visualize data and build graphs.

In [1]:
import tensorflow as tf
from tensorflow import keras

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Dropout

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

from matplotlib import pyplot as plt

import seaborn as sb
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category = DeprecationWarning)

## **Keras Tuner**

Installed **Keras Tuner** to automate the process of finding the best values for hyperparameters in a model. This reduces the manual work required to explore different configurations of hyperparameters.

In [2]:
pip install -q -U keras-tuner

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/129.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/129.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/129.1 kB[0m [31m697.5 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m122.9/129.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m894.7 kB/s[0m eta [36m0:00:00[0m
[?25h

## **Utils Function**

The following function is used to count the number of outliers.

In [3]:
def outliears_count(y_true, y_pred, threshold = 10000):

    residuals = np.abs(y_true - y_pred)

    residual_mean = np.mean(residuals)
    standard_residual = np.std(residuals)

    outlier_indices = np.where(residuals > threshold)[0]

    outliers = len(outlier_indices)

    return outliers

## **Loading the Dataset**

Load the dataset from Google Drive.

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)
dataset_path = 'drive/MyDrive/projectML/car_prices.csv'
data = pd.read_csv(dataset_path)
#data.head()

Mounted at /content/drive


Once the dataset is loaded, the column for the prices to predict is removed.

In [5]:
#Dataset without price column
dataset = data.drop('price', axis = 1)
#dataset.head()

In [6]:
#The price column to predict
prices = data['price']
#prices.head()

## **Dataset Preprocessing**

Before the dataset can be used, the issue of missing values or NaN data must be considered so as to have "clean" data.


In [7]:
null_dataset = dataset.isnull().values.any()
null_prices = prices.isnull().values.any()

print('There are invalid entries inside the detaset.' if null_dataset else 'There are not invalid entries inside the dataset.')
print('There are invalid entries inside prices column.' if null_prices else 'There are not invalid entries inside prices column.')

There are not invalid entries inside the dataset.
There are not invalid entries inside prices column.


Consider the unique values present within the features of the dataset.

In [8]:
manufacturer_unique_values = dataset['manufacturer'].unique()
model_unique_values = dataset['model'].unique()
transmission_unique_values = dataset['transmission'].unique()
type_of_fuel_unique_values = dataset['fuelType'].unique()

print("/---------------------------------------------------------------------------------------/\n")
print("Manufacturer Values:")
print(manufacturer_unique_values, "\n")
print("/---------------------------------------------------------------------------------------/\n")
print("Model Values:")
print(model_unique_values, "\n")
print("/---------------------------------------------------------------------------------------/\n")
print("Transmission Values:")
print(transmission_unique_values, "\n")
print("/---------------------------------------------------------------------------------------/\n")
print("Type of Fuel Values:")
print(type_of_fuel_unique_values, "\n")
print("/---------------------------------------------------------------------------------------/\n")

/---------------------------------------------------------------------------------------/

Manufacturer Values:
['Audi' 'BMW' 'Toyota' 'Mercedes'] 

/---------------------------------------------------------------------------------------/

Model Values:
[' A1' ' A6' ' A4' ' A3' ' Q3' ' Q5' ' A5' ' S4' ' Q2' ' A7' ' TT' ' Q7'
 ' RS6' ' RS3' ' A8' ' Q8' ' RS4' ' RS5' ' R8' ' SQ5' ' S8' ' SQ7' ' S3'
 ' S5' ' A2' ' RS7' ' 5 Series' ' 6 Series' ' 1 Series' ' 7 Series'
 ' 2 Series' ' 4 Series' ' X3' ' 3 Series' ' X5' ' X4' ' i3' ' X1' ' M4'
 ' X2' ' X6' ' 8 Series' ' Z4' ' X7' ' M5' ' i8' ' M2' ' M3' ' M6' ' Z3'
 ' GT86' ' Corolla' ' RAV4' ' Yaris' ' Auris' ' Aygo' ' C-HR' ' Prius'
 ' Avensis' ' Verso' ' Hilux' ' PROACE VERSO' ' Land Cruiser' ' Supra'
 ' Camry' ' Verso-S' ' IQ' ' Urban Cruiser' ' SLK' ' S Class' ' SL CLASS'
 ' G Class' ' GLE Class' ' GLA Class' ' A Class' ' B Class' ' GLC Class'
 ' C Class' ' E Class' ' GL Class' ' CLS Class' ' CLC Class' ' CLA Class'
 ' V Class' ' M Class' 

The features of the dataset can be divided into two categories:

* **categorical**: values that represent distinct and discrete attributes,  in our dataset we consider the following features as categorical: manufacturer, model, transmission, type of fuel;
* **numerical**: values on which it is possible to do mathematical operations, in the dataset considered we consider the following features as numeric: year, milage, milage per gallon, engine size.

In [9]:
numerical_features = ['year', 'milage', 'mpg', 'engineSize']
categorical_features = ['manufacturer', 'model', 'transmission', 'fuelType']


Machine Learning and Deep Learning models require all inputs and outputs to be numerical values; we need to convert categorical features such that they are numerical, so we use the **Encoding One-Hot** technique.

In this notebook we use sklearn's OneHotEncoder.

In [10]:
#Dataset with only categorical values
categorical_dataset = dataset[categorical_features]

#Inizialize the encoder
encoder = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

#Transformation of categorical features
dataset_encoded = pd.DataFrame(encoder.fit_transform(categorical_dataset), columns = encoder.get_feature_names_out(categorical_features))

#Update the dataset
dataset = pd.concat([dataset.drop(categorical_features, axis = 1), dataset_encoded], axis = 1)

#dataset.describe()

Now we move on to the dataset splitting phase. To avoid data leakege problems and ensure accurate evaluation of model performance, we first split the data and then continue preprocessing on the training and traninig sets so that we have zero mean and unit variance.

In [11]:
#Split dataset in traning set and testing set
dataset_train, dataset_test, prices_train, prices_test = train_test_split(dataset, prices, test_size = 0.2, random_state = 0)

#Split training set in training and validation sets
dataset_training, dataset_validation, prices_training, prices_validation = train_test_split(dataset_train, prices_train, test_size = 0.2, random_state = 0)

scaler = StandardScaler()

#Standardize data to zero mean and unit variance
dataset_training_scaled = scaler.fit_transform(dataset_training)
dataset_validation_scaled = scaler.transform(dataset_validation)
dataset_test_scaled = scaler.transform(dataset_test)

print("Train Size: ", len(dataset_training_scaled))
print("Validation Size: ", len(dataset_validation_scaled))
print("Test Size:", len(dataset_test_scaled))

Train Size:  26435
Validation Size:  6609
Test Size: 8262


## **Neural Network**

The first model we are going to consider is a neural network.

As indicated earlier, for the hypertuning operation we chose to use Keras Tuner so that the choice of hyperparameter values is automated.

The hyperparameters to be determined are:
* **Weght Inizializer**;
* **Number of Layers**;
* **Number of Units in each Layer**;
* **Learning Rate**.