# Car Price Prediction

# 1. Importing the libraries

A Python library is a collection of related modules. It contains bundles of code that can be used repeatedly in different programs. It makes Python Programming simpler and convenient for the programmer. As we don't need to write the same code again and again for different programs.

In this notebook, we will be using the following libraries.

In [None]:
### Data Wrangling 

import numpy as np
import pandas as pd
import missingno
from collections import Counter
from collections import OrderedDict

### Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns

### Data Preprocessing

import statsmodels.api as sm
from scipy import stats

### Modelling 

from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor
import xgboost as xg

### Tabulating the results

from tabulate import tabulate

### Remove unnecessary warnings

import warnings
warnings.filterwarnings('ignore')

# 2. Importing the data

In this section, I will fetch the dataset that is available in the Kaggle's project description in the Data section.

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each Car model. Your model will be based on “features” like Manufacturer, Model, Category, Mileage, Cylinders etc. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth. It is your job to predict these outcomes. For each car, our task is to predict the price of the car.

In [None]:
### Fetching the dataset

dataset = pd.read_csv('../input/car-price-prediction-challenge/car_price_prediction.csv')

In [None]:
### Looking at the sample data in the dataset

dataset.head(10)

In [None]:
### Shape of the dataset

dataset.shape

The training dataset consists of 18 columns and 19237 rows.

# 3. Exploratory Data Analysis

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Here, we will perform EDA on the categorical columns of the dataset - Manufacturer, Category, Leather interior, Fuel type, Gear box type, Drive wheels, Doors, Wheel, Color and the numerical columns of the dataset - Price, Levy, Prod. year, Engine volume, Mileage, Cylinders, Airbags.

# 3.1 Datatypes, Missing Data, and Summary Statistics

In [None]:
### Looking at the datatypes of the dataset

dataset.info()

Here, the columns - Manufacturer, Model, Category, Leather interior, Fuel type, Gear box type, Drive wheels, Doors, Wheel, Color are categorical. Hence, we modify the datatype of these columns to category.

In [None]:
### Modifying the datatypes of the columns to category

dataset.Manufacturer = dataset.Manufacturer.astype('category')
dataset.Model = dataset.Model.astype('category')
dataset.Category = dataset.Category.astype('category')
dataset["Leather interior"] = dataset["Leather interior"].astype('category')
dataset["Fuel type"] = dataset["Fuel type"].astype('category')
dataset["Gear box type"] = dataset["Gear box type"].astype('category')
dataset["Drive wheels"] = dataset["Drive wheels"].astype('category')
dataset.Doors = dataset.Doors.astype('category')
dataset.Wheel = dataset.Wheel.astype('category')
dataset.Color = dataset.Color.astype('category')

Looking at the modified datatypes of the columns in the dataset.

In [None]:
### Looking at the modified datatypes of the dataset

dataset.info()

From the above data, it is evident that there are no missing values in the dataset. But from the initial observation, we see that the missing values are represented using '-' in the dataset.

In [None]:
### Missing values (-) in the dataset

print('Missing values in the dataset:\n')
for each_column in dataset.columns:
    print('Column: {} - {}'.format(each_column, list(dataset[each_column]).count('-')))

From the above dataset, we can see that there are missing values in the column - Levy.

In [None]:
### Replacing the '-' to NA's in the column - Levy

dataset['Levy'] = [np.nan if value == '-' else float(value) for value in dataset['Levy']]
dataset['Levy'].isnull().sum()

In [None]:
### Visual representation of the missing data in the dataset

missingno.matrix(dataset)

From the above dataset, we can see that there are missing values in the column - Levy.

In [None]:
### Summary statistics of the numerical columns in the dataset

dataset.describe()

# 3.2 Feature Analysis

# 3.2.1 Categorical variable - Category

In [None]:
### Value counts of the column - Category

category_count = dataset['Category'].value_counts(dropna = False)
category_count

In [None]:
### Bar graph showing the value counts of the column - Category

plt.figure(figsize = (12, 6))
sns.barplot(category_count.index, category_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Category')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Category', fontsize = 12)
plt.show()

From the above graph, we can see that there is not sufficient data for the Categories - Cabriolet, Coupe, Goods wagon, Limousine, Microbus, Minivan, Pickup, Universal.

In [None]:
### Mean price per each Category 

mean_price_category = dataset[['Category', 'Price']].groupby('Category', as_index = False).mean()
mean_price_category

In [None]:
### Mean Price for each Category

plt.figure(figsize = (12, 6))
sns.barplot(mean_price_category['Category'], mean_price_category['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Category')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Category', fontsize = 12)
plt.show()

From the above graph, we can see that the mean price of all the categories is same. However, since there is no sufficient data we have to modify the column in order to get better information from the data.

# 3.2.2 Categorical variable - Leather interior

In [None]:
### Value counts of the column - Leather interior

interior_count = dataset['Leather interior'].value_counts(dropna = False)
interior_count

In [None]:
### Bar graph showing the value counts of the column - Leather interior

sns.barplot(interior_count.index, interior_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Leather interior')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Leather interior', fontsize = 12)
plt.show()

From the above graph, we can see that most of the cars have a leather interior.

In [None]:
### Mean price per each Leather interior 

mean_price_interior = dataset[['Leather interior', 'Price']].groupby('Leather interior', as_index = False).mean()
mean_price_interior

In [None]:
### Mean Price for each Leather interior

sns.barplot(mean_price_interior['Leather interior'], mean_price_interior['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Leather interior')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Leather interior', fontsize = 12)
plt.show()

From the above graph, we can see that the mean price of the car is same in both the cases.

# 3.2.3 Categorical variable - Fuel type

In [None]:
### Value counts of the column - Fuel type

fuel_count = dataset['Fuel type'].value_counts(dropna = False)
fuel_count

In [None]:
### Bar graph showing the value counts of the column - Fuel type

plt.figure(figsize = (7, 5))
sns.barplot(fuel_count.index, fuel_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Fuel type')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Fuel type', fontsize = 12)
plt.show()

From the above graph, there is no sufficient data for the fuel types - Hydrogen and Plug-in Hybrid. 

In [None]:
### Mean price per each Fuel type

mean_price_fuel = dataset[['Fuel type', 'Price']].groupby('Fuel type', as_index = False).mean()
mean_price_fuel

In [None]:
### Mean Price for each Fuel type

plt.figure(figsize = (7, 5))
sns.barplot(mean_price_fuel['Fuel type'], mean_price_fuel['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Fuel type')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Fuel type', fontsize = 12)
plt.show()

From the above graph, we can see that the mean price of the cars for every fuel type is different.

# 3.2.4 Categorical variable - Gear box type

In [None]:
### Value counts of the column - Gear box type

gear_count = dataset['Gear box type'].value_counts(dropna = False)
gear_count

In [None]:
### Bar graph showing the value counts of the column - Gear box type

sns.barplot(gear_count.index, gear_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Gear box type')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Gear box type', fontsize = 12)
plt.show()

From the above graph, we can see that most of the cars have an automatic gear box type.

In [None]:
### Mean price per each Gear box type

mean_price_gear = dataset[['Gear box type', 'Price']].groupby('Gear box type', as_index = False).mean()
mean_price_gear

In [None]:
### Mean Price for each Gear box type

sns.barplot(mean_price_gear['Gear box type'], mean_price_gear['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Gear box type')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Gear box type', fontsize = 12)
plt.show()

From the above graph, we can see that automatic, variator have similar price. Similarly, the cars having manual, tiptronic have similar price.

# 3.2.5 Categorical variable - Drive wheels 

In [None]:
### Value counts of the column - Drive wheels

drive_count = dataset['Drive wheels'].value_counts(dropna = False)
drive_count

In [None]:
### Bar graph showing the value counts of the column - Drive wheels

sns.barplot(drive_count.index, drive_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Drive wheels')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Drive wheels', fontsize = 12)
plt.show()

From the above graph, we can see that most of the cars have Front drive wheels.

In [None]:
### Mean price per each Drive wheels

mean_price_drive = dataset[['Drive wheels', 'Price']].groupby('Drive wheels', as_index = False).mean()
mean_price_drive

In [None]:
### Mean Price for each Drive wheels

sns.barplot(mean_price_drive['Drive wheels'], mean_price_drive['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Drive wheels')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Drive wheels', fontsize = 12)
plt.show()

From the above graph, we can see that the mean sales price of all the drive wheels is similar.

# 3.2.6 Categorical variable - Doors

In [None]:
### Value counts of the column - Doors

doors_count = dataset['Doors'].value_counts(dropna = False)
doors_count

In [None]:
### Bar graph showing the value counts of the column - Doors

sns.barplot(doors_count.index, doors_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Doors')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Doors', fontsize = 12)
plt.show()

Here in the above graph, the value 02-Mar means 2-3 doors, 04-Mar means 4-5 doors. Here, we will replace these values during Data preprocessing phase.

In [None]:
### Mean price per each Doors

mean_price_doors = dataset[['Doors', 'Price']].groupby('Doors', as_index = False).mean()
mean_price_doors

In [None]:
### Mean Price for each Doors

sns.barplot(mean_price_doors['Doors'], mean_price_doors['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Doors')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Doors', fontsize = 12)
plt.show()

From the above graph, we can see that cars having 2-3 doors have a high price.

# 3.2.7 Categorical variable - Wheel

In [None]:
### Value counts of the column - Wheel

wheel_count = dataset['Wheel'].value_counts(dropna = False)
wheel_count

In [None]:
### Bar graph showing the value counts of the column - Wheel

sns.barplot(wheel_count.index, wheel_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Wheel')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Wheel', fontsize = 12)
plt.show()

From the above graph, it is evident that most of the cars in the dataset have a left steering.

In [None]:
### Mean price per each Wheel

mean_price_wheel = dataset[['Wheel', 'Price']].groupby('Wheel', as_index = False).mean()
mean_price_wheel

In [None]:
### Mean Price for each Wheel

sns.barplot(mean_price_wheel['Wheel'], mean_price_wheel['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Wheel')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Wheel', fontsize = 12)
plt.show()

From the above graph, we can see that cars having a left steering have a high price than their right counterparts.

# 3.2.8 Categorical variable - Color

In [None]:
### Value counts of the column - Color

color_count = dataset['Color'].value_counts(dropna = False)
color_count

In [None]:
### Bar graph showing the value counts of the column - Color

plt.figure(figsize = (14, 5))
sns.barplot(color_count.index, color_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Color')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Color', fontsize = 12)
plt.show()

From the above graph, we can see that most of the colors of the cars have insufficient data.

In [None]:
### Mean price per each Color

mean_price_color = dataset[['Color', 'Price']].groupby('Color', as_index = False).mean()
mean_price_color

In [None]:
### Mean Price for each Color

plt.figure(figsize = (14, 5))
sns.barplot(mean_price_color['Color'], mean_price_color['Price'], alpha = 0.8)
plt.title('Mean Sales Price for each Color')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Color', fontsize = 12)
plt.show()

From the above graph, we can see that most of the colors have a similar mean price.

# 3.2.9 Numerical variable - Price

In [None]:
### Understanding the distribution of the column - Price

sns.distplot(dataset['Price'], label = 'Skewness: %.2f'%(dataset['Price'].skew()))
plt.legend(loc = 'best')
plt.title('Price Distribution')

From the above graph, we can see that the data is highly skewed. We will focus on removing the skewness in the Data preprocessing phase.

In [None]:
### Plotting a boxplot to check if the column has any outliers 

dataset.boxplot(column = ['Price'])

From the above graph, we can see that there are outliers in the column - Price. We will focus on removing them in the Data preprocessing phase.

# 3.2.10 Numerical variable - Levy

In [None]:
### Understanding the distribution of the column - Levy

sns.distplot(dataset['Levy'], label = 'Skewness: %.2f'%(dataset['Levy'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Levy')

From the above graph, we can see that the distribution is similar to normal distribution with a degree of right skewness.

In [None]:
### Plotting a boxplot to check if the column has any outliers 

dataset.boxplot(column = ['Levy'])

From the above graph, we can see that there are outliers in the column - Levy. We will focus on removing them in the Data preprocessing phase.

# 3.2.11 Numerical variable - Prod. year

In [None]:
### Understanding the distribution of the column - Prod. year

sns.distplot(dataset['Prod. year'], label = 'Skewness: %.2f'%(dataset['Prod. year'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Prod. year')

From the above graph, we can see that the graph is normally distributed with a left tail.

In [None]:
### Plotting a boxplot to check if the column has any outliers 

dataset.boxplot(column = ['Prod. year'])

From the above graph, we can see that there are outliers in the column - Prod. year. We will focus on removing them in the Data preprocessing phase.

# 3.2.12 Numerical variable - Mileage

In [None]:
### Modifying the column - Mileage

modified_mileage = [float(value.split(' ')[0]) for value in dataset['Mileage']]
dataset['Mileage'] = modified_mileage

In [None]:
### Understanding the distribution of the column - Mileage

sns.distplot(dataset['Mileage'], label = 'Skewness: %.2f'%(dataset['Mileage'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Mileage')

From the above graph, we can see that the graph has a high degree of skewness.

In [None]:
### Plotting a boxplot to check if the column has any outliers 

dataset.boxplot(column = ['Mileage'])

From the above graph, we can see that there are outliers in the column - Mileage. We will focus on removing them in the Data preprocessing phase.

# 3.2.13 Numerical variable - Cylinders

In [None]:
### Understanding the distribution of the column - Cylinders

sns.distplot(dataset['Cylinders'], label = 'Skewness: %.2f'%(dataset['Cylinders'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Cylinders')

From the above graph, we can see that there are 3 different cylinders - 4, 6, and 8. 

# 3.2.14 Numerical variables - Airbags

In [None]:
### Understanding the distribution of the column - Airbags

sns.distplot(dataset['Airbags'], label = 'Skewness: %.2f'%(dataset['Airbags'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Airbags')

From the above graph, we can see that the data has less skewness.

# 3.2.15 Numerical variables - Engine volume

In [None]:
### Modifying the column - Engine volume

modified_volume = [float(value.split(' ')[0]) for value in dataset['Engine volume']]
dataset['Engine volume'] = modified_volume

In [None]:
### Understanding the distribution of the column - Engine volume

sns.distplot(dataset['Engine volume'], label = 'Skewness: %.2f'%(dataset['Engine volume'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Engine volume')

From the above graph, we can see that the graph is normally distributed with a slight degree of skewness on the right side.

In [None]:
### Plotting a boxplot to check if the column has any outliers 

dataset.boxplot(column = ['Engine volume'])

From the above graph, we can see that there are outliers in the column - Engine volume. We will focus on removing them in the Data preprocessing phase.

# 4. Data Preprocessing

Data preprocessing is the process of getting our dataset ready for model training. In this section, we will perform the following preprocessing steps:

1. Detect and remove outliers in numerical variables
2. Drop and fill missing values
3. Feature Engineering
4. Data Trasformation
5. Feature Encoding
6. Feature Selection

# 4.1 Detect and remove outliers in numerical variables

Outliers are data points that have extreme values and they do not conform with the majority of the data. It is important to address this because outliers tend to skew our data towards extremes and can cause inaccurate model predictions. I will use the Tukey method to remove these outliers.

Here, we will write a function that will loop through a list of features and detect outliers in each one of those features. In each loop, a data point is deemed an outlier if it is less than the first quartile minus the outlier step or exceeds third quartile plus the outlier step. The outlier step is defined as 1.5 times the interquartile range. Once the outliers have been determined for one feature, their indices will be stored in a list before proceeding to the next feature and the process repeats until the very last feature is completed. Finally, using the list with outlier indices, we will count the frequencies of the index numbers and return them if their frequency exceeds n times.

In [None]:
def detect_outliers(df, n, features_list):
    outlier_indices = [] 
    for feature in features_list: 
        Q1 = np.percentile(df[feature], 25)
        Q3 = np.percentile(df[feature], 75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR 
        outlier_list_col = df[(df[feature] < Q1 - outlier_step) | (df[feature] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col) 
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(key for key, value in outlier_indices.items() if value > n) 
    return multiple_outliers

outliers_to_drop = detect_outliers(dataset, 2, ['Price', 'Levy', 'Prod. year', 'Mileage', 'Cylinders', 'Airbags',
                                               'Engine volume'])
print("We will drop these {} indices: ".format(len(outliers_to_drop)), outliers_to_drop)

Now let's look at the data present in the rows.

In [None]:
dataset.iloc[outliers_to_drop, :]

We will drop these rows from the dataset.

In [None]:
### Drop outliers and reset index

print("Before: {} rows".format(len(dataset)))
dataset = dataset.drop(outliers_to_drop, axis = 0).reset_index(drop = True)
print("After: {} rows".format(len(dataset)))

In [None]:
### Lets look at the new dataset

dataset

# 4.2 Drop and fill missing values

Here in the dataset, only the column - Levy has missing values. We will focus on replacing those missing values.

# 4.2.1 Handling missing values - Levy

In [None]:
### Replacing the missing values in the column - Levy using median

levy_index = list(~dataset['Levy'].isnull())
median_levy = np.median(dataset['Levy'].loc[levy_index])
median_levy

In [None]:
### Replacing the missing values of the column - Levy in the dataset

dataset['Levy'].fillna(median_levy, inplace = True)

In [None]:
### Checking if there are any missing values of Levy in the dataset

dataset['Levy'].isnull().sum()

# 4.2.2 Dropping unnecessary columns

Here, we will drop the columns - ID, Model, Leather interior, Drive wheels, Color from the dataset.

In [None]:
### Dropping the columns - ID, Manufacturer, Model, Leather interior, Drive wheels, Color

dataset.drop(['ID', 'Manufacturer', 'Model', 'Leather interior', 'Drive wheels', 'Color'], axis = 1, inplace = True)
dataset

# 4.3 Feature Engineering

Feature engineering is arguably the most important art in machine learning. It is the process of creating new features from existing features to better represent the underlying problem to the predictive models resulting in improved model accuracy on unseen data.

Here, we focus on creating new columns for:

1. NewCategory - using the column Category
2. NewFuelType - using the column Fuel type
3. NewGearbox - using the column Gear box
4. NewDoors - using the column Doors
5. Age - using the column Prod. year

# 4.3.1 NewCategory - using the column Category

Here, we will create the NewCategory column such that if the mean price of a category is less than 20000 then it belongs to class 1, else class 2.

In [None]:
### Seperating the categories into class 1 and 2

class_1 = []
class_2 = []

for index in range(len(mean_price_category)):
    if mean_price_category.iloc[index, 1] <= 20000:
        class_1.append(mean_price_category.iloc[index, 0])
    else:
        class_2.append(mean_price_category.iloc[index, 0])
        
print('Categories with less than 20000 mean price: ', class_1)
print('Categories with more than 20000 mean price: ', class_2)

In [None]:
### Modifying the Category column in the dataset

category_data = dataset['Category']
new_category_data = []

for value in category_data:
    if value in class_1:
        new_category_data.append(1)
    else:
        new_category_data.append(2)
        
dataset['Category'] = new_category_data

In [None]:
### Looking at the modified dataset

dataset

# 4.3.2 NewFuelType - using the column Fuel type

Here, if the fuel type is Hybrid, Hydrogen, Plug-in Hybrid then we rename it to other.

In [None]:
### Creating the new fuel type data

fuel_type_data = dataset['Fuel type']
new_fuel_type_data = []

for value in fuel_type_data:
    if value in {'Hybrid', 'Hydrogen', 'Plug-in Hybrid'}:
        new_fuel_type_data.append('Other')
    else:
        new_fuel_type_data.append(value)

set(new_fuel_type_data)

In [None]:
### Modifying the Fuel Type column

dataset['Fuel type'] = new_fuel_type_data

In [None]:
### Looking at the modified dataset

dataset

# 4.3.3 NewGearbox - using the column Gear box

Here, we will divide the column Gear box into 2 classes such that if the Gear box is either Automatic or Variator it belongs to class 1, else it belongs to class 2.

In [None]:
### Seperating the categories into class 1 and 2

gear_box_data = dataset['Gear box type']
new_gear_box_data = []

for value in gear_box_data:
    if value in {'Automatic', 'Variator'}:
        new_gear_box_data.append(1)
    else:
        new_gear_box_data.append(2)

set(new_gear_box_data)

In [None]:
### Modifying the Gear box type column

dataset['Gear box type'] = new_gear_box_data

In [None]:
### Looking at the modified dataset

dataset

# 4.3.4 NewDoors - using the column Doors

Here, we will modify the column Doors such that if the value is 04-May it is changed to 4-5, 02-Mar then it is changed to 2-3 or else keep it as it is.

In [None]:
### Creating the new Doors data

doors_data = dataset['Doors']
new_doors_data = []

for value in doors_data:
    if value == '04-May':
        new_doors_data.append('4-5')
    elif value == '02-Mar':
        new_doors_data.append('2-3')
    else:
        new_doors_data.append(value)

set(new_doors_data)

In [None]:
### Modifying the Doors column

dataset['Doors'] = new_doors_data

In [None]:
### Looking at the modified dataset

dataset

# 4.3.5 Age - using the column Prod. year

Here, we will use the column Prod. year to create Age using the formula 2022 - value.

In [None]:
### Creating the Age data

year_data = dataset['Prod. year']
age_data = []

for value in year_data:
    age_data.append(2022 - value)
    
len(set(age_data))

In [None]:
### Creating the Age column

dataset['Age'] = age_data

In [None]:
### Removing the Prod. year column

dataset.drop(['Prod. year'], axis = 1, inplace = True)

In [None]:
### Looking at the modified dataset

dataset

# 4.4 Data Transformation

In this section, we will remove the skewness present in the columns - Price, Levy, Age, Mileage, Engine volume by using a Box-Cox transformation on the data. Then, we will normalize all the numerical columns apart from the Target using MinMax Normalization.

# 4.4.1 Box Cox transforming the column - Price

In [None]:
### Understanding the distribution of the column - Price

sns.distplot(dataset['Price'], label = 'Skewness: %.2f'%(dataset['Price'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Price')

In [None]:
### Understanding the distribution of the data Box_Cox(Price)

price_data = [1 if value == 0 else value for value in dataset['Price']]

modified_price, _ = stats.boxcox(price_data)
dataset['Price'] = modified_price

sns.distplot(dataset['Price'], label = 'Skewness: %.2f'%(dataset['Price'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Price')

# 4.4.2 Box Cox transforming the column - Levy

In [None]:
### Understanding the distribution of the column - Levy

sns.distplot(dataset['Levy'], label = 'Skewness: %.2f'%(dataset['Levy'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Levy')

In [None]:
### Understanding the distribution of the data Box_Cox(Levy)

levy_data = [1 if value == 0 else value for value in dataset['Levy']]

modified_levy, _ = stats.boxcox(levy_data)
dataset['Levy'] = modified_levy

sns.distplot(dataset['Levy'], label = 'Skewness: %.2f'%(dataset['Levy'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Levy')

# 4.4.3 Box Cox transforming the column - Age

In [None]:
### Understanding the distribution of the column - Age

sns.distplot(dataset['Age'], label = 'Skewness: %.2f'%(dataset['Age'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Age')

In [None]:
### Understanding the distribution of the data Box_Cox(Age)

age_data = [1 if value == 0 else value for value in dataset['Age']]

modified_age, _ = stats.boxcox(age_data)
dataset['Age'] = modified_age

sns.distplot(dataset['Age'], label = 'Skewness: %.2f'%(dataset['Age'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Age')

# 4.4.4 Box Cox transforming the column - Mileage

In [None]:
### Understanding the distribution of the column - Mileage

sns.distplot(dataset['Mileage'], label = 'Skewness: %.2f'%(dataset['Mileage'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Mileage')

In [None]:
### Understanding the distribution of the data Box_Cox(Mileage)

mileage_data = [1 if value == 0 else value for value in dataset['Mileage']]

modified_mileage, _ = stats.boxcox(mileage_data)
dataset['Mileage'] = modified_mileage

sns.distplot(dataset['Mileage'], label = 'Skewness: %.2f'%(dataset['Mileage'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Mileage')

# 4.4.5 Box Cox transforming the column - Engine volume

In [None]:
### Understanding the distribution of the column - Engine volume

sns.distplot(dataset['Engine volume'], label = 'Skewness: %.2f'%(dataset['Engine volume'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Engine volume')

In [None]:
### Understanding the distribution of the data Box_Cox(Engine volume)

engine_data = [1 if value == 0 else value for value in dataset['Engine volume']]

modified_engine, _ = stats.boxcox(engine_data)
dataset['Engine volume'] = modified_engine

sns.distplot(dataset['Engine volume'], label = 'Skewness: %.2f'%(dataset['Engine volume'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Engine volume')

# 4.4.6 Normalizing the numerical columns

In [None]:
### A function to normalize numerical columns

def normalize_columns(dataframe, column):
    data = dataframe[column]
    mini = min(data)
    maxi = max(data)
    
    new_data = []
    for value in data:
        new_data.append((value - mini)/(maxi - mini))
    
    dataframe[column] = new_data

numerical_columns = ['Levy', 'Engine volume', 'Mileage', 'Cylinders', 'Airbags', 'Age']
for each_column in numerical_columns:
    normalize_columns(dataset, each_column)

In [None]:
### Looking at the sample records of the dataset

dataset

# 4.5 Feature Encoding

Feature encoding is the process of turning categorical data in a dataset into numerical data. It is essential that we perform feature encoding because most machine learning models can only interpret numerical data and not data in text form.

Here, we will use One Hot Encoding for the columns - Category, Fuel type, Gear box type, Doors, Wheel

In [None]:
### One Hot Encoding the columns - Category, Fuel type, Gear box type, Doors, Wheel of the dataset

encoded_dataset = pd.get_dummies(data = dataset, columns = ['Category', 'Fuel type', 'Gear box type', 'Doors', 'Wheel'])
encoded_dataset

In [None]:
### Create the column - Target using Price

target_data = encoded_dataset['Price']
encoded_dataset['Target'] = target_data

### Dropping the column - Price

encoded_dataset.drop(['Price'], axis = 1, inplace = True)
encoded_dataset

# 4.6 Feature Selection

Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

# 4.6.1 Plotting the correlation matrix for the numerical columns

In [None]:
### Creating a filter_dataset

filter_dataset = encoded_dataset[['Levy', 'Engine volume', 'Mileage', 'Cylinders', 'Airbags', 'Age']]
filter_dataset

In [None]:
### Plotting the correlation between various columns of the filter_dataset

plt.figure(figsize = (6, 6))
heatmap = sns.heatmap(filter_dataset.corr(), vmin = -1, vmax = 1, annot = True)
heatmap.set_title('Correlation Heatmap', fontdict = {'fontsize' : 12}, pad = 12)

# 4.6.2 Removing the columns that cause multicollinearity using VIF

In [None]:
### Detecting the columns that cause multicollinearity using VIF

names = ['Levy', 'Engine volume', 'Mileage', 'Cylinders', 'Airbags', 'Age']

for i in range(len(names)):
    y = filter_dataset.iloc[:, filter_dataset.columns == names[i]].values
    x = filter_dataset.iloc[:, filter_dataset.columns != names[i]].values
    x = sm.add_constant(x)
    model = sm.OLS(y, x)
    results = model.fit()
    
    rsq = results.rsquared
    vif = round(1 / (1 - rsq), 2)
    print(
        "R Square value of {} column is {} keeping all other columns as features".format(
            names[i], (round(rsq, 2))
        )
    )
    print(
        "Variance Inflation Factor of {} column is {} \n".format(
            names[i], vif)
        )

Since there is no column with a VIF of greater than 10, we will keep all the columns. Now our datasets are ready to modelling.

# 5. Modelling

Scikit-learn is one of the most popular libraries for machine learning in Python and that is what we will use in the modelling part of this project.

Since Car Price Prediction is a regression problem, we will need to use regression models, also known as regressors, to train on our model to make predictions. I highly recommend checking out the scikit-learn documentation for more information on the different machine learning models available in their library. I have chosen the following regression models for the job:

1. Multi Linear Regression
2. Lasso Regression
3. Ridge Regression
4. Support Vector Regression
5. Decision Tree regression
6. Random Forest Regression
7. Stacking Regression
8. XGBoost Regression 

In this section of the notebook, I will fit the models to the training set as outlined above and evaluate their Root Mean Squared Error (RMSE), R-squared at making predictions. Then, we will select the best model based on those values.

# 5.1 Splitting the data to Training and Test sets

Here, we will split the training data into X_train, X_test, Y_train, and Y_test so that they can be fed to the machine learning models that are used in the next section. Then the model with the best performance will be used to predict the result on the given test dataset.

In [None]:
### Splitting the dataset to the matrices X and Y

X = encoded_dataset.iloc[:, : -1].values
Y = encoded_dataset.iloc[:, -1].values

In [None]:
### Looking at the new training data - X

X

In [None]:
### Looking at the new test data - Y

Y

In [None]:
### Dividing the dataset into train and test in the ratio of 80 : 20

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 27, shuffle = True)

In [None]:
X_train

In [None]:
X_test

In [None]:
Y_train

In [None]:
Y_test

Now, we apply regressors using the above data.

# 5.2 Fit the model

In this section, we use various machine learning models to predict the results for our test data (X_test). We will store the model and its corresponding Root Mean Squared Error and Adjusted R-squared so that we can tabulate them later for choosing the best model.

In [None]:
### Dictionary to store model and its rmse

model_rmse = OrderedDict()

In [None]:
### Dictionary to store model and its r-squared

model_r2 = OrderedDict()

# 5.2.1 Applying Multi Linear Regression

In [None]:
### Training the Multi Linear Regression model on the Training set

linear_regressor = LinearRegression()
linear_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = linear_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Multi Linear Regression'] = rmse
model_r2['Multi Linear Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.2 Applying Lasso Regression

In [None]:
### Training the Lasso Regression model on the Training set

lasso = Lasso()
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 35, 40, 45, 50, 55, 100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring = 'neg_mean_squared_error', cv = 5)
lasso_regressor.fit(X_train, Y_train)

In [None]:
# Finding out negative mean squared error in Lasso Regression

print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

In [None]:
### Predicting the Test set results

Y_pred = lasso_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Lasso Regression'] = rmse
model_r2['Lasso Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.3 Applying Ridge Regression

In [None]:
### Training the Ridge Regression model on the Training set

ridge = Ridge()
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 35, 40, 45, 50, 55, 100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring = 'neg_mean_squared_error', cv = 5)
ridge_regressor.fit(X_train, Y_train)

In [None]:
# Finding out negative mean squared error in Lasso Regression

print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

In [None]:
### Predicting the Test set results

Y_pred = ridge_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Ridge Regression'] = rmse
model_r2['Ridge Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.4 Applying Support Vector Regression

In [None]:
### Training the Support Vector Regression model on the Training set

support_vector_regressor = SVR(kernel = 'rbf')
support_vector_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = support_vector_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Support Vector Regression'] = rmse
model_r2['Support Vector Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.5 Applying Decision Tree Regression

In [None]:
### Training the Decision Tree Regression model on the Training set

decision_tree_regressor = DecisionTreeRegressor()
decision_tree_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = decision_tree_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Decision Tree Regression'] = rmse
model_r2['Decision Tree Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.6 Applying Random Forest Regression (10 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 10, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (10 trees)'] = rmse
model_r2['Random Forest Regression (10 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.7 Applying Random Forest Regression (25 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 25, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (25 trees)'] = rmse
model_r2['Random Forest Regression (25 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.8 Applying Random Forest Regression (50 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 50, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (50 trees)'] = rmse
model_r2['Random Forest Regression (50 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.9 Applying Random Forest Regression (100 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 100, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (100 trees)'] = rmse
model_r2['Random Forest Regression (100 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.10 Applying Random Forest Regression (1000 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 1000, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (1000 trees)'] = rmse
model_r2['Random Forest Regression (1000 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.11 Applying Stacking Regression

In [None]:
### Preparing the Stacking Regressor

### Define the base models

base_models = list()

base_models.append(('decision_tree', decision_tree_regressor))
base_models.append(('support_vector', support_vector_regressor))

### Define the meta models

meta_model = random_forest_regressor

In [None]:
### Training the Stacking Regression model on the Training set

stacking_regressor = StackingRegressor(estimators = base_models, final_estimator = meta_model)
stacking_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = stacking_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Stacking Regression'] = rmse
model_r2['Stacking Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.12 Applying XGBoost Regression

In [None]:
### Training the XGBoost Regression model on the Training set

xgboost_regressor = xg.XGBRegressor(objective ='reg:linear', n_estimators = 100, seed = 27)
xgboost_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = xgboost_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['XGBoost Regression'] = rmse
model_r2['XGBoost Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.3 Model evaluation

Model evaluation is the process of using different evaluation metrics to understand a machine learning model's performance, as well as its strengths and weaknesses.

# 5.3.1 RMSE, R-squared of the models

Now we will tabulate all the models along with their rmse, r-squared. This data is stored in the model_performance dictionary. We will use the tabulate package for tabulating the results.

In [None]:
### Looking at the model rmse dictionary

model_rmse

In [None]:
### Looking at the model r-squared dictionary

model_r2

In [None]:
### Tabulating the results

table = []
table.append(['S.No.', 'Classification Model', 'Root Mean Squared Error', 'R-squared'])
count = 1

for model in model_rmse:
    row = [count, model, model_rmse[model], model_r2[model]]
    table.append(row)
    count += 1
    
print(tabulate(table, headers = 'firstrow', tablefmt = 'fancy_grid'))

From the above table, we can see that the model Random Forest Regression (1000 trees) has the least Root Mean Squared Error of 10.043 and the highest R-squared value of 0.661.

# 6. Conclusion

Hence, for this problem, we will use Random Forest regressor to predict the Sales Price.