# Car Sales Price Prediction

# 1. Importing the libraries

A Python library is a collection of related modules. It contains bundles of code that can be used repeatedly in different programs. It makes Python Programming simpler and convenient for the programmer. As we don't need to write the same code again and again for different programs.

In this notebook, we will be using the following libraries.

In [None]:
### Data Wrangling 

import numpy as np
import pandas as pd
import missingno
from collections import Counter
from collections import OrderedDict

### Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns

### Data Preprocessing

import statsmodels.api as sm
from scipy import stats

### Modelling 

from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor
import xgboost as xg

### Tabulating the results

from tabulate import tabulate

### Remove unnecessary warnings

import warnings
warnings.filterwarnings('ignore')

# 2. Importing the data

In this section, I will fetch the dataset that is available in the Kaggle's project description in the Data section.

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each Car model. Your model will be based on “features” like Manufacturer, Model, Vehicle Type, Horsepower etc. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth. It is your job to predict these outcomes. For each car, our task is to predict the sales price of the car.

In [None]:
### Fetching the dataset

dataset = pd.read_csv('../input/car-sales/Car_sales.csv')

In [None]:
### Looking at the sample data in the dataset

dataset.head(10)

In [None]:
### Shape of the dataset

dataset.shape

The dataset consists of 16 columns and 157 rows.

# 3. Exploratory Data Analysis

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Here, we will perform EDA on the categorical columns of the dataset - Manufacturer, Vehicle_type and the numerical columns of the dataset - Sales_in_thousands, __year_resale_value, Price_in_thousands, Engine_size, Horsepower, Wheelbase, Width, Length, Curb_weight, Fuel_capacity, Fuel_efficiency, Power_perf_factor.

# 3.1 Datatypes, Missing Data, and Summary Statistics

In [None]:
### Looking at the datatypes of the dataset

dataset.info()

Here, the columns - Manufacturer, Model, Vehicle_type are categorical. Hence, we modify the datatype of these columns to category.

In [None]:
### Modifying the datatypes of the columns to category

dataset.Manufacturer = dataset.Manufacturer.astype('category')
dataset.Model = dataset.Model.astype('category')
dataset.Vehicle_type = dataset.Vehicle_type.astype('category')

Looking at the modified datatypes of the columns in the dataset.

In [None]:
### Looking at the modified datatypes of the dataset

dataset.info()

From the above data it is evident that there are missing values in the dataset.

In [None]:
### Visual representation of the missing data in the dataset

missingno.matrix(dataset)

From the above dataset, we can see that there are missing values in the column - __year_resale_value, Price_in_thousands, Engine_size, Horsepower, Wheelbase, Width, Length, Curb_weight, Fuel_capacity, Fuel_efficiency, Power_perf_factor.

In [None]:
### Summary statistics of the numerical columns in the dataset

dataset.describe()

# 3.2 Feature Analysis

# 3.2.1 Categorical variable - Manufacturer

In [None]:
### Value counts of the column - Manufacturer

manufacturer_count = dataset['Manufacturer'].value_counts(dropna = False)
manufacturer_count

In [None]:
### Bar graph showing the value counts of the column - Manufacturer

plt.figure(figsize = (20, 6))
sns.barplot(manufacturer_count.index, manufacturer_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Manufacturer')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Manufacturer', fontsize = 12)
plt.show()

From the above graph, we can see that the number of occurences of the car manufacturers is not uniformly distributed.

In [None]:
### Mean price per each Manufacturer 

mean_price_manufacturer = dataset[['Manufacturer', 'Price_in_thousands']].groupby('Manufacturer', as_index = False).mean()
mean_price_manufacturer

In [None]:
### Mean Price for each Manufacturer

plt.figure(figsize = (20, 6))
sns.barplot(mean_price_manufacturer['Manufacturer'], mean_price_manufacturer['Price_in_thousands'], alpha = 0.8)
plt.title('Mean Sales Price for each Manufacturer')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Manufacturer', fontsize = 12)
plt.show()

# 3.2.2 Categorical variable - Vehicle_type

In [None]:
### Value counts of the column - Vehicle_type

vehicle_count = dataset['Vehicle_type'].value_counts(dropna = False)
vehicle_count

In [None]:
### Bar graph showing the value counts of the column - Vehicle_type

sns.barplot(vehicle_count.index, vehicle_count.values, alpha = 0.8)
plt.title('Bar graph showing the value counts of the column - Vehicle type')
plt.ylabel('Number of Occurrences', fontsize = 12)
plt.xlabel('Vehicle type', fontsize = 12)
plt.show()

From the above graph, we can see that most of the values in the column are Passenger.

In [None]:
### Mean price per each Vehicle type

mean_price_vehicle = dataset[['Vehicle_type', 'Price_in_thousands']].groupby('Vehicle_type', as_index = False).mean()
mean_price_vehicle

In [None]:
### Mean Price for each Vehicle_type

sns.barplot(mean_price_vehicle['Vehicle_type'], mean_price_vehicle['Price_in_thousands'], alpha = 0.8)
plt.title('Mean Sales Price for each Vehicle type')
plt.ylabel('Mean Price', fontsize = 12)
plt.xlabel('Vehicle type', fontsize = 12)
plt.show()

From the above graph, we can see that the mean sales price is similar for both the vehicle types.

# 3.2.3 Numerical variable - Sales_in_thousands

In [None]:
### Understanding the distribution of the column - Sales_in_thousands

sns.distplot(dataset['Sales_in_thousands'], label = 'Skewness: %.2f'%(dataset['Sales_in_thousands'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Sales in thousands')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.4 Numerical variable - __year_resale_value

In [None]:
### Understanding the distribution of the column - __year_resale_value

sns.distplot(dataset['__year_resale_value'], label = 'Skewness: %.2f'%(dataset['__year_resale_value'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - __year_resale_value')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.5 Numerical variable - Price_in_thousands

In [None]:
### Understanding the distribution of the column - Price_in_thousands

sns.distplot(dataset['Price_in_thousands'], label = 'Skewness: %.2f'%(dataset['Price_in_thousands'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Price_in_thousands')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.6 Numerical variable - Engine_size

In [None]:
### Understanding the distribution of the column - Engine_size

sns.distplot(dataset['Engine_size'], label = 'Skewness: %.2f'%(dataset['Engine_size'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Engine_size')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.7 Numerical variable - Horsepower

In [None]:
### Understanding the distribution of the column - Horsepower

sns.distplot(dataset['Horsepower'], label = 'Skewness: %.2f'%(dataset['Horsepower'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Horsepower')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.8 Numerical variable - Wheelbase

In [None]:
### Understanding the distribution of the column - Wheelbase

sns.distplot(dataset['Wheelbase'], label = 'Skewness: %.2f'%(dataset['Wheelbase'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Wheelbase')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.9 Numerical variable - Width

In [None]:
### Understanding the distribution of the column - Width

sns.distplot(dataset['Width'], label = 'Skewness: %.2f'%(dataset['Width'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Width')

From the above graph, we can see that the data is normally distributed.

# 3.2.10 Numerical variable - Length

In [None]:
### Understanding the distribution of the column - Length

sns.distplot(dataset['Length'], label = 'Skewness: %.2f'%(dataset['Length'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Length')

From the above graph, we can see that the data is normally distributed.

# 3.2.11 Numerical variable - Curb_weight

In [None]:
### Understanding the distribution of the column - Curb_weight

sns.distplot(dataset['Curb_weight'], label = 'Skewness: %.2f'%(dataset['Curb_weight'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Curb_weight')

From the above graph, we can see that the data is normally distributed.

# 3.2.12 Numerical variable - Fuel_capacity

In [None]:
### Understanding the distribution of the column - Fuel_capacity

sns.distplot(dataset['Fuel_capacity'], label = 'Skewness: %.2f'%(dataset['Fuel_capacity'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Fuel_capacity')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 3.2.13 Numerical variable - Fuel_efficiency

In [None]:
### Understanding the distribution of the column - Fuel_efficiency

sns.distplot(dataset['Fuel_efficiency'], label = 'Skewness: %.2f'%(dataset['Fuel_efficiency'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Fuel_efficiency')

From the above graph, we can see that the data is normally distributed.

# 3.2.14 Numerical variable - Power_perf_factor

In [None]:
### Understanding the distribution of the column - Power_perf_factor

sns.distplot(dataset['Power_perf_factor'], label = 'Skewness: %.2f'%(dataset['Power_perf_factor'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Power_perf_factor')

From the above graph, we can see that the data is slightly skewed. We will remove this skewness during the Data Preprocessing phase.

# 4. Data Preprocessing

Data preprocessing is the process of getting our dataset ready for model training. In this section, we will perform the following preprocessing steps:

1. Detect and remove outliers in numerical variables
2. Drop and fill missing values
3. Feature Engineering
4. Data Trasformation
5. Feature Encoding
6. Feature Selection

# 4.1 Detect and remove outliers in numerical variables

Outliers are data points that have extreme values and they do not conform with the majority of the data. It is important to address this because outliers tend to skew our data towards extremes and can cause inaccurate model predictions. I will use the Tukey method to remove these outliers.

Here, we will write a function that will loop through a list of features and detect outliers in each one of those features. In each loop, a data point is deemed an outlier if it is less than the first quartile minus the outlier step or exceeds third quartile plus the outlier step. The outlier step is defined as 1.5 times the interquartile range. Once the outliers have been determined for one feature, their indices will be stored in a list before proceeding to the next feature and the process repeats until the very last feature is completed. Finally, using the list with outlier indices, we will count the frequencies of the index numbers and return them if their frequency exceeds n times.

In [None]:
def detect_outliers(df, n, features_list):
    outlier_indices = [] 
    for feature in features_list: 
        Q1 = np.percentile(df[feature], 25)
        Q3 = np.percentile(df[feature], 75)
        IQR = Q3 - Q1
        outlier_step = 1.5 * IQR 
        outlier_list_col = df[(df[feature] < Q1 - outlier_step) | (df[feature] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col) 
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(key for key, value in outlier_indices.items() if value > n) 
    return multiple_outliers

outliers_to_drop = detect_outliers(dataset, 2, ['Sales_in_thousands', '__year_resale_value', 'Price_in_thousands', 
                                               'Engine_size', 'Horsepower', 'Wheelbase', 'Width', 'Length', 'Curb_weight',
                                               'Fuel_capacity', 'Fuel_efficiency', 'Power_perf_factor'])
print("We will drop these {} indices: ".format(len(outliers_to_drop)), outliers_to_drop)

From the above cell, we can see that there are no outliers in the data.

# 4.2 Drop and fill missing values

We will first remove the records that have missing Price_in_thousands.

In [None]:
### Filtering the rows that has a value in the column - Price_in_thousands

modified_dataset = dataset[dataset['Price_in_thousands'].notna()]
modified_dataset

In [None]:
### Looking at the missing values in the dataset

modified_dataset.isnull().sum().sort_values(ascending = False)

From the modified dataset, we can see that there are missing values in the columns - __year_resale_value, Fuel_efficiency, Curb_weight.

# 4.2.1 Handling missing values - __year_resale_value

In [None]:
### Replacing the missing values in the column - __year_resale_value using median

year_index = list(~modified_dataset['__year_resale_value'].isnull())
median_year = np.median(modified_dataset['__year_resale_value'].loc[year_index])
median_year

In [None]:
### Replacing the missing values of the column - __year_resale_value in the dataset

modified_dataset['__year_resale_value'].fillna(median_year, inplace = True)

In [None]:
### Checking if there are any missing values of __year_resale_value in the dataset

modified_dataset['__year_resale_value'].isnull().sum()

# 4.2.2 Handling missing values - Fuel_efficiency

In [None]:
### Replacing the missing values in the column - Fuel_efficiency using median

fuel_index = list(~modified_dataset['Fuel_efficiency'].isnull())
median_fuel = np.median(modified_dataset['Fuel_efficiency'].loc[fuel_index])
median_fuel

In [None]:
### Replacing the missing values of the column - Fuel_efficiency in the dataset

modified_dataset['Fuel_efficiency'].fillna(median_year, inplace = True)

In [None]:
### Checking if there are any missing values of Fuel_efficiency in the dataset

modified_dataset['Fuel_efficiency'].isnull().sum()

# 4.2.3 Handling missing values - Curb_weight

In [None]:
### Replacing the missing values in the column - Curb_weight using median

curb_index = list(~modified_dataset['Curb_weight'].isnull())
median_curb = np.median(modified_dataset['Curb_weight'].loc[curb_index])
median_curb

In [None]:
### Replacing the missing values of the column - Curb_weight in the dataset

modified_dataset['Curb_weight'].fillna(median_curb, inplace = True)

In [None]:
### Checking if there are any missing values of Curb_weight in the dataset

modified_dataset['Curb_weight'].isnull().sum()

# 4.2.4 Dropping unnecessary columns

Here, we will drop the columns - Model from the dataset.

In [None]:
### Dropping the columns - Model

modified_dataset.drop(['Model'], axis = 1, inplace = True)
modified_dataset

# 4.3 Feature Engineering

Feature engineering is arguably the most important art in machine learning. It is the process of creating new features from existing features to better represent the underlying problem to the predictive models resulting in improved model accuracy on unseen data.

Here, we focus on creating new columns for:

1. NewManufacturer - using the column Manufacturer
2. Age - using the column Latest_Launch

# 4.3.1 NewManufacturer - using the column Manufacturer

Here, we will create the NewManufacturer column such that if the mean price of a Manufacturer is less than 30 then it belongs to class 1, else class 2.

In [None]:
### Seperating the Manufacturers into class 1 and 2

class_1 = []
class_2 = []

for index in range(len(mean_price_manufacturer)):
    if mean_price_manufacturer.iloc[index, 1] <= 30:
        class_1.append(mean_price_manufacturer.iloc[index, 0])
    else:
        class_2.append(mean_price_manufacturer.iloc[index, 0])
        
print('Manufacturers with less than 30 mean price: ', class_1)
print('Manufacturers with more than 30 mean price: ', class_2)

In [None]:
### Modifying the Manufacturer column in the dataset

manufacturer_data = modified_dataset['Manufacturer']
new_manufacturer_data = []

for value in manufacturer_data:
    if value in class_1:
        new_manufacturer_data.append(1)
    else:
        new_manufacturer_data.append(2)
        
modified_dataset['Manufacturer'] = new_manufacturer_data

In [None]:
### Looking at the modified dataset

modified_dataset

# 4.3.2 Age - using the column Latest_Launch

Here, we will create the Age column using the formula 2022 - year value.

In [None]:
### Creating the Age data

age_data = []
launch_data = modified_dataset['Latest_Launch']

for value in launch_data:
    year = int(value.split('/')[-1])
    age_data.append(2022 - year)

In [None]:
### Adding the Age column

modified_dataset['Age'] = age_data
modified_dataset

In [None]:
### Understanding the distribution of the column - Age

sns.distplot(modified_dataset['Age'], label = 'Skewness: %.2f'%(modified_dataset['Age'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Age')

From the above graph, we can see that there are only 3 main values for this column.

In [None]:
### Dropping the column - Latest_Launch

modified_dataset.drop(['Latest_Launch'], axis = 1, inplace = True)

In [None]:
### Looking at the modified dataset

modified_dataset

# 4.4 Data Transformation

In this section, we will remove the skewness present in the columns - Sales_in_thousands, __year_resale_value, Engine_size, Horsepower, Fuel_capacity, Power_perf_factor by using a Box-Cox transformation on the data. Then, we will normalize all the numerical columns apart from the Target using MinMax Normalization.

# 4.4.1 Box Cox transforming the column - Sales_in_thousands

In [None]:
### Understanding the distribution of the column - Sales_in_thousands

sns.distplot(modified_dataset['Sales_in_thousands'], label = 'Skewness: %.2f'%(modified_dataset['Sales_in_thousands'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Sales_in_thousands')

In [None]:
### Understanding the distribution of the data Box_Cox(Sales_in_thousands)

sales_data = [1 if value == 0 else value for value in modified_dataset['Sales_in_thousands']]

modified_sales, _ = stats.boxcox(sales_data)
modified_dataset['Sales_in_thousands'] = modified_sales

sns.distplot(modified_dataset['Sales_in_thousands'], label = 'Skewness: %.2f'%(modified_dataset['Sales_in_thousands'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Sales_in_thousands')

From the above graph, we can see that most of the skewness is removed.

# 4.4.2 Box Cox transforming the column - __year_resale_value

In [None]:
### Understanding the distribution of the column - __year_resale_value

sns.distplot(modified_dataset['__year_resale_value'], label = 'Skewness: %.2f'%(modified_dataset['__year_resale_value'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - __year_resale_value')

In [None]:
### Understanding the distribution of the data Box_Cox(__year_resale_value)

year_data = [1 if value == 0 else value for value in modified_dataset['__year_resale_value']]

modified_year, _ = stats.boxcox(year_data)
modified_dataset['__year_resale_value'] = modified_year

sns.distplot(modified_dataset['__year_resale_value'], label = 'Skewness: %.2f'%(modified_dataset['__year_resale_value'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - __year_resale_value')

From the above graph, we can see that most of the skewness is removed.

# 4.4.3 Box Cox transforming the column - Engine_size

In [None]:
### Understanding the distribution of the column - Engine_size

sns.distplot(modified_dataset['Engine_size'], label = 'Skewness: %.2f'%(modified_dataset['Engine_size'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Engine_size')

In [None]:
### Understanding the distribution of the data Box_Cox(Engine_size)

engine_data = [1 if value == 0 else value for value in modified_dataset['Engine_size']]

modified_engine, _ = stats.boxcox(engine_data)
modified_dataset['Engine_size'] = modified_engine

sns.distplot(modified_dataset['Engine_size'], label = 'Skewness: %.2f'%(modified_dataset['Engine_size'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Engine_size')

From the above graph, we can see that most of the skewness is removed.

# 4.4.4 Box Cox transforming the column - Horsepower

In [None]:
### Understanding the distribution of the column - Horsepower

sns.distplot(modified_dataset['Horsepower'], label = 'Skewness: %.2f'%(modified_dataset['Horsepower'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Horsepower')

In [None]:
### Understanding the distribution of the data Box_Cox(Horsepower)

horsepower_data = [1 if value == 0 else value for value in modified_dataset['Horsepower']]

modified_horsepower, _ = stats.boxcox(horsepower_data)
modified_dataset['Horsepower'] = modified_horsepower

sns.distplot(modified_dataset['Horsepower'], label = 'Skewness: %.2f'%(modified_dataset['Horsepower'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Horsepower')

From the above graph, we can see that most of the skewness is removed.

# 4.4.5 Box Cox transforming the column - Fuel_capacity

In [None]:
### Understanding the distribution of the column - Fuel_capacity

sns.distplot(modified_dataset['Fuel_capacity'], label = 'Skewness: %.2f'%(modified_dataset['Fuel_capacity'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Fuel_capacity')

In [None]:
### Understanding the distribution of the data Box_Cox(Fuel_capacity)

fuel_data = [1 if value == 0 else value for value in modified_dataset['Fuel_capacity']]

modified_fuel, _ = stats.boxcox(fuel_data)
modified_dataset['Fuel_capacity'] = modified_fuel

sns.distplot(modified_dataset['Fuel_capacity'], label = 'Skewness: %.2f'%(modified_dataset['Fuel_capacity'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Fuel_capacity')

From the above graph, we can see that most of the skewness is removed.

# 4.4.6 Box Cox transforming the column - Power_perf_factor

In [None]:
### Understanding the distribution of the column - Power_perf_factor

sns.distplot(modified_dataset['Power_perf_factor'], label = 'Skewness: %.2f'%(modified_dataset['Power_perf_factor'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Power_perf_factor')

In [None]:
### Understanding the distribution of the data Box_Cox(Power_perf_factor)

power_data = [1 if value == 0 else value for value in modified_dataset['Power_perf_factor']]

modified_power, _ = stats.boxcox(power_data)
modified_dataset['Power_perf_factor'] = modified_power

sns.distplot(modified_dataset['Power_perf_factor'], label = 'Skewness: %.2f'%(modified_dataset['Power_perf_factor'].skew()))
plt.legend(loc = 'best')
plt.title('Distribution of the column - Power_perf_factor')

From the above graph, we can see that most of the skewness is removed.

# 4.4.7 Normalizing the numerical columns

In [None]:
### A function to normalize numerical columns

def normalize_columns(dataframe, column):
    data = dataframe[column]
    mini = min(data)
    maxi = max(data)
    
    new_data = []
    for value in data:
        new_data.append((value - mini)/(maxi - mini))
    
    dataframe[column] = new_data

numerical_columns = ['Sales_in_thousands', '__year_resale_value', 'Engine_size', 'Horsepower', 'Wheelbase', 'Width',
                    'Length', 'Curb_weight', 'Fuel_capacity', 'Fuel_efficiency', 'Power_perf_factor', 'Age']
for each_column in numerical_columns:
    normalize_columns(modified_dataset, each_column)

In [None]:
### Looking at the sample records of the modified dataset

modified_dataset

# 4.5 Feature Encoding

Feature encoding is the process of turning categorical data in a dataset into numerical data. It is essential that we perform feature encoding because most machine learning models can only interpret numerical data and not data in text form.

Here, we will use One Hot Encoding for the columns - Manufacturer, Vehicle_type.

In [None]:
### One Hot Encoding the columns - Manufacturer, Vehicle_type of the modified dataset

encoded_dataset = pd.get_dummies(data = modified_dataset, columns = ['Manufacturer', 'Vehicle_type'])
encoded_dataset

In [None]:
### Create the column - Target using Price_in_thousands

target_data = encoded_dataset['Price_in_thousands']
encoded_dataset['Target'] = target_data

### Dropping the column - Price_in_thousands

encoded_dataset.drop(['Price_in_thousands'], axis = 1, inplace = True)
encoded_dataset

# 4.6 Feature Selection

Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

# 4.6.1 Plotting the correlation matrix for the numerical columns

In [None]:
### Creating a filter_dataset

filter_dataset = encoded_dataset[['Sales_in_thousands', '__year_resale_value', 'Engine_size', 'Horsepower', 'Wheelbase', 
                                  'Width', 'Length', 'Curb_weight', 'Fuel_capacity', 'Fuel_efficiency', 'Power_perf_factor',
                                  'Age']]
filter_dataset

In [None]:
### Plotting the correlation between various columns of the filter_dataset

plt.figure(figsize = (10, 10))
heatmap = sns.heatmap(filter_dataset.corr(), vmin = -1, vmax = 1, annot = True)
heatmap.set_title('Correlation Heatmap', fontdict = {'fontsize' : 12}, pad = 12)

From the above correlation matrix, we can see that there are a few strong correlations between the data. We will use VIF to remove the multi collinearity.

# 4.6.2 Removing the columns that cause multicollinearity using VIF

In [None]:
### Detecting the columns that cause multicollinearity using VIF

column_names = list(filter_dataset.columns)

for name in column_names:
    if len(column_names) >= 2:
        Y = filter_dataset.loc[:, filter_dataset.columns == name]
        X = filter_dataset.loc[:, filter_dataset.columns != name]
        X = sm.add_constant(X)
        linear_model = sm.OLS(Y, X)
        results = linear_model.fit()
        r_squared = results.rsquared
        vif_value = round(1/(1 - r_squared), 2)
        print("Column: {} and VIF: {}".format(name, vif_value))
        
        if vif_value > 10:
            filter_dataset = filter_dataset.drop([name], axis = 1)
            column_names.remove(name)

From the above data, we can see that the columns - Engine_size, Horsepower, Curb_weight, Fuel_capacity, Power_perf_factor cause multicollinearity.

# 5. Modelling

Scikit-learn is one of the most popular libraries for machine learning in Python and that is what we will use in the modelling part of this project.

Since Car Price Prediction is a regression problem, we will need to use regression models, also known as regressors, to train on our model to make predictions. I highly recommend checking out the scikit-learn documentation for more information on the different machine learning models available in their library. I have chosen the following regression models for the job:

1. Multi Linear Regression
2. Lasso Regression
3. Ridge Regression
4. Support Vector Regression
5. Decision Tree regression
6. Random Forest Regression
7. Stacking Regression
8. XGBoost Regression

In this section of the notebook, I will fit the models to the training set as outlined above and evaluate their Root Mean Squared Error (RMSE), R-squared at making predictions. Then, we will select the best model based on those values.

# 5.1 Splitting the data to Training and Test sets

Here, we will split the training data into X_train, X_test, Y_train, and Y_test so that they can be fed to the machine learning models that are used in the next section. Then the model with the best performance will be used to predict the result on the given test dataset.

In [None]:
### Splitting the dataset to the matrices X and Y

X = encoded_dataset.iloc[:, : -1].values
Y = encoded_dataset.iloc[:, -1].values

In [None]:
### Looking at the new training data - X

X

In [None]:
### Looking at the new test data - Y

Y

In [None]:
### Dividing the dataset into train and test in the ratio of 80 : 20

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 27, shuffle = True)

Now, we apply regressors using the above data.

# 5.2 Fit the model

In this section, we use various machine learning models to predict the results for our test data (X_test). We will store the model and its corresponding Root Mean Squared Error and Adjusted R-squared so that we can tabulate them later for choosing the best model.

In [None]:
### Dictionary to store model and its rmse

model_rmse = OrderedDict()

In [None]:
### Dictionary to store model and its r-squared

model_r2 = OrderedDict()

# 5.2.1 Applying Multi Linear Regression

In [None]:
### Training the Multi Linear Regression model on the Training set

linear_regressor = LinearRegression()
linear_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = linear_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Multi Linear Regression'] = rmse
model_r2['Multi Linear Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.2 Applying Lasso Regression

In [None]:
### Training the Lasso Regression model on the Training set

lasso = Lasso()
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 35, 40, 45, 50, 55, 100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring = 'neg_mean_squared_error', cv = 5)
lasso_regressor.fit(X_train, Y_train)

In [None]:
### Finding out negative mean squared error in Lasso Regression

print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

In [None]:
### Predicting the Test set results

Y_pred = lasso_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Lasso Regression'] = rmse
model_r2['Lasso Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.3 Applying Ridge Regression

In [None]:
### Training the Ridge Regression model on the Training set

ridge = Ridge()
parameters = {'alpha': [1e-15, 1e-10, 1e-8, 1e-3, 1e-2, 1, 5, 10, 20, 30, 35, 40, 45, 50, 55, 100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring = 'neg_mean_squared_error', cv = 5)
ridge_regressor.fit(X_train, Y_train)

In [None]:
### Finding out negative mean squared error in Lasso Regression

print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

In [None]:
### Predicting the Test set results

Y_pred = ridge_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Ridge Regression'] = rmse
model_r2['Ridge Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.4 Applying Support Vector Regression

In [None]:
### Training the Support Vector Regression model on the Training set

support_vector_regressor = SVR(kernel = 'rbf')
support_vector_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = support_vector_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Support Vector Regression'] = rmse
model_r2['Support Vector Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.5 Applying Decision Tree Regression

In [None]:
### Training the Decision Tree Regression model on the Training set

decision_tree_regressor = DecisionTreeRegressor()
decision_tree_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = decision_tree_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Decision Tree Regression'] = rmse
model_r2['Decision Tree Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.6 Applying Random Forest Regression (10 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 10, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (10 trees)'] = rmse
model_r2['Random Forest Regression (10 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.7 Applying Random Forest Regression (25 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 25, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (25 trees)'] = rmse
model_r2['Random Forest Regression (25 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.8 Applying Random Forest Regression (50 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 50, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (50 trees)'] = rmse
model_r2['Random Forest Regression (50 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.9 Applying Random Forest Regression (100 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 100, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (100 trees)'] = rmse
model_r2['Random Forest Regression (100 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.10 Applying Random Forest Regression (1000 trees)

In [None]:
### Training the Random Forest Regression model on the Training set

random_forest_regressor = RandomForestRegressor(n_estimators = 1000, random_state = 27)
random_forest_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = random_forest_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Random Forest Regression (1000 trees)'] = rmse
model_r2['Random Forest Regression (1000 trees)'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.11 Applying Stacking Regression

In [None]:
### Preparing the Stacking Regressor

### Define the base models

base_models = list()

base_models.append(('decision_tree', decision_tree_regressor))
base_models.append(('support_vector', support_vector_regressor))

### Define the meta models

meta_model = random_forest_regressor

In [None]:
### Training the Stacking Regression model on the Training set

stacking_regressor = StackingRegressor(estimators = base_models, final_estimator = meta_model)
stacking_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = stacking_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['Stacking Regression'] = rmse
model_r2['Stacking Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.2.12 Applying XGBoost Regression

In [None]:
### Training the XGBoost Regression model on the Training set

xgboost_regressor = xg.XGBRegressor(objective ='reg:linear', n_estimators = 100, seed = 27)
xgboost_regressor.fit(X_train, Y_train)

In [None]:
### Predicting the Test set results

Y_pred = xgboost_regressor.predict(X_test)

In [None]:
### Calculating RMSE and Adjusted R-squared for the model

mse = round(mean_squared_error(Y_test, Y_pred), 3)
rmse = round(sqrt(mse), 3)

r2_value = round(r2_score(Y_test, Y_pred), 3)

model_rmse['XGBoost Regression'] = rmse
model_r2['XGBoost Regression'] = r2_value

print('Root Mean Squared Error of the model is : {}'.format(rmse))
print('R-squared value of the model is : {}'.format(r2_value))

# 5.3 Model evalution

Model evaluation is the process of using different evaluation metrics to understand a machine learning model's performance, as well as its strengths and weaknesses.

# 5.3.1 RMSE, R-squared of the models

Now we will tabulate all the models along with their RMSE, R-Squared. This data is stored in the model_rmse, model_r2 dictionary. We will use the tabulate package for tabulating the results.

In [None]:
### Looking at the model rmse dictionary

model_rmse

In [None]:
### Looking at the model r-squared dictionary

model_r2

In [None]:
### Tabulating the results

table = []
table.append(['S.No.', 'Classification Model', 'Root Mean Squared Error', 'R-squared'])
count = 1

for model in model_rmse:
    row = [count, model, model_rmse[model], model_r2[model]]
    table.append(row)
    count += 1
    
print(tabulate(table, headers = 'firstrow', tablefmt = 'fancy_grid'))

From the above table, we can see that the model Linear Regression has the least Root Mean Squared Error of 4.245 and the highest R-squared value of 0.926.

# 6. Conclusion

Hence, for this problem, we will use Linear regressor to predict the Sales Price of the Car.