<a href="https://colab.research.google.com/github/akande75/Explainability-of-Deep-Learning-Models/blob/main/Vehicle_Price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing the required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(
    { "figure.figsize": (6, 4) },
    style='ticks',
    color_codes=True,
    font_scale=0.8
)
%config InlineBackend.figure_format = 'retina'
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# 1. Data/Domain Understanding / Exploration

## 1.1 Meaning and Types of Features

## Importing the dataset

In [1]:
#loading and reading the dataset
adv_data = pd.read_csv("adverts.csv")
adv_data.head()

NameError: name 'pd' is not defined

In [None]:
adv_data.shape

### Checking data types and nullable values

adv_data.info()

In [None]:
# To specifically check for null value counts, we'll run the code below:
adv_data.isna().sum()

In [None]:
#display columns names
adv_data.columns

1. **public_reference**: A unique integer identifier for each advert.
2. **mileage**: The mileage of the vehicle (float).
3. **reg_code**: Registration code, possibly indicating the region or year of registration (categorical).
4. **standard_colour**: The color of the vehicle (categorical).
5. **standard_make**: The make of the vehicle (categorical).
6. **standard_model**: The model of the vehicle (categorical).
7. **vehicle_condition**: The condition of the vehicle (categorical) (e.g., new, used).
8. **year_of_registration**: The year the vehicle was registered.
9. **price**: The selling price (integer) of the vehicle - our target variable for prediction.
10. **body_type**: The body type of the vehicle (categorical) (e.g., SUV, Saloon).
11. **crossover_car_and_van**: A boolean indicating whether the vehicle is a crossover between a car and a van.
12. **fuel_type**: The type of fuel the vehicle uses (categorical).

1.2.1 Summarized Statistics

In [None]:
# Summarized distribution of our dataset
adv_data.describe()

### 1.1.1 Analysis of Distributions

The above shows the distribution of our numerical features, we can also see issues with mileage (out of range), price (max price) and year _of_registration data (min year_of_registration). we can see that the minimum year of registration of 999 is awkward with mileage of 0 for a used car. The first car in UK was registered in 1904

### a. Distribution Analysis  of Categorical Features

### The `Reg_Code` Feature

In [None]:
# Analysis of distribution for the 'reg_code' feature
reg_code_description = adv_data['reg_code'].describe()

# Plotting the frequency of different 'reg_code'
plt.figure(figsize=(8, 4))
adv_data['reg_code'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 Registration Codes')
plt.xlabel('Registration Code')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

reg_code_description

There are 72 unique registration codes in the dataset.
The most frequent registration code is '17', which appears 36,738 times.
The distribution is more concentrated amongst a few registration codes, with the top few codes appearing significantly more frequently than others.

### The `Standard_Colour` Feature

In [None]:
# Analysis of distribution for the 'standard_colour' feature
standard_colour_description = adv_data['standard_colour'].describe()

# Plotting the percentage of different 'standard_colour'
plt.figure(figsize=(12, 10))
total_ads = len(adv_data)
percentage_data = (adv_data['standard_colour'].value_counts() / total_ads * 100).sort_values(ascending=True)
ax = percentage_data.plot(kind='barh')
ax.set_title('Percentage of Different Standard Colours')
ax.set_ylabel('Standard Colour')

standard_colour_description

The dataset contains 22 unique standard colors for the vehicles.
Black is the most common color, appearing 86,287 times, followed by other popular colors like white and grey.
The distribution reveals a clear preference for certain colors in vehicles, with a few colors being significantly more common than others.

### The `Standard_Make` Feature

In [None]:
# Analysis of distribution for the'standard_make' feature
standard_make_description = adv_data['standard_make'].describe()

# Plotting the percentage of different 'standard_make'
plt.figure(figsize=(10, 8))
total_entries = len(adv_data)
percentage_data_make = (adv_data['standard_make'].value_counts() / total_entries * 100).head(10)
ax = percentage_data_make.sort_values(ascending=True).plot(kind='barh')
ax.set_title('Top 10 Standard Makes')
ax.set_ylabel('Standard Make')

standard_make_description

The dataset includes 110 unique car makes.
BMW is the most common make, appearing 37,376 times at 9.3% of all vehicles in the dataset, indicating its popularity in the dataset.
The distribution reveals that certain car makes are more prevalent in the dataset, suggesting brand popularity or market dominance.

### The `Standard_Model` Feature

In [None]:
# Analysis of distribution for the 'standard_model' feature
standard_model_description = adv_data['standard_model'].describe()

# Plotting the percentage of different 'standard_model'
plt.figure(figsize=(10, 8))
total_entries_model = len(adv_data)
percentage_data_model = (adv_data['standard_model'].value_counts() / total_entries_model * 100).head(10)
ax = percentage_data_model.sort_values(ascending=True).plot(kind='barh')
ax.set_title('Top 10 Standard Models')
ax.set_ylabel('Standard Model')

standard_model_description

The dataset comprises of 1,168 unique vehicle models.
The Volkswagen Golf is the most represented model, with 11,583 entries.
Similar to the standard_make, this distribution shows a preference for certain models, likely reflecting market trends and popularity.

### The `Vehicle_Condition` Feature

In [None]:
adv_data['vehicle_condition'].value_counts()

In [None]:
# Analysis of distribution for the 'vehicle_condition' feature
vehicle_condition_description = adv_data['vehicle_condition'].describe()

# Plotting a donut pie chart for the frequency of different 'vehicle_condition'
plt.figure(figsize=(6, 6))
condition_counts = adv_data['vehicle_condition'].value_counts()
colors = ['#66b3ff', '#99ff99']  # Two different colors for 'New' and 'Used'
plt.pie(condition_counts, labels=condition_counts.index, autopct='%1.1f%%', colors=colors, startangle=90)
plt.title('Distribution of Vehicle Conditions')
plt.show()


There are only two categories in this feature: "NEW" and "USED".
"USED" vehicles overwhelmingly dominate the dataset, with 370,756 instances at 92.2% of all vehicles compared to new vehicles.
This distribution is very skewed, reflecting the prevalence of used vehicles in the market or in the dataset.

### The `Fuel_Type` Feature

In [None]:
# Analysis of distribution for the 'fuel_typee' feature
fuel_type_description = adv_data['fuel_type'].describe()

# Plotting the percentage of different 'fuel_type'
plt.figure(figsize=(15, 10))
total_ads = len(adv_data)
percentage_data = (adv_data['fuel_type'].value_counts() / total_ads * 100).sort_values(ascending=True)
ax = percentage_data.plot(kind='barh')
ax.set_title('Percentage of Different Fuel Type')
ax.set_ylabel('Fuel Type')


fuel_type_description

### The `Body_Type` Feature

In [None]:
# Analysis of distribution for the 'body_type' feature
body_type_description = adv_data['body_type'].describe()

# Plotting the percentage of different 'body_type'
plt.figure(figsize=(15, 10))
total_ads = len(adv_data)
percentage_data = (adv_data['body_type'].value_counts() / total_ads * 100).sort_values(ascending=True)
ax = percentage_data.plot(kind='barh')
ax.set_title('Percentage of Different Body Type')
ax.set_ylabel('Body Type')

body_type_description

The dataset contains 16 unique vehicle body type.
The Hatchback being the most common body type, appearing 167,314 times representing 41.62% of all body type in the dataset, followed by other body types like SUV and Saloon.
The distribution reveals a clear preference for certain vehicle body types, with a few body types being significantly more common than others.

### The `crossover_car_and_van` Feature

In [None]:
 adv_data['crossover_car_and_van'].value_counts()

In [None]:
# Analysis of distribution for the 'crossover_car_and_van' feature
crossover_car_and_van_description = adv_data['crossover_car_and_van'].describe()

# Plotting a donut pie chart for the frequency of different 'vehicle_condition'
plt.figure(figsize=(6, 6))
condition_counts = adv_data['crossover_car_and_van'].value_counts()
colors = ['#00cc00', '#ff6666']  # Two different colors for 'True' and 'False'
plt.pie(condition_counts, labels=condition_counts.index, autopct='%1.1f%%', colors=colors, startangle=90)
plt.title('Distribution of Crossover Car and Van')
plt.show()

crossover_car_and_van_description

There are only two categories in this feature: "True" and "False".
Vehicles without the crossover to car and van feature overwhelmingly dominate the dataset, with 400,205 instances at 99.6% of all vehicles.
This distribution is very skewed, reflecting the prevalence of vehicles without the crossover to car and van in the market or in the dataset.

### The `public_reference` Feature

The Public_Reference Feature is a unique identifier for each advert row. Its distribution won't contribute to the predictive power of the vehicle's prices but it’s still crucial for data management purposes. i.e. we’ll look into the presence of duplicates using this feature under the Data pre-processing section in 1.3 below.

### The `Mileage` Feature

In [None]:
# Analysis of distribution for 'mileage' feature
mileage_description = adv_data['mileage'].describe()

# Plotting the distribution of the 'mileage' feature
plt.figure(figsize=(8, 3))
sns.histplot(adv_data['mileage'].dropna(), kde=True, bins=30)
plt.title('Distribution of Mileage')
plt.xlabel('Mileage')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

mileage_description

In [None]:
skewness = adv_data['mileage'].skew()
skewness

Mileage is a key factor in determining the value of a used car. Lower mileage generally indicates a higher value.
The mileage feature ranges from 0 to 999,999 miles, with a mean of around 37,744 miles.
The distribution is right-skewed, which indicates that most of the vehicles in the adverts have lower mileage, but a few have very high mileage, but it is important to note that the presence of vehicles with extremely high mileage (close to 1 million miles) could be outliers or data entry errors.

### The `Price` Feature

In [None]:
# Analysis for 'Price' feature
price_description = adv_data['price'].describe()

# Plotting the distribution of 'year_of_registration'
plt.figure(figsize=(10, 6))
sns.histplot(adv_data['price'].dropna(), bins=30, kde=True)
plt.title('Distribution of the Vehicle Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


price_description

In [None]:
skewness = adv_data['price'].skew()
print(f"Skewness: {skewness}")

The distribution of prices is right-skewed, indicating that the majority of the vehicles are not really expensive. It is essential to note that this conclusion provides an overview of all adverts and may not hold true for all customers. The perception of a vehicle being 'not really expensive' is relative, as what one customer considers expensive may not be considered expensive by another.

### The `Year_of_Registration` Feature

In [None]:
adv_data['year_of_registration'].value_counts()

The application of the `value_counts()` method on the `year_of_registration` feature above reveals the presence of some irregularities in the feature in the form of `1010`,`1016`,`1063`,`1015` e.t.c. and also the years recorded as `2017.0` as seen above which are likely a result of an input error.

In [None]:
# Analysis for year of registration feature
adv_data['year_of_registration'].describe()

To understand the extent of the supposed input error we need to check for the presence of missing value in the `year_of_registration` feature and handle it accordingly.

In [None]:
skewness = adv_data['year_of_registration'].skew()
print(f"Skewness: {skewness}")

In [None]:
adv_data['year_of_registration'].isnull().sum()

## 1.2 Analysis of Predictive Power of Features

For this part, we need to analyze how each feature might influence the selling price. We can do this by:

**Calculating the Correlation Coefficients**: For numeric features, we can compute correlation coefficients with the price.

**Visualizing Relationships**: We can create visualizations like box plots for categorical features to see how they relate to the price.

### Numerical Feature

In [None]:
# Calculating the correlation between mileage and price
correlation = adv_data['price'].corr(adv_data['mileage'])

# Printing the correlation coefficient
print(f"Correlation between 'price' and 'mileage': {correlation}")

In [None]:
# Creating a correlation matrix
correlation_matrix = adv_data[['price', 'mileage']].corr()

# Plotting the heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap - Price vs Mileage')
plt.show()

The mileage seems to have a negative correlation with the price, which is expected as higher mileage typically decreases a car's value.

In [None]:
# Plotting box plots for key categorical features against price
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))

# Box plot for Standard Make vs Price
sns.boxplot(ax=axes[0, 0], data=adv_data, x='standard_make', y='price')
axes[0, 0].set_title('Standard Make vs Price')
axes[0, 0].tick_params(axis='x', rotation=90)

# Box plot for Vehicle Condition vs Price
sns.boxplot(ax=axes[0, 1], data=adv_data, x='vehicle_condition', y='price')
axes[0, 1].set_title('Vehicle Condition vs Price')

# Box plot for Body Type vs Price
sns.boxplot(ax=axes[1, 0], data=adv_data, x='body_type', y='price')
axes[1, 0].set_title('Body Type vs Price')
axes[1, 0].tick_params(axis='x', rotation=90)

# Box plot for Fuel Type vs Price
sns.boxplot(ax=axes[1, 1], data=adv_data, x='fuel_type', y='price')
axes[1, 1].set_title('Fuel Type vs Price')
axes[1, 1].tick_params(axis='x', rotation=90)

plt.tight_layout()
plt.show()

**Standard Make vs Price**: Different makes (brands) have distinct price distributions, indicating that the make of a car is a significant predictor of its price. Some brands show a higher median and range of prices, suggesting a premium segment.

**Vehicle Condition vs Price**: The condition of the vehicle (e.g., new, used) shows different price distributions, with new vehicles typically priced higher, as expected.

**Body Type vs Price**: Different body types have varying price distributions. This suggests that the body type of a vehicle influences its selling price.

**Fuel Type vs Price**

In [None]:
# Check the number of unique values for standard_model
unique_models_count = adv_data['standard_model'].nunique()

# Creating plots
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(6, 7))

# Box plot for Standard Colour vs Price
sns.boxplot(ax=axes[0], data=adv_data, x='standard_colour', y='price')
axes[0].set_title('Standard Colour vs Price')
axes[0].tick_params(axis='x', rotation=90)

# Box plot for Crossover Car and Van vs Price
sns.boxplot(ax=axes[1], data=adv_data, x='crossover_car_and_van', y='price')
axes[1].set_title('Crossover Car and Van vs Price')

plt.tight_layout()
plt.show()

unique_models_count

**Standard Colour**: The box plot indicates that there are some variations in price across different colours. However, the impact of colour on price doesn't appear to be as pronounced as other factors like make or model. This could be due to colour being a more subjective preference.

**Crossover Car and Van**: There are differences in price distributions between vehicles that are and are not crossovers between cars and vans. This suggests that this feature might have some influence on the price.

From the above, car make, condition, body type, and fuel type are all factors that influence a vehicle's selling price. Different makes have distinct price distributions with some belonging to a premium segment. New cars are generally priced higher than used ones, and different body types have varying price distributions. Lastly, the type of fuel a vehicle uses plays a role in its price.

## 1.3 Data Processing for Data Exploration and Visualisation

In [None]:
# Looking out for wrong data types
adv_data.info()

We can observe from the above that 6 columns (Mileage, Reg Code, Standard Colour, Year of Registration, Body Type, and Fuel Type) contain missing values. Year of Registration and Registration Code have substantial numbers of missing values of 33311 and 31857 respectively while Mileage contains the lowest number of missing values.

#### Checking for duplicates

The `public_reference` column will be used to check for duplicates to check for possible occurence of multiple entry, as the `public_reference` feature is expected to be unique for all adverts.

In [None]:
# Checking for duplicates based on the 'public_reference' column
duplicates = adv_data.duplicated(subset=['public_reference'], keep=False)

# Counting the number of duplicate rows
num_duplicates = duplicates.sum()
num_duplicates

The dataset has now been confirmed to contain no duplicates

#### Preparing 'Year_of_registration' for Data Exploration and Visualisation

In order to make sense of our exploration and visualization, we'll handle the missing values in the `year_of_registration` feature. We will iterate over each value under the `year_of_registration` feature and temporarily replace anywhere a missing value (NaN) is found with -1 and we'll equally convert the years to an integer format from appearing as `2017.0` to `2017`.

In [None]:
# Converting the 'year_of_registration' column in adverts_data to integer, handling missing values
for i in range(len(adv_data['year_of_registration'])):
    if pd.isna(adv_data.at[i, 'year_of_registration']):
        # Replacing NaN with a default value, e.g., -1
        adv_data.at[i, 'year_of_registration'] = -1

# Converting the 'year_of_registration' column to integer
adv_data['year_of_registration'] = adv_data['year_of_registration'].astype(int)

In [None]:
adv_data['year_of_registration'].unique()

In [None]:
adv_data[adv_data['year_of_registration'] == 1515]

After examining the untruncated `year_of_registration` feature above, 12 different years were identified to be input errors as listed as follows, `1007`, `999`, `1009`, `1515`, `1008`, `1006`, `1017`, `1018`, `1010`, `1063`, `1016`, and `1015`.

In [None]:
adv_data[adv_data['year_of_registration'] == 999]

In [None]:
adv_data[adv_data['year_of_registration'] == 1063]

From domain knowledge the input errors can be handled by changing the  first `1` in years like `1007`, `1009`, `1010`,`1017` e.t.c. to `2`, with a justifiable assumption years being `2007`, `2009`, `2010`, `2017`, and the year `1515` to `2015` as the `Audi A4 Avant` was first introduced in `November 1995` which is also reasonable assumption, but would be dropping the year `999` which we might assume to be `1999` but both the `Mazda3` & `BMW Z4` were first introduced in the year `2003`. The year `1063` would aslo be dropped which we might also assume to be `1963` but the vehicle `Smart fortwo` was first launched in the year `1998`. The above was done with the ideology that the vehicles cannot be registerd before they were ever launched.

Furthermore, only the years `1007`, `1009`, `1008`, `1006`, `1017`, `1018`, `1010`, `1016`, and `1015` would be rectified, and the years `1515`, `999` and `1063` wil be dropped.

In [None]:
# Rectifying the specified years
year_mapping = {
    1007: 2007,
    1009: 2009,
    1008: 2008,
    1006: 2006,
    1017: 2017,
    1018: 2018,
    1010: 2010,
    1016: 2016,
    1015: 2015
}

adv_data['year_of_registration'].replace(year_mapping, inplace=True)

# Dropping rows with the specified years
years_to_drop = [999, 1515, 1063]
adv_data = adv_data[~adv_data['year_of_registration'].isin(years_to_drop)]

# Resetting the index after dropping rows
adv_data.reset_index(drop=True, inplace=True)

Now that the `year_of_registration` feature has been rectifed we can now check its true distribution.

In [None]:
# Analysis for 'year_of_registration' feature
year_of_registration_description = adv_data['year_of_registration'].describe()

# Plotting the distribution of 'year_of_registration'
plt.figure(figsize=(10, 6))
sns.histplot(adv_data['year_of_registration'].dropna(), bins=30, kde=True)
plt.title('Distribution of Year of Registration')
plt.xlabel('Year of Registration')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


year_of_registration_description

In [None]:
skewness = adv_data['year_of_registration'].skew()
print(f"Skewness: {skewness}")

The distribution is left-skewed, indicating that newer models are more common in the dataset.

# 2. Data Pre-processing

### Dealing with missing values in 'Year_of_Registration'

It is important to recall the that under the `Analysis of Features` section, the missing values in the `year_of_registration` feature were replaced with `-1` to ensure the easy analysis of the feature distribution. Now to properly handle all missing values in the dataset, the earlier modification of the missing values in the `year_of_reagistration` feature must be reverted by replacing all values with `-1` in to feature with `Na`.

In [None]:
# Replacing -1 with NaN in the 'year_of_registration' column
adv_data['year_of_registration'] = adv_data['year_of_registration'].replace(-1, pd.NA)


#### Checking for missing values

In [None]:
adv_data.isnull().sum()

We have our missing value back in the year of registration column

### Getting the percentage of missing values

In [None]:
# Getting the percentage of missing values
null_counts = adv_data.isnull().sum()
columns_with_nulls = null_counts[null_counts > 0]
(columns_with_nulls / len(adv_data)) * 100

### Filling missing values in year of registration for New Vehicles

In [None]:
# Replacing missing values in 'year_of_registration' with 2020 for 'NEW' vehicles
# First, let's identify rows where 'year_of_registration' is missing and 'vehicle_condition' is 'NEW'
condition_new_mask = (adv_data['vehicle_condition'] == 'NEW') & (adv_data['year_of_registration'].isnull())

# Replaing missing values in 'year_of_registration' with 2020 for these rows
adv_data.loc[condition_new_mask, 'year_of_registration'] = 2020

adv_data.head(2)

`Justification`

1. The year of registration indicates the vehicle's age and is often used to determine its price. However, handling missing values in this feature is difficult since it does not follow a normal distribution. To ensure the integrity of the dataset, we’ll drop all missing values in the year of registration feature for the 'USED' category.
2. Initially, I thought dropping missing values in the year of registration` was best. However, it erases new vehicle data under vehicle conditions, which is bad for modeling. This was discovered in section 2.2, and I had to revisit our approach. The missing values will be replaced with a placeholder year say 2020 which is expected to be the most recent year in the dataset.


In [None]:
# Dropping the missing values in the 'year_of_registartion' column
adv_data = adv_data.dropna(subset=['year_of_registration'])

In [None]:
adv_data.isnull().sum()
adv_data.head(1)

In [None]:
# checking for any other missing values in the 'year_of_registration' column
adv_data['year_of_registration'].isnull().sum()

In [None]:
adv_data.isnull().sum()

### Filling with missing values for 'Categorical' features

In [None]:
# Creating grouping for object variables
adv_object = adv_data.select_dtypes(include=['object'])

# Get the object columns
adv_object = adv_object.columns

# I will now get the mode and fill all missing values for categorical features with their mode using the following expression
adv_data[adv_object] = adv_data[adv_object].apply(lambda x: x.fillna(x.value_counts().index[0]))

In [None]:
adv_data.info()

As we can see, we have only mileage left to deal with before we proceed

### Dealing with Mileage

Converting mileage column to integer datatype so we can calculate the mean and fill the missing values with it.

In [None]:
#converting mileage to integer for ease of classification
adv_data['mileage']= adv_data['mileage'].astype("Int64")

In [None]:
adv_data.head(1)

### Filling missing values for mileage with the mean value

Since mileage is now the only integer datatype with missing value based on our check above, we'll run the expression to calculate and fill in missing interger nalues with the columns. This will address missing values for mileage in the dataset

In [None]:
# Creating grouping for integer variables as previously done for categorical variables
adv_object = adv_data.select_dtypes(include=['Int64'])

# Get the object columns
adv_object = adv_object.columns

# I will now get the mode and fill all missing values for categorical features with their mode using the following expression
adv_data[adv_object] = adv_data[adv_object].apply(lambda x: x.fillna(x.value_counts().index[0]))

### Now, we'll confirm no existence of missing value(s) in our dataset

In [None]:
#checking the missing values are now filled
adv_data.isnull().sum().sort_values()

### Checking for Outliers

Besides the price feature the mileage is the only true numerical feature in the dataset, hence it is essential to check the mileage feature for the presence of outliers.

In [None]:
# Creating a box plot for the 'mileage' column
plt.figure(figsize=(8, 6))
sns.boxplot(y=adv_data['mileage'])
plt.title('Box Plot of the Mileage Feature')
plt.show()

We observed that extreme values in the mileage features are not necessarily input errors or outliers as the dataset comprises both new and old models, used and new vehicles which can justify the presence of the outliers; hence the outliers won't be modified. Although the outliers may be seen as noise by the model, they will be scaled under the feature engineering section.

In [None]:
plt.subplots(figsize=(8,6))
sns.boxplot(x=adv_data['year_of_registration']);

From our car registration information, the first car was manufactured in 1904 but as we can see from our dataset that we have some cars registered before then. This could be due to data entry error or any other factor. We'll exclude them from our analysis for later investigation. It was equally observed that most of the cars placed for advert were registered in the 20's

### 2.2 Feature Engineering

### Dropping  'public_reference' and 'reg_code' columns

Since public reference does not have any influence in determining the price of the vehicle, we'll drop the column. We'll equally drop reg_code column as the column is completely filled with errorneous entris without a patter to transform it.

In [None]:
# Dropping the 'public_reference' and 'reg_code' columns
adv_data = adv_data.drop(['public_reference', 'reg_code'], axis=1)

In [None]:
adv_data.head(2)

### Creating Additional Feature 'Age'

Since the year of registration can be used as proxy for the age of the vehicle the. A Vehicle's Age feature will be created from the year_of_registration feature to derive the vehicle's age and the year maximum year on our dataset will be used as a reference year assuming the vehicle's were advertised on Autotrader's website that same year.

In [None]:
##Checking most recent year
adv_year = adv_data['year_of_registration'].max()
adv_year

In [None]:
#Creating the new feature
adv_data['Age']= adv_year - adv_data.year_of_registration
adv_data.head()

As we have introduced a new feature to determine the age of the vehicle , we'll now drop year_of_registration column as the Age now represents the year_of_registration.

In [None]:
# Dropping the 'year_of_registration' column
adv_data = adv_data.drop('year_of_registration', axis=1)

#Confirming the column is dropped
adv_data.head(2)

### Dropping  'crossover_car_and_van' columns

As a result of the 'crossover_car_and_van' feature comprising largely of the False value, the feature will be dropped to prevent bias on the minority class.

In [None]:
adv_data['crossover_car_and_van'].value_counts()

In [None]:
adv_data = adv_data.drop(['crossover_car_and_van'], axis=1)
adv_data.head(2)

### 2.3 Subsetting

### Some feature selections and row sampling

### Top 5 most expensive cars with their age and mileage

In [None]:
# Top 5 most expensive cars with their age and mileage
oldest_vehicles = adv_data.sort_values( ['price','Age'], ascending=[False,True]).head(5)
sns.barplot(data=oldest_vehicles, x='Age', y='price', hue='mileage')
oldest_vehicles

## List of New Vehicles with mileage above zero

In [None]:
#Viewing NEW cars with mileage above zero
adv_data.loc[(adv_data['vehicle_condition']=='NEW') & (adv_data['mileage']>0)].sort_values(by=['mileage'] , ascending=False)

The above producs a list of new vehicles that have been driven already. This could be due to testing or transacporting the vehicles for sales.

### Generating a unique lists of vehicle make

In [None]:
#counting unique vehicle make
adv_data['standard_make'].nunique()

In [None]:
# A quick view of our vehicle makes
adv_data['standard_make'].unique()

In [None]:
# Viewing unique values for vehicle condition
adv_data['standard_colour'].nunique()

In [None]:
# Viewing unique values for vehicle condition
adv_data ['vehicle_condition'] .unique()

In [None]:
#taking the count of vehicle condition
adv_data['vehicle_condition'].nunique()

In [None]:
adv_data['body_type'].nunique()

In [None]:
adv_data['body_type'].unique()

In [None]:
adv_data['fuel_type'].unique()

In [None]:
adv_data['Age'].unique

### Transforming Categorical Variables (Label Encode)

### Importing Required Libraries

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Columns to be label encoded
columns_to_encode = ['standard_colour', 'vehicle_condition', 'body_type', 'fuel_type', 'standard_model', 'standard_make']

# Initializing the LabelEncoder
label_encoder = LabelEncoder()

# Applying Label Encoding to each column
for column in columns_to_encode:
    adv_data[column] = label_encoder.fit_transform(adv_data[column].astype(str)).astype(object)

In [None]:
adv_data.head(2)

In [None]:
# Columns that were label encoded
encoded_columns = ['standard_colour', 'vehicle_condition', 'body_type', 'fuel_type', 'standard_model', 'standard_make']

# Checking the data types of these columns
encoded_columns_dtypes = adv_data[encoded_columns].dtypes

encoded_columns_dtypes

## Splitting the dataset

The dataset will be split into training and test sets to prevent data leakage and ensure reliable results. The split will be based on the vehicle condition feature to ensure proportional representation of NEW and USED vehicles in the training set.

In [None]:
from sklearn.model_selection import train_test_split

# Assuming adv_data is your DataFrame
# Separate features and target
X = adv_data.drop('price', axis=1)  # features
y = adv_data['price']                # target

# Perform the split with stratification based on 'vehicle_condition'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=adv_data['vehicle_condition'])
X_train.head(2)

### Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Creating a StandardScaler instance
scaler = StandardScaler()

# Defining the numerical columns to be scaled
numerical_cols =['mileage', 'Age']

# Fitting  the scaler on the numerical columns of the training data
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])

# Transforming the numerical columns of the test set using the parameters learned from the training set
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])
X_train.head(2)

# 3. Model Building

## 3.1 Algorithm Selection, Model Instantiation and Configuration

## K-Nearest Neighbour Regression (KNN)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Creating a KNeighborsRegressor instance
knn_regressor = KNeighborsRegressor(n_neighbors=5)

# Training the KNN regressor on the training set
knn_regressor.fit(X_train, y_train)

In [None]:
# printing the model accuracy score for training and testing datasets
knn_regressor.score(X_test, y_test), knn_regressor.score(X_train, y_train)

#### Comparing the test and predicted prices

In [None]:
y_pred = knn_regressor.predict(X_test)
np.set_printoptions(precision=2)

# Converting to numpy arrays and reshape
y_pred_array = y_pred.reshape(len(y_pred), 1) if not isinstance(y_pred, pd.Series) else y_pred.to_numpy().reshape(-1, 1)
y_test_array = y_test.to_numpy().reshape(-1, 1)

# Concatenating and printing predicted price and the actual price for easy comparison
print(np.concatenate((y_pred_array, y_test_array), axis=1))

#### Evaluating the K-NN Regression Model

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculating the evaluation metrics
mse_knn_unopt = mean_squared_error(y_test, y_pred)
mae_knn_unopt = mean_absolute_error(y_test, y_pred)
r2_knn_unopt = r2_score(y_test, y_pred)

# Calculating the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of independent variables
adj_r2_knn_unopt = 1 - (1 - r2_knn_unopt) * (n - 1) / (n - p - 1)

# Printing the evaluation metrics
print(f"Mean Squared Error: {mse_knn_unopt}")
print(f"Mean Absolute Error (MAE): {mae_knn_unopt}")
print(f"R² Score: {r2_knn_unopt}")
print(f"Adjusted R² Score: {adj_r2_knn_unopt}")

### Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Creating a decision tree regressor instance
decision_tree_regressor = DecisionTreeRegressor(random_state=42)

# Training the decision tree regressor on the training set
decision_tree_regressor.fit(X_train, y_train)

In [None]:
# printing the model accuracy score for training and testing datasets
decision_tree_regressor.score(X_test, y_test), decision_tree_regressor.score(X_train, y_train)

#### Comparing the test and predicted prices

In [None]:
# Making predictions on the test set
y_pred = decision_tree_regressor.predict(X_test)
np.set_printoptions(precision=2)

# Converting to numpy arrays and reshape
y_pred_array = y_pred.reshape(len(y_pred), 1) if not isinstance(y_pred, pd.Series) else y_pred.to_numpy().reshape(-1, 1)
y_test_array = y_test.to_numpy().reshape(-1, 1)

# Concatenating and printing predicted price and the actual price for easy comparison
print(np.concatenate((y_pred_array, y_test_array), axis=1))

#### Evaluating the Decision Tree Regression Model

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculating the evaluation metrics
mse_dt_unopt = mean_squared_error(y_test, y_pred)
mae_dt_unopt = mean_absolute_error(y_test, y_pred)
r2_dt_unopt = r2_score(y_test, y_pred)

# Calculating the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of independent variables
adj_r2_dt_unopt = 1 - (1 - r2_dt_unopt) * (n - 1) / (n - p - 1)

# Printing the evaluation metrics
print(f"Mean Squared Error (MSE): {mse_dt_unopt}")
print(f"Mean Absolute Error (MAE): {mae_dt_unopt}")
print(f"R² Score: {r2_dt_unopt}")
print(f"Adjusted R² Score: {adj_r2_dt_unopt}")

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

# Creating a LinearRegression instance
linear_regressor = LinearRegression()

# Training the linear regression model on the training set
linear_regressor.fit(X_train, y_train)

In [None]:
# printing the model accuracy score for training and testing datasets
linear_regressor.score(X_test, y_test), linear_regressor.score(X_train, y_train)

#### Comparing the test and predicted prices

In [None]:
y_pred = linear_regressor.predict(X_test)
np.set_printoptions(precision=2)

# Converting to numpy arrays and reshape
y_pred_array = y_pred.reshape(len(y_pred), 1) if not isinstance(y_pred, pd.Series) else y_pred.to_numpy().reshape(-1, 1)
y_test_array = y_test.to_numpy().reshape(-1, 1)

# Concatenating and printing predicted price and the actual price for easy comparison
print(np.concatenate((y_pred_array, y_test_array), axis=1))

#### Evaluating the Linear Regression Model

In [None]:
# Calculating the evaluation metrics
mse_linear_unopt = mean_squared_error(y_test, y_pred)
mae_linear_unopt = mean_absolute_error(y_test, y_pred)
r2_linear_unopt = r2_score(y_test, y_pred)

# Calculating the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of independent variables
adj_r2_linear_unopt = 1 - (1 - r2_linear_unopt) * (n - 1) / (n - p - 1)

# Printing the evaluation metrics
print(f"Mean Squared Error: {mse_linear_unopt}")
print(f"Mean Absolute Error (MAE): {mae_linear_unopt}")
print(f"R² Score: {r2_linear_unopt}")
print(f"Adjusted R² Score: {adj_r2_linear_unopt}")

## 3.2 Grid Search, and Model Ranking and Selection

## Grid Search for Hyperparameter Tuning

### Decision Tree Regression

In [None]:
from sklearn.model_selection import GridSearchCV

# Parameter grid to search
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Grid search with cross-validation
grid_search = GridSearchCV(DecisionTreeRegressor(random_state=42), param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

#### Retraining the Decision Tree Regression Model with the Best Parameters

In [None]:
# Retrieving the best parameters from the grid search
best_params = grid_search.best_params_

# Create a new Decision Tree Regressor instance with the best parameters
optimized_decision_tree = DecisionTreeRegressor(**best_params)

# Retraining the model on the entire training set
optimized_decision_tree.fit(X_train, y_train)

#### Evaluating the Optimized Decision Tree Model

In [None]:
# Evaluating the optimized model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Making predictions on the test dataset
y_pred = optimized_decision_tree.predict(X_test)

# Calculating the evaluation metrics
mse_dt_opt = mean_squared_error(y_test, y_pred)
mae_dt_opt = mean_absolute_error(y_test, y_pred)
r2_dt_opt = r2_score(y_test, y_pred)

# Calculating the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of independent variables
adj_r2_dt_opt = 1 - (1 - r2_dt_opt) * (n - 1) / (n - p - 1)

# Printing the evaluation metrics
print(f"Mean Squared Error: {mse_dt_opt}")
print(f"Mean Absolute Error (MAE): {mae_dt_opt}")
print(f"R² Score: {r2_dt_opt}")
print(f"Adjusted R² Score: {adj_r2_dt_opt}")

#### Comparing the test and predicted prices of the optimized decision tree model

In [None]:
np.set_printoptions(precision=2)

# Converting to numpy arrays and reshape
y_pred_array = y_pred.reshape(len(y_pred), 1) if not isinstance(y_pred, pd.Series) else y_pred.to_numpy().reshape(-1, 1)
y_test_array = y_test.to_numpy().reshape(-1, 1)

# Concatenating and printing predicted price and the actual price for easy comparison
print(np.concatenate((y_pred_array, y_test_array), axis=1))

### K-Nearest Neighbour Regression

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Reduced parameter grid to search
param_grid = {
    'n_neighbors': [5, 10],  # Fewer options for number of neighbors
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto'],  # Using 'auto' to let the algorithm decide the best approach
    'p': [1, 2]
}

# Grid search with reduced cross-validation
knn_grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, cv=3, n_jobs=-1, verbose=1)
knn_grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters for KNN:", knn_grid_search.best_params_)
print("Best score for KNN:", knn_grid_search.best_score_)

#### Retraining the KNN regression Model with the Best Parameters

Retraining the K-NN regression Model with the Best Parameters

In [None]:
# Retrieving the best parameters from the grid search
best_params_knn = knn_grid_search.best_params_

# Creating a new KNN regressor instance with the best parameters
optimized_knn_regressor = KNeighborsRegressor(**best_params_knn)

# Retraining the model on the entire training set
optimized_knn_regressor.fit(X_train, y_train)

#### Evaluating the Optimized K-NN Regression Model

In [None]:
# Making predictions on the test dataset
y_pred_optimized = optimized_knn_regressor.predict(X_test)

# Calculating the evaluation metrics
mse_knn_opt = mean_squared_error(y_test, y_pred_optimized)
mae_knn_opt = mean_absolute_error(y_test, y_pred_optimized)
r2_knn_opt = r2_score(y_test, y_pred_optimized)

# Calculating the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of independent variables
adj_r2_knn_opt = 1 - (1 - r2_knn_opt) * (n - 1) / (n - p - 1)

# Printing the evaluation metrics
print(f"Mean Squared Error (MSE): {mse_knn_opt}")
print(f"Mean Absolute Error (MAE): {mae_knn_opt}")
print(f"R-squared (R²): {r2_knn_opt}")
print(f"Adjusted R² Score: {adj_r2_knn_opt}")

Comparing the test and predicted prices of the optimized K-NN model

In [None]:
np.set_printoptions(precision=2)

# Converting to numpy arrays and reshape
y_pred_array = y_pred.reshape(len(y_pred), 1) if not isinstance(y_pred, pd.Series) else y_pred.to_numpy().reshape(-1, 1)
y_test_array = y_test.to_numpy().reshape(-1, 1)

# Concatenating and printing predicted price and the actual price for easy comparison
print(np.concatenate((y_pred_array, y_test_array), axis=1))

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Creating a pipeline with PolynomialFeatures and LinearRegression
pipeline = Pipeline([
    ('poly', PolynomialFeatures()),
    ('linear', LinearRegression())
])

# Parameter grid to search
param_grid = {
    'poly__degree': [1, 2, 3],  # Trying out linear, quadratic, and cubic terms
}

# Grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

#### Retraining the Linear regression Model with the Best Parameters

In [None]:
# Retrievinh the best polynomial degree from the grid search
best_poly_degree = grid_search.best_params_['poly__degree']

# Creating a new pipeline with the best polynomial degree and LinearRegression
optimized_linear_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=best_poly_degree)),
    ('linear', LinearRegression())
])

# Retraining the model on the entire training set
optimized_linear_pipeline.fit(X_train, y_train)

#### Evaluating the Optimized Linear Regression Model

In [None]:
# Making predictions on the test dataset using the optimized linear regression model
y_pred_optimized_linear = optimized_linear_pipeline.predict(X_test)

# Calculating the evaluation metrics
mse_linear_opt = mean_squared_error(y_test, y_pred_optimized_linear)
mae_linear_opt = mean_absolute_error(y_test, y_pred_optimized_linear)
r2_linear_opt = r2_score(y_test, y_pred_optimized_linear)

# Calculating the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of independent variables
adj_r2_linear_opt = 1 - (1 - r2_linear_opt) * (n - 1) / (n - p - 1)

# Printing the evaluation metrics
print(f"Mean Squared Error (MSE) for Optimized Linear Regression: {mse_linear_opt}")
print(f"Mean Absolute Error (MAE) for Optimized Linear Regression: {mae_linear_opt}")
print(f"R-squared (R²) for Optimized Linear Regression: {r2_linear_opt}")
print(f"Adjusted R² Score: {adj_r2_linear_opt}")

### Model Ranking and Selection

In [None]:
metrics = {
    "Model": ["KNN Reg Unoptimized", "KNN Reg Optimized", "Linear Reg Unoptimized", "Linear Reg Optimized",
              "DT Reg Unoptimzed", "DT Reg Optimized"],
    "MSE": [mse_knn_unopt, mse_knn_opt, mse_linear_unopt, mse_linear_opt, mse_dt_unopt, mse_dt_opt],
    "MAE": [mae_knn_unopt, mae_knn_opt, mae_linear_unopt, mae_linear_opt, mae_dt_unopt, mae_dt_opt],
    "R2": [r2_knn_unopt, r2_knn_opt, r2_linear_unopt, r2_linear_opt, r2_dt_unopt, r2_dt_opt],
    "Adjusted R²": [adj_r2_knn_unopt, adj_r2_knn_opt, adj_r2_linear_unopt, adj_r2_linear_opt, adj_r2_dt_unopt, adj_r2_dt_opt]
}

model_comparison_df = pd.DataFrame(metrics)

# Rounding the DataFrame to two decimal places
model_comparison_df_rounded = model_comparison_df.round(2)

# Displaying the rounded DataFrame
model_comparison_df_rounded

#### Ranking the models in descending order

In [None]:
# Ranking the models based on Adjusted R² (descending order)
model_comparison_df_ranked = model_comparison_df_rounded.sort_values(by='Adjusted R²', ascending=False)

# Adding the rank column to the dataframe
model_comparison_df_ranked['Rank'] = model_comparison_df_ranked['Adjusted R²'].rank(ascending=False, method='min')

# Reseting index to have a clean new ranking
model_comparison_df_ranked = model_comparison_df_ranked.reset_index(drop=True)

# Reordering the columns to put 'Rank' first
model_comparison_df_ranked = model_comparison_df_ranked[['Rank', 'Model', 'MSE', 'MAE', 'R2', 'Adjusted R²']]

# Display the ranked DataFrame
model_comparison_df_ranked

# 4. Model Evaluation and Analysis

## 4.1. Coarse-Grained Evaluation/Analysis

In [None]:
# Calculating average scores for each model
model_comparison_df['Average Score'] = model_comparison_df[['MSE', 'MAE', 'R2', 'Adjusted R²']].mean(axis=1)

# Ranking the models based on the average score
model_comparison_df['Rank'] = model_comparison_df['Average Score'].rank(ascending=False)

# Summary table
summary_table = model_comparison_df[['Model', 'Average Score', 'Rank']]

# Display the summary table
summary_table

**Model Suitability**: The ranking suggests that linear regression models (both optimized and unoptimized) are more suitable for your dataset/problem than decision tree or KNN models.

**Effect of Optimization**: For linear regression, optimization doesn't lead to a significant change in ranking, whereas for KNN, it appears to have a negative impact.

**Need for Further Analysis**: While this coarse-grained analysis provides a high-level overview, it is crucial to delve deeper to understand why certain models performed better or worse. Considering examining specific cases where models performed exceptionally well or poorly.

**Revisit the Evaluation Method**: Given the unusual result with the optimized KNN model, you might want to revisit the optimization process or the way the average score is calculated. Ensure that the metrics are appropriately scaled and combined.

## 4.2. Feature Importance

### Unoptimized Decision Tree Regressor

In [None]:
feature_importances = decision_tree_regressor.feature_importances_

# Creating a DataFrame for easier visualization
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)

In [None]:
plt.figure(figsize=(12, 8))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance of the Unoptimized Decision Tree Regressor')
plt.gca().invert_yaxis()
plt.show()

In [None]:
# Accessing the 'linear' step in the pipeline to get the coefficients
coefficients = optimized_linear_pipeline.named_steps['linear'].coef_

# The coefficients for PolynomialFeatures will be in a flattened array.
# Ensuring the length matches the number of generated features
poly_features = optimized_linear_pipeline.named_steps['poly']
expanded_feature_names = poly_features.get_feature_names_out(input_features=X_train.columns)

# Creating a DataFrame for easier visualization
coef_df = pd.DataFrame({
    'Feature': expanded_feature_names,
    'Coefficient': coefficients.flatten()  # Flatten in case of multidimensional array
}).sort_values(by='Coefficient', ascending=False)

print(coef_df)

In [None]:
# Selecting the top 10 features based on the absolute value of coefficients
top_features = coef_df.iloc[:10]

plt.figure(figsize=(12, 8))
plt.barh(top_features['Feature'], top_features['Coefficient'])
plt.xlabel('Coefficient')
plt.ylabel('Feature')
plt.title('Top 10 Features in Optimized Linear Regression Model')
plt.gca().invert_yaxis()  # Invert y-axis to have the highest value at the top
plt.show()

# Optimized KNN Regressor

In [None]:
from sklearn.inspection import permutation_importance

# Using a subset of test data (e.g., 25% of the data)
subset_index = np.random.choice(X_test.index, size=int(len(X_test)*0.25), replace=False)
X_test_subset = X_test.loc[subset_index]
y_test_subset = y_test.loc[subset_index]

# Reducing the number of repeats
result = permutation_importance(optimized_knn_regressor, X_test_subset, y_test_subset, n_repeats=5, random_state=42, n_jobs=-1)

# Organizing and plotting the results
sorted_idx = result.importances_mean.argsort()

plt.figure(figsize=(10, 6))
plt.boxplot(result.importances[sorted_idx].T, vert=False, labels=X_test_subset.columns[sorted_idx])
plt.title("Permutation Importance of Features for the optimized KNN Model (Subset)")
plt.xlabel("Decrease in Model Accuracy")
plt.show()

# 4.3. Fine-Grained Evaluation

### Calculating Instance-Level Errors

In [None]:
absolute_errors = abs(y_pred_optimized - y_test)  # Absolute error
squared_errors = (y_pred_optimized - y_test)**2  # Squared error

### Analyzing Errors with Respect to Features

In [None]:
# Adding errors to the test DataFrame
X_test_with_errors = X_test.copy()
X_test_with_errors['Absolute_Error'] = absolute_errors
X_test_with_errors['Squared_Error'] = squared_errors

# Analyzing error by feature - example with a specific feature
feature_analysis = X_test_with_errors.groupby('mileage').mean()[['Absolute_Error', 'Squared_Error']]
feature_analysis

### Visualizing the Error Distributions

In [None]:
plt.figure(figsize=(12, 8))
plt.scatter(X_test_with_errors['mileage'], X_test_with_errors['Absolute_Error'], alpha=0.5)
plt.title('Absolute Error vs. Mileage')
plt.xlabel('Mileage')
plt.ylabel('Absolute Error')
plt.show()

### Identifying and Examining Outliers

In [None]:
# Finding instances with high errors
mean_error = X_test_with_errors['Absolute_Error'].mean()
std_error = X_test_with_errors['Absolute_Error'].std()
error_threshold = mean_error + 2 * std_error

high_error_instances = X_test_with_errors[X_test_with_errors['Absolute_Error'] > error_threshold]

# Displaying some high error instances
high_error_instances.head()

#### Error Comparison

In [None]:
# Finding instances with low errors for comparison
low_error_instances = X_test_with_errors[X_test_with_errors['Absolute_Error'] < error_threshold].sample(n=len(high_error_instances))

# Displaying some low error instances
low_error_instances.head()

#### Visualizing High and Low Error Comparison

In [None]:
# Example: Histogram of a feature for high-error and low-error instances
plt.figure(figsize=(12, 6))
plt.hist(high_error_instances['mileage'], bins=20, alpha=0.5, label='High Error')
plt.hist(low_error_instances['mileage'], bins=20, alpha=0.5, label='Low Error')
plt.xlabel('mileage')
plt.ylabel('Frequency')
plt.title('Feature Distribution in High vs. Low Error Instances')
plt.legend()
plt.show()