# Case Study - Car Price Prediction

This data is a **regression problem**, trying to predict car price.

The followings describe the features.

- **name**: Detail description of car's brand & model
- **year**: Release year of the car's model
- **selling_price**: Car's selling price
- **km_driven**: How much distance (in kilometres) the car had travelled.
- **fuel**: Fuel type in 'Diesel','Petrol','CNG' and 'LPG'
- **seller_type**: Seller types in 'Individual', 'Dealer' and 'Trustmark Dealer'
- **transmission**: Car's gearbox types in 'Manual' and 'Automatic'
- **owner**: Car's owner type in 'First Owner', 'Second Owner', 'Third Owner','Fourth & Above Owner' and 'Test Drive Car'
- **mileage**: How many miles the vehicle runs per liter of fuel
- **engine**: Engine power of the car
- **max_power**: The maximum power available of the car
- **torque**: The measurement of car's ability
- **seats**: No of seats in each car

## Importing libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
import matplotlib
np.__version__, pd.__version__, sns.__version__, matplotlib.__version__

## 1. Load data

In [None]:
df = pd.read_csv('/root/source_code/raw_data/Cars.csv')

In [None]:
# Keeping original dataframe
df_org = df
from datetime import datetime
df_org['car_age'] = (datetime.now().year) - df_org['year']

In [None]:
# print the first rows of data
df.head()

In [None]:
# print the shape of data
df.shape

In [None]:
# Statistical info Hint: look up .describe()
df.describe()

In [None]:
# Check Dtypes of input data
df.info()

## 2. Exploratory Data Analysis

### Renaming

Rename 'name' column into 'brand'.

In [None]:
# Check the column names
df.columns

In [None]:
df.rename(columns = {'name':'brand'}, inplace = True)

In [None]:
# Check the column names after renaming
df.columns

In [None]:
# Get the car brand only from name

df['brand'] = df['brand'].str.split(' ', expand=True)[0]

Remove all rows with CNG and LPG fuel type because CNG and LPG use a different mileage system.

In [None]:
# Check records count in 'fuel'
df.fuel.value_counts()

In [None]:
# Remove 'CNG' and 'LPG' record count from data set
exc_fuel = df[ (df['fuel'] == 'CNG') | (df['fuel'] == 'LPG') ].index
df.drop(exc_fuel , inplace=True)

In [None]:
# Check again fuel records count to make sure there is no record related to 'CNG' or 'LPG'
df.fuel.value_counts()

Remove mileage unit of 'kmpl' to get mileage number only

In [None]:
df['mileage'] = df['mileage'].str.split(' ', expand=True)[0].astype(float)

Remove engine unit of 'CC' to get number value only

In [None]:
df['engine'] = df['engine'].str.split(' ', expand=True)[0].astype(float)

Remove max_power unit to get number value only

In [None]:
var_value = df['max_power'].str.split(' ', expand=True)[0]
var_value_2 = [None if isinstance(value, str) and value.isalpha() else float(value) for value in var_value]
df['max_power'] = var_value_2

Drop 'Torque' feature as not clearly understanding of the value

In [None]:
df = df.drop('torque', axis=1)

Delete all sample records of 'Test Drive Cars'

In [None]:
# Check records count in 'Owner'
df.owner.value_counts()

In [None]:
# Checking total count, mean, min, max
df.selling_price.count(),df.selling_price.mean(),df.selling_price.max(),df.selling_price.min()

In [None]:
# Checking total count, mean, min, max of 'Test Drive Car'
var_testcar = df['owner'] == 'Test Drive Car'
df[var_testcar].selling_price.count(),df[var_testcar].selling_price.mean(),df[var_testcar].selling_price.max(),df[var_testcar].selling_price.min()

In [None]:
# Checking total count, mean, min, max of non 'Test Drive Car'
var_no_testcar = df['owner'] != 'Test Drive Car'
df[var_no_testcar].selling_price.count(),df[var_no_testcar].selling_price.mean(),df[var_no_testcar].selling_price.max(),df[var_no_testcar].selling_price.min()

In [None]:
# Remove 'Test Drive Car'
exc_owner = df[df['owner'] == 'Test Drive Car'].index
df.drop(exc_owner , inplace=True)

In [None]:
# Check again owner records count to make sure there is no record related to 'Test Drive Car'
df.owner.value_counts()

In [None]:
# Check number of car by seller_type
df.seller_type.value_counts()

In [None]:
# Check number of car by transmission type
df.transmission.value_counts()

#### Feature driven of 'car_age' from 'year'

In [None]:
from datetime import datetime
df['car_age'] = (datetime.now().year) - df['year']

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# Statistical information
df.describe()

### 2.1 Univariate analyis

Single variable exploratory data anlaysis

#### Countplot

In [None]:
sns.countplot(data = df, x = 'fuel')

In [None]:
sns.countplot(data = df, x = 'seller_type')

In [None]:
sns.countplot(data = df, x = 'transmission')

In [None]:
sns.countplot(data = df, x = 'owner')

In [None]:
sns.countplot(data = df, x = 'seats')

#### Distribution plot

In [None]:
sns.displot(data = df, x = 'brand')

In [None]:
sns.displot(data = df, x = 'car_age')

In [None]:
sns.displot(data = df, x = 'selling_price')

### 2.2 Multivariate analysis

Multiple variable exploratory data analysis

#### Boxplot

In [None]:
sns.boxplot(x = df["transmission"], y = df["selling_price"]);
plt.xlabel("transmission")
plt.ylabel("Selling Price")

In [None]:
sns.boxplot(x = df["fuel"], y = df["selling_price"]);
plt.xlabel("fuel")
plt.ylabel("Selling Price")

In [None]:
sns.boxplot(x = df["owner"], y = df["selling_price"]);
plt.xlabel("Owner")
plt.ylabel("Selling Price")

#### Scatterplot

In [None]:
sns.scatterplot(x = df['car_age'], y = df['selling_price'], hue=df['fuel'])

In [None]:
sns.scatterplot(x = df['car_age'], y = df['selling_price'], hue=df['transmission'])

In [None]:
sns.scatterplot(x = df['km_driven'], y = df['selling_price'], hue=df['transmission'])

In [None]:
sns.scatterplot(x = df['mileage'], y = df['selling_price'], hue=df['transmission'])

In [None]:
sns.scatterplot(x = df['max_power'], y = df['selling_price'], hue=df['transmission'])

In [None]:
sns.scatterplot(x = df['max_power'], y = df['selling_price'], hue=df['car_age'])

In [None]:
sns.scatterplot(x = df['max_power'], y = df['selling_price'], hue=df['transmission'])

In [None]:
sns.scatterplot(x = df['brand'], y = df['selling_price'], hue=df['transmission'])

In [None]:
df.head()

#### Correlation Matrix checking before label encoding to object features

In [None]:
plt.figure(figsize = (15,8))
my_df = df.select_dtypes(exclude = [object])
sns.heatmap(my_df.corr(),annot=True,cmap="coolwarm")

#### Predictive Power Score

In [None]:
import ppscore as pps

In [None]:
# Check features correlation with ppscore
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')

#plot
plt.figure(figsize = (15,8))
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

#### Label encoding for features

In [None]:
# Checking value count for Owner
df.owner.value_counts()

In [None]:
# Label encoding for 'owner'
df['owner'] = df['owner'].map({'First Owner':1,'Second Owner':2,'Third Owner':3,'Fourth & Above Owner':4,'Test Drive Car':5})

In [None]:
# Check label encoding value for 'owner'
df.owner.value_counts()

In [None]:
# Label encoding for 'Transmission' by adding additional column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["transmission"] = le.fit_transform(df["transmission"])

df["transmission"].unique()

In [None]:
# Check label encode mapping
le.classes_

In [None]:
# Label encoding for 'Fuel' by adding additional column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["fuel"] = le.fit_transform(df["fuel"])

df["fuel"].unique()

In [None]:
# Check label encode mapping
le.classes_

In [None]:
# Check data set values after label encoding
df.head()

In [None]:
# Check with heatmap again including label encoded columns
plt.figure(figsize = (15,8))
my_df = df.select_dtypes(exclude = [object])
sns.heatmap(my_df.corr(),annot=True,cmap="coolwarm")

#sns.heatmap(df.corr(), annot=True, cmap="coolwarm")  #don't forget these are not all variables! categorical is not here...

In [None]:
# Check feture correlation again including label encoded features
import ppscore as pps

#this needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')

#plot
plt.figure(figsize = (15,8))
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

#### Log transform for features with big number

In [None]:
# Check original value before log transform
df.head()

In [None]:
# Log transform
df['selling_price'] = np.log(df['selling_price'])
df['km_driven'] = np.log(df['km_driven'])
df['engine'] = np.log(df['engine'])

In [None]:
# Check after log transform
df.head()


## 4. Feature selection

In [None]:
# choosen features
# transmission,km_driven,fuel
# max_power,car_age,mileage

#x is our strong features
X = df[        ['max_power', 'car_age', 'mileage']        ]

#y is simply the life expectancy col
y = df['selling_price']

In [None]:
print(X.shape) #2d (no of samples, no of features)
print(y.shape) #1d (no of samples)

### Train test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 72)

In [None]:
print(X_train.shape) #2d (no of samples, no of features)
print(X_test.shape) #2d (no of samples, no of features)
print(y_train.shape) #1d (no of samples)
print(y_test.shape) #1d (no of samples)

## 5. Preprocessing

### Null values checking

In [None]:
#check for null values
X_train[['max_power', 'car_age', 'mileage']].isna().sum()

In [None]:
X_test[['max_power', 'car_age', 'mileage']].isna().sum()

In [None]:
y_train.isna().sum()

In [None]:
y_test.isna().sum()

### Fill Null value with mean or median

In [None]:
sns.displot(data=df, x='max_power')

In [None]:
# Cheking mean, median value to fill up the Null value X
df['max_power'].mean(), df['max_power'].median()

In [None]:
# Fill the max_power null values with median because data seem left screw.
X_train['max_power'].fillna(X_train['max_power'].median(), inplace=True)

In [None]:
# Fill to testing data.
X_test['max_power'].fillna(X_train['max_power'].median(), inplace=True)

In [None]:
sns.displot(data=df, x='mileage')

In [None]:
# Cheking mean, median value to fill up the Null value X
df['mileage'].mean(), df['mileage'].median()

In [None]:
# Fill the max_power null values with mean because data seem normal distribution
X_train['mileage'].fillna(X_train['mileage'].mean(), inplace=True)

In [None]:
# Fill the testing data
X_test['mileage'].fillna(X_train['mileage'].mean(), inplace=True)

In [None]:
# check null value in training data
X_train[['max_power', 'car_age', 'mileage']].isna().sum()

In [None]:
# check null value in testing data
X_test[['max_power', 'car_age', 'mileage']].isna().sum()

In [None]:
y_train.isna().sum(), y_test.isna().sum()

### Checking Outliers

In [None]:
# Create a dictionary of columns.
col_dict = {'max_power':1,'car_age':2,'mileage':3}

# Detect outliers in each variable using box plots.
plt.figure(figsize=(20,30))

for variable,i in col_dict.items():
                     plt.subplot(5,4,i)
                     plt.boxplot(X_train[variable])
                     plt.title(variable)

plt.show()

In [None]:
def outlier_count(col, data = X_train):

    # calculate your 25% quatile and 75% quatile
    q75, q25 = np.percentile(data[col], [75, 25])

    # calculate your inter quatile
    iqr = q75 - q25

    # min_val and max_val
    min_val = q25 - (iqr*1.5)
    max_val = q75 + (iqr*1.5)

    # count number of outliers, which are the data that are less than min_val or more than max_val calculated above
    outlier_count = len(np.where((data[col] > max_val) | (data[col] < min_val))[0])

    # calculate the percentage of the outliers
    outlier_percent = round(outlier_count/len(data[col])*100, 2)

    if(outlier_count > 0):
        print("\n"+15*'-' + col + 15*'-'+"\n")
        print('Number of outliers: {}'.format(outlier_count))
        print('Percent of data that is outlier: {}%'.format(outlier_percent))

In [None]:
for col in X_train.columns:
    outlier_count(col)

### Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# feature scaling helps improve reach convergence faster
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

In [None]:
# Let's check shapes of all X_train, X_test, y_train, y_test
print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_test: ", y_test.shape)

## 6. Modeling

In [None]:
from sklearn.linear_model import LinearRegression  #we are using regression models
from sklearn.metrics import mean_squared_error, r2_score

lr = LinearRegression()
lr.fit(X_train, y_train)
yhat = lr.predict(X_test)

print("MSE: ", mean_squared_error(y_test, yhat))
print("r2: ", r2_score(y_test, yhat))

### Cross validation + Grid search

In [None]:
from sklearn.linear_model import LinearRegression  #we are using regression models
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Libraries for model evaluation

# models that we will be using, put them in a list
algorithms = [LinearRegression(), SVR(), KNeighborsRegressor(), DecisionTreeRegressor(random_state = 0),
              RandomForestRegressor(n_estimators = 100, random_state = 0)]

# The names of the models
algorithm_names = ["Linear Regression", "SVR", "KNeighbors Regressor", "Decision-Tree Regressor", "Random-Forest Regressor"]

Let's do some simple cross-validation here....

In [None]:
y_train.isna().sum()

In [None]:
from sklearn.model_selection import KFold, cross_val_score

#lists for keeping mse
train_mse = []
test_mse = []

#defining splits
kfold = KFold(n_splits=5, shuffle=True)

for i, model in enumerate(algorithms):
    scores = cross_val_score(model, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
    print(f"{algorithm_names[i]} - Score: {scores}; Mean: {scores.mean()}")

Random forest is doing better than other models.

### Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'bootstrap': [True], 'max_depth': [5, 10, None],
              'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15]}

rf = RandomForestRegressor(random_state = 1)

grid = GridSearchCV(estimator = rf,
                    param_grid = param_grid,
                    cv = kfold,
                    n_jobs = -1,
                    return_train_score=True,
                    refit=True,
                    scoring='neg_mean_squared_error')

# Fit your grid_search
grid.fit(X_train, y_train);  #fit means start looping all the possible parameters

In [None]:
grid.best_params_

In [None]:
# Find your grid_search's best score
best_mse = grid.best_score_

In [None]:
best_mse  # ignore the minus because it's neg_mean_squared_error

## 7. Testing

Of course, once we do everything.  We can try to shoot with the final test set.  We should no longer do anything like improving the model.  It's illegal!  since X_test is the final final test set.

In [None]:
yhat = grid.predict(X_test)
mean_squared_error(y_test, yhat)

## 8. Analysis:  Feature Importance

Understanding why is **key** to every business, not how low MSE we got.  Extracting which feature is important for prediction can help us interpret the results.  There are several ways: algorithm, permutation, and shap.  Note that these techniques can be mostly applied to most algorithms.

Most of the time, we just apply all, and check the consistency.

#### Algorithm way

Some ML algorithms provide feature importance score after you fit the model

In [None]:
#stored in this variable
#note that grid here is random forest
rf = grid.best_estimator_

rf.feature_importances_

In [None]:
#let's plot
plt.barh(X.columns, rf.feature_importances_)

In [None]:
#hmm...let's sort first
sorted_idx = rf.feature_importances_.argsort()
plt.barh(X.columns[sorted_idx], rf.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

#### Permutation way

This method will randomly shuffle each feature and compute the change in the model’s performance. The features which impact the performance the most are the most important one.

*Note*: The permutation based importance is computationally expensive. The permutation based method can have problem with highly-correlated features, it can report them as unimportant.

In [None]:
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(rf, X_test, y_test)

#let's plot
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(X.columns[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

#### Shap way

The SHAP interpretation can be used (it is model-agnostic) to compute the feature importances. It is using the Shapley values from game theory to estimate the how does each feature contribute to the prediction. It can be easily installed (<code>pip install shap</code>)

In [None]:
import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

In [None]:
#shap provides plot
shap.summary_plot(shap_values, X_test, plot_type="bar", feature_names = X.columns)

In [None]:
stop here

## 9. Inference

To provide inference service or deploy, it's best to save the model for latter use.

In [None]:
import pickle

# save the model to disk
filename = 'car_price_prediction.model'
pickle.dump(grid, open(filename, 'wb'))

In [None]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
df_org[['max_power', 'car_age', 'mileage','selling_price']].loc[200]

In [None]:
# let's try to create one silly example
# 'max_power', 'car_age', 'mileage','selling_price'
df[['max_power', 'car_age', 'mileage','selling_price']].loc[200]

In [None]:
#['max_power', 'car_age', 'mileage']
# sample = np.array([[64.10, 12.00, 21.00]]) # 20
sample = np.array([[67.05, 14.00, 21.79]]) # 100
sample = np.array([[67.05, 14.00, 21.79]]) # 200
sample

In [None]:
predicted_life_exp = loaded_model.predict(sample)
predicted_life_exp,np.exp(predicted_life_exp)
# 13.25976536
# 13.3280505