# Predicting Used Car Price for Sales

## 1. Data Exploration

### 1.1. Reading the Data

In [None]:
import numpy as np
import pandas as pd

df = pd.read_csv("data/toyota.csv")

​X_1 = df.iloc[:,3:]

X_2 = df.iloc[:,0:2]

X = pd.concat([X_1, X_2], 1)

y = df.iloc[:,2]

print(X)
print(y)

### 1.2. Observing the Variables

In [None]:
# sorting with respect to "price" to check the min and max value

df.sort_values("price")

print(df.describe()) # to analyze count, mean, std etc. of each variables
print(df.info()) # to observe variable data types and to check on null values

The dataset contains information about model, year, price, transmission, mileage, fuel type, road tax, mpg and engine size. The table above shows that the dataset contains no missing values so that changing null values are not needed at the preprocessing steps. Also, it can be seen that model, transmission and fuel type are categorical variables. This implies that one hot encoding operations should be performed to transform categorical variables into numerical ones.
### 1.3. Data Visualization


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(df, x_vars=["price", "year", "mileage"], y_vars=["price"], aspect=1, height=5)

sns.pairplot(df, x_vars=["tax", "mpg", "engineSize"], y_vars=["price"], aspect=1, height=5)


From the graphs above, it can be seen that "price" has a negative correlation with "year" and positive correlation with "mileage". It can be said that both variables are realted with "price" and they have a relatively high coefficients. From the second graph, it is observed the relations between the independent variables. This graph is more like a proof of some preobserved hypothesis. For example, "tax" is increasing as "engineSize" increases.

After this, "price" should be rescaled. It is a discrete, high range distributed value. To perform linear regression on this variable, the density graph should as much as similar to look like a normal distribution. That is why, some scaling operations such as logging, square-rooting or cube-rooting are performed. At the end, logging operation is selected since it gives the closest normal distribution of price.


In [None]:
# Observing the distribution of the "price" Data

%matplotlib inline
sns.displot(y)

%matplotlib inline
sns.displot(np.log10(y))

%matplotlib inline
sns.displot(np.sqrt(y))

%matplotlib inline
sns.displot(np.cbrt(y))



If logging is performed on price, it is the closest to normal distribution. So, price value should be changed to its logged version.

In [None]:
y = np.log10(y)

## 2. Data Preprocessing

Here, I performed one hot encoding. Cleaning the data from null values is ignored since the data doesn't involve any. It was observed from the previous .info() method on the data.
### One Hot Encoding

This step is required since some of the variables are categorical and contains string values which should be changed to numeraical values to efficiently perform the models. These variables are 'transmission', 'fuelType' and 'model'. One hot encoding creates dummy columns for each of elements of each of the variables.

The result X data set of one hot encoding includes 31 columns while it has originally 8 columns.

In [None]:
for col in ["model", "fuelType", "transmission"]:

    unique = list(X[col].unique())
    print(f"{col} has {len(unique)} unique values:" )
    print(unique)

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

# One Hot Encoding for 'transmission' 

encoder = OneHotEncoder(sparse = False)
x_encoded_1 = pd.DataFrame (encoder.fit_transform(X[['transmission']]))
x_encoded_1.columns = encoder.get_feature_names(['transmission'])

# One Hot Encoding for 'fuelType' 
x_encoded_2 = pd.DataFrame (encoder.fit_transform(X[['fuelType']]))
x_encoded_2.columns = encoder.get_feature_names(['fuelType'])

# One Hot Encoding for 'model' 
x_encoded_3 = pd.DataFrame (encoder.fit_transform(X[['model']]))
x_encoded_3.columns = encoder.get_feature_names(['model'])

x_encoded= pd.concat([x_encoded_1, x_encoded_2, x_encoded_3, X.iloc[:,3:6], X.iloc[:,-1], X.iloc[:,1]],1)

​
df = pd.concat([x_encoded, y],1)

## 3. Feature Selection
### 3.1. Correlation Heat Map

This is a step to check if the independent variables are correlated each other, it is crucial as including correlated independent variables decrease the efficieny of the model. Another aim of heat map is to check whether the dependent values are correlated with independent variable.

In [None]:
cor = df.astype(int).corr()
plt.figure(figsize=(40, 20))

sns.heatmap(cor, annot=True)

columns = np.full((cor.shape[0],), True, dtype=bool)

# Selecting of Variables Regarding Correlation Heat Map

for i in range(cor.shape[0]):
    for j in range(i+1, cor.shape[0]):
        if cor.iloc[i, j] >= 0.75:
            if columns[i]:
                columns[j] = False

print(df.columns[columns])
selected_columns = df.columns[columns]

#df = df[selected_columns]


The correlation heat map suggests that 'fuelType_hybrid' is highly correlated with 'transmission_Automatic'. The variable of hybrid fuel type can be disregarded from the model. Also, 'transmission_Other', 'model_Avensis' and 'mpg' are weakly correlated with y. That is why these independent variables can be ignored in the model too. However, I will decide the result of the independent variables set after performing other feature selection methods.

### 3.2. Backward Feature Elimination

This is a feature elimination step. It starts with full model and eliminate independent variables stepwise with respect to p values of the linear model.

In [None]:
import statsmodels.regression.linear_model as sm

selected_columns = x_encoded.columns[0:].values

def backward_elimination(x, y, sl, cols):
    numVars = len(x[0])
    for i in range(0, numVars):
        regressor_ols = sm.OLS(y, x).fit()
        maxVar = max(regressor_ols.pvalues).astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if regressor_ols.pvalues[j].astype(float) == maxVar:
                    x = np.delete(x, j, 1)
                    cols = np.delete(cols, j)

    regressor_ols.summary()
    return x, cols

SL = 0.01

data_modeled, selected_columns_2 = backward_elimination(x_encoded.values, y.values, SL, selected_columns)

print(selected_columns_2)
print(len(selected_columns_2))
print(data_modeled)

Backward elimination feature selection with significance level is 0.01 suggests that data set can be diminished to the 28 variables which are

'transmission_Automatic' 'transmission_Manual' 'transmission_Other' 'transmission_Semi-Auto' 'fuelType_Diesel' 'fuelType_Hybrid' 'fuelType_Other' 'fuelType_Petrol' 'model_ Auris' 'model_ Avensis' 'model_ Aygo' 'model_ C-HR' 'model_ Camry' 'model_ Corolla' 'model_ GT86' 'model_ Hilux' 'model_ IQ' 'model_ Land Cruiser' 'model_ PROACE VERSO' 'model_ Prius' 'model_ RAV4' 'model_ Supra' 'model_ Urban Cruiser' 'model_ Verso' 'model_ Verso-S' 'model_ Yaris' 'year' 'mileage'

This result supports the results of the correlation heat map.
### 3.3. Lasso Method

Last feature selection method is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting statistical model.


In [None]:
from sklearn.feature_selection import SelectFromModel, SelectKBest
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Lasso

# data split for lasso
X_train, X_test, y_train, y_test = train_test_split(x_encoded, y, test_size=0.25, random_state=42)

# pipeline for scaling and lasso model
pipeline = Pipeline([('scaler',StandardScaler()),('model',Lasso())])

# applying pipeline to our data
search = GridSearchCV(pipeline,
                      {'model__alpha':np.arange(0.1,10,0.1)},
                      cv = 5, scoring="neg_mean_squared_error",verbose=3
                      )
search.fit(X_train,y_train)

# scores

search.best_params_
coefficients = search.best_estimator_.named_steps['model'].coef_
importance = np.abs(coefficients)

# now let's see which of the features are high in importance which are not
print(np.array(x_encoded.columns)[importance > 0])
np.array(x_encoded.columns)[importance == 0]

Lasso model suggests that the 'engineSize' and 'year' are significant variables for price. It results with deleting other 29 independent variables.

## 4. Data Fitting to the Models

After analyzed the data set, I observed that the y variable which is 'price' in this case is an integerly valued and linearly distributed discrete variable. So, I have decided to perform linear regression with full model and OLS model with feature selected model.
For linear regression, LinearRegression() model of scikit-learn helped me. Here, model included all the variables. Instead for OLS regression, I have employed statsmodels package and I specified the features as input of the model. The independent variables are chosen here as the combination of previosly mentioned feature selection methods which are backward elimination and lasso method.
### 4.1. Linear Regression with Full Model

In [None]:
# Linear Regression model
from sklearn.linear_model import LinearRegression

# Instantiate a LinearRegression classifier with default parameter values
regr = LinearRegression()

# Fit regr to the train set
regr.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, max_error
import matplotlib.pyplot as pt

# Use regr to predict instances from the test set and store it
y_pred_LR = regr.predict(X_test)

# Get the accuracy score of regr model 
print("Accuracy of linear regression classifier: ", regr.score(X_test, y_test))
print(f"Max error of predictions: {max_error(y_pred_LR, y_test)}")


# The coefficients
print("Coefficients: \n", regr.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred_LR))
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred_LR))

Linear regression with full model results in the accuracy of 0.95, 0.96 R<sup>2</sup> value.
### 4.2. OLS Model with Feature Selected Models

In this model, I considered two models based on the different feature selection methods.

In [None]:
import statsmodels.api as sm 
#Columns of Backward Feature Selection Method 

column_bfe = ['transmission_Automatic', 'transmission_Manual', 'transmission_Other', 'transmission_Semi-Auto', 'fuelType_Diesel', 'fuelType_Hybrid', 'fuelType_Other', 'fuelType_Petrol', 'model_ Auris', 'model_ Avensis', 'model_ Aygo', 'model_ C-HR', 'model_ Camry', 'model_ Corolla', 'model_ GT86', 'model_ Hilux', 'model_ IQ', 'model_ Land Cruiser', 'model_ PROACE VERSO', 'model_ Prius', 'model_ RAV4', 'model_ Supra', 'model_ Urban Cruiser', 'model_ Verso', 'model_ Verso-S', 'model_ Yaris', 'year', 'mileage']

x_bfe_train = X_train[column_bfe]
x_bfe_test = X_test[column_bfe]

model_bfe = sm.OLS(y_train, x_bfe_train).fit() 
y_pred_bfe = model_bfe.predict(x_bfe_test) 

print(model_bfe.rsquared)  
print(f"Max error of predictions based on backward feature selection: {max_error(y_pred_bfe, y_test)}")

#Columns of Lasso Method 

column_lasso = ['engineSize', 'year']

x_lasso_train = X_train[column_lasso]
x_lasso_test = X_test[column_lasso]

model_lasso = sm.OLS(y_train, x_lasso_train).fit() 
y_pred_lasso = model_lasso.predict(x_lasso_test) 

print(model_lasso.rsquared)  
print(f"Max error of predictions based on lasso selection: {max_error(y_pred_lasso, y_test)}")

In [None]:
from sklearn.metrics import mean_absolute_error

print(f"Mean error of predictions of LM: {mean_absolute_error(y_pred_LR, y_test)}")
print(f"Mean error of predictions of backward FS: {mean_absolute_error(y_pred_bfe, y_test)}")
print(f"Mean error of predictions of lasso: {mean_absolute_error(y_pred_lasso, y_test)}")

# Metric to check what percentage is not within the 1500 range
perc_LR = len([abs(y_pred_LR[i]-v) for i, v in enumerate(y_test) if abs(y_pred_LR[i]-v) > np.log10(1500)])

y_pred_bfe_arr = y_pred_bfe.array
perc_bfe = len([abs(y_pred_bfe_arr[i]-v) for i, v in enumerate(y_test) if abs(y_pred_bfe_arr[i]-v) > np.log10(1500)])

y_pred_lasso_arr = y_pred_lasso.array
perc_lasso = len([abs(y_pred_lasso_arr[i]-v) for i, v in enumerate(y_test) if abs(y_pred_lasso_arr[i]-v) > np.log10(1500)])


print("%.4f of the test values are not within the range of £1500 in linear regression test." % perc_LR)
print("%.4f of the test values are not within the range of £1500 in backward fs OLS test." % perc_bfe)
print("%.4f of the test values are not within the range of £1500 in lasso selected." % perc_lasso)


The comperatively results suggest that the data set with lasso based feature selected columns give the highest R<sup>2</sup> on OLS model. As a result, even if the linear regression with full model and backward feature selected OLS model give good results, for complexity reasons and R<sup>2</sup> reasons, I would only choose 'engine size' and 'year' variables to predict a price of a used car as lasso selction suggested.

Since the test data set is not large enough, the deep learning methods were not applicable here. And since data not very large accuracy might be affected. For a large dataset we might have better results.
