# This is the model development for price predictions

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# Larger scale for plots in notebooks
sns.set_context('notebook')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# Setting seed for entire notebook
SEED = 42
np.random.seed(SEED)


In [5]:
cars_cleaned = pd.read_pickle('dataset/cars_cleaned.pkl')
cars_cleaned.head()
cars_cleaned.dtypes
cars_cleaned.info()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,GT86,2016,16000,Manual,24089,Petrol,265,36.2,2.0
1,GT86,2017,15995,Manual,18615,Petrol,145,36.2,2.0
2,GT86,2015,13998,Manual,27469,Petrol,265,36.2,2.0
3,GT86,2017,18998,Manual,14736,Petrol,150,36.2,2.0
4,GT86,2017,17498,Manual,36284,Petrol,145,36.2,2.0


model           category
year               int64
price              int64
transmission    category
mileage            int64
fuelType        category
tax                int64
mpg              float64
engineSize      category
dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6738 entries, 0 to 6737
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   model         6738 non-null   category
 1   year          6738 non-null   int64   
 2   price         6738 non-null   int64   
 3   transmission  6738 non-null   category
 4   mileage       6738 non-null   int64   
 5   fuelType      6738 non-null   category
 6   tax           6738 non-null   int64   
 7   mpg           6738 non-null   float64 
 8   engineSize    6738 non-null   category
dtypes: category(4), float64(1), int64(4)
memory usage: 290.0 KB


In [20]:
# Data preparation
cars_ml = cars_cleaned.copy()
dummies = pd.get_dummies(cars_ml[['model','transmission','fuelType']], drop_first=True)
cars_ml = pd.concat([cars_ml,dummies],axis=1)
cars_ml = cars_ml.drop(['model','transmission','fuelType'],axis=1)
cars_ml.shape
cars_ml.isna().sum().sum()

(6738, 29)

0

In [25]:
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split

In [34]:
X = cars_ml.drop('price', axis=1).to_numpy()
y = cars_ml['price'].to_numpy()
X.shape, y.shape


((6738, 28), (6738,))

## Model Fitting
Price predictions is a regression problem. I am choosing **Ridge Regression** as a baseline model to train because of its robustness and tendency to reduce standard error.

**Linear regression** is my comparison model because it is similar to **Ridge Regression** but without any **regularization** applied.

In [30]:
# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=SEED)


0.9259013740886726

0.9272688698822567

### Model 1: Ridge Regression

In [39]:

# Baseline model | Ridge Regression
ridge = Ridge()
ridge.fit(X_train, y_train)
ridge.score(X_test, y_test)

0.9259013740886726

### Model 2: Linear Regression

In [26]:
# Comparison model | Linear Regression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
linreg.score(X_test,y_test)

0.9258346240443078

### Model Evaluation
I am choosing **root mean squared error (rmse)** as my metric for evaluating the models. **rmse** is in the same unit as the predictions, this makes it is easy to interprete.

For the Ridge model a **rmse** of GBP1,743.48 was obtained from the model.

For the Linear model a **rmse** of GBP1,744.27 was obtained from the model.

The **rmse** of both models were almost equal with only a negligible difference of GBP0.79 between them. This indicates both models are evenly matched with their performance.
**R_squared** was used as a secondary metric for the models and a similar conclusion can be drawn from its results as well. 



| Metric | Ridge Regression | Linear Regression |
|--------|------------------|-------------------|
| **rmse** | GBP1,743.4 | GBP1,744.27 |
| **r_squared** | 0.9259 | 0.9258 |

In [38]:
# Model Evaluation
from sklearn.metrics import mean_squared_error

ridge_y_pred = ridge.predict(X_test)
ridge_rmse = np.sqrt(mean_squared_error(y_true=y_test,y_pred=ridge_y_pred))

lin_y_pred = linreg.predict(X_test)
lin_rmse = np.sqrt(mean_squared_error(y_true=y_test,y_pred=lin_y_pred))

lin_rmse.round(2)
ridge_rmse.round(2)

lin_rmse.round(2)-ridge_rmse.round(2)

1744.27

1743.48

0.7899999999999636

### Hyperparameter Tuning
Since the baseline and comparision models were closely matched, I fine tune the models a bit further using **hyperparameter tuning** and **Grid Search** cross validation to find the better model using **R_squared** as the evaluation metric.

In [69]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold, GridSearchCV

def find_best_model(X,y):
    params_gd = {
        'Ridge Regression':{
            'model':Ridge(),
            'params': {
                'alpha':[1,2,5,10],
                'fit_intercept':[True,False]
        }
        },
        'Linear Regression':{
            'model':LinearRegression(),
            'params': {
                'fit_intercept':[True,False],
                'positive':[True,False]
            }
        },
        'DecisionTree Regression':{
            'model':DecisionTreeRegressor(),
            'params':{
                'criterion':['absolute_error','friedman_mse'],
                'splitter':['best','random'],
                'max_depth':[None,1,6,10,20],
                'min_samples_leaf':[1,0.8,0.5,0.2]
            }
        }
    }

    scores = []
    cv = KFold(n_splits=5, random_state = SEED, shuffle=True)
    for model_name, model_cofig in params_gd.items():
        gs = GridSearchCV(model_cofig['model'], model_cofig['params'],cv=cv,return_train_score=False, n_jobs=-1)
        gs.fit(X,y)
        scores.append({
            'model':model_name,
            'best_score': gs.best_score_,
            'best_params':gs.best_params_,
            })
    return pd.DataFrame(scores)


In [70]:
hyp_tuned_model = find_best_model(X,y)
hyp_tuned_model.sort_values('best_score', ascending=False)

Unnamed: 0,model,best_score,best_params
2,DecisionTree Regression,0.945339,"{'criterion': 'friedman_mse', 'max_depth': 10, 'min_samples_leaf': 1, 'splitter': 'best'}"
1,Linear Regression,0.92646,"{'fit_intercept': True, 'positive': False}"
0,Ridge Regression,0.925493,"{'alpha': 1, 'fit_intercept': True}"


In [63]:
dct = DecisionTreeRegressor()
dct.fit(X_train,y_train)
dct.score(X_test,y_test)
dct_y_pred = dct.predict(X_test)
dct_rmse = np.sqrt(mean_squared_error(y_true=y_test, y_pred=dct_y_pred))
dct_rmse.round(2)

0.9479050863068267

1461.87

In [68]:
(ridge_rmse / cars_cleaned['price'].mean())*100
(dct_rmse / cars_cleaned['price'].mean())*100

13.922906933110394

11.674075677928382

### Conclusion
The hyperparameter tuned model of the Decision Tree Regressor provides the best fit with a R_squared value of 0.948 and a rmse of GBP1,461.87. This **rmse** represents (+/-)12% deviation from the true prices which is 2% shy of the 10% requirement but better than the current estimates of around **(+/-)30%**
After the hypertuning process, the Decision Tree Regressor scored the highest R_squared value of 0.947. Ridge, and Linear Regression models all scored closely matched values.