<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Modeling, Model-Selection, and Feature-Selection </p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [FUNCTIONS](#fn)
* [MODELING](#2A)
* [MODEL-SELECTION](#2B)
* [FEATURE-SELECTION](#2C)
* [THE END OF XXXX](#3)

<a id="0"></a>

## Introduction

Copy from the file 00_data_cleaning after the project is finished


<a id="2A"></a>
### Modeling
* I am going to try many models: linear OLS, poly OLS, lasso, ridge, Random Forest, SGDregression, XGB, light GBM, catBoost.
### Model-Selection
* Consequently, in model-selection, I will drop the worst performing ones
### Feature-Selection
* Ultimately, I will do feature selection for these models to keep only about 10% of the features from the current dataset.


<a id="1"></a>

## Importing Libraries

In [1]:
from catboost import CatBoostRegressor
import numpy as np
from lightgbm import LGBMRegressor, plot_importance
import matplotlib.pyplot as plt
import pandas as pd
import regex as re
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from time import perf_counter
from xgboost import XGBRegressor

<a id='fn'></a>
## Functions

In [2]:
def make_results_table(models,X_train_tr, y_train, X_test_tr, y_test):
    '''
    input: dict of models(name: model, X_train_tr, y_train, X_test_tr, y_test
    output: dataframe of model and their performance scores.
    '''
    results = pd.DataFrame(columns = ['model','score(r2)', 'rmse', 'mae', 'time'])
    for name in models.keys():
        print(name)
        t0 = perf_counter()
        model = models[name]
        row = {}
        row['model'] = [name]
        model.fit(X_train_tr, y_train)
        row['score(r2)'] = [model.score(X_test_tr, y_test)]
        y_test_hat = model.predict(X_test_tr)
        row['rmse'] = [int((mean_squared_error(y_test, y_test_hat))**(1/2))]
        row['mae'] = [int(mean_absolute_error(y_test, y_test_hat))]
        t1 = perf_counter()
        row['time'] = t1 - t0
        row = pd.DataFrame(row)
        results = pd.concat([results, row], ignore_index=True)
        models[name] = model
    return results, models


<a id="2A"></a>
## Modeling

In [3]:
df = pd.read_json('data_post03.json', lines=True)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31768 entries, 0 to 31767
Columns: 191 entries, price to saf_led_headlights
dtypes: float64(5), int64(186)
memory usage: 46.3 MB


In [5]:
vat = df['vat_deductible']

In [6]:
df = df.drop(['consumption_city','consumption_country','vat_deductible'], axis=1)

In [7]:
df['vat_deductible'] = vat

In [8]:
df.columns[:15]

Index(['price', 'km', 'hp', 'displacement', 'weight', 'co2_emission',
       'warranty_mo', 'consumption_comb', 'make_model_audi_a1',
       'make_model_audi_a3', 'make_model_opel_astra', 'make_model_opel_corsa',
       'make_model_opel_insignia', 'make_model_renault_clio',
       'make_model_renault_espace'],
      dtype='object')

In [9]:
num_cols = ['km', 'hp', 'displacement', 'weight', 'co2_emission',
       'warranty_mo', 'consumption_comb']

### Linear without polynomial features
#### preliminary work - train and test split

In [10]:
X = df.drop('price',axis=1)
y = df.price

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [12]:
num_transformer = StandardScaler()

In [13]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

In [14]:
X_train_tr = preprocessor.fit_transform(X_train)

In [15]:
X_test_tr = preprocessor.transform(X_test)

#### Linear OLS

In [16]:
ols = LinearRegression()

In [17]:
ols.fit(X_train_tr, y_train)

In [18]:
ols.score(X_test_tr, y_test)

0.9466939450915532

In [19]:
y_test_hat = ols.predict(X_test_tr)
(mean_squared_error(y_test, y_test_hat))**(1/2)

2412.8798533539116

#### Linear Ridge

In [20]:
ridge = Ridge()

In [21]:
ridge.fit(X_train_tr,y_train)

In [22]:
ridge.score(X_test_tr,y_test)

0.9466979681387315

In [23]:
y_test_hat = ridge.predict(X_test_tr)
(mean_squared_error(y_test, y_test_hat))**(1/2)

2412.78880072715

### Adding polynomial features
* polynomial (deg 2) has no observable effect on scores for OLS or Ridge

In [24]:
poly = PolynomialFeatures(degree=2)

In [25]:
preprocessor2 = ColumnTransformer(
    transformers=[
        ('poly', poly, ['warranty_mo']),
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

# for poly, I also ried 'hp', 'km', 'consumption_comb', etc, 
# for which the objective functions does not converge

In [26]:
X_train_tr2 = preprocessor2.fit_transform(X_train)
X_test_tr2 = preprocessor2.transform(X_test)

#### Poly OLS

In [27]:
ols2 = LinearRegression()

In [28]:
ols2.fit(X_train_tr2, y_train)

In [29]:
ols2.score(X_test_tr2, y_test)

0.9466939946117351

In [30]:
y_test_hat = ols2.predict(X_test_tr2)
(mean_squared_error(y_test, y_test_hat))**(1/2)

2412.8787325968315

#### Poly Ridge

In [31]:
ridge2 = Ridge()

In [32]:
ridge2.fit(X_train_tr2,y_train)

In [33]:
ridge2.score(X_test_tr2,y_test)

0.9466966706260205

In [34]:
y_test_hat = ridge2.predict(X_test_tr2)
(mean_squared_error(y_test, y_test_hat))**(1/2)

2412.818167385202

### Preliminary Comparative Performance of 6 Models

* I have created a function to build a dataframe of performance metrics of the models
* The performance metrics I have chosen for comaprison at this first stage elimination are: score(r2), rmse, mae, and performance time

In [35]:
models = {}

In [36]:
#ols = LinearRegression()
#models['OLS'] = ols

In [37]:
lasso = Lasso()
models['Lasso'] = lasso

In [38]:
ridge = Ridge()
models['Ridge'] = ridge

In [39]:
sgd = SGDRegressor()
models['SGD'] = sgd

In [40]:
xgb = XGBRegressor()
models['XGB'] = xgb

In [41]:
rf = RandomForestRegressor()
models['RF'] = rf

In [42]:
lgbm = LGBMRegressor()
models['LGBM'] = lgbm

In [43]:
cb = CatBoostRegressor(verbose=False)
models['CatBoost'] = cb

In [44]:
results, models = make_results_table(models,'X_train_tr', 'y_train', 'X_test_tr', 'y_test')

OLS


ValueError: Expected 2D array, got scalar array instead:
array=X_train_tr.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

<a id="2B"></a>
## Model-Selection

* I can easily reject OLS, Ridge, and SGD based on their performance scores.
* RF takes way too much time for a modest gain.
* For feature selection I will only use XGB, LGBM, catBoost

In [None]:
results

## Feature Selection

* I have trained models stored in the dicionary models.
* I will use that to observe the feature importance

### XGB

In [None]:
xgb = models['XGB']

In [None]:
feature_importance = xgb.get_booster().get_score(importance_type='weight')
values = list(feature_importance.values())
keys = [X_train.columns[int(re.sub('f','',x))] for x in feature_importance.keys()]
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(15, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 15 features

In [None]:
xgb_cols = set(data.nlargest(15, columns="scores").index)
xgb_cols

### LGBM

In [None]:
lgbm = models['LGBM']

In [None]:
feature_importance = list(lgbm.feature_importances_)
values = list(feature_importance)
keys = list(X_train.columns)
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(15, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 15 features

In [None]:
lgbm_cols = set(data.nlargest(15, columns="scores").index)
lgbm_cols

### catBoost

In [None]:
cb = models['CatBoost']

In [None]:
feature_importance = list(cb.feature_importances_)
values = list(feature_importance)
keys = list(X_train.columns)
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(15, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 15 features

In [None]:
cb_cols = set(data.nlargest(15, columns="scores").index)
cb_cols

In [None]:
sel_cols = xgb_cols.union(lgbm_cols).union(cb_cols)

In [None]:
sel_cols.add('price')

In [None]:
len(sel_cols), sel_cols

## Model Selection Using Selected Features

In [None]:
num_cols = ['km', 'hp', 'displacement', 'weight', 'co2_emission', 'warranty_mo', 'consumption_comb']
sel_cols = [num_cols.append(i) for i in sel_cols if i not in sel_cols]

In [None]:
df2 = df[[all_cols]]

In [None]:
X = df2.drop('price',axis=1)
y = df2.price
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
num_transformer = StandardScaler()
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')
X_train_tr = preprocessor.fit_transform(X_train)
X_test_tr = preprocessor.fit_transform(X_test)

In [None]:
models = {}
#xgb = XGBRegressor()
#models['XGB'] = xgb
lgbm = LGBMRegressor()
models['LGBM'] = lgbm
cb = CatBoostRegressor(verbose=False)
models['CatBoost'] = cb

In [None]:
results, models = make_results_table(models, X_train_tr, y_train, X_test_tr, y_test)

In [None]:
results

In [None]:
cb = models['LGBM']
feature_importance = list(cb.feature_importances_)
values = list(feature_importance)
keys = list(X_train.columns)
assert len(values) == len(keys)
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(15, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 15 features

## Conclusion
* LGBM is the best model for our purpose - fastest for the best scores. 

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="3"></a>
## End of Modeling