<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Modeling, Model-Selection, and Feature-Selection </p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION NOTEBOOK](00_introduction.ipynb)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [FUNCTIONS](#fn)
* [MODELING](#2A)
* * I am going to try many models: linear OLS, poly OLS, lasso, ridge, Random Forest, SGDregression, XGB, light GBM, catBoost.
* [MODEL-SELECTION](#2B)
* * I will drop the worst performing ones.
* [FEATURE-SELECTION](#2C)
* * I'll keep only about 10 of the features from the current dataset.
* [THE END OF MODELING](#3)

<a id="1"></a>

## Importing Libraries

In [1]:
from catboost import CatBoostRegressor
import numpy as np
from lightgbm import LGBMRegressor, plot_importance
import matplotlib.pyplot as plt
import pandas as pd
import regex as re
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from time import perf_counter
from xgboost import XGBRegressor

ModuleNotFoundError: No module named 'catboost'

<a id='fn'></a>
## Functions

In [None]:
def make_results_table(models,X_train_tr, y_train, X_test_tr, y_test):
    '''
    input: dict of models(name: model, X_train_tr, y_train, X_test_tr, y_test
    output: dataframe of model and their performance scores.
    '''
    results = pd.DataFrame(columns = ['model','score(r2)', 'rmse', 'mae', 'time'])
    for name in models.keys():
        print(name)
        t0 = perf_counter()
        model = models[name]
        row = {}
        row['model'] = [name]
        model.fit(X_train_tr, y_train)
        row['score(r2)'] = [model.score(X_test_tr, y_test)]
        y_test_hat = model.predict(X_test_tr)
        row['rmse'] = [int((mean_squared_error(y_test, y_test_hat))**(1/2))]
        row['mae'] = [int(mean_absolute_error(y_test, y_test_hat))]
        t1 = perf_counter()
        row['time'] = t1 - t0
        row = pd.DataFrame(row)
        results = pd.concat([results, row], ignore_index=True)
        models[name] = model
    return results, models

In [None]:
def make_results_table_with_cv(models,df2,rep=10):
    '''
    input: dict of models(name: model, X_train_tr, y_train, X_test_tr, y_test
    output: dataframe of model and their performance scores.
    '''
    results = pd.DataFrame(columns = ['model','score(r2)', 'rmse', 'mae', 'time'])
    for i in range(rep):
        X = df2.drop('price',axis=1)
        y = df2.price
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
        num_transformer = StandardScaler()
        preprocessor = ColumnTransformer(
                transformers=[
                    ('num', num_transformer, num_cols)
                ], remainder='passthrough')
        X_train_tr = preprocessor.fit_transform(X_train)
        X_test_tr = preprocessor.transform(X_test)
        for name in models.keys():
            print(name)
            t0 = perf_counter()
            model = models[name]
            row = {}
            row['model'] = [name]
            model.fit(X_train_tr, y_train)
            row['score(r2)'] = [model.score(X_test_tr, y_test)]
            y_test_hat = model.predict(X_test_tr)
            row['rmse'] = [int((mean_squared_error(y_test, y_test_hat))**(1/2))]
            row['mae'] = [int(mean_absolute_error(y_test, y_test_hat))]
            t1 = perf_counter()
            row['time'] = t1 - t0
            row = pd.DataFrame(row)
            results = pd.concat([results, row], ignore_index=True)
            models[name] = model
            
    meta = pd.DataFrame(columns = ['model', 'score(r2)_mean','score(r2)_sd', 'rmse_mean', 'rmse_sd', 'mae_mean', 'mae_sd','time_mean', 'time_sd'])
    meta['model'] = results.model.unique()
    for name in meta.model:
        for s in ["score(r2)", "rmse", "mae", "time"]:
            tmean = s+'_mean'
            meta.loc[results.model==name,tmean] = results[results.model==name][s].mean()
            tsd = s+'_sd'
            m = meta.loc[results.model==name,tmean]
            sd = sum([(x-m)**2 for x in results[results.model==name][s]])**(1/2)
            meta.loc[results.model==name,tsd] = sd
    colnames = list(X_train.columns)
    return meta, models, colnames

<a id="2A"></a>
## Modeling

In [None]:
df = pd.read_json('data_post03.json', lines=True)

In [None]:
df.info()

In [None]:
cat_cols = ['make_model', 'body_type','prev_owner','type', 'body_color',
           'paint_type', 'nr_doors', 'nr_seats', 'gearing_type', 'drive_chain', 'fuel', 
            'country_version', 'upholstery_material', 'upholstery_color', 'emission_class', 'gears', 'cylinders']
df = df.drop(cat_cols, axis=1)

In [None]:
vat = df['vat_deductible']

In [None]:
df = df.drop(['consumption_city','consumption_country','vat_deductible'], axis=1)

In [None]:
df['vat_deductible'] = vat

In [None]:
df.columns[:15]

In [None]:
num_cols = ['km', 'age', 'hp', 'displacement', 'weight', 'co2_emission',
       'warranty_mo', 'consumption_comb']

### Linear without polynomial features
#### preliminary work - train and test split

In [None]:
X = df.drop('price',axis=1)
y = df.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
num_transformer = StandardScaler()

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

In [None]:
X_train_tr = preprocessor.fit_transform(X_train)

In [None]:
X_test_tr = preprocessor.transform(X_test)

#### Linear OLS

In [None]:
ols = LinearRegression()

In [None]:
ols.fit(X_train_tr, y_train)

In [None]:
ols.score(X_test_tr, y_test)

In [None]:
y_test_hat = ols.predict(X_test_tr)
(mean_squared_error(y_test, y_test_hat))**(1/2)

#### Linear Ridge

In [None]:
ridge = Ridge()

In [None]:
ridge.fit(X_train_tr,y_train)

In [None]:
ridge.score(X_test_tr,y_test)

In [None]:
y_test_hat = ridge.predict(X_test_tr)
(mean_squared_error(y_test, y_test_hat))**(1/2)

### Adding polynomial features
* polynomial (deg 2) has no observable effect on scores for OLS or Ridge

In [None]:
poly = PolynomialFeatures(degree=2)

In [None]:
preprocessor2 = ColumnTransformer(
    transformers=[
        ('poly', poly, ['warranty_mo']),
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

# for poly, I also ried 'hp', 'km', 'consumption_comb', etc, 
# for which the objective functions does not converge

In [None]:
X_train_tr2 = preprocessor2.fit_transform(X_train)
X_test_tr2 = preprocessor2.transform(X_test)

#### Poly OLS

In [None]:
ols2 = LinearRegression()

In [None]:
ols2.fit(X_train_tr2, y_train)

In [None]:
ols2.score(X_test_tr2, y_test)

In [None]:
y_test_hat = ols2.predict(X_test_tr2)
(mean_squared_error(y_test, y_test_hat))**(1/2)

#### Poly Ridge

In [None]:
ridge2 = Ridge()

In [None]:
ridge2.fit(X_train_tr2,y_train)

In [None]:
ridge2.score(X_test_tr2,y_test)

In [None]:
y_test_hat = ridge2.predict(X_test_tr2)
(mean_squared_error(y_test, y_test_hat))**(1/2)

### Preliminary Comparative Performance of 6 Models

* I have created a function to build a dataframe of performance metrics of the models
* The performance metrics I have chosen for comaprison at this first stage elimination are: score(r2), rmse, mae, and performance time

In [None]:
models = {}

In [None]:
ols = LinearRegression()
models['OLS'] = ols

In [None]:
lasso = Lasso()
models['Lasso'] = lasso

In [None]:
ridge = Ridge()
models['Ridge'] = ridge

In [None]:
sgd = SGDRegressor()
models['SGD'] = sgd

In [None]:
xgb = XGBRegressor()
models['XGB'] = xgb

In [None]:
rf = RandomForestRegressor()
models['RF'] = rf

In [None]:
lgbm = LGBMRegressor()
models['LGBM'] = lgbm

In [None]:
cb = CatBoostRegressor(verbose=False)
models['CatBoost'] = cb

In [None]:
results, models = make_results_table(models,X_train_tr, y_train, X_test_tr, y_test)

<a id="2B"></a>
## Model-Selection

* I can easily reject OLS, Ridge, and SGD based on their performance scores.
* RF takes way too much time for a modest gain.
* For feature selection I will only use XGB, LGBM, catBoost

In [None]:
results

## Feature Selection

* I have trained models stored in the dicionary models.
* I will use that to observe the feature importance

### XGB

In [None]:
xgb = models['XGB']

In [None]:
feature_importance = xgb.get_booster().get_score(importance_type='weight')
values = list(feature_importance.values())
keys = [X_train.columns[int(re.sub('f','',x))] for x in feature_importance.keys()]
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(10, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 15 features

In [None]:
xgb_cols = set(data.nlargest(10, columns="scores").index)
xgb_cols

### LGBM

In [None]:
lgbm = models['LGBM']

In [None]:
feature_importance = list(lgbm.feature_importances_)
values = list(feature_importance)
keys = list(X_train.columns)
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(10, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 10 features

In [None]:
lgbm_cols = set(data.nlargest(10, columns="scores").index)
lgbm_cols

### catBoost

In [None]:
cb = models['CatBoost']

In [None]:
feature_importance = list(cb.feature_importances_)
values = list(feature_importance)
keys = list(X_train.columns)
data = pd.DataFrame(data=values, index=keys, columns=["scores"]).sort_values(by = "scores", ascending=True)
data.nlargest(10, columns="scores").plot(kind='barh', figsize = (15,5)) ## plot top 15 features

In [None]:
cb_cols = set(data.nlargest(10, columns="scores").index)
cb_cols

In [None]:
good_cols = xgb_cols.union(lgbm_cols).union(cb_cols)

In [None]:
good_cols.add('price')

In [None]:
len(good_cols), good_cols

## Model Selection Using Selected Features

In [None]:
num_cols = ['km', 'hp', 'displacement', 'weight', 'co2_emission', 'warranty_mo', 'consumption_comb']
sel_cols = num_cols.copy()
for i in good_cols:
    if i not in set(num_cols):
        sel_cols.append(i)

In [None]:
sel_cols

In [None]:
df.columns

In [None]:
df2 = df[sel_cols]

In [None]:
df2.columns

## Cross Validation (10 reps) and Final Model Selection

In [None]:
models = {}
xgb = XGBRegressor()
models['XGB'] = xgb
lgbm = LGBMRegressor()
models['LGBM'] = lgbm
cb = CatBoostRegressor(verbose=False)
models['CatBoost'] = cb

In [None]:
meta, models, colnames = make_results_table_with_cv(models,df2)

In [None]:
meta

## Conclusion
* XGB is the best model for our purpose. Faster than the other two and more accurate that LGBM
* The features I am going to focus on are the following: 
* * ['co2_emission', 'com_hill_holder', 'com_multi-function_steering_wheel', 'consumption_comb', 'displacement', 'hp', 'km', 'saf_xenon_headlights', 'warranty_mo', 'weight']

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="3"></a>
## End of Modeling

## Next: [Parameter Selection](05_parameter_selection.ipynb)