<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Parameter Tuning</p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION NOTEBOOK](00_introduction.ipynb)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [FUNCTIONS](#fn)
* [PRELIMINARIES](#2A)
* [PARAMETER TUNING](#2B)
* [THE END OF PARAMETER TUNING](#3)

<a id="2A"></a>
### Parameter Selection for XGBoost
In the [previous](04_model_selection.ipynb) notebook I selected the XGBoost model. I also selected the ten (10) most important features. In this final notebook I am going to finalize the model by choosing the hyper-parameters for the XGBoost model.

<a id="1"></a>

## Importing Libraries

In [1]:
import pandas as pd
import pickle
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

<a id="2A"></a>
## Preliminaries

In [2]:
df = pd.read_json('data_post03.json', lines=True)

In [3]:
cols = ['price'] + ['age',
 'co2_emission',
 'consumption_comb',
 'displacement',
 'hp',
 'km',
 'warranty_mo',
 'weight']

In [4]:
df = df[cols]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15884 entries, 0 to 15883
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             15884 non-null  int64  
 1   age               15884 non-null  int64  
 2   co2_emission      15884 non-null  int64  
 3   consumption_comb  15884 non-null  float64
 4   displacement      15884 non-null  int64  
 5   hp                15884 non-null  int64  
 6   km                15884 non-null  float64
 7   warranty_mo       15884 non-null  int64  
 8   weight            15884 non-null  float64
dtypes: float64(3), int64(6)
memory usage: 1.1 MB


In [6]:
X = df.drop('price', axis=1)

In [7]:
y = df['price']

## Features Transformer

In [8]:
num_cols = ['km', 'weight', 'hp', 'co2_emission', 'consumption_comb',
            'displacement', 'warranty_mo']

In [9]:
num_transformer = StandardScaler()

In [10]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

## XGBoost Model and Parameters

XGBoost model has many parameters. I am using the default values for most of the model parameters. Most important ones are listed here:
* ``tree_method``: 'hist' which is also the default.

I am going to focus on choosing the optimum values of the following:
1. ``grow_policy``: choose between 'depthwise' and 'lossguide'
2. ``eta``: step size shrinkage used in update to prevent overfitting. 
3. ``gamma``: or ``min_split_loss``, minimum loss reduction required to make a further partition on a leaf node of the tree.
4. ``lambda``: L2 regularization term on weights.
5. ``alpha``: L1 regularization term on weights.

## Grid Search for Parameter Tuning

### Ran 13 mins, only to reveal that the default parameters are the best!

{'xgbregressor__alpha': 0,
 'xgbregressor__eta': 0,
 'xgbregressor__gamma': 0,
 'xgbregressor__grow_policy': 'depthwise',
 'xgbregressor__lambda': 0}


## Preserve the Pipeline

In [11]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', XGBRegressor())])

In [12]:
pipe.fit(X, y)



Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num', StandardScaler(),
                                                  ['km', 'weight', 'hp',
                                                   'co2_emission',
                                                   'consumption_comb',
                                                   'displacement',
                                                   'warranty_mo'])])),
                ('model', XGBRegressor())])

In [13]:
with open('model.pkl', 'wb') as f:
    pickle.dump(pipe, f)

## Test Unpickle

* Unpickle does not require one to load the environment (no import libraries)
* It is robust to the dataframe having extra columns

In [14]:
with open('model.pkl', 'rb') as f:
    loaded_model_pickle = pickle.load(f)



In [15]:
df = pd.read_json('data_post03.json', lines=True)
X = df.drop('price', axis=1)
y = df['price']

In [16]:
loaded_model_pickle.predict(X[:5])

array([14509.009, 16318.796, 15882.247, 13855.735, 14894.147],
      dtype=float32)

In [17]:
cols = ['age',
 'co2_emission',
 'consumption_comb',
 'displacement',
 'gearing_type_manual',
 'hp',
 'km',
 'prev_owner_1',
 'warranty_mo',
 'weight']

In [18]:
X2 = X[cols]

In [19]:
loaded_model_pickle.predict(X2[:5])

array([14509.009, 16318.796, 15882.247, 13855.735, 14894.147],
      dtype=float32)

## Summary

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="3"></a>
## End of Parameter Tuning