<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Parameter Tuning</p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION NOTEBOOK](00_introduction.ipynb)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [FUNCTIONS](#fn)
* [PRELIMINARIES](#2A)
* [PARAMETER TUNING](#2B)
* [THE END OF PARAMETER TUNING](#3)

<a id="2A"></a>
### Parameter Selection for XGBoost
In the [previous](04_model_selection.ipynb) notebook I selected the XGBoost model. I also selected the ten (10) most important features. In this final notebook I am going to finalize the model by choosing the hyper-parameters for the XGBoost model.

<a id="1"></a>

## Importing Libraries

In [1]:
import pandas as pd
import pickle
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import xgboost
from xgboost import XGBRegressor

In [2]:
!python --version

zsh:1: command not found: python


In [3]:
print(xgboost.__version__)

2.1.3


In [4]:
print(sklearn.__version__)

1.5.2


In [5]:
print(pd.__version__)

2.2.3


<a id="2A"></a>
## Preliminaries

In [6]:
df = pd.read_json('data_post03.json', lines=True)

In [7]:
cols = ['price'] + ['age',
 'co2_emission',
 'consumption_comb',
 'displacement',
 'hp',
 'km',
 'warranty_mo',
 'weight',
 'gearing_type_manual',
 'prev_owner',
 'make_model'
 ]

In [8]:
df = df[cols]

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15884 entries, 0 to 15883
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                15884 non-null  int64  
 1   age                  15884 non-null  int64  
 2   co2_emission         15884 non-null  int64  
 3   consumption_comb     15884 non-null  float64
 4   displacement         15884 non-null  int64  
 5   hp                   15884 non-null  int64  
 6   km                   15884 non-null  float64
 7   warranty_mo          15884 non-null  int64  
 8   weight               15884 non-null  float64
 9   gearing_type_manual  15884 non-null  bool   
 10  prev_owner           15884 non-null  int64  
 11  make_model           15884 non-null  object 
dtypes: bool(1), float64(3), int64(7), object(1)
memory usage: 1.3+ MB


In [10]:
mm = [i for i in df['make_model'].value_counts().index]

In [11]:
df['prev_owner'].value_counts()

prev_owner
1    8293
0    6794
2     778
3      17
4       2
Name: count, dtype: int64

In [36]:
mm

['audi_a3',
 'audi_a1',
 'opel_insignia',
 'opel_astra',
 'opel_corsa',
 'renault_clio',
 'renault_espace']

In [10]:
df['prev_owner'].value_counts(dropna=False)

prev_owner
1    8293
0    6794
2     778
3      17
4       2
Name: count, dtype: int64

In [11]:
X = df.drop('price', axis=1)

In [12]:
y = df['price']

## Features Transformer

In [13]:
num_cols = ['km', 'weight', 'hp', 'co2_emission', 'consumption_comb',
            'displacement', 'warranty_mo']

In [14]:
num_transformer = StandardScaler()

In [15]:
cat_cols = ['make_model']

In [16]:
cat_transformer = OneHotEncoder()

In [17]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols)
    ], remainder='passthrough')

## XGBoost Model and Parameters

XGBoost model has many parameters. I am using the default values for most of the model parameters. Most important ones are listed here:
* ``tree_method``: 'hist' which is also the default.

I am going to focus on choosing the optimum values of the following:
1. ``grow_policy``: choose between 'depthwise' and 'lossguide'
2. ``eta``: step size shrinkage used in update to prevent overfitting. 
3. ``gamma``: or ``min_split_loss``, minimum loss reduction required to make a further partition on a leaf node of the tree.
4. ``lambda``: L2 regularization term on weights.
5. ``alpha``: L1 regularization term on weights.

## Grid Search for Parameter Tuning

### Ran 13 mins, only to reveal that the default parameters are the best!

{'xgbregressor__alpha': 0,
 'xgbregressor__eta': 0,
 'xgbregressor__gamma': 0,
 'xgbregressor__grow_policy': 'depthwise',
 'xgbregressor__lambda': 0}


## Preserve the Pipeline

In [18]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', XGBRegressor())])

In [19]:
pipe.fit(X, y)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [20]:
with open('model.pkl', 'wb') as f:
    pickle.dump(pipe, f)

## Test Unpickle

* Unpickle does not require one to load the environment (no import libraries)
* It is robust to the dataframe having extra columns

In [21]:
with open('model.pkl', 'rb') as f:
    loaded_model_pickle = pickle.load(f)

In [22]:
loaded_model_pickle

In [23]:
df = pd.read_json('data_post03.json', lines=True)
X = df.drop('price', axis=1)
y = df['price']

In [24]:
loaded_model_pickle.predict(X[:5])

array([16187.608, 16071.156, 14935.071, 15973.295, 16244.474],
      dtype=float32)

In [30]:
xs = ['age',
 'co2_emission',
 'consumption_comb',
 'displacement',
 'hp',
 'km',
 'warranty_mo',
 'weight',
 'gearing_type_manual',
 'prev_owner',
 'make_model'
 ]

In [31]:
X2 = X[xs]

In [32]:
loaded_model_pickle.predict(X2[:5])

array([16187.608, 16071.156, 14935.071, 15973.295, 16244.474],
      dtype=float32)

In [33]:
y[:5]

0    15770
1    14500
2    14640
3    14500
4    16790
Name: price, dtype: int64

## Summary

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="3"></a>
## End of Parameter Tuning