<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Parameter Tuning</p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION NOTEBOOK](00_introduction.ipynb)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [FUNCTIONS](#fn)
* [PRELIMINARIES](#2A)
* [PARAMETER TUNING](#2B)
* [THE END OF PARAMETER TUNING](#3)

<a id="2A"></a>
### Parameter Selection for XGBoost
In the [previous](04_model_selection.ipynb) notebook I selected the XGBoost model. I also selected the ten (10) most important features. In this final notebook I am going to finalize the model by choosing the hyper-parameters for the XGBoost model.

<a id="1"></a>

## Importing Libraries

In [1]:
import pandas as pd
import pickle
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor

<a id="2A"></a>
## Preliminaries

In [2]:
df = pd.read_json('data_post03.json', lines=True)

In [3]:
cols = ['price'] + ['co2_emission',
 'consumption_comb',
 'displacement',
 'first_registration_2018',
 'first_registration_2019',
 'gearing_type_manual',
 'hp',
 'km',
 'warranty_mo',
 'weight']

In [4]:
df = df[cols]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15884 entries, 0 to 15883
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   price                    15884 non-null  int64  
 1   co2_emission             15884 non-null  int64  
 2   consumption_comb         15884 non-null  float64
 3   displacement             15884 non-null  int64  
 4   first_registration_2018  15884 non-null  int64  
 5   first_registration_2019  15884 non-null  int64  
 6   gearing_type_manual      15884 non-null  int64  
 7   hp                       15884 non-null  int64  
 8   km                       15884 non-null  float64
 9   warranty_mo              15884 non-null  int64  
 10  weight                   15884 non-null  float64
dtypes: float64(3), int64(8)
memory usage: 1.3 MB


In [6]:
X = df.drop('price', axis=1)

In [7]:
y = df['price']

## Features Transformer

In [8]:
num_cols = ['km', 'weight', 'hp', 'co2_emission', 'consumption_comb',
            'displacement', 'warranty_mo']

In [9]:
num_transformer = StandardScaler()

In [10]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

## XGBoost Model and Parameters

XGBoost model has many parameters. I am using the default values for most of the model parameters. Most important ones are listed here:
* ``tree_method``: 'hist' which is also the default.

I am going to focus on choosing the optimum values of the following:
1. ``grow_policy``: choose between 'depthwise' and 'lossguide'
2. ``eta``: step size shrinkage used in update to prevent overfitting. 
3. ``gamma``: or ``min_split_loss``, minimum loss reduction required to make a further partition on a leaf node of the tree.
4. ``lambda``: L2 regularization term on weights.
5. ``alpha``: L1 regularization term on weights.

## Grid Search for Parameter Tuning

### Ran 13 mins, only to reveal that the default parameters are the best!

{'xgbregressor__alpha': 0,
 'xgbregressor__eta': 0,
 'xgbregressor__gamma': 0,
 'xgbregressor__grow_policy': 'depthwise',
 'xgbregressor__lambda': 0}


In [12]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', XGBRegressor())])

In [18]:
params = {
    'model__grow_policy': ['depthwise', 'lossguide'],
    'model__eta': [i//20 for i in range(0,100,20)],
    'model__gamma': [0] + [2**i for i in range(4)],
    'model__lambda': [0] + [2**i for i in range(4)],
    'model__alpha': [0] + [2**i for i in range(4)]
}

In [19]:
grid = GridSearchCV(pipe, param_grid=params, scoring='r2', cv=2)

In [20]:
grid.estimator.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'model', 'preprocessor__n_jobs', 'preprocessor__remainder', 'preprocessor__sparse_threshold', 'preprocessor__transformer_weights', 'preprocessor__transformers', 'preprocessor__verbose', 'preprocessor__verbose_feature_names_out', 'preprocessor__num', 'preprocessor__num__copy', 'preprocessor__num__with_mean', 'preprocessor__num__with_std', 'model__base_score', 'model__booster', 'model__colsample_bylevel', 'model__colsample_bynode', 'model__colsample_bytree', 'model__gamma', 'model__importance_type', 'model__learning_rate', 'model__max_delta_step', 'model__max_depth', 'model__min_child_weight', 'model__missing', 'model__n_estimators', 'model__n_jobs', 'model__nthread', 'model__objective', 'model__random_state', 'model__reg_alpha', 'model__reg_lambda', 'model__scale_pos_weight', 'model__seed', 'model__silent', 'model__subsample', 'model__verbosity'])

In [None]:
grid.fit(X, y)

























































In [None]:
grid.best_params_

## Preserve the Pipeline

In [None]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', XGBRegressor())])

In [None]:
pipe.fit(X, y)

In [None]:
with open('model.pkl', 'wb') as f:
    pickle.dump(pipe, f)

## Summary

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="3"></a>
## End of Parameter Tuning