<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Parameters Selection </p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [FUNCTIONS](#fn)
* [PRELIMINARIES](#2A)
* [PARAMETERS-SELECTION](#2B)
* [THE END OF PARAMETER SELECTION](#3)

<a id="0"></a>

## Introduction


Welcome to "***Auto Scout Car Price Prediction Project***". 

**Auto Scout** data used for this project, were scraped from the on-line car trading company, Auto Scout, in 2019, contains many features of 9 different car models. In this project, I will go through all the steps of a data project: data cleaning, modeling, features selection, and model selection. 

In the first part of this project I will apply many commonly used algorithms for data cleaning and exploratory data analysis by using many Python libraries such as Numpy, Pandas, Matplotlib, Seaborn, Scipy.

These are the steps for the first part. 
* **[data cleaning](00_data_cleaning.ipynb)** -  dealing with incorrect headers (column names), incorrect format, anomalies, and dropping obviously  useless columns.
* **[data imputation](01_data_imputation.ipynb)** - handling missing values, reducing classes in features to be encoded.
* **[handling outliers](02_data_viz_&_outliers.ipynb)** -  via visualisation libraries. Some insights are extracted.

In the second part of the project I explore many types of models for predicting prices. I explore OLS, Ridge, Lasso, SGD, Random Forest, XGB, light GBM, and catBoost.

* **[data encoding](03_data_encoding.ipynb)** in preparation for modeling: converting multiclass features into dummy columns, making dummy columns from nested features.
* **[modeling](04_modeling.ipynb)** trying out different models, model selection, feature selection, and cross-validation.
* **[parameter selection](05_parameter_selection.ipynb)** selecting the best parameters for the chosen model.

<a id="2A"></a>
### Parameter Selection for XGBoost
In the [previous](04_model_selection.ipynb) notebook I selected the XGBoost model. I also selected the ten (10) most important features. In this final notebook I am going to finalize the model by choosing the hyper-parameters for the XGBoost model.

<a id="1"></a>

## Importing Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import regex as re
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from time import perf_counter
from xgboost import XGBRegressor

<a id='fn'></a>
## Functions

In [2]:
# 

In [3]:
#

<a id="2A"></a>
## Preliminaries

In [4]:
df = pd.read_json('data_post03.json', lines=True)

In [10]:
cols = ['co2_emission', 'com_hill_holder', 'com_multi-function_steering_wheel', 'consumption_comb', 'displacement', 'hp',
        'km', 'saf_xenon_headlights', 'warranty_mo', 'weight']

In [7]:
df = df[cols]

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31768 entries, 0 to 31767
Data columns (total 10 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   co2_emission                       31768 non-null  int64  
 1   com_hill_holder                    31768 non-null  int64  
 2   com_multi-function_steering_wheel  31768 non-null  int64  
 3   consumption_comb                   31768 non-null  float64
 4   displacement                       31768 non-null  int64  
 5   hp                                 31768 non-null  int64  
 6   km                                 31768 non-null  float64
 7   saf_xenon_headlights               31768 non-null  int64  
 8   warranty_mo                        31768 non-null  int64  
 9   weight                             31768 non-null  float64
dtypes: float64(3), int64(7)
memory usage: 2.4 MB


### Train and test split

In [None]:
X = df.drop('price',axis=1)
y = df.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Features Transformer

In [None]:
num_cols = ['km', 'weight', 'hp', 'co2_emission', 'consumption_comb',
            'displacement', 'warranty_mo']

In [None]:
num_transformer = StandardScaler()

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols)
    ], remainder='passthrough')

In [None]:
X_train_tr = preprocessor.fit_transform(X_train)

In [None]:
X_test_tr = preprocessor.transform(X_test)

## XGB Model and Parameters

XGBoost model has many parameters, I am going to focus on choosing the optimum values of three:


In [None]:
xgb = models['XGB']

## Preserve the Model and Transformers

## Summary

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="3"></a>
## End of Parameter Selection