# Set Metric-Establish Baseline-Model Selection-Hyperparameter Tuning

This notebook will perform steps 4 and 5:

4. Set Evaluation Metric & Establish Baseline
5. Model Selection & Tune Hyperparameters of the Model

## Imports

In [1]:
# manipulation libraries
import pandas as pd
import numpy as np

# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
# to display visuals in the notebook

%config InlineBackend.figure_format='retina'
#to enable high resolution plots

# normalization and random-search and error metric
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# potential machine Learning Models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
import lightgbm as lgb

# to save machine Learning Models
import pickle

## Auxiliary Functions

In [2]:
# functions to use in the notebook
def fit_evaluate_model(model, X_train, y_train, 
                       X_valid, y_valid):
    # function to train a given model
    # return mean squared error of the
    # actuals and predictions
    model.fit(X_train, y_train)
    y_predicted = model.predict(X_valid)
    return mean_squared_error(y_valid, y_predicted)

def convert_features_to_array(features):
    # function to convert feature df
    # to an array
    num_rows = len(features)
    num_cols = len(features.columns)
    
    features_array = (np
                      .array(features)
                      .reshape((num_rows, 
                                num_cols)))

    return features_array

def convert_target_to_array(target):
    # function to convert target df
    # to an array
    target_array = (np
                    .array(target)
                    .reshape((-1, )))
    return target_array

## Load data and convert to array:

In [3]:
X_train = pd.read_csv("../transformed/X_train.csv")
y_train = pd.read_csv("../transformed/y_train.csv")

X_valid = pd.read_csv("../transformed/X_valid.csv")
y_valid = pd.read_csv("../transformed/y_valid.csv")

In [4]:
y_train_array = convert_target_to_array(y_train)
y_valid_array = convert_target_to_array(y_valid)

X_train_array = convert_features_to_array(X_train)
X_valid_array = convert_features_to_array(X_valid)

# Set Evaluation Metric & Establish Baseline

In real-world applications of data science/machine learning, the evaluation metric is set by data scientists in line with the stakeholder’s expectations from the ML model. That is why this is an important step.

## Set Metric

***Mean square error (MSE)*** is the average of sum of squared residuals where a ***residual*** is the a difference between the actual and predicted value of a target variable. In other words, we are going to evaluate our model by looking at the measure of how large our squared errors (residuals) are spread out.

Mean square error is selected as an error metric, because it is interpretable, it is analogous to variance and it also aligns with our selected algorithm's error minimization criteria.  

On the other hand, this error metric is sensitive to extreme values or outliers. Since it takes the square of the differences between the actuals and predictions, in the presence of extreme values and outliers difference grows quadratically.
