#### Model training with an algorithm that requires feature scaling

Let's take a look the basic process to fit a model in machine learning. We are going to assume that the data is already clean, 
since the purpose here is not perform exploratory data analysis but to create a model using scikit-learn framework. In this example, we will take a look on a model that requires feature scaling. In addition, we will still use the 3-stes idea: split the data into train, evaluate and test.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## MODELS ##
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVC

## METRICS ##
# classificaton models
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
# regression models
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_squared_error

# in case you want silence warnings
#import warnings
#warnings.filterwarnings("ignore") #"default"



### Function Definitions

In [2]:
def print_performance_metrics_aprf(y_true, y_pred, average="binary"):
    """
    Combine the main classification metrics accuracy, precision, recall and f1 score.
    Parameters:
        y_true: the true value from the dataset
        y_pred: the precitions from the model
        avarage: used in case it is a multi-class classification
    """
    print(f"Accuracy: {accuracy_score(y_true, y_pred)*100:.2f}%")
    print(f"Precision: {precision_score(y_true, y_pred, average=average)*100:.2f}%")
    print(f"Recall: {recall_score(y_true, y_pred, average=average)*100:.2f}%")
    print(f"F1 Score: {f1_score(y_true, y_pred, average=average)*100:.2f}%")

In [3]:
def plot_classification_metrics(y_true, y_pred, average="binary"):
    """
    Create a simple plot with all four classification metrics.
    Parameters:
        y_true: the true value from the dataset
        y_pred: the precitions from the model
        avarage: used in case it is a multi-class classification
    """
    metrics = [0,0,0,0]
    metrics[0] = accuracy_score(y_true, y_pred)
    metrics[1] = precision_score(y_true, y_pred, average=average)
    metrics[2] = recall_score(y_true, y_pred, average=average)
    metrics[3] = f1_score(y_true, y_pred, average=average)
    print(metrics)
    plt.bar(x=[0,1,2,3], height=metrics)
    plt.xticks([0,1,2,3],['Accuracy','Precision','Recall','F1-score'])
    plt.ylim([.6,1])
    plt.legend(['Metrics'])
    plt.title('Performance metrics')
    plt.show()

In [4]:
def get_metric_names(metric_list: list):
    """
    The function receives a list of substrings and it will return all metrics that contains the substrings.
    Parameters:
        metric_list (list): list with substrings of potential metric name.
    return:
        a list with all metrics that matches the substrings provided
    """
    from sklearn.metrics import SCORERS
    
    result = set()
    metrics = []
    for metric_substring in metric_list:
        metrics = [i for i in SCORERS if metric_substring in i]
        
        for m in metrics:
            result.add(m)
    
    return result

In [5]:
def split_scaling_features(X, y, scaler, test_size=0.2, random_state=None):
    """
    Function that splits the features into train and test, and in addition, performs feature scaling.
    Parameters:
        X: features from the dataset
        y: label from the dataset
        scaler: the type of feature scaling method applied. It can be 'min_max' for MinMaxScaler(), 'max_abs' for MaxAbsScaler(),
                and 'std' for StandardScaler().
    Return:
        X_train, X_test, y_train, y_test
    """
    from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler, StandardScaler
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    if (scaler == "min_max"):
        scaler = MinMaxScaler()
    elif (scaler == "max_abs"):
        scaler = MaxAbsScaler()
    elif (scaler == "std"):
        scaler = StandardScaler()
    else:
        scaler = StandardScaler()
    
    # compute the statistics only to the train set
    scaler.fit(X_train)
    
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

### Model training

In [6]:
# We use the the dataset California housing mareket, provided by scikit-learn
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [7]:
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [8]:
# Transform the result into a dataframe
house_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
house_df["target"] = pd.DataFrame(housing["target"])
house_df.head(4)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413


In [9]:
house_df.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
target        0
dtype: int64

#### Step 1: Separate the features and the label

In [10]:
# This is a classification model to predict heart disease.
X = house_df.drop(columns="target")
y = house_df["target"]

X.shape, y.shape

((20640, 8), (20640,))

#### Step 2: Split the data into train and test. In addition, perform feature scaling.

Cross-validation is a generic term that indicates a methodology to split the data into train (to create the model) and test (to test the performance of the model). The basic and simplest way is to split into only two sets: train and test datasets. In that case sacrifice part of the data for model performance evaluation. A more advanced methods for splitting the data is to use the 3-sets: split into train, evaluation (used to fine tuning the hyperparameters) and test (the truly unseen data used to perform the final test).

A variation of train/evaluation/test sets is to use cros_val_score() or cross_validate() functions. We split into train and set datasets, but the difference is that we use the those functions to train and compute peforance accross the whole training set. We can have a much better idea of the true performance of the model 

Note that cross_validate function returns more information and it allows to use compute more performance metrics. That's what we are going to do.

Feature scaling is mandatory for some algorithms that peform gradient descent. It transforms the features inot a small range of values. It is important to note that, as the name implies, ONLY the features are scaled. The labels/target does not need to be scaled. The three most common method to perform feature scaling is:
* maximum absolute scaling
* normalization scaling (min max scaling)
* standardization (also called z-score normalization)

To prevent data leakage we need to fit on train data, onlly then peform the scaling on train and test dataset. The fit() method compute the statistics used by the correspondent method and the transform() function apply the statistics metrics to the dataset.

In [11]:
# let's use the function split_scaling_features, defined above, to perform train/test split AND feature scaling
# X_test and y_test will be used only to peform the final performance
X_train, X_test, y_train, y_test = split_scaling_features(X, y, scaler="min_max", test_size=0.3, random_state=7)

In [12]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((14448, 8), (6192, 8), (14448,), (6192,))

Step 3: Choose the model based on the type of problem. In this case, this is a classificaiton problem, and we use random forest estimator.

#### Step 3: Choose the model

We are going to use Ridge algorithm for this dataset.

In [13]:
# Ridge add in a penalty which is the alpha parameter.
model = Ridge(alpha=10)

In [14]:
# we use 5 folds (cv=5). The scoring parameters is the errors you want to measure. You need to pass the correct name. 
# Go to https://scikit-learn.org/stable/modules/model_evaluation.html for a complete list of metrics you can use.
# You can also use the funciton above get_metric_names() that returns the metrics from a substring of metrics

# I want metrics with contains error and root in their names. For this classification metrics. Let's see what it returns:
get_metric_names(["mean_"])

{'neg_mean_absolute_error',
 'neg_mean_absolute_percentage_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_root_mean_squared_error'}

In [15]:
# this time we will use cross_val_score function. For an example for cross_validate, 
#    please refere to ml_training_model_baseline notebook.

#scores_cv = cross_validate(model, X_train, y_train, scoring=["neg_mean_absolute_error","neg_root_mean_squared_error"], cv=5)
#OR 
scores_cv = cross_val_score(model, X_train, y_train, scoring="neg_root_mean_squared_error", cv=10)

In [16]:
# All scorer objects follow the convention that higher return value are better accross different metrics.
# So, they apply negative values
scores_cv

array([-0.7649308 , -0.73764669, -0.69416695, -0.76768818, -0.77553752,
       -0.70542886, -0.75649899, -0.71170538, -0.71871456, -0.73783851])

In [17]:
# since the hyperparameter can vary from 0 to infinity, we will use cross-validation
# We need to experiment for many aplha values.
np.abs(np.mean(scores_cv))

0.7370156439700112

#### Step 4: Experiment/fine tune the mode with different hyperparameters

We need to experiment with a variety of alpha values and compare the results.

In [18]:
# There is another version of Ridge called RidgeCV. It peforms cross-validation for many alpha values
from sklearn.linear_model import RidgeCV

In [19]:
# It accepts a list of alpha values, and the cv parameter for the number of folds for the cross-validation
# We also set the scoring parameter to set the metris used to select the best aplha.
model_cv = RidgeCV(alphas=(0.05, 0.1, 0.5, 1, 5, 10), cv=10, scoring="neg_root_mean_squared_error")

In [20]:
model_cv.fit(X_train, y_train)

In [21]:
# it shows the alpha that performs the best
model_cv.alpha_

0.1

In [22]:
# Now that we know the best aplha, we can compute the metric

In [23]:
y_pred = model_cv.predict(X_test)

In [24]:
# we got a better model using RidgCV
mean_squared_error(y_test, y_pred, squared=False) # squared false means to compute root mean square error

0.7230868941321775

It is important to note here that maybe the data is not linear. Since Ridge is a linear model, it may not fit well with this type of data.
That's the reason we have to experiment with other non-linear algorithm like RandomForestRegressor.

Let's do a quick test using the default hyperparameter for RandomForestRegressor.

In [25]:
from sklearn.ensemble import RandomForestRegressor
model_rfr = RandomForestRegressor()
model_rfr.fit(X_train, y_train)
y_pred_rtf = model_rfr.predict(X_test)

In [26]:
# we got a much better result than RidgeCV
mean_squared_error(y_test, y_pred_rtf, squared=False)

0.5221420976132407

#### Step 5: Save the model

In [27]:
from joblib import dump, load
dump(model_rfr, 'california_housing_model.joblib') 

['california_housing_model.joblib']