# What are we doing?

## Objectives

+ Construct a cross-validation pipeline.
+ Use cross-validation to evaluate different hyperparameter performance.
+ Perform grid search for systemic evaluation.
+ Store and manage results.

## Procedure

The diagram below, taken from Scikit Learn's documentation, shows the procedure that we will follow:

![](./img/grid_search_workflow.png)


+ System requriements:
    
    - Automation: the system should operate automatically with the least amount of supervision. 
    - Replicability: changes to code and (arguably) data should be logged and controled. Randomness should also be controlled (random seeds, etc.)
    - Persistence: persist results for later analysis.


## What is a Hyperparameter?

+ Generally speaking, hyperparameters are parameters that control the learning process: regularization weights, learning rate, entropy/gini metrics, etc. 
+ Hyperparameters will drive the behaviour and performance of a model. Model selection is intimately related with hyperparameter tuning. 
+ Selection critieria are based on performance evaluation and, to get better performance estimates, we use cross-validation.

## Searching the Hyperparameter Grid

+ To address the automation requirement, we could use `GridSearchCV()`, which is a self-contained function for performing a Grid Search over a hyperparameter space.
+ To "Search the Hyperparameter Grid" exhaustively means that we will consider all possible combination of hyperparameter values in the search space and evaluate the model using those hyperparams. For example, if we have two parameters that we are exploring, kernel (takes values "rbf" and "poly") and C (takes values 1.0 and 0.5), then this grid would be the combinations:

    + (rbf, 1.0)
    + (rbf, 0.5)
    + (poly, 1.0)
    + (poly, 0.5)

+ Under each combination, we perform CV and evaluate the model's performance.

# Setup

We start with [Give me some credit](https://www.kaggle.com/c/GiveMeSomeCredit) data that we used in the previous session.

In [1]:
%load_ext dotenv
%dotenv ../src/.env
import sys
sys.path.append("../src")
import pandas as pd
import numpy as np
import os
ft_path = os.getenv("CREDIT_DATA")
df_raw = pd.read_csv(ft_path)


In [2]:
# same dataset as before
df = df_raw.drop(columns = ["Unnamed: 0"]).rename(
    columns = {
        'SeriousDlqin2yrs': 'delinquency',
        'RevolvingUtilizationOfUnsecuredLines': 'revolving_unsecured_line_utilization', 
        'age': 'age',
        'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_late', 
        'DebtRatio': 'debt_ratio', 
        'MonthlyIncome': 'monthly_income',
        'NumberOfOpenCreditLinesAndLoans': 'num_open_credit_loans', 
        'NumberOfTimes90DaysLate':  'num_90_days_late',
        'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
        'NumberOfTime60-89DaysPastDueNotWorse': 'num_60_89_days_late',
        'NumberOfDependents': 'num_dependents'
    }
).assign(
    high_debt_ratio = lambda x: (x['debt_ratio'] > 1)*1,
    missing_monthly_income = lambda x: x['monthly_income'].isna()*1,
    missing_num_dependents = lambda x: x['num_dependents'].isna()*1, 
)

Use a simple pipeline composed of:

+ Preprocessing steps.
+ Logistic Regression classifier.

We will explore the hyperparameter sapce by evaluating different regularization strategies and parameters.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
# not using cross-validate
# using gridsearch Cross Validation

In [4]:
# same pipline
num_cols = ['revolving_unsecured_line_utilization', 'age',
       'num_30_59_days_late', 'debt_ratio', 'monthly_income',
       'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents', 
       # Although expressed as numbers, these columns are boolean:
       # 'high_debt_ratio',
       # 'missing_monthly_income', 
       # 'missing_num_dependents' 
       ]


pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

ctransform_simple= ColumnTransformer([
    ('numeric_simple', pipe_num_simple, num_cols),
], remainder='passthrough')

pipe_lr = Pipeline([
    ('preprocess', ctransform_simple),
    ('clf', LogisticRegression())
])
pipe_lr

Obtain the parameters of the pipeline with `.get_params()`.

In [5]:
# want to know the names of our parameter
# give full list of parameters available. Rich
# Memory available to you from system
# No verbosity, not spit out messages. VERY IMPORTANT
# Preprocess and __ means hyperparameters. 
# we can decide how many cores we are going to use in our grid search
#  'preprocess__n_jobs': None,
#  'preprocess__remainder': 'passthrough',
#  'preprocess__sparse_threshold': 0.3,
#  'preprocess__transformer_weights': None,  Transformer step
#  'preprocess__transformers': [('numeric_simple'  
# can effect imputer
# 'preprocess__numeric_simple__imputer__missing_values': nan,
#  'preprocess__numeric_simple__imputer__strategy': 'median',
# these names will controll the parameters 
# updated at the end our standard name
# we can change our strategy at the end
# C at value 1.0
# 'clf__C': 1.0,
#  'clf__class_weight': None,
#  'clf__dual': False,
#  'clf__fit_intercept': True,
#  'clf__intercept_scaling': 1,
#  'clf__l1_ratio': None,
#  'clf__max_iter': 100,
#  'clf__multi_class': 'auto',
#  'clf__n_jobs': None,
#  'clf__penalty': 'l2',
#  'clf__random_state': None,
#  'clf__solver': 'lbfgs',
#  'clf__tol': 0.0001,
#  'clf__verbose': 0,
#  'clf__warm_start': False}

# can access them dynamically through get_params  - it is a python dictionary; can get all params with keys()

pipe_lr.get_params()

{'memory': None,
 'steps': [('preprocess',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('numeric_simple',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('standardizer',
                                                     StandardScaler())]),
                                    ['revolving_unsecured_line_utilization', 'age',
                                     'num_30_59_days_late', 'debt_ratio',
                                     'monthly_income', 'num_open_credit_loans',
                                     'num_90_days_late', 'num_real_estate_loans',
                                     'num_60_89_days_late', 'num_dependents'])])),
  ('clf', LogisticRegression())],
 'verbose': False,
 'preprocess': ColumnTransformer(remainder='passthrough',
                   transformers=[('numeric_simpl

In [7]:
hyperparams = pipe_lr.get_params()
hyperparams.keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocess', 'clf', 'preprocess__n_jobs', 'preprocess__remainder', 'preprocess__sparse_threshold', 'preprocess__transformer_weights', 'preprocess__transformers', 'preprocess__verbose', 'preprocess__verbose_feature_names_out', 'preprocess__numeric_simple', 'preprocess__numeric_simple__memory', 'preprocess__numeric_simple__steps', 'preprocess__numeric_simple__verbose', 'preprocess__numeric_simple__imputer', 'preprocess__numeric_simple__standardizer', 'preprocess__numeric_simple__imputer__add_indicator', 'preprocess__numeric_simple__imputer__copy', 'preprocess__numeric_simple__imputer__fill_value', 'preprocess__numeric_simple__imputer__keep_empty_features', 'preprocess__numeric_simple__imputer__missing_values', 'preprocess__numeric_simple__imputer__strategy', 'preprocess__numeric_simple__standardizer__copy', 'preprocess__numeric_simple__standardizer__with_mean', 'preprocess__numeric_simple__standardizer__with_std', 'clf__C', 'clf__class_weight', '

## Setup the Splitting Strategy

In [8]:
X = df.drop(columns = 'delinquency')
Y = df['delinquency']

scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)



To perform the Grid Search we need to define a parameter grid:

- A parameter grid defines all of the combinations of parameters that we need to explore.
- The function `GridSearchCV()` performs an exhaustive search of parameter combinations.
- The parameter grid is defined as a dictionary of lists:

    * Each entry's key is the name of the parameter.
    * Each entry's value is the list of values that we would like to explore.

In [9]:
# this is where grid starts
# dictionary created with {key:value} pairs
param_grid = {
    # name not arbitrary, got from get_params()
    # when we encounter parameter C, I want to test for these values 
    # we can give a broader set of parameters, will take long
    # this is a dumb, brute force search
    'clf__C': [0.01, 0.5, 1.0],
    # penalty determination, how to I set my penalty term? Use absolute values, power of 1 is l1, power of 2 is l2
    'clf__penalty': ['l1', 'l2'],
    # only want to fix it, always use the liblinear solver
    'clf__solver': ['liblinear'],
    }

Some key inputs to [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) are:

+ `estimator`: the pipeline or classifier that we are tuning.
+ `param_grid`: the parameter grid defined as a dictionary of lists described above.
+ `n_jobs`: settings for parallel computation.
+ `refit`: options for refitting the model using the best-performing configuration.

In [10]:
# will take param_grid and by through pipeline
# logistic regression
grid_cv = GridSearchCV(
    # sending entire pipeline; needs to implement sit and fit method; 
    estimator=pipe_lr, 
    param_grid=param_grid, 
    scoring = scoring, 
    cv = 5,
    # one of my scoring measures is negative log loss, grid search will run it procedure and it will find the best performance on negative log loss, 
    # it will return trainin gof best parameters, then it will retain model based on those parameters
    # optimization criteria is 'negative log loss'
    refit = "neg_log_loss")
# givining it data we prepared is called xtrain and ytrain
grid_cv.fit(X_train, Y_train)

In [14]:
grid_cv.cv_results_

{'mean_fit_time': array([4.88426013, 0.69663277, 7.43054314, 0.6964282 , 7.49893417,
        0.71018353]),
 'std_fit_time': array([0.53584907, 0.0518807 , 0.35994998, 0.01969303, 0.31161969,
        0.01136668]),
 'mean_score_time': array([0.09793634, 0.08664403, 0.08780599, 0.08833084, 0.09571519,
        0.08999553]),
 'std_score_time': array([0.01284919, 0.00373911, 0.00633295, 0.00628991, 0.01511616,
        0.00734157]),
 'param_clf__C': masked_array(data=[0.01, 0.01, 0.5, 0.5, 1.0, 1.0],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_clf__penalty': masked_array(data=['l1', 'l2', 'l1', 'l2', 'l1', 'l2'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_clf__solver': masked_array(data=['liblinear', 'liblinear', 'liblinear', 'liblinear',
                    'liblinear', 'liblinear'],
              mask=[False, False, False, False, Fals

In [15]:
# if dictionary need two stars **, python will translater with two stars 'clf-C = 1.0'
params = grid_cv.best_params_
pipe_test = pipe_lr.set_params(**params)
pipe_test.get_params()
# params
# pipe_lr.set_params(**params)

{'memory': None,
 'steps': [('preprocess',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('numeric_simple',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('standardizer',
                                                     StandardScaler())]),
                                    ['revolving_unsecured_line_utilization', 'age',
                                     'num_30_59_days_late', 'debt_ratio',
                                     'monthly_income', 'num_open_credit_loans',
                                     'num_90_days_late', 'num_real_estate_loans',
                                     'num_60_89_days_late', 'num_dependents'])])),
  ('clf', LogisticRegression(penalty='l1', solver='liblinear'))],
 'verbose': False,
 'preprocess': ColumnTransformer(remainder='passthrough',
                

Access the cross-validation results using the property `.cv_results_`:

In [13]:
# grid_cv will produce scoring results for all
pd.DataFrame(grid_cv.cv_results_).to_csv('temp.csv')

In [11]:
# "param_clf__C	param_clf__penalty	param_clf__solver" these give us our profiles
# "mean_test_neg_log_loss	std_test_neg_log_loss	rank_test_neg_log_loss" 
# we are getting a ranking of our models in terms of performance based on negative log loss

res = grid_cv.cv_results_
res = pd.DataFrame(res)
res.columns

res[['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_clf__C', 'param_clf__penalty', 'param_clf__solver', 'params',
       'mean_test_neg_log_loss',
       'std_test_neg_log_loss', 'rank_test_neg_log_loss']].sort_values('rank_test_neg_log_loss')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__penalty,param_clf__solver,params,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss
4,7.498934,0.31162,0.095715,0.015116,1.0,l1,liblinear,"{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__so...",-0.225373,0.000492,1
5,0.710184,0.011367,0.089996,0.007342,1.0,l2,liblinear,"{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__so...",-0.225374,0.000488,2
2,7.430543,0.35995,0.087806,0.006333,0.5,l1,liblinear,"{'clf__C': 0.5, 'clf__penalty': 'l1', 'clf__so...",-0.225374,0.000488,3
3,0.696428,0.019693,0.088331,0.00629,0.5,l2,liblinear,"{'clf__C': 0.5, 'clf__penalty': 'l2', 'clf__so...",-0.225377,0.000481,4
0,4.88426,0.535849,0.097936,0.012849,0.01,l1,liblinear,"{'clf__C': 0.01, 'clf__penalty': 'l1', 'clf__s...",-0.227513,0.000446,5
1,0.696633,0.051881,0.086644,0.003739,0.01,l2,liblinear,"{'clf__C': 0.01, 'clf__penalty': 'l2', 'clf__s...",-0.228725,0.000628,6


Access the best-performing configuration:

In [25]:
grid_cv.best_params_

{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}

In [26]:
# 'est' is a pipeline
est = grid_cv.best_estimator_
est

The best-performing classifier (pipeline) trained on the complete training set is:

# Tracking GridSearchCV Experiments

+ We can expand our infrastructure for hyperparameter tuning across various models.
+ The plan:

    - Create a model ingredient to obtain the classifier object.
    - Create experiment param grids in json files to organize our parameter grids.
    - Schedule the experiments.


## The Design

<div>
<img src=./img/experiment_setup.png width="75%">
</div>

Explore the code in `./src/credit_experiment.py` and `./src/credit_model_ingredient.py`:

+ `credit_model_ingredient.py` implements a function that returns a model given a string. This way, we can parametrize models in the experiment.
+ `credit_experiment.py` is modularized version of our previous file, `credit_experiment_nb.py` which only worked with Naive Bayes classifier.
+ The experiment is now further *modularized*: there are ingredients for most components and it can be broken down even more depending on the evolution of the model.

## Running Experiments from the Command Line

Access the experiment through the [Command Line Interface](https://sacred.readthedocs.io/en/stable/command_line.html).

```
cd src  # if required
python credit_experiment.py
```

We can also change the parameters of the experiment. For instance, using the same code, we can run an experiment with a logistic regression classifier using a basic (not power) preprocessing pipeline:

```
python .\credit_experiment.py with 'preprocessing="basic"' 'model="LogisticRegression"'
```

# A Few Notes About Sacred

+ Sacred is a powerful tool, but it is only the beginning.  
+ Sacred is useful in keeping track of experiments within a limited scope: it is not a project management tool.
+ It works well in SQL environments, but handling hyperparameters can be painful.
+ The natural backend is MongoDB, however not all workplaces have running instances.


## Experiment Schema

The database schema implemented by sacred is shown below. The schema is a useful representation of the code and setup of an experiment. The package offers a [metrics API](https://sacred.readthedocs.io/en/stable/collected_information.html#metrics-api), but we have decided to extend the framework with a few ad-hoc tables with performance metrics. 

The database backend is a database like any other: you can query it with Python, R, or PowerBI.

+ Server is located in localhost port 5432.
+ User and password are in the .env file in `./src/db/`.

<div>
<img src=./img/sacred_sql_schema.png width="40%">
</div>