# Training & Evaluation
* We are going to break down training and evaluation into multiple notebooks, one for each algorithm that we train and evalutate. 
* In this first notebook, we'll create baseline models to get the predictions based on `stratified` and `most frequent` classes

## Install Libraries

In [1]:
# %pip install scikit-learn

## Import Libraries

In [2]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score,recall_score,precision_score,precision_recall_curve
import seaborn as sns


# Build an absolute path from this notebook's parent directory
module_path = os.path.abspath(os.path.join('..'))

# Add to sys.path if not already present
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.utils import preprocessing
from src.utils import common
from src.utils.training import refit_strategy

## Initialize Directories

In [3]:
data_root_dir = Path("..", "data/")
models_root_dir = Path("..", "models/")

## Read Data

In [4]:
X_train = pd.read_csv(Path(data_root_dir,"X_train.csv"))
y_train = pd.read_csv(Path(data_root_dir,"y_train.csv"))

In [5]:
# preprocessed_data_df = pd.DataFrame(preprocessing.pipeline.fit_transform(
#     X_train,y_train), columns=preprocessing.pipeline.get_feature_names_out())
# preprocessed_data_df.head()

In [6]:
# preprocessed_data_df.isna().sum()

## Training Default Model

In [7]:
# import sklearn


# sklearn.metrics.get_scorer_names() 

In [8]:
from sklearn.discriminant_analysis import StandardScaler
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.pipeline import Pipeline


default_logistic_regression_model = LogisticRegression(max_iter=1000)

model_pipeline = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", default_logistic_regression_model)
])

scoring = ["recall", "precision", "f1"]

default_logistic_regression_scores = cross_validate(
    estimator=model_pipeline, 
    X=X_train, 
    y=y_train.values.ravel(), 
    cv=3, scoring=scoring,
    n_jobs=-1, verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 20 concurrent workers.


[CV] END .................................................... total time=   1.9s
[CV] END .................................................... total time=   1.8s
[CV] END .................................................... total time=   1.9s


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    3.2s finished


In [9]:
default_logistic_regression_scores

{'fit_time': array([1.13014841, 1.03965688, 1.01748991]),
 'score_time': array([0.72425652, 0.82147956, 0.79781532]),
 'test_recall': array([0.89370983, 0.88269054, 0.88016529]),
 'test_precision': array([0.85522847, 0.85349612, 0.86176669]),
 'test_f1': array([0.8740458 , 0.86784787, 0.87086882])}

In [10]:
cv_scores = default_logistic_regression_scores


In [11]:
mean_recall,mean_precision,mean_f1 = common.calculate_mean_from_cv(default_logistic_regression_scores)

Mean Recall: 0.8855218855218855, Mean Precision: 0.8855218855218855,Mean F1: 0.8709208329196105


In [12]:
## commenting this code out to avoid overwriting the metrics file. 
common.update_models_metrics("Logistic Regression", "v0", mean_recall,mean_precision,mean_f1)

Unnamed: 0,model,version,recall,precision,f1,file
0,Logistic Regression,v0,0.885522,0.885522,0.870921,


Observations:
* With average recall of `0.88` and average precision of `0.85` we already have a better model than baseline estimator. Although the recall is less than most frequent verion of baseline, average precision and F1 score makes this model more promising.


## GridSearch CV

In [13]:
## checking params
# preprocessing.pipeline.get_params()

In [14]:
from sklearn.model_selection import GridSearchCV


model_pipeline = Pipeline([
    ("preprocessing", preprocessing.pipeline),
    ("normalizing", StandardScaler()),
    ("prediction", LogisticRegression(max_iter=1000))
])

scoring = ["recall", "precision", "f1"]

param_grid = {
    "prediction__solver": ["liblinear", "saga"],
    "prediction__penalty": ["l1", "l2"],
    "prediction__C": [0.1, 1, 10]
}

grid_search = GridSearchCV(model_pipeline, param_grid, scoring=scoring, cv=3,n_jobs=-1,refit=refit_strategy)
grid_search.fit(X_train, y_train.values.ravel())


In [15]:
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Scores : {grid_search.best_index_}")
average_recall,average_precision,average_f1 = common.read_best_mean_grid_search_metrics(grid_search.cv_results_,grid_search.best_index_)

Best Parameters: {'prediction__C': 10, 'prediction__penalty': 'l2', 'prediction__solver': 'saga'}
Best Scores : 11
Mean Recall: 0.8859810223446587, Mean Precision: 0.8570896952019277,Mean F1: 0.8712749087558519


In [16]:
# commenting this code out to avoid overwriting the metrics file.
_, file_name = common.save_model(
    "Logistic Regression", "v1", grid_search.best_estimator_)
common.update_models_metrics("Logistic Regression", "v1", average_recall,
                             average_precision, average_f1, file_name=file_name)
common.update_model_params(
    "LogisticRegression", "v1", grid_search.best_params_)

[{'name': 'LogisticRegression',
  'version': 'v1',
  'params': {'prediction__C': 10,
   'prediction__penalty': 'l2',
   'prediction__solver': 'saga'}}]