# Logistic Regression

Our first model is a logistic regression classifier. This notebook execution lasts around 20 minutes, mainly due to the Grid Seach.

### Importing Libraries

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score, make_scorer, classification_report
import pyprojroot
import pickle

We set a random seed for reproducibility.

In [6]:
RANDOM_STATE = 11

## 1. Data Preprocessing

In [7]:
DATA_PATH = pyprojroot.here().joinpath('data', 'fetal_health.csv')
df = pd.read_csv(DATA_PATH)

In [8]:
X = df[df.columns.difference(['fetal_health', 'fetal_health_label'])]
y = df['fetal_health']

In [9]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, stratify = y, random_state=RANDOM_STATE)

## Defining our Metrics

In [15]:
metrics = {
    'accuracy': accuracy_score,
    'balanced_accuracy': balanced_accuracy_score, 
    'recall': lambda y_true, y_pred: recall_score(y_true, y_pred, average='macro', zero_division=0),
    'precision': lambda y_true, y_pred: precision_score(y_true, y_pred, average='macro', zero_division=0),
    'f1_score': lambda y_true, y_pred: f1_score(y_true, y_pred, average='macro', zero_division=0)
}

for key in metrics.keys():
    metrics[key] = make_scorer(metrics[key])

## 2. Model Training

We tune our hyperparameters with GridSearchCV, which conducts an exhaustive search over specified parameter values of an estimator. It optimizes the specified metric, balanced accuracy in this case, with cross-validation.

In [21]:
# We define our parameters dictionary
params = {
    "penalty": ["l1", "l2", "elasticnet", None],
    "solver": ["lbfgs", "liblinear", "newton-cg", "newton-cholesky", "sag", "saga"],
    "C": np.logspace(-6, 6, 101),
    "random_state": [RANDOM_STATE],
    "max_iter": [1_000]
}

In [None]:
# Execute Grid Search
grid_search = GridSearchCV(estimator=LogisticRegression(), param_grid=params, scoring=metrics["balanced_accuracy"],\
    cv = KFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE))
grid_search.fit(X_train, y_train)

# Save results in dataframe for visualization
grid_df = pd.DataFrame(grid_search.cv_results_)

In [62]:
# Results are given such that each row corresponds to grid search iteration.
# Here we extract only the columns that contain the parameters used for the iteration
# and the mean test score.
sum_cols = [f"param_{param}" for param in list(params.keys())]
sum_cols.append('mean_test_score')
grid_df[sum_cols].sort_values(by="mean_test_score", ascending=False).head(20)

Unnamed: 0,param_penalty,param_solver,param_C,param_random_state,param_max_iter,mean_test_score
1133,l1,saga,0.436516,11,1000,0.787974
1157,l1,saga,0.57544,11,1000,0.787449
1181,l1,saga,0.758578,11,1000,0.784489
1277,l1,saga,2.290868,11,1000,0.783851
1253,l1,saga,1.737801,11,1000,0.782862
1229,l1,saga,1.318257,11,1000,0.782192
1205,l1,saga,1.0,11,1000,0.781786
1282,l2,sag,2.290868,11,1000,0.781535
1280,l2,newton-cg,2.290868,11,1000,0.781535
1278,l2,lbfgs,2.290868,11,1000,0.781535


Now, we use the best estimator found by the algorithm to run a more rigorous evaluation of the model over a larger set of metrics

In [69]:
log_reg_clf = grid_search.best_estimator_
log_reg_clf

## 3. Evaluation

We're using cross validation to precisely evaluate model performance, independent from random data partitions, which may influence our metrics.

In [73]:
cv_result = cross_validate(log_reg_clf, X_test, y_test, scoring=metrics,\
    cv = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE), return_estimator=True)
cv_result_df = pd.DataFrame(cv_result)



In [74]:
cv_result_df = cv_result_df.drop('estimator', axis=1)
cv_result_df.mean()

fit_time                  0.355753
score_time                0.010324
test_accuracy             0.866445
test_balanced_accuracy    0.689822
test_recall               0.689822
test_precision            0.741956
test_f1_score             0.700151
dtype: float64

We can see that dummy classifier has very low performance. These will be our baseline metrics.

In [75]:
y_pred = log_reg_clf.predict(X_test)

In [77]:
labels = [1, 2, 3]
target_names = ['Normal', 'Suspect', 'Pathological']

In [78]:
print(classification_report(y_test, y_pred, labels=labels, target_names=target_names, zero_division=0))

              precision    recall  f1-score   support

      Normal       0.93      0.93      0.93       332
     Suspect       0.61      0.66      0.63        59
Pathological       0.90      0.77      0.83        35

    accuracy                           0.88       426
   macro avg       0.81      0.79      0.80       426
weighted avg       0.89      0.88      0.88       426



Here, we can see in more detail our metrics calculated per class.

## 4. Saving the Model

We're using the _pickle_ library for saving our models.

In [79]:
MODEL_PATH = pyprojroot.here().joinpath('models', 'log_reg_clf.pkl')

# Save the model
with open(MODEL_PATH,'wb') as f:
    pickle.dump(grid_search.best_estimator_, f)