# Logistic Regression

Our first model is a logistic regression classifier. This notebook execution lasts around 20 minutes, mainly due to the Grid Seach.

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score, make_scorer, classification_report
import pyprojroot
import pickle

We set a random seed for reproducibility.

In [2]:
RANDOM_STATE = 11

## 1. Data Preprocessing

In [3]:
DATA_PATH = pyprojroot.here().joinpath('data', 'fetal_health.csv')
df = pd.read_csv(DATA_PATH)

In [4]:
X = df[df.columns.difference(['fetal_health', 'fetal_health_label'])]
y = df['fetal_health']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state=RANDOM_STATE)

In [6]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

## Defining our Metrics

In [7]:
metrics = {
    'accuracy': accuracy_score,
    'balanced_accuracy': balanced_accuracy_score, 
    'precision': lambda y_true, y_pred: precision_score(y_true, y_pred, average='macro', zero_division=0),
    'f1_score': lambda y_true, y_pred: f1_score(y_true, y_pred, average='macro', zero_division=0)
}

for key in metrics.keys():
    metrics[key] = make_scorer(metrics[key])

## 2. Model Training

In [8]:
# We define our parameters dictionary
params = {
    "penalty": ["l1", "l2", "elasticnet", None],
    "solver": ["lbfgs", "liblinear", "newton-cg", "newton-cholesky", "sag", "saga"],
    "C": np.logspace(-6, 6, 101),
    "random_state": [RANDOM_STATE],
    "max_iter": [1_000]
}

In [22]:
import warnings

# Ignores warnings that indicate params incompatibility
warnings.filterwarnings("ignore")

In [23]:
# Execute Grid Search
grid_search = GridSearchCV(estimator=LogisticRegression(), param_grid=params, scoring=metrics["balanced_accuracy"],\
    cv = KFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE), return_train_score=True)
grid_search.fit(X_train_scaled, y_train)

# Save results in dataframe for visualization
grid_df = pd.DataFrame(grid_search.cv_results_)

In [24]:
# Results are given such that each row corresponds to grid search iteration.
# Here we extract only the columns that contain the parameters used for the iteration
# and the mean test score.
sum_cols = [f"param_{param}" for param in list(params.keys())]
sum_cols.append('mean_test_score')
sum_cols.append('mean_train_score')
grid_df[sum_cols].sort_values(by="mean_test_score", ascending=False).head(20)

Unnamed: 0,param_penalty,param_solver,param_C,param_random_state,param_max_iter,mean_test_score,mean_train_score
1133,l1,saga,0.436516,11,1000,0.787974,0.81776
1157,l1,saga,0.57544,11,1000,0.787449,0.821449
1181,l1,saga,0.758578,11,1000,0.784489,0.8236
1277,l1,saga,2.290868,11,1000,0.783851,0.833394
1253,l1,saga,1.737801,11,1000,0.782862,0.830192
1229,l1,saga,1.318257,11,1000,0.782192,0.828371
1205,l1,saga,1.0,11,1000,0.781786,0.823449
1280,l2,newton-cg,2.290868,11,1000,0.781535,0.82822
1278,l2,lbfgs,2.290868,11,1000,0.781535,0.828094
1282,l2,sag,2.290868,11,1000,0.781535,0.828346


Now, we use the best estimator found by the algorithm to run a more rigorous evaluation of the model over a larger set of metrics

In [25]:
log_reg_clf = grid_search.best_estimator_
log_reg_clf

## 3. Evaluation

We're using cross validation to precisely evaluate model performance, independent from random data partitions, which may influence our metrics.

In [26]:
cv_result = cross_validate(log_reg_clf, X_test_scaled, y_test, scoring=metrics,\
    cv = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE), return_estimator=True)
cv_result_df = pd.DataFrame(cv_result)

In [27]:
cv_result_df = cv_result_df.drop('estimator', axis=1)
cv_result_df.mean()

fit_time                  0.416907
score_time                0.010742
test_accuracy             0.866390
test_balanced_accuracy    0.685986
test_precision            0.742103
test_f1_score             0.697508
dtype: float64

In [28]:
y_pred = log_reg_clf.predict(X_test_scaled)

In [30]:
labels = [1, 2, 3]
target_names = ['Normal', 'Suspect', 'Pathological']

In [31]:
print(classification_report(y_test, y_pred, labels=labels, target_names=target_names, zero_division=0))

              precision    recall  f1-score   support

      Normal       0.93      0.93      0.93       332
     Suspect       0.60      0.63      0.61        59
Pathological       0.82      0.80      0.81        35

    accuracy                           0.88       426
   macro avg       0.78      0.78      0.78       426
weighted avg       0.88      0.88      0.88       426



Here, we can see in more detail our metrics calculated per class.

## 4. Saving the Model

We're using the _pickle_ library for saving our models.

In [32]:
MODEL_PATH = pyprojroot.here().joinpath('models', 'log_reg_clf.pkl')

# Save the model
with open(MODEL_PATH,'wb') as f:
    pickle.dump(grid_search.best_estimator_, f)