**<font size="5">Applied Statistics</font>**

<font size="3">MSc in High Performance Computing Engineering, Computer Science and Engineering, Physics Engineering - A.Y. 2024-2025</font>

Prof. Mario Beraha - Dott. Vittorio Torri

---

<font size="4">**Lab 7 - Logistic Regression**</font>

# Libraries

In [None]:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import numpy as np

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

In [None]:
import statsmodels.api as sm

In [None]:
np.random.seed(1234)

In [None]:
import scipy.stats as stats

# Load Dataset

In [None]:
df = pd.read_csv('../DatasetsLabs/heart_failure_clinical_records_dataset_smhd.csv')  # Adjust the path as necessary

In [None]:
cat_vars = ['anaemia', 'diabetes', 'high_blood_pressure',  'sex',  'smoking',  'DEATH_EVENT']
num_vars = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'bmi', 'time']

# Binary Logistic Regression

We want to build a model to classify patients as dead or survived during follow-up: binary classification.

We use a logistic regression model, which models the probability of the positive class (death) in the following way, where $x_1, ..., x_p$ are the input variables

$$
P(y = 1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p)}}
$$


## Model

In [None]:
X = df[['age', 'bmi', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'ejection_fraction', 'time']]
y = df['DEATH_EVENT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1234)

train_index = X_train.index
test_index = X_test.index

We start using all numerical variables

In [None]:
X_train_1 = X_train[num_vars]
X_test_1 = X_test[num_vars]

# Fit the logistic regression model
logit_model = sm.Logit(y_train, X_train_1).fit()

print(logit_model.summary())

By default statsmodels uses the Newton-Raphson iterative optimization method to maximize the likelihood, but other methods can be specified

Statsmodels compute the pseudo R^2 statistics using the McFadden's definition:

$$pseudo_{R^2} = 1 - \frac{LL}{LL_{Null}}$$

It has values from 0 to 1 and it indicates the goodness of the model, but it's not the primary metric used to evaluate LR models

## Odds Ratios

L'odd ratio è P/1-P cioè la probabilità che l'evento accade diviso la probabilità che l'evento non accade. L'odd ratio per una variabile è e^(coefficient), quindi:
- odds ratio = 1 la variabile non ha effetto (coefficient = 0)
- > 1 la variabile aumenta la prob dell'evneto
- < 1 la variabile diminuisce la prob dell'evneto

In a logistic model, the effect of the coefficient is different from a linear regression model. A k-unit increase in $x_j$ increases the risk by a factor of $exp(k \cdot \hat{\beta_j})$. Odds ratios are defined as:

$$
OR_j = exp(\hat{\beta_j})
$$

In [None]:
coef = logit_model.params
odds_ratios = np.exp(coef)

conf = logit_model.conf_int()
conf.columns = ['2.5%', '97.5%']
conf = np.exp(conf)  # Exponentiate to get ORs' CIs

or_summary = pd.DataFrame({
    "Coefficient": coef,
    "Odds Ratio": odds_ratios,
    "2.5% CI OR": conf['2.5%'],
    "97.5% CI OR": conf['97.5%']
})

print(or_summary)

In [None]:
# Sort by odds ratio for better visualization
or_summary = or_summary.sort_values(by="Odds Ratio", ascending=False)

fig, ax = plt.subplots(figsize=(8, len(or_summary) * 0.6))

# Plot the OR as points
ax.errorbar(or_summary['Odds Ratio'], or_summary.index,
            xerr=[or_summary['Odds Ratio'] - or_summary['2.5% CI OR'], or_summary['97.5% CI OR'] - or_summary['Odds Ratio']],
            fmt='o', color='darkblue', ecolor='lightgray', elinewidth=3, capsize=4)

# Add a vertical line at OR = 1 (no effect)
ax.axvline(1, color='red', linestyle='--')

# Set labels
ax.set_xlabel("Odds Ratio (log scale)")
ax.set_title("Forest Plot of Odds Ratios with 95% CI")
ax.set_xscale("log")  # Log scale for better visualization

plt.show()

Infatti come si può notare all'aumentare della age, si avrà sempre un aumento della probabilità di morte

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_test_pred_prob = logit_model.predict(X_test_1)

In [None]:
y_test_pred_prob

In [None]:
# Convert probabilities to binary outcomes based on a threshold
threshold = 0.5
y_test_pred_class = (y_test_pred_prob > threshold).astype(int)

In [None]:
print("Confusion Matrix on Test Set:")
print(confusion_matrix(y_test, y_test_pred_class))

With a more nice visualization:

In [None]:
cm = confusion_matrix(y_test, y_test_pred_class)

class_names = ['Survived', 'Dead']

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

[  TN    FP  
   FN    TP]

## AUC

The Receiver-Operating-Characteristic Curve (ROC Curve) plots the FPR vs TPR at varying the classification threshold from 0 to 1, where

$$TPR = \frac{TP}{TP+FN} (= Sensitivity = Recall)$$

$$FPR = \frac{FP}{FP+TN}$$

The ideal point is (0,1), which maximizes the TPR and minimizes the FPR

The Area Under the Curve (AUC) is a measure of goodness of the model that is not influences by the choice of a classification threshold

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, _ = roc_curve(y_test, y_test_pred_prob)

# Plot the ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f"ROC Curve")

# Plot the diagonal line representing random chance
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label="Chance")

# Labels and legend
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Binary Classification")
plt.legend(loc="lower right")
plt.show()


In [None]:
from sklearn.metrics import roc_auc_score

auc_score = roc_auc_score(y_test, y_test_pred_prob)
print(f"ROC-AUC Score on Test Set: {auc_score}")

## Accuracy, Precision, Recall/Sensitivity, F1-Score, Specificity

$$Accuracy = \frac{TP+TN}{TP+FP+TN+FN}$$
$$Precision = \frac{TP}{TP+FP}$$
$$Recall = Sensitivity = \frac{TP}{TP+FN}$$
$$\text{F1-Score} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$$
$$Specificity = \frac{TN}{TN+FP}$$

In [None]:
from sklearn.metrics import classification_report

print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_test_pred_class, target_names=['Survived', 'Dead']))

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_pred_class)
cm

In [None]:
## TODO: try to compute the specificity from the confusion matrix
tn, fp, fn, tp = cm.ravel()
specificity = tn / (tn + fp) #percentuale di negativi che

print(f"Specificity on Test Set: {specificity}") # It is the recall of class that we defined as negative (Survived)

## Choice of the threshold

Using 0.5 as threshold is the most common choice, but it might not be the optimal one, especially in case of class unbalance

One suggest value is the percentage of negative samples

In [None]:
optimal_threshold_1 = 1 - df['DEATH_EVENT'].mean() # 1 - % of positive samples

In [None]:
optimal_threshold_1

In [None]:
y_test_pred_class = (y_test_pred_prob > optimal_threshold_1).astype(int)

print("Confusion Matrix on Test Set:")
print(confusion_matrix(y_test, y_test_pred_class))

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_test_pred_class).ravel()

tpr_1 = tp / (tp + fn)  # Sensitivity or Recall or TPR
fpr_1 = fp / (fp + tn)  # False Positive Rate (FPR)

In [None]:
print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_test_pred_class, target_names=['Survived', 'Dead']))

In [None]:
# Plot ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f"ROC Curve")

# Plot the optimal threshold point
plt.scatter(fpr_1, tpr_1, color='orange', s=100, label=f"Threshold with Proportion of Negative Samples (= {optimal_threshold_1:.2f})")


# Plot the diagonal line representing random chance
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label="Chance")

# Labels and legend
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve with Threshold = Proportion of negative samples")
plt.legend(loc="lower right")
plt.show()

In [None]:
fpr, tpr, thresholds= roc_curve(y_test, y_test_pred_prob)
thresholds

In [None]:
#calcolare la roc curve per un valore del threshold dato
threshold = 0.5
probabilities = logit_model.predict(X_test_1)
y_pred = [1 if p>threshold else 0 for p in probabilities]
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

# fpr = fp / (fp + tn)
# tpr = tp / (tp + fn)
#oppure

fpr, tpr, thresholds = roc_curve(y_test, probabilities)
thresholds

# Multiclass logistic regression

We want to develop a model to classify HF patients in three ejection fraction classes

In [None]:
df['ef_cat'] = pd.cut(df['ejection_fraction'], bins=[0,40,50,100], labels=['reduced', 'mildly reduced', 'preserved'])

In [None]:
df[['ejection_fraction', 'ef_cat']]

## Model

A multinomial logistic regression model assumes a baseline class (assume the *K*-th without loss of generality) and computes a set of coefficients for each other class *k*:

$$
P(y = k | x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \beta_{k2} x_2 + \dots + \beta_{kp} x_p}}{1 + \sum_{j=1}^{K-1} e^{\beta_{j0} + \beta_{j1} x_1 + \beta_{j2} x_2 + \dots + \beta_{jp} x_p}} \quad\text{for }k < K \\
P(y = K | x) = \frac{1}{1 + \sum_{j=1}^{K-1} e^{\beta_{j0} + \beta_{j1} x_1 + \beta_{j2} x_2 + \dots + \beta_{jp} x_p}}
$$

The multinomial logistic regression model implemented in the *statsmodels* library requires categorical input variables to be *one-hot encoded*

In [None]:
df_encoded = pd.get_dummies(df, columns=cat_vars, drop_first=True, dtype=int) # another way to apply one-hot encoding

In [None]:

X = df_encoded[['age', 'bmi', 'serum_sodium', 'serum_creatinine', 'diabetes_1', 'sex_Male', 'smoking_1', 'high_blood_pressure_1', 'anaemia_1']]
y = df_encoded['ef_cat']  # Multiclass target variable

X = sm.add_constant(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234, stratify=y) #STRATIFICATION WITH RESPECT TO THE TARGET CLASS

model = sm.MNLogit(y_train, X_train)
result = model.fit()

result.predict(X_test)

In [None]:
np.exp(result.params).rename(columns={0: 'mildly_reduced OR', 1:'preserved OR'})

In [None]:
model._ynames_map

In [None]:
coef = result.params[0]
odds_ratios = np.exp(coef)

conf = result.conf_int().loc['mildly reduced']
conf.columns = ['2.5%', '97.5%']
conf = np.exp(conf)  # Exponentiate to get ORs' CIs

or_summary = pd.DataFrame({
    "Coefficient Mildly Reduced": coef,
    "Odds Ratio Mildly Reduced": odds_ratios,
    "2.5% CI OR Mildly Reduced": conf['2.5%'],
    "97.5% CI OR Mildly Reduced": conf['97.5%']
})

fig, ax = plt.subplots(figsize=(8, len(or_summary) * 0.6))

ax.errorbar(or_summary['Odds Ratio Mildly Reduced'], or_summary.index,
            xerr=[or_summary['Odds Ratio Mildly Reduced'] - or_summary['2.5% CI OR Mildly Reduced'], or_summary['97.5% CI OR Mildly Reduced'] - or_summary['Odds Ratio Mildly Reduced']],
            fmt='o', color='darkblue', ecolor='lightgray', elinewidth=3, capsize=4)

ax.axvline(1, color='red', linestyle='--')

ax.set_xlabel("Odds Ratio (log scale) Mildly Reduced")
ax.set_title("Forest Plot of Odds Ratios for Mildly Reduced class with 95% CI")
ax.set_xscale("log")  # Log scale for better visualization

plt.show()

## Test set evaluation

In [None]:
y_test_pred_prob = result.predict(X_test)  # Returns probabilities for each class
y_test_pred_prob

In [None]:
y_test_pred_prob.rename(model._ynames_map, axis='columns', inplace=True)

y_test_pred_class = y_test_pred_prob.idxmax(axis=1) #for each sample, assign the class with maximum probability

In [None]:
cm = confusion_matrix(y_test, y_test_pred_class, labels=list(model._ynames_map.values()))

class_names = model._ynames_map.values()

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=class_names, yticklabels=class_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred_class))

In a multiclass context, the ROC-AUC can be computed per class with a one-vs-all approach

In [None]:
# Compute ROC AUC for each class (one-vs-all)
roc_auc_scores = {}
for i in (y_test_pred_prob.columns):
    roc_auc = roc_auc_score(y_test == i, y_test_pred_prob.loc[:, i])  # Treat class i as positive
    roc_auc_scores[i] = roc_auc

# Output the ROC AUC scores
for class_name, auc in roc_auc_scores.items():
    print(f'ROC AUC for {class_name}: {auc:.2f}')

An aggregate ROC-AUC can be computed in different ways

In [None]:
y_test = y_test.map({'reduced': 0, 'mildly reduced': 1, 'preserved': 2})
y_test_pred_prob.rename(columns={'reduced': 0, 'mildly reduced': 1, 'preserved': 2}, inplace=True)

roc_auc_micro = roc_auc_score(y_test, y_test_pred_prob, average='micro', multi_class='ovr')
print(f'Micro-average ROC AUC: {roc_auc_micro:.2f}')

roc_auc_micro = roc_auc_score(y_test, y_test_pred_prob, average='macro', multi_class='ovr')
print(f'Macro-average ROC AUC: {roc_auc_micro:.2f}')

roc_auc_micro = roc_auc_score(y_test, y_test_pred_prob, average='weighted', multi_class='ovr')
print(f'Weighted-average ROC AUC: {roc_auc_micro:.2f}')

\begin{array}{|c|c|c|c|}
\hline
\textbf{Aspect} & \textbf{Macro-Averaging} & \textbf{Micro-Averaging} & \textbf{Weighted Averaging} \\
\hline
\textbf{Calculation} & \text{Average of individual class metrics} & \text{Aggregated metrics across all classes} & \text{Average of class metrics weighted by class size} \\
\hline
\textbf{Class Weighting} & \text{Treats all classes equally} & \text{Larger classes have more influence} & \text{Larger classes influence the average more} \\
\hline
\textbf{Performance Insight} & \text{Insight into performance of each class} & \text{Overall performance across the entire dataset} & \text{Balanced evaluation considering class sizes} \\
\hline
\textbf{Sensitivity} & \text{Sensitive to class imbalance} & \text{More robust to class imbalance} & \text{Moderately sensitive to class imbalance} \\
\hline
\textbf{Use Cases} & \text{Class-specific performance} & \text{Overall model performance} & \text{Balanced performance in imbalanced datasets} \\
\hline
\end{array}


$$
\text{Precision}_{\text{micro}} = \frac{\sum_{i=1}^{C} \text{TP}_i}{\sum_{i=1}^{C} (\text{TP}_i + \text{FP}_i)}
$$

$$
\text{Recall}_{\text{micro}} = \frac{\sum_{i=1}^{C} \text{TP}_i}{\sum_{i=1}^{C} (\text{TP}_i + \text{FN}_i)}
$$

$$
\text{F1}_{\text{micro}} = 2 \times \frac{\text{Precision}_{\text{micro}} \times \text{Recall}_{\text{micro}}}{\text{Precision}_{\text{micro}} + \text{Recall}_{\text{micro}}}
$$


Use macro when:

*  You care about per-class behavior, and you want to treat all classes equally, regardless of their frequency.

*  You want to evaluate model fairness across all classes, including rare or underrepresented ones.

*  You're dealing with imbalanced data, and you want to make sure small classes still matter in the evaluation.

*  You want to highlight weaknesses in your model’s performance on rare classes.



Use micro when:

*  You don't care about classes individually.

* You only care about total prediction quality.

* You are working with highly multilabel settings (many labels per sample).

* You want one simple number for "How many labels did I guess right?".

Use weighted when:

* You do care about per-class behavior.

* You want a summary that adjusts for imbalance but still thinks per-class.

* You have imbalanced classes, but you still want per-class precision/recall insight.

## Change baseline class

Manually encoding the target variable allows us to change the order of the numerical labels and consequently change the variable used as base

In [None]:
df_encoded = pd.get_dummies(df, columns=cat_vars, drop_first=True, dtype=int)
df_encoded['ef_cat'] = pd.Categorical(df['ef_cat'], categories=['mildly reduced', 'preserved', 'reduced'], ordered=True)

In [None]:
X = df_encoded[['age', 'bmi', 'serum_sodium', 'serum_creatinine', 'diabetes_1', 'sex_Male', 'smoking_1', 'high_blood_pressure_1', 'anaemia_1']]
y = df_encoded['ef_cat']  # Multiclass target variable

X = sm.add_constant(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234, stratify=y)

In [None]:
model = sm.MNLogit(y_train, X_train)
result = model.fit()

print(result.summary())

In [None]:
np.exp(result.params).rename(columns={0: 'preserved OR', 1:'reduced OR'})