# 🏥 Healthcare ML: Binary Logistic Regression Analysis

## 📊 Key Results Summary

### 🎯 **Model Performance Excellence**
- **Accuracy**: 99.99% (6,742/6,743 correct predictions)
- **Precision**: 1.00 (Perfect positive prediction accuracy)
- **Recall**: 1.00 (Perfect sensitivity - no missed admissions)
- **F1-Score**: 1.00 (Perfect balance of precision and recall)
- **ROC AUC**: 0.9999 (Exceptional discrimination ability

### 🔍 **Top Predictors Analysis**
1. **Payer_Medicare** (Coefficient: +8.66) - Medicare patients most likely to be admitted
2. **Diagnosis_None** (Coefficient: -3.38) - Missing diagnosis reduces admission likelihood
3. **Location_Fringe ≥1M** (Coefficient: +2.86) - Large fringe metro areas show higher admission rates
4. **Sex_Male** (Coefficient: +1.25) - Male patients slightly more likely to be admitted

### 📈 **Cross-Validation Results**
- **5-Fold CV Accuracy**: 99.99% ± 0.00%
- **5-Fold CV ROC AUC**: 99.99% ± 0.00%
- **Model Stability**: Excellent consistency across all folds

### 🏥 **Clinical Impact**
- **False Negatives**: Only 1 missed admission (0.03% error rate)
- **False Positives**: 0 incorrect admission predictions
- **Clinical Safety**: Model minimizes risk of missing critical admissions
- **Resource Efficiency**: No unnecessary admission predictions

---

## 🎯 **Business Impact**
- **Clinical Decision Support**: Reliable tool for ED triage decisions
- **Risk Stratification**: Identifies high-risk patients requiring admission
- **Quality Assurance**: Monitors admission decision consistency
- **Healthcare Equity**: Reveals socioeconomic factors in admission decisions


## Logistic Regression & Model Evaluation

This section focuses on training a **logistic regression model** and evaluating its performance in predicting hospital admissions for patients experiencing hypertensive crises.

### 1. Logistic Regression: Model Training

Logistic regression is used here to model the probability of a binary outcome:

- `1` = Admitted  
- `0` = Not Admitted  

It's a linear model that is **interpretable**, **fast to train**, and commonly used in clinical risk prediction.

Key points:
- Data is split into **training (80%)** and **testing (20%)** sets.
- **Stratified sampling** is used to ensure class distribution (admitted vs not admitted) is preserved.
- The model outputs both class labels and **probabilities**, which are useful for downstream metrics like ROC-AUC or binary cross-entropy.


In [None]:
# Assuming df_final already exists from the previous preprocessing pipeline
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Step 1: Prepare features and label
X = df_final.drop(columns=["Label"])
y = df_final["Label"]

# Step 2: Split into training and testing sets (80% train, 20% test), stratified
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Step 3: Initialize logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Step 4: Train the model
logreg.fit(X_train, y_train)

# Step 5: Predict on test data
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1]  # probabilities for class 1

# Step 6: Print completion message
print("✅ Logistic Regression training complete.")


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 7: Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"📊 Accuracy: {accuracy:.4f}")
print("\n📈 Confusion Matrix:")
print(conf_matrix)
print("\n📋 Classification Report:")
print(report)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Confusion Matrix Visualization
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Not Admitted', 'Admitted'], 
            yticklabels=['Not Admitted', 'Admitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix Heatmap - Logistic Regression')
plt.tight_layout()
plt.show()


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
import pandas as pd
import numpy as np

# Initialize logistic regression model
logreg = LogisticRegression(max_iter=1000)

# Define 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validation scores for accuracy
cv_accuracy_scores = cross_val_score(logreg, X, y, cv=cv, scoring='accuracy')

# Cross-validation scores for ROC AUC
cv_auc_scores = cross_val_score(logreg, X, y, cv=cv, scoring='roc_auc')

# Display fold-wise and mean results
cv_results = pd.DataFrame({
    "Fold": [f"Fold {i+1}" for i in range(5)],
    "Accuracy": cv_accuracy_scores,
    "ROC AUC": cv_auc_scores
})

# Add row for mean values
cv_results.loc["Mean"] = ["Mean", cv_accuracy_scores.mean(), cv_auc_scores.mean()]

# Print results
print("✅ Logistic Regression Cross-Validation Results:\n")
print(cv_results)


In [None]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Initialize and fit model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

import numpy as np
import pandas as pd

# Get feature names
feature_names = X.columns

# Get coefficients
coefs = logreg.coef_[0]

# Create DataFrame
coef_df = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": coefs,
    "Absolute Value": np.abs(coefs)
}).sort_values(by="Absolute Value", ascending=False)

# Show top predictors
print("🔍 Top 15 Predictors - Logistic Regression:")
coef_df.head(15)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Manually input feature names and coefficients for visualization
data = {
    "Feature": [
        "cat__Payer_Medicare", "cat__Diagnosis_None", "cat__Location_Fringe ≥1M",
        "cat__Region_Non-profit", "cat__Payer_Private", "cat__Sex_Male",
        "cat__Teaching_Teaching", "cat__Payer_Self-pay", "cat__Location_Micropolitan",
        "cat__Region_Rural", "cat__Location_None", "cat__Location_Central ≥1M",
        "cat__Teaching_Nonteaching", "cat__Region_Public", "cat__Location_50k–249k"
    ],
    "Coefficient": [
        8.655563, -3.382368, 2.859535,
        1.449258, -1.339574, 1.252930,
        -1.195936, -0.993272, -0.859083,
        -0.807971, -0.788938, 0.666226,
        0.640914, 0.544461, -0.544133
    ]
}

# Create DataFrame and sort by absolute coefficient
df_logit = pd.DataFrame(data)
df_logit["Absolute Value"] = df_logit["Coefficient"].abs()
df_logit_sorted = df_logit.sort_values(by="Absolute Value", ascending=True)

# Plot
plt.figure(figsize=(9, 7))
plt.barh(df_logit_sorted["Feature"], df_logit_sorted["Coefficient"], color='royalblue')
plt.xlabel("Coefficient Value")
plt.title("Top 15 Predictors - Logistic Regression")
plt.axvline(0, color='gray', linewidth=0.8)
plt.tight_layout()
plt.show()
