# 🏥 Case Study Application – Predicting Hospital Readmission Risk (40 points)

---

## **Problem Scope (5 points)**

### **Problem Definition:**
Develop an AI system to predict the likelihood of a patient being readmitted to the hospital within **30 days post-discharge**.

### **Objectives:**
- 🚑 Reduce avoidable readmissions by flagging high-risk patients for targeted interventions.  
- ⚙️ Optimize resource allocation (e.g., follow-up calls, post-discharge care).  
- ❤️ Improve patient outcomes and lower healthcare costs.

### **Stakeholders:**
- **Patients:** Receive proactive care to prevent readmission.  
- **Clinicians:** Use predictions to personalize discharge plans.  
- **Hospital Administrators:** Reduce financial penalties under value-based care models.  
- **Payers (Insurers):** Lower costs from preventable readmissions.  
- **Data/IT Teams:** Maintain data pipelines and model deployment infrastructure.

---

## **Data Strategy (10 points)**

### **Proposed Data Sources:**
- **Electronic Health Records (EHRs):** Diagnoses, medications, lab results, procedures, vital signs.  
- **Demographics:** Age, gender, socioeconomic status, ZIP code (social determinants).  
- **Admission/Discharge Records:** Length of stay, discharge disposition, prior admissions.  
- **Medication Adherence:** Pharmacy records or patient self-reported data.  
- **Comorbidity Indices:** e.g., Charlson Comorbidity Index.

### **Ethical Concerns:**

#### 🔒 Patient Privacy:
- **Risk:** Unauthorized access to sensitive data (e.g., diagnosis codes) violating HIPAA.  
- **Mitigation:** Data de-identification, encryption, and strict access controls.

#### ⚖️ Bias in Training Data:
- **Risk:** Model underperforms on marginalized groups due to historical disparities.  
- **Mitigation:** Audit data for representativeness and enforce fairness constraints.

### **Preprocessing Pipeline:**

#### 🧹 Data Cleaning:
- Impute missing values: **Median** for numerical, **mode** for categorical.  
- Drop columns with >50% missing values.  
- Remove duplicates and standardize data formats.

#### ⚙️ Feature Engineering:
- **Temporal Features:** Days since last admission, admission frequency.  
- **Clinical Aggregates:** Comorbidity count, polypharmacy indicator (≥5 medications).  
- **Socioeconomic Proxies:** Area Deprivation Index (from ZIP code).  
- **Discharge Context:** Length of stay, discharge day.

#### 🔠 Encoding & Scaling:
- **One-hot encode** categorical variables.  
- **Standardize** numerical values.

#### ⚖️ Class Balancing:
- Use **SMOTE** or **downsampling** to address class imbalance in readmission labels.

---

## **Model Development (10 points)**

### **Model Selection:**
- **Chosen Model:** Gradient Boosting Machine (e.g., XGBoost)

### **Justification:**
- Handles mixed data types (numeric + categorical).  
- Captures nonlinear interactions between variables.  
- Delivers **feature importance** for clinical interpretability.  
- Robust against outliers; superior performance over logistic regression.

### **Evaluation (Hypothetical):**

| Actual \ Predicted   | Readmitted | Not Readmitted |
|----------------------|------------|----------------|
| **Readmitted**       | 90 (TP)    | 50 (FN)        |
| **Not Readmitted**   | 60 (FP)    | 800 (TN)       |

- **Precision:** 90 / (90 + 60) = **60%**  
  > *60% of flagged patients were correctly high-risk.*

- **Recall:** 90 / (90 + 50) = **64.3%**  
  > *Model captures 64.3% of actual readmissions.*

---

## **Deployment (10 points)**

### **Integration Steps:**
- **API Development:** Expose model via REST API (e.g., FastAPI).  
- **EHR Integration:** Trigger model at discharge; feed real-time data from hospital systems.  
- **Clinician Dashboard:** Display risk scores within Epic or other EHR dashboards.  
- **Alert System:** Notify care teams of high-risk patients through email/EHR alerts.  
- **Model Retraining:** Monthly batch updates with latest patient data.

### **Regulatory Compliance (HIPAA):**
- **Data Minimization:** Collect only essential features.  
- **Encryption:** Encrypt all stored and transmitted data.  
- **Access Controls:** Role-based access (e.g., only assigned clinicians).  
- **Audit Logs:** Track access and prediction usage.  
- **BAAs:** Ensure all cloud providers sign HIPAA-compliant Business Associate Agreements.

---

## **Optimization (5 points)**

### **Method to Address Overfitting:**

#### ✅ Technique: **Stratified K-Fold Cross-Validation (5-fold)**
- Maintains class distribution in each fold.  
- Trains on 4 folds, validates on 1 — repeated 5 times.

### **Implementation:**
1. Split dataset into 5 stratified folds.
2. Train model 5 times (each time on a different train-validation split).
3. Tune hyperparameters (e.g., `max_depth`, `learning_rate`) using validation performance.
4. Select the best config based on **average recall** (prioritizing capturing true readmissions).

### **Why It Works:**
- Reduces the chance of overfitting to specific data subsets.  
- Enhances generalization to unseen patient data.
