# Final Project: Gallstone Risk Prediction

## 1. Problem & Intended User

This project addresses the clinical challenge of **predicting gallstone presence** using physiological and lab-based data. The intended users are:

- **Clinicians and decision support systems** seeking interpretable, data-driven insights.
- **Public health analysts** evaluating risk markers in patient populations.

Accurate prediction can improve screening prioritization, early intervention, and resource allocation, especially in environments where imaging resources are limited.

## 2. Data & Exploratory Analysis

We used a structured dataset of biometric, clinical, and biochemical features from adult patients. Key steps included:

- Handling missing values and categorical encoding.
- Identifying class imbalances and feature distributions.
- Visualizing correlations and multicollinearity.
- Assessing class-conditional feature shifts using boxplots and statistical trends.

## 3. Modeling Strategy

We compared two supervised classifiers:

- **Logistic Regression (baseline)**: for its transparency and interpretability.
- **Random Forest (advanced)**: to capture potential non-linear feature interactions.

We addressed multicollinearity via **Variance Inflation Factor (VIF)** pruning, scaled inputs using `StandardScaler`, and validated with **5-fold cross-validation**.

## 4. Evaluation Results

Test set metrics showed:

| Metric     | Logistic Regression | Random Forest |
|------------|---------------------|----------------|
| Accuracy   | 0.750               | 0.703          |
| Precision  | 0.767               | 0.724          |
| Recall     | 0.719               | 0.656          |
| F1 Score   | 0.742               | 0.689          |
| ROC AUC    | 0.811               | 0.791          |

Logistic Regression outperformed Random Forest on most metrics while offering better interpretability.

## 5. Interpretability & Feature Insights

- **Top predictors**: `crp_mg_l`, `vitamin_d_ng_ml`, `hyperlipidemia`, and `diabetes`.
- **Logistic coefficients** and **tree-based importance scores (MDI, permutation, SHAP)** aligned with clinical expectations.
- **SHAP values** supported both global patterns and instance-level explainability.
- Overall, explanations were consistent, trustworthy, and user-friendly for clinical integration.

## 6. Conclusion

The Logistic Regression model offers a **strong balance between performance and interpretability**, making it a viable candidate for low-risk clinical decision support.

- Random Forest provides richer modeling capacity but requires **additional interpretability tooling**.
- Transparent feature logic and alignment with medical literature build **trust and adoption potential**.

## Problem & User

We aim to interpret machine learning models for predicting **gallstone disease** using a real-world clinical dataset. Our analysis uses the same dataset described in:

**Esen İ, Arslan H, Aktürk Esen S, Gülşen M, Kültekin N, Özdemir O.**  
*Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data.*  
Medicine. 2024;103(8):e37258.  
[https://doi.org/10.1097/MD.0000000000037258](https://doi.org/10.1097/MD.0000000000037258)

### Intended Users:
- **Clinicians and Analysts** who want interpretable models for early screening of gallstone disease.
- **Medical Data Scientists** building diagnostic tools with transparent decision-making processes.
- **Public Health Researchers** evaluating population-level predictors of gallstone formation.

Our focus is not only to replicate modeling performance, but to **bridge clinical meaning with statistical reasoning**—supporting adoption of ML in medical workflows where **trust and traceability** are essential.

## Data and Exploratory Data Analysis (EDA)

### Dataset Overview

We utilize the same dataset from Esen et al. (2024), which examines early prediction of gallstone disease using clinical, laboratory, and bioimpedance data. The dataset comprises **319 patients** and **40 features**, including demographics, lab values, body composition metrics, and derived outlier flags.

> **Citation:**  
> Esen İ, Arslan H, Aktürk Esen S, Gülşen M, Kültekin N, Özdemir O.  
> *Early prediction of gallstone disease with a machine learning-based method from bioimpedance and laboratory data.*  
> Medicine. 2024;103(8):e37258. DOI: [10.1097/MD.0000000000037258](http://dx.doi.org/10.1097/MD.0000000000037258)

### Data Structure

- **Observations:** 319 patients  
- **Features:** 39 predictors + 1 binary target (`has_gallstones`)
- **Target Classes:** 0 = No Gallstones, 1 = Gallstones
- **Data Types:** Numerical, binary categorical, and derived outlier flags
- **Preprocessing:** Categorical columns restored; no missing values remain

### Target Class Distribution

The target variable is nearly perfectly balanced, which supports fair model training and evaluation without requiring class rebalancing.

<div align="left">
  <img src="../plots/target_distribution.png" width="400">
</div>

- **Gallstones:** 161 (50.5%)  
- **No Gallstones:** 158 (49.5%)

### Distributions of Key Numeric Features

We plotted KDE-enhanced histograms to assess each numeric feature's distribution. This revealed that:

- **Right-skewed distributions** are common for lab values (e.g., `crp_mg_l`, `glucose_mg_dl`, `triglyceride_mg_dl`)
- **Body composition features** are generally near-normal or mildly skewed (e.g., `muscle_mass_kg`, `fat_ratio_percent`)
- Outliers in `obesity_percent` were capped at 70% and flagged using `obesity_outlier_flag`

### Selected Visualizations

<div align="left">
  <img src="../plots/hist_bmi.png" width="400">
  <img src="../plots/hist_hdl_mg_dl.png" width="400">
  <img src="../plots/hist_vitamin_d_ng_ml.png" width="400">
</div>

### Class-wise Feature Distributions

We used boxplots to compare numeric feature values by gallstone status. These plots reveal meaningful clinical and statistical separation.

<div align="left">
  <img src="../plots/boxplot_vitamin_d_ng_ml.png" width="400">
  <img src="../plots/boxplot_fat_ratio_percent.png" width="400">
</div>

- **Higher in Gallstone Patients**:  
  - `bmi`, `fat_ratio_percent`, `crp_mg_l`  
- **Lower in Gallstone Patients**:  
  - `vitamin_d_ng_ml`, `hdl_mg_dl`  
- These align with established metabolic and inflammatory risk patterns.

### Correlation Analysis

We computed Pearson correlation coefficients between all numeric features and visualized them using a heatmap:

<div align="left">
  <img src="../plots/correlation_matrix.png" width="800">
</div>

#### Key Insights:
- **Highly correlated clusters**:
  - Muscle mass and water content features (e.g., `tbw_kg`, `ecw_kg`, `icw_kg`, `muscle_mass_kg`)
  - Fat-related metrics (e.g., `fat_mass_kg`, `fat_ratio_percent`, `bmi`)
- **Implications**:
  - These redundancies can cause multicollinearity in linear models and were addressed in later modeling steps using VIF-based feature reduction.

  ### Summary of EDA Findings

| Area                    | Key Insight |
|-------------------------|-------------|
| **Class Balance**       | 50.5% gallstone-positive, ideal for modeling |
| **Feature Shapes**      | Mostly right-skewed labs; body metrics near-normal |
| **Class Separation**    | `vitamin_d_ng_ml`, `hdl_mg_dl`, `bmi`, `fat_ratio_percent`, and `crp_mg_l` show strong signal |
| **Correlated Features** | Strong clusters suggest redundancy; guide feature pruning |
| **Clinical Plausibility** | Trends align with known risk factors, supporting human-centered interpretability |

These insights provided the foundation for principled model development and explainability strategies used throughout the project.


## Modeling

### Objective

Build predictive models for gallstone presence using features derived from bioimpedance, laboratory tests, and clinical data. Evaluate both performance and interpretability to guide clinical trust and deployment readiness.

### 1. Modeling Strategy

We explored two primary modeling approaches:

- **Logistic Regression** (Linear Model)
  - Prioritized for its simplicity, transparency, and explainability.
  - Applied **VIF-based feature pruning** to reduce multicollinearity before training.

- **Random Forest Classifier** (Non-linear Model)
  - Used to capture complex interactions among features.
  - Evaluated for potential performance gain and compared against logistic regression in both accuracy and explainability.

### 2. Train/Test Setup

- Used an **80/20 stratified split** to preserve class balance during evaluation.
- Applied **standard scaling** for the logistic regression pipeline.
- Feature selection for Logistic Regression was based on multicollinearity reduction; Random Forest used all features.

### 3. Evaluation Metrics

Model performance was assessed using:

- **Accuracy** – Proportion of correctly predicted labels.
- **ROC-AUC** – Area under the receiver operating characteristic curve.
- **Precision & Recall** – Class-specific assessment, particularly relevant in healthcare prediction.
- **F1 Score** – Harmonic mean of precision and recall.
- **Confusion Matrix** – Breakdown of true/false positives and negatives.

### 4. Results Overview

| Model              | Accuracy | Precision | Recall | F1 Score | ROC AUC |
|-------------------|----------|-----------|--------|----------|---------|
| Logistic Regression | 0.750    | 0.767     | 0.719  | 0.742    | 0.811   |
| Random Forest       | 0.703    | 0.724     | 0.656  | 0.689    | 0.791   |

### 5. Summary of Key Modeling Insights

- **Logistic Regression** outperformed Random Forest across all metrics on the test set.
- It offers **transparent coefficients** and intuitive explanations of feature influence.
- **Random Forest**, while flexible and capable of capturing nonlinear patterns, underperformed slightly in both F1 Score and ROC AUC.
- Logistic Regression is better suited for this clinical context, where **interpretability and recall** are both essential.

## Evaluation

### Objective

Evaluate model performance not just in terms of numerical accuracy, but with a focus on **real-world reliability**, **interpretability**, and **clinical relevance**. This includes comparing Logistic Regression and Random Forest across multiple metrics, visualizing results, and analyzing the cost of different types of errors in the healthcare context.

### 1. Evaluation Approach

- Used a consistent **80/20 stratified split** for fair model comparison.
- Applied the same data transformations used during training:
  - Logistic Regression used **VIF-pruned and standardized features**.
  - Random Forest used the **full original feature set** (no scaling required).
- Predictions and probability scores were generated on the test set.
- **Cross-validation** (5-fold) was used to assess generalization on the training set.

### 2. Performance Metrics (Test Set)

| Model              | Accuracy | Precision | Recall | F1 Score | ROC AUC |
|-------------------|----------|-----------|--------|----------|---------|
| Logistic Regression | 0.750    | 0.767     | 0.719  | 0.742    | 0.811   |
| Random Forest       | 0.703    | 0.724     | 0.656  | 0.689    | 0.791   |

- **Logistic Regression outperformed Random Forest** across all metrics on the test set.
- It also achieved higher **recall**, which is especially critical in healthcare where false negatives are dangerous.
- The **ROC AUC** was also higher (0.811 vs. 0.791), indicating better overall discriminative power.

### 3. Cross-Validation Results (5-Fold on Training Set)

| Model              | Mean ROC AUC (CV) |
|-------------------|-------------------|
| Logistic Regression | 0.814             |
| Random Forest       | 0.841             |

- Random Forest slightly outperformed Logistic Regression in cross-validation.
- However, its **test performance dropped**, suggesting it may have overfit to the training data.

### 4. Visualizations

We used two visual diagnostics to assess and communicate model performance:

#### Confusion Matrices

<div align="left">
  <img src="../plots/confusion_matrices.png" width="700">
</div>

- Logistic Regression produced **fewer false negatives**, a critical benefit in medical screening where missed diagnoses can delay treatment.
- Random Forest had slightly more false positives, but also more **false negatives**, making it less suitable for high-stakes use cases like gallstone detection.

#### ROC Curves – Gallstones Detection

<div align="left">
  <img src="../plots/roc_curves_inline_labels.png" width="600">
</div>

- **Logistic Regression** achieved the highest AUC (**0.81**) and exhibited more favorable calibration.
- ROC curves were styled for clarity, with **inline model labels** and a clean diagonal reference line.

### 5. Error Trade-Offs in Clinical Context

- **False Negatives** (missed gallstone cases) carry higher clinical risk than false positives.
- This makes **recall a priority**, and Logistic Regression had the highest recall (0.719).
- False positives lead to extra imaging, but false negatives could delay treatment and lead to complications.

### 6. Clinical Interpretability

- Top predictors like `crp_mg_l`, `vitamin_d_ng_ml`, and `alt_u_l` are **clinically plausible** and align with known risk factors.
- Logistic Regression’s coefficient transparency supports **decision support** adoption in medical workflows.

### Conclusion

Logistic Regression emerged as the more **trustworthy, interpretable, and clinically appropriate** model despite Random Forest’s slight edge in cross-validation. It provides consistent performance, better recall, and transparent reasoning—making it well-suited for deployment in real-world diagnostic settings.


## Interpretation

This section focuses on how both models—**Logistic Regression** and **Random Forest**—arrive at their predictions. We use interpretable techniques and professional-quality visuals to understand which features most influence predictions and whether the models are aligned with real-world clinical reasoning.

### Logistic Regression Coefficients

The coefficients represent the **direction** and **strength** of influence for each input feature:

- **Positive Coefficient → Increased Gallstone Risk**
- **Negative Coefficient → Decreased Gallstone Risk**

<div align="left">
  <img src="../plots/logreg_feature_importance.png" width="650">
</div>

#### Notable Features:
- `crp_mg_l` (C-reactive protein) ↑ — confirms inflammation is a strong gallstone signal
- `hyperlipidemia` ↑ and `diabetes` ↑ — metabolic conditions associated with higher risk
- `vitamin_d_ng_ml` ↓ — supports recent findings linking deficiency to gallbladder dysfunction
- `ast_u_l` ↓ — mild protective signal, consistent with preserved liver health

These **directional, transparent** coefficients are critical for clinicians who need to **understand and trust model logic**.

### Random Forest Interpretability

To peek inside the Random Forest, we computed:

- **MDI** – how often a feature splits the data
- **Permutation Importance** – performance drop when a feature is shuffled
- **SHAP Values** – per-prediction explanations grounded in game theory

<div align="left">
  <img src="../plots/rf_feature_importances_combined.png" width="800">
</div>

#### Key Findings:
- `crp_mg_l` ranked top across all methods
- `vitamin_d_ng_ml`, `ast_u_l`, `ecf_tbw_ratio_index`, and `fat_ratio_percent` were frequent top contributors
- These results **corroborate** findings from logistic regression and highlight Random Forest’s robustness

### Clinical Relevance and Trust

| Feature                 | Clinical Significance                                      | Model Signal |
|-------------------------|------------------------------------------------------------|--------------|
| CRP                    | Inflammatory marker for gallbladder disease                | ↑ Strong     |
| Vitamin D              | Linked to bile function and motility                        | ↓ Moderate   |
| AST / ALT              | Liver function indicators                                   | ↓ Moderate   |
| Hyperlipidemia         | Part of the metabolic syndrome                              | ↑ Moderate   |
| ECF/TBW Ratio          | Marker of fluid distribution, may reflect bile composition  | ↑ Mild       |

Both models surfaced **medically plausible** predictors, increasing confidence in model use.

### Interpretability Summary

| Aspect               | Logistic Regression                | Random Forest                         |
|----------------------|-------------------------------------|----------------------------------------|
| Transparency         | ✅ High (direct coefficients)       | ➖ Medium (SHAP + permutation help)     |
| Clinical Alignment   | ✅ Excellent                        | ✅ Strong                               |
| Trustworthiness      | ✅ Ready for clinical decision aid  | ➕ With visual support                  |

Logistic Regression offers an intuitive explanation path, while Random Forest—despite its complexity—can be made interpretable with proper tools.


## Conclusion

Over the course of this mini-project, we completed an end-to-end machine learning workflow using a real-world clinical dataset from *Esen et al. (2024)* on early prediction of gallstone disease. The goal was to build interpretable models that could support clinicians in risk stratification based on laboratory and bioimpedance features.

### Key Accomplishments
- **Data Cleaning & Preparation**: We addressed missing values, corrected data types, removed outliers, and engineered clinically relevant features. VIF pruning was applied to reduce multicollinearity for linear modeling.
- **Exploratory Data Analysis (EDA)**: Visual and statistical exploration revealed class imbalance, key feature distributions, and correlations that informed downstream modeling choices.
- **Modeling**: We trained both **Logistic Regression** (with standardized, reduced features) and **Random Forest** (on full data), optimizing for generalization and interpretability.
- **Evaluation**: Logistic Regression achieved superior test-set performance (ROC AUC = 0.811) and higher recall, making it the safer model for minimizing false negatives in a clinical context.
- **Interpretability**: Feature attribution using coefficients, SHAP, and permutation importance confirmed that top predictors like `crp_mg_l`, `vitamin_d_ng_ml`, and `hyperlipidemia` were both statistically strong and clinically meaningful.
- **Communication**: Visuals were created using *Storytelling with Data* design principles to clearly convey findings to technical and non-technical stakeholders alike.

### Final Takeaways
- **Logistic Regression** is a trustworthy baseline for clinical deployment: interpretable, generalizable, and well-aligned with domain knowledge.
- **Random Forest**, augmented with SHAP and permutation methods, provides richer modeling capacity but requires additional interpretive support for user trust.
- This project exemplifies responsible machine learning by balancing **performance, interpretability, and usability** in a sensitive healthcare application.

The structured, reproducible approach demonstrated here reflects key skills in human-centered ML engineering: combining rigorous data science with empathy for end users and their real-world decision contexts.