# Class 2 Group Exercise: Building a Diabetes Risk Prediction Model

## Background
The Pima Indians Diabetes dataset contains diagnostic measurements from female patients of Pima Indian heritage. Your task is to build and evaluate a predictive model for diabetes diagnosis.

## Your Task
Working in groups, build a complete predictive modeling pipeline from data exploration to model evaluation.

This exercise combines skills from:
- Understanding distributions (CentralLimitExample)
- Bootstrap confidence intervals (bootstrapping)
- Logistic regression modeling (LogisticRegressionSignificance)
- Model evaluation with ROC/AUC (How-to-Roc)

## Part 1: Data Exploration & Understanding Distributions (15 minutes)

1. Load the Pima Indians dataset and examine its structure
2. Check the distribution of the outcome variable (diabetes: Yes/No)
3. For each predictor, create histograms and assess normality
4. Identify any predictors that might need transformation
5. Check for missing values or suspicious zeros (hint: can BMI = 0?)

In [None]:
library(MASS)
library(ggplot2)
library(dplyr)

# Load data
data(Pima.tr, package="MASS")
data(Pima.te, package="MASS")

# Combine for exploration
pima_full <- rbind(Pima.tr, Pima.te)
head(pima_full)
summary(pima_full)

# Your code here: Check outcome distribution
table(pima_full$type)

# Your code here: Histograms of predictors


# Your code here: Check for suspicious zeros


## Part 2: Train/Test Split & Bootstrap Baseline (10 minutes)

1. Split the data into 70% training and 30% test sets
2. Calculate the baseline diabetes prevalence in your training set
3. Use bootstrapping to estimate the 95% confidence interval for this prevalence
4. This baseline tells us: if we predicted "No diabetes" for everyone, what would our accuracy be?

In [None]:
set.seed(123)  # For reproducibility

# Train/test split
train_idx <- sample(1:nrow(pima_full), 0.7 * nrow(pima_full))
train_data <- pima_full[train_idx, ]
test_data <- pima_full[-train_idx, ]

cat("Training set size:", nrow(train_data), "\n")
cat("Test set size:", nrow(test_data), "\n")

# Your code here: Calculate baseline prevalence


# Your code here: Bootstrap CI for prevalence
# Hint: Create a function that calculates proportion of "Yes" and use replicate()


## Part 3: Build Logistic Regression Models (15 minutes)

Build and compare multiple logistic regression models:

1. **Model 1 (Simple):** Predict diabetes using only `glu` (glucose)
2. **Model 2 (Clinical):** Add `bmi` and `age` to the model
3. **Model 3 (Full):** Include all available predictors

For each model:
- Examine the summary and identify significant predictors
- Note the AIC value for model comparison

In [None]:
# Model 1: Simple - glucose only
model1 <- glm(type ~ glu, data = train_data, family = binomial)
summary(model1)

# Your code here: Model 2 - glucose, BMI, age


# Your code here: Model 3 - all predictors
# Predictors available: npreg, glu, bp, skin, bmi, ped, age


# Compare AIC values
cat("Model 1 AIC:", AIC(model1), "\n")
# Add AIC for other models

## Part 4: Model Evaluation - ROC Curves & AUC (15 minutes)

Evaluate your models on the **test set** (not training set!):

1. Generate predicted probabilities for each model on the test set
2. Create ROC curves for all three models
3. Calculate AUC for each model
4. Which model performs best on unseen data?

In [None]:
library(pROC)

# Predictions on test set - Model 1
pred1 <- predict(model1, newdata = test_data, type = "response")

# ROC curve for Model 1
roc1 <- roc(test_data$type, pred1)
plot(roc1, main = "ROC Curves Comparison", col = "red")
cat("Model 1 AUC:", auc(roc1), "\n")

# Your code here: Predictions and ROC for Model 2


# Your code here: Predictions and ROC for Model 3


# Add all ROC curves to the same plot for comparison
# Use: plot(roc2, add=TRUE, col="blue")

## Part 5: Threshold Selection & Confusion Matrix (10 minutes)

For your best model:

1. Using a threshold of 0.5, create predictions and a confusion matrix
2. Calculate sensitivity (true positive rate) and specificity (true negative rate)
3. In a clinical screening context, would you prefer higher sensitivity or specificity? Why?
4. Try a threshold of 0.3 - how does this change sensitivity and specificity?

In [None]:
# Use your best model's predictions
best_pred <- pred1  # Change this to your best model's predictions

# Threshold = 0.5
pred_class_50 <- ifelse(best_pred > 0.5, "Yes", "No")
table(Predicted = pred_class_50, Actual = test_data$type)

# Your code here: Calculate sensitivity and specificity for threshold = 0.5
# Sensitivity = TP / (TP + FN)
# Specificity = TN / (TN + FP)


# Your code here: Try threshold = 0.3


## Part 6: Bootstrap Model Performance (Optional Challenge - 10 minutes)

Use bootstrapping to estimate the uncertainty in your model's AUC:

1. Bootstrap the test set 1000 times
2. For each bootstrap sample, calculate the AUC
3. Report the 95% confidence interval for the AUC
4. Is the AUC significantly better than 0.5 (random guessing)?

In [None]:
# Bootstrap AUC estimation
set.seed(456)

bootstrap_auc <- function(data, predictions) {
  idx <- sample(1:nrow(data), replace = TRUE)
  boot_data <- data[idx, ]
  boot_pred <- predictions[idx]
  roc_obj <- roc(boot_data$type, boot_pred, quiet = TRUE)
  return(as.numeric(auc(roc_obj)))
}

# Your code here: Run 1000 bootstrap iterations
# auc_bootstrap <- replicate(1000, bootstrap_auc(test_data, best_pred))


# Your code here: Calculate 95% CI


## Group Discussion Questions

1. **Model Selection:** Which model would you recommend for clinical use? Consider:
   - Predictive performance (AUC)
   - Interpretability
   - Practical data availability

2. **Clinical Utility:** If this model were used for diabetes screening:
   - What threshold would you recommend?
   - What are the consequences of false positives vs false negatives?
   - How would you explain the model output to a patient?

3. **Limitations:** What are the limitations of this analysis?
   - Sample characteristics (Pima Indian women only)
   - Missing data handling
   - Temporal validation

4. **Next Steps:** What would you do to improve this model?
   - Feature engineering
   - Different algorithms
   - External validation

In [None]:
# Space for your group's notes and conclusions
