# Introduction

This tutorial shows how to compute sensitivity, specificity and
predictive values in R. It also shows how to obtain ROC curves based
on logistic regression.

We will use the Pima diabetes data set https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download

In [None]:
# Load necessary libraries
library(MASS)
library(pROC)

# Load the Pima Indians Diabetes dataset
data(Pima.tr, package="MASS")
data(Pima.te, package="MASS")

# Read about the data set
?Pima.te

# see a sample of the data
head(Pima.te)

Let's get some training data, and some test data.  Here we show how to randomly sample the data, even though with this data set, you could skip this part, because they provde split dataset already.

In [2]:
# Combine training and test data for splitting
pima_data <- rbind(Pima.tr, Pima.te)

# Splitting the data into training and testing sets
set.seed(123) # for reproducibility
trainingIndex <- sample(1:nrow(pima_data), 0.7*nrow(pima_data))
trainData <- pima_data[trainingIndex, ]
testData <- pima_data[-trainingIndex, ]

Now Let's train our model.  Here I will use only glucose as the predictor variable

In [None]:
# Training the Logistic Regression model using only glucose as the predictor
model <- glm(type ~ glu, data = trainData, family = 'binomial')
summary(model)

Now we'll make some predictions as to whether the individual has diabetes.  We set a threshold beyond which the likelihood is indicative to us that diabetes is present.

In [None]:
# Making predictions on the test set
predictions <- predict(model, newdata = testData, type = "response")
head(predictions)

# Binarize predictions based on a threshold (e.g., 0.5)
threshold <- 0.5
classPredictions <- ifelse(predictions > threshold, 1, 0)
head(classPredictions)

Now we'll calculate ROC, AUC and show the plot

In [None]:
# Calculate the ROC curve and AUC
roc_curve <- roc(testData$type, as.numeric(predictions))
plot(roc_curve, main = "ROC Curve")
auc_value <- auc(roc_curve)
print(paste("AUC:", auc_value))

Finally we'll show the confusion matrix which is useful to show PPV and NPV

In [None]:
# the basic confusion matrix table
table(testData$type, classPredictions)
#install.packages("yardstick")
library(yardstick)
library(dplyr)

# Convert actual and predicted values to factors
actual <- as.numeric(testData$type)
actual <- actual - 1
actual <- factor(actual, levels = c(0, 1))
predicted <- factor(classPredictions)

# Create a tibble with the true and predicted values
results <- tibble(
  truth = actual,
  estimate = predicted
)

# Compute and print the confusion matrix
conf_matrix <- conf_mat(results, truth, estimate)
print(conf_matrix)

# Compute and print accuracy
accuracy_result <- accuracy(results, truth, estimate)
print(accuracy_result)

# Compute and print sensitivity
sensitivity_result <- sens(results, truth, estimate)
print(sensitivity_result)

# Compute and print specificity
specificity_result <- spec(results, truth, estimate)
print(specificity_result)

# Load data, Framingham data set

For the purpose of illustration we use the Framingham data. We work
with coronary heart disease outcome (detected at any future
examination)


In [None]:
# Framingham <- read.table("https://publicifsv.sund.ku.dk/~tag/Teaching/share/data/Framingham.txt",sep=" ")
Framingham <- read.table("https://gist.githubusercontent.com/kkholst/60512439ce0fca7f07a79e2728a6a4d5/raw/95dd3640a55b94c74d03aa1e18bef3d5120d3510/framingham.txt",sep=",", header=TRUE)
head(Framingham)
Framingham$chd <- factor(Framingham$CHD>0,levels=c(FALSE,TRUE),labels=c("event-free","chd")) 
head(Framingham)

# More complex Framingham set:
framinghamLong <- read.csv("frmgham2.csv")
head(framinghamLong)

## In-Class Exercise: Cardiovascular Disease Prediction Model

Build a logistic regression model to predict cardiovascular disease using the Framingham Heart Study data.

### Using the Simple Dataset (`Framingham`)
Available predictors: `SBP` (systolic blood pressure), `DBP` (diastolic blood pressure), `CHOL` (cholesterol), `AGE`, `BMI`, `SMOKE`, `DIABETES`

### Your Task:

1. **Data Preparation:**
   - Explore the Framingham dataset structure
   - Split into 70% training / 30% test sets using `set.seed(456)` for reproducibility

2. **Model Building:**
   - Build a logistic regression model predicting `chd` using at least 3 predictors
   - Examine the model summary - which predictors are significant?

3. **Model Evaluation:**
   - Generate predicted probabilities on the test set
   - Create an ROC curve and calculate the AUC
   - Create a confusion matrix using threshold = 0.5

4. **Analysis Questions:**
   - What is your model's AUC? Is it better than random guessing (AUC = 0.5)?
   - Calculate sensitivity and specificity at threshold = 0.5
   - If you wanted to catch more true cases of CHD (higher sensitivity), how would you adjust the threshold?
   - What trade-off does this create?

### Bonus Challenge:
Try the more complex longitudinal dataset (`framinghamLong`) which has additional variables like `PREVCHD`, `PREVAP`, `PREVMI`, `PREVSTRK`. Can you build a better model?

In [ ]:
# Step 1: Data Preparation
set.seed(456)
train_idx <- sample(1:nrow(Framingham), 0.7 * nrow(Framingham))
train_fram <- Framingham[train_idx, ]
test_fram <- Framingham[-train_idx, ]

cat("Training size:", nrow(train_fram), "\n")
cat("Test size:", nrow(test_fram), "\n")

# Step 2: Build your model
# Your code here: glm(chd ~ ..., data = train_fram, family = binomial)


# Step 3: Model Evaluation
# Your code here: predictions, ROC curve, confusion matrix


# Step 4: Analysis - write your answers as comments
# 