# Introduction

This tutorial shows how to compute sensitivity, specificity and
predictive values in R. It also shows how to obtain ROC curves based
on logistic regression.

We will use the Pima diabetes data set https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download

In [None]:
# Load necessary libraries
library(MASS)
library(pROC)

# Load the Pima Indians Diabetes dataset
data(Pima.tr, package="MASS")
data(Pima.te, package="MASS")

# Read about the data set
?Pima.te

# see a sample of the data
head(Pima.te)

Let's get some training data, and some test data.  Here we show how to randomly sample the data, even though with this data set, you could skip this part, because they provde split dataset already.

In [2]:
# Combine training and test data for splitting
pima_data <- rbind(Pima.tr, Pima.te)

# Splitting the data into training and testing sets
set.seed(123) # for reproducibility
trainingIndex <- sample(1:nrow(pima_data), 0.7*nrow(pima_data))
trainData <- pima_data[trainingIndex, ]
testData <- pima_data[-trainingIndex, ]

Now Let's train our model.  Here I will use only glucose as the predictor variable

In [None]:
# Training the Logistic Regression model using only glucose as the predictor
model <- glm(type ~ glu, data = trainData, family = 'binomial')
summary(model)

Now we'll make some predictions as to whether the individual has diabetes.  We set a threshold beyond which the likelihood is indicative to us that diabetes is present.

In [None]:
# Making predictions on the test set
predictions <- predict(model, newdata = testData, type = "response")
head(predictions)

# Binarize predictions based on a threshold (e.g., 0.5)
threshold <- 0.5
classPredictions <- ifelse(predictions > threshold, 1, 0)
head(classPredictions)

Now we'll calculate ROC, AUC and show the plot

In [None]:
# Calculate the ROC curve and AUC
roc_curve <- roc(testData$type, as.numeric(predictions))
plot(roc_curve, main = "ROC Curve")
auc_value <- auc(roc_curve)
print(paste("AUC:", auc_value))

Finally we'll show the confusion matrix which is useful to show PPV and NPV

In [None]:
# the basic confusion matrix table
table(testData$type, classPredictions)
#install.packages("yardstick")
library(yardstick)
library(dplyr)

# Convert actual and predicted values to factors
actual <- as.numeric(testData$type)
actual <- actual - 1
actual <- factor(actual, levels = c(0, 1))
predicted <- factor(classPredictions)

# Create a tibble with the true and predicted values
results <- tibble(
  truth = actual,
  estimate = predicted
)

# Compute and print the confusion matrix
conf_matrix <- conf_mat(results, truth, estimate)
print(conf_matrix)

# Compute and print accuracy
accuracy_result <- accuracy(results, truth, estimate)
print(accuracy_result)

# Compute and print sensitivity
sensitivity_result <- sens(results, truth, estimate)
print(sensitivity_result)

# Compute and print specificity
specificity_result <- spec(results, truth, estimate)
print(specificity_result)

# Load data, Framingham data set

For the purpose of illustration we use the Framingham data. We work
with coronary heart disease outcome (detected at any future
examination)


In [None]:
# Framingham <- read.table("https://publicifsv.sund.ku.dk/~tag/Teaching/share/data/Framingham.txt",sep=" ")
Framingham <- read.table("https://gist.githubusercontent.com/kkholst/60512439ce0fca7f07a79e2728a6a4d5/raw/95dd3640a55b94c74d03aa1e18bef3d5120d3510/framingham.txt",sep=",", header=TRUE)
head(Framingham)
Framingham$chd <- factor(Framingham$CHD>0,levels=c(FALSE,TRUE),labels=c("event-free","chd")) 
head(Framingham)

# More complex Framingham set:
framinghamLong <- read.csv("frmgham2.csv")
head(framinghamLong)

# Model training challenge
Use either the simplified data set or the more complex longitudinal data set to train a model which predicts overall cardiovascular disease.  Split your data into a training and test set, and then plot the ROC of your model.