# MATH 3375 Examples Notebook #13

# Logistic Regression

We explore logistic regression to predict a binary response, starting with the **mtcars** data set.


In [None]:
#Look at data set
head(mtcars)

## Binary Feature: am

The **am** column is binary. A value of 0 means the car has an automatic transmission; a value of 1 means it has a manual transmission. According to the summary below, there are 19 cars with automatic transmission, and 13 with manual.

In [None]:
summary(as.factor(mtcars$am))

#### Another Way to Summarize Binary Data

Because the response variable has values that are either 0 or 1, the **_mean_** of the values gives the **_proportion_** of data points where the response variable is 1. In this data set, that tells us the proportion of cars with a manual transmission.

In [None]:
mean(mtcars$am)

In [None]:
tran_model0 <- glm(am~1,family="binomial",data=mtcars)
summary(tran_model0)

### Interpreting the Intercept in a Logistic Model with No Predictors

The prediction equation is as follows, where $p$ represents the _**probability the response variable is 1:**_

$$g(p) = ln \left(\frac{p}{1-p}\right) = \widehat{\beta}_0 = -0.3795$$ 

Solving for $p$ gives us:

$$ p = \frac{e^{-0.3795}}{1+e^{-0.3795}} \approx 0.406247 $$

This is the same as the value we computed previously as the proportion of cars in the data set with a manual transmission.

Because we took no other features into account in this model, the model will simply predict that **_every_** car has a 40.6% chance of having a manual transmission!  We illustrate this by using the model to predict the probability for every row of our original data set.

Notice that we use **type="response"** in the predict statement to have the model compute the probability for us. _If this parameter is omitted, the predict function will only give you the log of the odds (in this case, -0.3795)._

In [None]:
model0_preds <- predict(tran_model0,mtcars,type="response")

data.frame(model0_preds, mtcars$am)

### More Useful Models

The above model isn't very useful, but it does give us a starting point by illustrating the simplest possible model.

Now we look at models where we use one or more features to predict the transmission type.


In [None]:
tran_model1 <- glm(am ~ disp, family="binomial", data=mtcars)
summary(tran_model1)

#### Interpreting the Model

The prediction equation is as follows, where $p$ represents the _**probability the response variable is 1:**_

$$g(p) = ln \left(\frac{p}{1-p}\right) = \widehat{\beta}_0 + \widehat{\beta}_1 \times disp = 2.6308 - 0.0146disp$$ 

Solving for $p$ gives us:

$$ p = \frac{e^{2.6308 - 0.0146disp}}{1+e^{2.6308 - 0.0146disp}} $$


#### Visualizing the Model

Since we only have one predictor in this model, we can use the above equation to _visualize_ the probabilities over many different possible values of the predictor (**disp**).

In [None]:
points <- seq(0,500,by=0.5)
logit_points <- 2.6308 - 0.0146*points
prob_points <- exp(logit_points)/(1+exp(logit_points))

plot(am~disp,data=mtcars,xlim=c(0,500))
points(points,prob_points,col="red",cex=0.25)

### Using the Model

This model will give different probabilities for the different cars in the data set, depending on the value of the **disp** feature in each row.  Let's look at how the model performs with our data points.

In [None]:
model1_preds <- predict(tran_model1,mtcars,type="response")

data.frame(model1_preds, mtcars$am)

### Using the Model to Predict Classification

To deciding whether the model is useful, we can predict the _**classification**_ for each car (manual or automatic). To do this, we need to pick a threshold. 

_**How high should the predicted probability be to predict that the car has a manual transmission?**_ To answer this question, we will try a few different thresholds and see which one gives us the best result. 

It seems natural to try a thrshold of 0.5: Let's say that if the probability is greater than 0.5, we will predict the car has a manual transmission. Otherwise, we predict it does not.

In [None]:
tran_class_0.5 <- as.integer(model1_preds > 0.5)
df_tran_class=data.frame(model1_preds,tran_class_0.5,mtcars$am)
df_tran_class

### Summarizing the Model Performance

We can create a 'confusion matrix' to summarize how many cars were correctly classified using this threshold and how many were not.

**NOTE:** Ideally, we would save a separate test data set to evaluate model performance and compute metrics.  However, **mtcars** is quite small, so we have skipped that step in this example.

In [None]:
Predicted <- tran_class_0.5
Actual <- mtcars$am
table(Actual,Predicted)

### Model Metrics from the Table

The confusion matrix tells us that out of the 19 cars with _automatic_ transmission, 14 were correctly classified (0) and 5 were incorrectly classified. Out of the 13 cars with _manual_ transmission, 11 were correctly classified (1) and 2 were incorrectly classified.

Using the confusion matrix, we compute the following classification metrics. 

Recall that all of these calculations depend on where we choose to set our threshold. For now, we have chosen a threshold of 0.5.

In [None]:
confusion_mtx <- table(Actual,Predicted)

#Row sums: Total actual 'positive' (manual transmission) and actual 'negative' (automatic transmission)
ActualPos <- sum(confusion_mtx[2,])
ActualNeg <- sum(confusion_mtx[1,])

#Column sums: Total PREDICTED positive and negative
PredPos <- sum(confusion_mtx[,2])
PredNeg <- sum(confusion_mtx[,1])

#True Negative and True Positive: Count of CORRECT negative/positive predictions
TN <- confusion_mtx[1,1]
TP <- confusion_mtx[2,2]

#False Negative and False Positive: Count of INCORRECT negative/postivie predictions
FP <- confusion_mtx[1,2]
FN <- confusion_mtx[2,1]

Sensitivity <- TP/ActualPos
Specificity <- TN/ActualNeg
FalseNegRate <- FN/ActualPos
FalsePosRate <- FP/ActualNeg

Precision <- TP/PredPos
NegativePredValue <- TN/PredNeg

Accuracy <- (TP+TN)/(ActualPos+ActualNeg)

data.frame(Sensitivity,Specificity,Precision,NegativePredValue,Accuracy)


In [None]:
#install.packages("pROC")
library(pROC)

In [None]:
roc_data=roc(Actual, Predicted, quiet=TRUE) 
plot(roc_data, print.auc=TRUE, main ="ROC curve - Logistic Regression with Threshold 0.5")

### Adjusting Threshold to Improve the Model

Now we will repeat the above steps with multiple thresholds to find the threshold that gives us the highest AUC.


In [None]:
par(mfrow=c(2,2))

thresh <- c()
auc <- c()
for (i in 3:8) {
    t <- i/10
    Predicted <- as.integer(model1_preds > t)
    roc_data=roc(Actual, Predicted, quiet=TRUE) 
    plot(roc_data, print.auc=TRUE, main =paste("ROC curve - Threshold",t))
    thresh <- append(thresh,t)
    auc <- append(auc,roc_data$auc)
}




### Comparing Model Performance at Different Thresholds

We can examine the AUC of our model at the different thresholds in table and graph form below.

Of the thresholds tests, 0.4 and 0.5 have the same (highest) performance. For a more thorough analysis, we could try more finely tuned thresholds (such as 0.45, 0.53, etc.) This would be more useful if the data set were larger. 

In [None]:
data.frame(thresh,auc)
plot(thresh,auc,xlab="Threshold",ylab="AUC",type="b")

### Calibration Plots

Regardless of the threshold selected to predict classification, the original predictions from the model are _**probabilities**_. Another type of evaluation for the model is an analysis of how well those probabilities are calibrated. 

Suppose several points in the data set have a predicted probability near 20%. This means the model is indicating those points have around a 20% chance of being 'positive' or 1 (in this example, 20% chance of being a manual transmission).  If the model is well calibrated, then of all the points with a predicted probability near 20%, the proportion that **_actually_** are positive should be near 20%.  

A **calibration plot** can help us visualize how well calibrated the model's predicted probabilities are.

In [None]:
#install.packages("predtools")
library(predtools)

In [None]:
Probs <- model1_preds
df_calibrate <- data.frame(Actual,Probs)
calibration_plot(data = df_calibrate, obs = "Actual", pred = "Probs", title = "Calibration Plot: Training Data", 
                 x_lim=c(0, 0.9), data_summary=TRUE)


#### Interpreting the Plot

* The diagonal line indicates perfect calibration (observed proportion is the same as predicted probability). 
* Points above the line have higher **_actual_** proportion than predicted; the model's predicted probability was too low.
* Points below the line have lower **_actual_** proportion than predicted; the model's predicted probability was too high.
* Points are shown with a margin of error (_confidence intervals_); when the margin of error crosses the line, that is an indication that the observed data are highly variable, but the model calibration may still be reasonable.

#### Important Notes for This Example

* All metrics reported are only for training data. To evaluate a model fully, the same analysis should be conducted with a separate test data set.
* The data set used in this example is very small. Therefore, the model is probably not particularly robust, and the metrics provide only very rough estimates of its performance.
