# Tutorial 04: Binary Responses

## Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. Describe the logistic regression estimation procedure (binary response variable).
2. Discuss the relationship between linear regression and logistic regression. Discuss the consequences of modelling data more suitable for logistic regression models as a linear regression model.
3. Interpret the estimated coefficients and $p$-values in the logistic regression.
4. Write a computer script to estimate a logistic regression and perform model diagnostics. Interpret and communicate the results from that computer script.

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(broom)
library(ISLR)
library(dplyr)
install.packages("arm")
library(arm)

source("tests_tutorial_04.R")

## 1. Logistic Regression

Logistic Regressions are commonly used to model the probability of an event based on a set of observed covariates. As with Linear Regression, Logistic Regression can also be used to:

- estimate and test the true relation between different types of variables and a <u>*binary response*</u>.


- predict the probability of a <u>*binary response*</u> (aka, classifier)


### Recall from lecture:

The **response** variable $Y_i$ flags the successes (S) for a given event of interest: 

$$
Y_i =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th observation is S},\\
0 \; \; \; \; 	\mbox{otherwise.}
\end{cases}
$$

For example, $Y_i = 1$ if the $i$th passenger of the Titanic survived. 

In statistics, 

$$Y_i \sim \text{Bernoulli}(p_i)$$

where 

$$
E(Y_i) = p_i
$$

#### A regression model

Since a probability value ranges between 0 and 1, we can't use a LR to model the conditional expectation given a set of covariates (for simplicity only one covariate is shown below):

$$E(Y_i|X_i) = \beta_0 + \beta_1X_i$$

Instead, we use the ***logistic curve*** to model the conditional expectation, in this case the conditional probability of success:

$$E(Y_i|X_i) = P(\left.Y_{i} = 1 \right| X_{i}) = p_i = \frac{e^{\beta_0 + \beta_1X_i}}{1+e^{\beta_0 + \beta_1X_i}}$$ 

With some algebra, we can get two other equivalent useful formulations:

\begin{equation*}
\mbox{logit}(p_i)=\log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 X_i
\end{equation*} 

> Note that mathematically, instead of modeling the probability as a line we model a *function* of the probability as a line

\begin{equation*}
\left(\frac{p_i}{1 - p_i}\right) = e^{\beta_0 + \beta_1 X_i}
\end{equation*} 

#### Odds and odds ratios

The quantity $P(\left.Y_{i} = 1 \right| X_{i})/\left[1-P(\left.Y_{i} = 1 \right| X_{i})\right] = p_i/(1-p_i)$ is called <font color='darkred'>**odds**</font>

> Note the relation between odds of success and odds of failure.

<font color='darkred'>**Odds ratios**</font> are commonly used to compare odds of different groups:

For example:

$\text{odds}_F = e^\beta_0$ and $\text{odds}_M = 
e^{\beta_0 + \beta_1} = e^\beta_0 e^\beta_1$

Then, 
$$\frac{\text{odds}_M}{\text{odds}_F} =
e^\beta_1$$


 

### Dataset

This tutorial will focus on the dataset `Default` from [*An Introduction to Statistical Learning*](https://www.statlearning.com/) (James et al., 2013). This is a dataset of $n = 10,000$ observations with the following variables:

- `default`: a binary response indicating whether the customer defaulted on their debt (`Yes` or `No`).
- `student`: a binary input variable indicating whether the customer is a student (`Yes` or `No`).
- `balance`: a continuous input variable indicating the remaining average balance of the customer's credit card (after the monthly payment).
- `income`: a continuous input variable indicating the customer's income.

In [None]:
head(Default)

### Estimation

Let's estimate the logistic regression given the data collected using *Maximum Likelihood Estimation (MLE)*. 

**Question 1.1:**
<br>{points: 1}

To fit the model, we can use the function `glm()` and its argument `family = binomial` (required to specify the binary nature of the response). 

Now, let's put this into practice. We'll estimate a binary logistic regression utilizing the function `glm()` with `default` as the response and `student` as the input variable. The dataset is `Default`.
    
Store the model in an object named `default_logistic_student`. The `glm()` parameters are analogous to `lm()` (`formula` and `data`) with the addition of `family = binomial` for this specific model. 

In [None]:
# default_logistic_student <- 
#   ...(formula = ...,
#       data = ...,
#       family = ...)


# your code here
fail() # No Answer - remove if you provide an answer

summary(default_logistic_student)

In [None]:
test_1.1()

You can also get the exponentiated coefficients with the argument `exponentiate = TRUE` in `tidy()`.

In [None]:
default_logistic_student %>% 
    tidy(exponentiate = TRUE) 

**Note**:  recall that the `std.error` column still refers to the non-exponentiated estimators, e.g., it refers to $\hat{\beta}_0$ not $e^{\hat{\beta}_0}$

**Question 1.2**
<br>{points: 1}

Considering the estimated model `default_logistic_student`, what is the correct interpretation of the  $\hat{\beta}_\textit{student}$?

**A.** The odds of default are expected to be $49.9\%$ higher for non-student customers than for student customers.

**B.** The odds of default are expected to be $49.9\%$ higher for student customers than for non-student customers.

**C.** The odds of default are expected to be $40.5\%$ higher for non-student customers than student customers.

**D.** The odds of default are expected to be $40.5\%$ higher for student customers than non-student customers.

*Assign your answer to the object `answer1.2` (character type surrounded by quotes).*

In [None]:
# answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2()

#### Multiple covariates

As we learned in LR, a logistic regression can have multiple covariates of differnet types. The interpretation of their coefficients are analogous, except that in the case of the logistic regression changes in the covariates are associated with changes in log odds (raw results) or in odds (exponentiated results).

Note that in the following questions we use different possible interpretations for the coefficients of a logistic regression. 

**Question 1.3:**
<br>{points: 1}

In order to fit the model, we can use the function `glm()` and its argument `family = binomial` (required to specify the binary nature of the response). 

Let us use the function `glm()` to estimate a logistic regression. Using the `Default` dataset, we will fit an *additive* logistic model with `default` as the response and `student`, `balance`, and `income` as input variables (in that order).
    
Store the model in an object named `default_logistic_multiple`. The `glm()` parameters are analogous to `lm()` (`formula` and `data`) with the addition of `family = binomial` for this specific model. 

In [None]:
# default_logistic_multiple <- 
#   ...(...,
#       ...,
#       ...)

# your code here
fail() # No Answer - remove if you provide an answer

default_logistic_multiple %>%
    tidy() %>% 
    mutate_if(is.numeric, round, 3)

default_logistic_multiple %>%
    tidy(exponentiate = TRUE) %>% 
    mutate_if(is.numeric, round, 3)

In [None]:
test_1.3()

**Question 1.4**
<br>{points: 1}

Considering the estimated `default_logistic_multiple` model, which is (are) correct interpretation(s) of the  $\hat{\beta}_\textit{student}$?

**A.** Since $1 / 0.524 = 1.908$, we estimate that the odds of *non-default* are $90.8\%$ higher for non-student customers than student customers while keeping the rest of the input variables constant.

**B.** Since $1 / 0.524 = 1.908$, we estimate that the odds of *non-default* for student customers are $90.8$ while keeping the rest of the input variables constant.

**C.** Since $1 / 0.524 = 1.908$, we estimate that the odds of *non-default* are $90.8\%$ higher for student customers than non-student customers while keeping the rest of the input variables constant.

**D.** Since $1 / 0.524 = 1.908$, we estimate that the odds of *default* are $90.8\%$ higher for student custormers than non-student customers, while keeping the rest of the input variables constant.

**E.** Since $0.524 - 1 = - 0.476$, we estimate that the odds of *default* decline $47.6\%$ when customers are non-student, while keeping the rest of the input variables constant.

**F.** Since $0.524 - 1 = - 0.476$, we estimate that the odds of *default* decline $47.6\%$ when customers are student, while keeping the rest of the input variables constant.

*Assign your answer to the object `answer1.4`. Your answers have to be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., "ABCD" indicates you are selecting the first four options).*

In [None]:
# answer1.4 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.4()

**Question 1.5**
<br>{points: 1}

How would you interprete the coefficient of `balance` from the estimated `default_logistic_multiple` model?

> Note: provide an interpretation based on odds, *not* log odds

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.


### 1.2 Inference

We can use this estimated model to make inferences about the population parameters, i.e., we can determine whether an input variable is statistically associated with the logarithm of the odds through hypothesis testing for the parameters $\beta_j$. 

**Question 1.6**
<br>{points: 1}

Report the estimated coefficients, their standard errors, and corresponding $p$-values by calling `tidy()` on `default_logistic_multiple`. Include the corresponding asymptotic 95% confidence intervals. 

_Store the results in the variable `default_logistic_multiple_results`._

In [None]:
# default_logistic_multiple_results <- 
#    ... %>%
#    ...(conf.int = TRUE) 

# your code here
fail() # No Answer - remove if you provide an answer

default_logistic_multiple_results

In [None]:
test_1.6()

**Question 1.7**
<br>{points: 1}

Use `tidy()` and the estimated model stored in `default_logistic_multiple` to obtain estimated coefficients and confident intervals for the association between each variable and the *odds* of being in default. 

_Store the results in the variable `default_logistic_odds_results`._

In [None]:
# default_logistic_odds_results <- 
#    default_logistic_multiple %>%
#    ... %>%
#    mutate_if(is.numeric, round, 6)

# your code here
fail() # No Answer - remove if you provide an answer

default_logistic_odds_results

In [None]:
test_1.7()

**Question 1.8**
<br>{points: 1}

Using a **significance level $\alpha = 0.05$**, which inputs are statistically associated to the probability of default in `default_logistic_odds_results`?

**A.** The categorical input `student`.

**B.** The continuous input `balance`.

**C.** The continuous input `income`.

*Assign your answers to the object `answer1.8`. Your answers must be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., `"ABC"` indicates you are selecting the three options).*

In [None]:
# answer1.8 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.8()

### 1.3 Prediction

Besides inference, we can use an estimated logistic regression model to predict the probability of success. 

For example, suppose we want to predict the odds of a student who has a credit card balance of \\$2200 and an income of \\$35000 to be in default relative to not being in default.

Mathematically, our predicted log odds will be 

\begin{gather*} 
\log \bigg( \frac{\hat{p}_\texttt{default}}{1 - \hat{p}_\texttt{default}}\bigg) = \underbrace{-10.869045}_{\hat{\beta}_0} - \underbrace{0.646776}_{\hat{\beta}_1} \times 1 + \underbrace{0.005737}_{\hat{\beta}_2} \times 2200 + \underbrace{0.000003}_{\hat{\beta}_2} \times 35000= 1.21 \\
\end{gather*}

Next, by taking the exponential on both sides of the equation, we obtain our predicted *odds*: 

$$
\frac{\hat{p}_\texttt{default}}{1 - \hat{p}_\texttt{default}} = e^{1.21} = 3.36.
$$

Finally, solving the above for $\hat{p}_\texttt{default}$, we obtain our predicted probability of default

$$
\hat{p}_\texttt{default} = 3.36/4.36 = 0.7706
$$

**Question 1.9**
<br>{points: 1}

Using `predict` and `default_logistic_multiple`, obtain the **odds** predicted above.

*Hint: Check the argument `type` when coding this prediction.*

*Assign your answer to the object `answer1.9`. Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# answer1.9 <- 
#     ...(...,
#         tibble(student = ..., balance = ..., income = ...),
#         type = ...) %>%
#     ... # how did you obtain the predicted odds above? 

# your code here
fail() # No Answer - remove if you provide an answer

answer1.9

In [None]:
test_1.9()

**Question 1.10**
<br>{points: 1}

We can also predict *probabilities* for classification purposes, i.e., whether the customer will default. Using the function `predict()` with the object `default_logistic_multiple`, obtain the estimated probability that a given customer will default. This customer is a `student` who has a credit card `balance `2200` and income of `35000`.


*Hint: Check the argument `type` when coding this prediction.*

*Assign your answer to the object `answer1.10`. Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# answer1.10 <- 
#     ...(...,
#         tibble(...),
#         type = ...)

# your code here
fail() # No Answer - remove if you provide an answer

answer1.10

In [None]:
test_1.10()

### Fitted Values

Predicted values for in-sample observations are usually called *fitted* values and they are stored as an object of the estimated model (check `objects(default_logistic_multiple)`). In the next question, you will explore different ways of obtaining *fitted* values. 

But as noted in the previous questions, we can get different types of fitted values since the same model can be used to predict *probabilities, odds, and log odds* (linear component). 

It is important that you know which values you obtain when using different functions in R.

**Question 1.11**
<br>{points: 1}

Using the function `augment()` with the object `default_logistic_multiple`, you can obtain fitted values and residuals along with the original data points. The `augment()` function adds many columns that we are not going to use at this time so we are selecting only a subset of the columns. 

Next, we will add the predicted probability to this tibble that are stored in the object `default_logistic_multiple` under the name `fitted`. Call the column `pred_prob`.

Using this predicted probability you can calculate the logit (log odds) for each response, i.e., $\log(p/(1-p))$. Call the column `pred_line`, since this correspond to the estimated linear component of the model.

Add the predicted odds to the tibble and call the column `pred_odds`.

*Assign your answer to the object `default_logistic_fitted`. Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# default_logistic_fitted <- default_logistic_multiple %>%
#               ...() %>%
#               dplyr::select(colnames(Default),.fitted) %>%
#               mutate(pred_prob = ...$fitted,
#                        pred_line = ...,
#                        pred_odds = ...(.fitted))

# your code here
fail() # No Answer - remove if you provide an answer

head(default_logistic_fitted,3)

In [None]:
test_1.11()

**Question 1.12**
<br>{points: 1}

Based on the results obtained in `default_logistic_fitted`, which of the following observations are correct?

**A.** The column `.fitted` in the output of `augment()` contains the predicted probabilities of default for each customer in the dataset.

**B.** The column `.fitted` in the output of `augment()` contains the predicted odds of default for each customer in the dataset.

**C.** The element `fitted` in `default_logistic_multiple` contains the predicted probabilities of default for each customer in the dataset.

**D.** The element `fitted` in `default_logistic_multiple` contains the predicted odds of default for each customer in the dataset.

**E.** The predicted odds can be calculated from `default_logistic_multiple$fitted` or `.fitted`.

**F.** The column `.fitted` computes the linear prediction, used in the logistic regression to model the log odds.

*Assign your answers to the object answer1.12. Your answers must be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., "ABC" indicates you are selecting the first three options).*

In [None]:
# answer1.12 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.12()

### Residuals

Different residuals have been proposed for GLM. The **raw residuals** are, as usual, the differences between the observed and the fitted values. 

However, in most GLMs the observations have different variances. Thus, residuals are not comparable and should be adjusted. Below is a list of other residuals suggested for GLMs.

Definitions of each of these type of residuals can be found [here](https://online.stat.psu.edu/stat501/lesson/15/15.4) 

**Question 1.13: Residuals**
<br>{points: 1}

Fitted values are important to compute *residuals*, the difference between the observed and the predicted values by the estimated model. We'll learn more about residuals later in the course. But, which type of fitted values we need to use??

Since the observations are binary values, 0 and 1, the residuals are calculated using the predicted probabilities, i.e., `your_model$fitted`. However, there are many type of residuals as well to account for the difference in variance of each response.

In this question, we'll focus on the *residuals*.

The `augment()` function also computes residuals, but which version?? Let's start by getting the residuals computed by `augment()`.

Next, we will add the residuals using the function `residuals` on the estimated model `default_logistic_multiple`. Note that this function computes different type of residuals, add type `response`, `pearson` and `deviance` to the tibble.

Compute the raw and the Pearson's residuals by hand to check your understanding.

Finally, add the standardized residuals using the function `rstandard` on the estimated model `default_logistic_multiple`.

*Assign your answer to the object `default_logistic_residuals`. Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# default_logistic_residuals <- ... %>%
#               ...() %>%
#               dplyr::select(colnames(Default),.fitted, .resid,.std.resid) %>%
#               mutate(pred_prob = ...,
#                      response = ...(... == "Yes", 1, 0),
#                      resid_raw = residuals(..., type = "response"),
#                      raw_byhand = ... - ...,
#                      resid_deviance = residuals(default_logistic_multiple),
#                      resid_pearson =...(default_logistic_multiple,"..."),
#                      pearson_byhand = .../sqrt(...),
#                      resid_standardized =rstandard(...))

# your code here
fail() # No Answer - remove if you provide an answer

head(default_logistic_residuals,3)

In [None]:
test_1.13()

**Question 1.14**
<br>{points: 1}

Based on the results obtained in `default_logistic_residuals`, which of the following observations are correct?

**A.** The column `.resid` in the output of `augment()` contains the values of the response minus the predicted probabilities of default for each customer in the dataset.

**B.** The function `residuals()` computes as a default the values of the response minus the predicted probabilities of default for each customer in the dataset.

**C.** The column `.resid` in the output of `augment()` contains the values of deviance residuals.

**D.** The function `residuals()` computes as a default the values of the deviance residuals.

**E.** The raw and Pearson's residuals can be computed using function `residuals()` changing the argument `type`.

*Assign your answers to the object answer1.14. Your answers must be included in a single string indicating the correct options in alphabetical order and surrounded by quotes (e.g., "ABC" indicates you are selecting the first three options).*

In [None]:
# answer1.14 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.14()

**Question 1.15**
<br>{points: 1}

The plot of the residuals are usually hard to interpret for models with discrete responses. Thus, plots with binned residuals have been proposed instead.

Use the function `binnedplot()` in the package `arm` to plot a binnedplot of the residuals. Call the plot `residual_plot`.

> Note 1: the argument `nclass` regulates the number of bins used. In this questions we are using the default argument `nclass` but you can try to change it other values, e.g., `nclass = 10` will show less points.

> Note 2: it is recommended to use the residuals and the fitted values in the scale of the response for this plot, i.e., raw residuals and predicted probabilities.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# y_resid <- default_logistic_residuals$...
# x_fit <- ...

# residual_plot <- ...(x_fit, y_resid)

# your code here
fail() # No Answer - remove if you provide an answer

residual_plot

In [None]:
test_1.15()

### Overdispersion

The variance of a binary response variable is a function of the mean: $p(1-p)$. This means that the estimate of the mean also provides an estimate of the variance of the response. 

Since the logistic regression is built assuming that the response is *Bernoulli*, the estimated $\hat{p}$ conditions the estimated variance of the response to be $\hat{p}(1-\hat{p})$.  

Unfortunately, in real applications, even in situations where the model seems to be estimating the mean well, the data's variability is not quite compatible with the model's assumed variance. <font style='color: darkred'>This misspecification in the variance affects the SE of the coefficients, not their estimates.</font>

A way around this problem is to estimate a dispersion parameter, usually called $\phi$, to correct the standard error of our estimators. An easy implementation is to change the `family` argument to a `quasibinomial`. Let's see an example.  

In [None]:
default_logistic_quasi <- summary(glm(
    formula = default ~ student + balance + income,
    data = Default,
    family = quasibinomial))

default_logistic_quasi

In [None]:
#Run this cell for a reference to previous model
tidy(default_logistic_multiple)

**Question 1.16**
<br>{points: 1}

Compare the estimated coefficients and the standard errors of the regression coefficients of the models `default_logistic_multiple` and `default_logistic_quasi`. Have these changed? 

What is the estimated overdispersion parameter in this case? Would you consider the original model `default_logistic_multiple` appropriate based on the new results?

> *Your answer goes here.*

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.