# Worksheet 04: Binary Responses

## Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:

1. Describe the logistic regression estimation procedure (binary response variable).
2. Discuss the relationship between linear regression and logistic regression. Discuss the consequences of modelling data more suitable for logistic regression as a linear regression model.
3. Interpret the coefficients and $p$-values in the estimated logistic regression.
4. Write a computer script to perform logistic regression and perform model diagnostics. Interpret and communicate the results from that computer script.

In [None]:
library(tidyverse)
library(broom)
library(modelr)
library(titanic)

titan  <-
    titanic_train  %>%
    as_tibble() %>%
    rename(passenger_class = Pclass,
           passenger_id = PassengerId,
           ticket_number = Ticket)

colnames(titan) <- str_to_lower(colnames(titan))

options(repr.plot.width = 10, repr.plot.height = 5)

source("tests_worksheet_04.R")

## 0. Intro

So far, you have explored the Multiple Linear Regression (MLR) as a way to model the mean of a numeric response variable, $Y$, given a set of covariate $\mathbf{X}$:

$$
E\left[Y\left|\mathbf{X}=\left(X_1,...,X_p\right)\right.\right] = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p
$$

However, in some situations, the MLR is not suitable. This week, we are going to study two of those situations that commonly arise in practice:

- **this week**: the case of dichotomous response variables (e.g., yes/no, success/failure, win/lose, sick/not sick)  

- **next week**: the case of response variables representing counts (e.g., number of cases of a rare disease in Vancouver in one year; the number of accidents on the Canada Highway in one month).

## 1. Logistic Regression

Logistic Regressions are commonly used to model the probability of an event based on a set of observed covariates. As with Linear Regression, Logistic Regression can also be used to:

- estimate and test the true relation between different types of variables and a <u>*binary response*</u>.


- predict the probability of a <u>*binary response*</u> (aka, classifier)

<br>

For example, we can use a logistic regression to

1. compare the presence of bacteria between groups taking a new drug and a placebo, respectively.
    - Response: *present* or *not present*

<br>

2. predict whether or not a customer will default on a loan given their income and demographic variables.
    - Response: *default* or *not default*

<br>

3. know how GPA, ACT score, and number of AP classes taken are associated with the probability of getting accepted into a particular university.
    - Response: *accepted* or *not accepted*
    

<br>


### The response variable:

The function in R to fit a logistic regression requires either a numerical response (0 and 1) or a `factor`, with two levels (note that R stores factors as integers). 

Mathematically, we have to construct a binary response $Y_i$ that flags the successes (S) for a given event of interest: 

$$
Y_i =
\begin{cases}
1 \; \; \; \; \mbox{if the $i$th observation is S},\\
0 \; \; \; \; 	\mbox{otherwise.}
\end{cases}
$$

In the examples above, we would set:

1. $Y_i = 1$ if the bacteria is present in the blood sample of the $i$th patient 


2. $Y_i = 1$ if the $i$th customer defaulted on their loan


3. $Y_i = 1$ if the $i$th student was accepted to a particular university


In statistics, we refer to each $Y_i$ as a Bernoulli trial with a $p_i$ probability of success, i.e., 

$$Y_i \sim \text{Bernoulli}(p_i)$$

where 

$$
E(Y_i) = p_i
$$

**Note**: Logistic regression can also be used to model a *Binomial* response, which defined as the number of successes in $n$ identical, independent trials with constant probability $p$ of success.

This worksheet will focus on the dataset `Titanic` dataset (Dawson, Robert J. MacG. (1995), The ‘Unusual Episode’ Data Revisited. Journal of Statistics Education, 3. [doi:10.1080/10691898.1995.11910499.](https://www.tandfonline.com/doi/full/10.1080/10691898.1995.11910499)). 

The dataset contains survival and demographic information about the passangers.
  
- You can find details about the columns by typing `?titan `.

- We want to explore if there is any association between the information we have about the passengers and the probability of survival (see `Survived` column).

- Let's start by using the model we know, MLR, and see why it is less than adequate for this job. 

In [None]:
cat("\nThe titanic dataset has information about", nrow(titan ), "passengers.")

titan  %>%
    slice_sample(n = 3)

### Simple Linear Regression for Binary Data

#### A range problem

In this first exercise, we use a simple linear regression (SLR) to estimate the relation between `fare` and the response `survived`.

A SLR will model: 

$$
    \text{survived}_i = \beta_0 + \beta_1\times\text{fare}_i + \varepsilon_i
$$


$$
E\left[\ \text{survived}_i\ |\ \text{fare}_i\ \right] = \beta_0 + \beta_1\times \text{fare}_i
$$

- Remember that the linear regression is model the conditional mean $E[\text{survived}\ |\ \text{fare}]$

> **From Probability**: the expected value of *binary* random variable equals the probablity of success

- Thus, in the model above, the line equals the probability of `survived` given `fare`:

$$E[\text{survived}\ |\ \text{fare}] =P(\left.\text{survived} = 1 \right| \text{fare})$$   

- However, probabilities are between 0 and 1 (always!), and the model gives us outputs above 1. 

**Question 1.1**
<br>{points: 1}

Create a plot of the data (using `geom_point()`) along with the estimated regression line (using `geom_smooth()` with `method = "lm"`). Include proper axis labels. The `ggplot()` object's name will be `titan_SLR_plot`.

> Always check if the response is a numeric variable. In this case it is but if not, you won't be able to fit a SLR to a categorical variable. Recall that you can create numerical variables from categorical variables using `if_else`

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# Adjust these numbers so the plot looks good in your desktop.
options(repr.plot.width = 7, repr.plot.height = 5) 

# titan_SLR_plot <- 
#     titan %>%
#     ...
#   ...(aes(..., ...)) +
#   ...(aes(..., ...), method = ..., se = FALSE) +
#   labs(y = ..., x = ...) +
#   ggtitle(...) +
#   ylim(-0.5, 1.5) +
#   theme(text = element_text(size = 16.5)) +
#   scale_x_continuous(breaks = seq(0, 2500, 500))


# your code here
fail() # No Answer - remove if you provide an answer

titan_SLR_plot

In [None]:
test_1.1()

#### Discussion

Do you see any problems with our model? Discuss it with a peer.

### Logistic Regression: an alternative to LR

The problem stems from using the *linear* model to estimate a probability. 

Mathematically, the linear component $\beta_0 + \beta_1X_{i1} + ... +\beta_pX_{ip}$ can take any value, while probabilities is **always** between 0 and 1.

> For simplicity, we'll use a model with only one covariate in all mathematical equations. However, the model can have many variables of different type!

A natural way to solve the "range" problem is to use a curve, instead of a line, with a range between $[0,1]$. One such curve is the logistic curve:

$$E(Y_i|X_i) = P(\left.Y_{i} = 1 \right| X_{i}) = p_i = \frac{e^{\beta_0 + \beta_1X_i}}{1+e^{\beta_0 + \beta_1X_i}} \quad\quad\quad\quad\quad\quad\quad\quad [\text{Eq. 1}]$$ 

Note that <font style='color:darkred'>we are still using a linear component but not to model the conditional expectation directly</font>. 

With some algebra, we can show that:

\begin{equation*}
\log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 X_i
\end{equation*} 

> Note that mathematically, instead of modeling the probability as a line we model a function of the probability as a line

This function is called **logit**, and it is *logarithm of the odds*. Thus, this model is known as the **Logistic Regression**

### The odds

- But note that this is the same as 

$$
\frac{P(\left.Y_{i} = 1 \right| X_{i})}{1-P(\left.Y_{i} = 1 \right| X_{i})} = e^{\beta_0 + \beta_1 X_{i}}
$$

- The quantity $P(\left.Y_{i} = 1 \right| X_{i})/\left[1-P(\left.Y_{i} = 1 \right| X_{i})\right]$ is called <font color='darkred'>**odds**</font>.

<br>

- For the titanic case:
$$
\frac{P(\left.\text{survived}_i = 1 \right| \text{fare}_i)}{1-P(\left.\text{survived}_i = 1 \right| \text{fare}_i)} = e^{\beta_0 + \beta_1 \text{fare}_i}
$$

**Definition**: $p_i$ to $1 - p_i$ are also known as the **odds** and can be estimated by *number of sucesses* to *number of failures*

    
**Numerical Examples:**

For example, among 891 passanger, 341 survived. Thus, the odds of surviving is 341 to 550 or $341/550 = 0.62$ to $1$ 

- A passenger with an odd of $0.25 = 1/4$ means that the passenger has 1 in 5 chance of surviving.
- A passenger with an odd of $0.5 = 1/2$ means that the passenger has 1 in 3 chance of surviving.
- A passenger with an odd of $0.75 = 3/4$ means that the passenger has 3 in 7 chance of surviving.
- A passenger with an odd of $1$ means that the passenger has 1 in 2 chance of surviving.
- A passenger with an odd of $2$ means that the passenger has 2 in 3 chance of surviving.
- A passenger with an odd of $3$ means that the passenger has 3 in 4 chance of surviving.

**Odds versus Probability**

Note that *odds* is different from *probability*

- odds close to 0 -> low chance of survival;
- as the probability of survival increases, the odds increases (indefinitely).

**Question 1.2: Understanding the odds**
<br>{points: 1}

Vancouver Canucks is playing against Calgary Flames in the Final of the NHL. The match will be at Rogers' arena, Canucks home. It is expected that out of 18,910 seats in the arena, 13,700 seats will be occupied by Canucks fans. During the match, prizes are randomly distributed among the seats. What are the odds that a Canucks fan wins a given prize? 

Assign your answer to an object named `answer1.2`.

In [None]:
#answer1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

answer1.2

In [None]:
test_1.2()

### The Logistic Curve

Let's take a look at $P(\left.Y_{i} = 1 \right| X_{i1})$ as a function of $X$: 

$$
P(\left.Y_{i} = 1 \right| X_{i1}) = \frac{e^{\beta_0 + \beta_1 X_{i1}}}{1 + e^{\beta_0 + \beta_1 X_{i1}}}\quad\quad\quad\quad\quad\quad\quad\quad [\text{Eq. 1}]
$$

> Note that under this model, the estimated probability is always between 0 and 1.  

**Question 1.3**
<br>{points: 1}

Let's see what the logistic curve looks like. In this exercise, you will plot the logistic curve to see how it behaves.

_Save the plot in an object named `logistic_curve`._

In [None]:
# logistic_curve <-
#     tibble(z = seq(-10,10,0.01),
#            logistic_z = ...) %>% 
#     ggplot(aes(z, ...)) + 
#     geom_line() +
#     geom_hline(yintercept = 1, lty=2) + 
#     geom_hline(yintercept = 0, lty=2) +
#     theme(text = element_text(size = 20)) + 
#     ggtitle("Logistic curve")

# your code here
fail() # No Answer - remove if you provide an answer

logistic_curve

In [None]:
test_1.3()

**Question 1.4:**
<br>{points: 1}

Let us plot the predictions of the binary logistic regression model on top of `titan_SLR_plot`. Use `geom_smooth()` with `method = "glm"` and `method.args = c(family = binomial)`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# titan_SLR_plot <- 
#     titan_SLR_plot +
#     ...(aes(..., ...),
#         method = ...,
#         method.args = c(family = ...), 
#         se = FALSE, color = "red") +
#     ggtitle("SLR and Logistic Regression")

# your code here
fail() # No Answer - remove if you provide an answer

titan_SLR_plot

In [None]:
test_1.4()

Much better, isn't it? 

### 1.1 Estimation

The question now is, how do we estimate the coefficients $\beta_j$'s? 

So far, in the case of linear regression, we have been using the Least Square Estimators. However, due to the type of response of the logistic model, this procedure is no longer appropriate.

A common method for estimating a logistic regression (and many other models) is *Maximum Likelihood Estimation (MLE)*. Details of MLE are outside the scope of this course, but we will still implement it using R! 

To fit the model, we can use the function `glm()` and its argument `family = binomial` (required to specify the binary nature of the response). 

#### One categorical covariate

Let's fit a logistic regression now using `sex` as covariate. 

$$
\log\left(\frac{P(\left.\text{survived} = 1 \right| \text{male})}{1-P(\left.\text{survived} = 1 \right| \text{male})}\right) = \beta_0 + \beta_1 X_\text{male}
$$

where $X_\text{male} = 1$ if passenger is male and 0 if female. 

> We know that `R` will create a dummy variable $X$ to include in the model in this case.

#### Interpretation

In LR, we were modelling 

$$
E[Y|X] = \beta_0 + \beta_1 X
$$

but in logistic regression, we are modelling the **logit**

$$
f(E[Y|X]) = \log\left(\frac{P(\left.Y = 1 \right| X)}{1-P(\left.Y = 1 \right| X)}\right) = \beta_0 + \beta_1 X
$$

therefore, we need to adjust our interpretation.



- Intercept: $\hat{\beta}_0$ represents the log odds of the reference group (e.g., female)

- Slope: $\hat{\beta}_1$ represents the difference in log odds between the treatment and the reference group (e.g., male vs. female)

**Exponentiated coefficients**

Since log-odds are difficult to interpret, it is common to also interpret the exponentiated version of the coefficients:

$$
\frac{P(\left.\text{survived}_i = 1 \right| \text{male}_i)}{1-P(\left.\text{survived}_i = 1 \right| \text{male}_i)} = \text{odds}_i = e^{\beta_0 + \beta_1 \text{male}_i}
$$

- Intercept: $e^{\hat{\beta}_0}$ represents the odds of the reference group (`female`)

- Slope: $e^{\hat{\beta}_1}$ represents the *odds ratio*, i.e., the ratio between the odds of the treatment (`male`) vs the odds of the reference group (`female`)

*Run the cell below to compute these quantities. Read and follow calculations*

In [None]:
titan  %>%
    group_by(sex) %>%
    count(survived)

In [None]:
#[9] Odds of surviving - female
233/81 # This is our beta0

In [None]:
#[10] Odds of surviving - male
109/468

In [None]:
#[11] Odds of surving - male: using beta_1 times odds_surving_female
233/81 * 0.08097

**Question 1.5:**
<br>{points: 1}

We'll estimate a binary logistic regression utilizing the function `glm()` with `survived` as the response and `sex` as the input variable. The dataset is `titan`.
    
Store the model in an object named `model_titanic_logistic_sex`. The `glm()` parameters are analogous to `lm()` (`formula` and `data`) with the addition of `family = binomial` for this specific model. 

In [None]:
# model_titanic_logistic_sex <- 
#   ...(formula = ...,
#       data = ...,
#       family = ...)


# your code here
fail() # No Answer - remove if you provide an answer

summary(model_titanic_logistic_sex)

In [None]:
test_1.5()

But interpreting the `logit` function itself is hard. So, we usually interpret the exponential of the coefficients, so we can talk in terms of **odds** instead of logits. 

**Coefficients for the odds**

In [None]:
tidy(model_titanic_logistic_sex, exponentiate = TRUE)

**Note**:  that the `std.error` column still refers to the non-exponentiated estimators, e.g., it refers to $\hat{\beta}_0$ not $e^{\hat{\beta}_0}$

In [None]:
tidy(model_titanic_logistic_sex)

**Question 1.6**
<br>{points: 1}

Considering the results of the estimated `model_titanic_logistic_sex`, what is the correct interpretation of the "Intercept", $\hat{\beta}_0$?

Note that the following interpretations correspond to its exponential form, i.e., $e^{\hat{\beta}_0}$

**A.** A female passenger had a 1.0566 odds of surviving (i.e., the proportion of of survivals relative to the proportion of deaths in the sample)

**B.** A female passenger had a 2.8765 odds of surviving (i.e., the proportion of survivals relative to the proportion of deaths in the sample)

**C.** A male passenger had a 1.0566 odds of surviving (i.e., the proportion of survivals relative to the proportion of deaths in the sample)

**D.** A male passenger had a 2.8765 odds of surviving (i.e., the proportion of survivals relative to the proportion of deaths in the sample)

**E.** A female passenger had a 2.8765 odds of dying (i.e., the proportion of death relative to the proportion of survival in the sample)

**F.** A male passenger had a 2.8765 odds of dying (i.e., the proportion of death relative to the proportion of survival in the sample)

In [None]:
# answer1.6 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.6()

**Question 1.7**
<br>{points: 1}

Considering the results of the estimated `model_titanic_logistic_sex`, which of the following is(are) the correct interpretation(s) of $\hat{\beta}_\textit{male}$?

Note that the following interpretations correspond to its exponential form, i.e., $e^{\hat{\beta}_\textit{male}}$

**A.** The odds of surviving for males are $8\%$ higher than the odds of survival for females passangers.

**B.** The odds of surviving for males are $91\%$ lower than the odds of survival for females passangers, $e^{-2.514} = 0.0809$

**C.** The odds of dying for males are $91\%$ lower than the odds of survival for females passangers, $e^{-2.514} = 0.0809$

**D.** Being male decreases the log-odds of survival by $2.513710$ compared to females

**E.** Being female decreases the log-odds of survival by $2.513710$ compared to males

*Assign your answer to the object `answer1.7`. Your answers have to be included in a single string indicating the correct options **in alphabetical order** and surrounded by quotes (e.g., `"ABCDE"` indicates you are selecting the five options).*

In [None]:
# answer1.7 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.7()

**Discussion**:

What would be the estimated male passangers's **odds** of **dying** ??

#### One numeric covariate

Let's see a case where the model has only one continuous covariate

$$
\log\left(\frac{P(\left.\text{survived} = 1 \right| \text{fare})}{1-P(\left.\text{survived} = 1 \right| \text{fare})}\right) = \beta_0 + \beta_1\times\text{fare}
$$

In [None]:
#[3] Fitting Logistic Regression

model_titanic_logistic <- glm(formula = survived ~ fare, data = titan, 
                              family = 'binomial')
summary(model_titanic_logistic)

In [None]:
tidy(model_titanic_logistic)

**Interpretation:**

- Intercept - $\hat{\beta}_0 = -0.9413$: if the `fare` paid was 0 dollars, the logit is expected to be -0.9413.
- Slope - $\hat{\beta}_1 = 0.0152$: an increase of 1 dollar in the `fare` paid is associated with an increase in the logit of 0.0152.

You can also get the exponentiated coefficients with the argument `exponentiate = TRUE` in `tidy()`.

In [None]:
tidy(model_titanic_logistic, exponentiate = TRUE)

- Intercept: $e^{\hat{\beta}_0} = e^{-0.9413} = 0.3901$, if the `fare` paid was 0 dollars, the **odds** is expected to be 0.3901.
  
- Slope: $e^{\hat{\beta}_1} = e^{0.0152} = 1.0153$, it <font color='darkred'>**multiplies**</font> the **odds** by 1.0153, i.e., an increase of 1 dollar in the fare paid is associated with an increase in the odds of $1.53\%$ of its value.

**One categorical one numerical covariate**

Let's fit a logistic regression using `Sex` and `Fare`:

$$
\log\left(\frac{P(\left.\text{survived} = 1 \right| \text{fare})}{1-P(\left.\text{survived} = 1 \right| \text{fare})}\right) = \beta_0 + \beta_1 X_\text{male} + \beta_2\times\text{fare}
$$

where $X_\text{male} = 1$ if passenger is male and 0 if female. 

**Question 1.8:**
<br>{points: 1}

In order to fit the model, we can use the function `glm()` and its argument `family = binomial` (required to specify the binary nature of the response). 

Let us use the function `glm()` to estimate a binary logistic regression. Using the `titan` dataset, we will fit a binary logistic model with `survived` as the response and `sex` and `fare` as input variables.
    
Store the model in an object named `model_titanic_logistic_multiple `. The `glm()` parameters are analogous to `lm()` (`formula` and `data`) with the addition of `family = binomial` for this specific model. 

In [None]:
# model_titanic_logistic_multiple <- 
#   ...(...,
#       ...,
#       ...)

# your code here
fail() # No Answer - remove if you provide an answer

model_titanic_logistic_multiple  %>%
    tidy() %>% 
    mutate_if(is.numeric, round, 3)

model_titanic_logistic_multiple  %>%
    tidy(exponentiate = TRUE) %>% 
    mutate_if(is.numeric, round, 3)

In [None]:
test_1.8()

**Question 1.9**
<br>{points: 1}

Considering the `model_titanic_logistic_multiple` tibble, what is the correct interpretation of the  $\hat{\beta}_\textit{sex}$?

Note that the following interpretations correspond to its exponential form.

**A.** Being male changes the odds of surviving by different factors depending on the fare value.

**B.** For any constant fare value, being female changes the odds of surviving by a factor of $e^{-2.42276} = 0.0887$, or in other words, it decreases the odds of surviving by $91\%$.

**C.** For any constant fare value, being male changes the odds of surviving by a factor of $e^{-2.42276} = 0.0887$, or in other words, it decreases the odds of surviving by $91\%$.

**D.** None of the above.

*Assign your answer to the object `answer1.8` (character type surrounded by quotes).*

In [None]:
# answer1.9 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.9()

**Question 1.10**
<br>{points: 1}

What is the correct interpretation of the regression equation's estimated slope for `fare`?

Note that the following interpretations correspond to its exponential form.

**A.** A $\$1$ increase in `fare` is associated with a $1.1\%$ increase in the odds of survival for either male or female passangers. 

**B.** A $\$1$ increase in `fare` is associated with a $1.1\%$ increase in the odds of survival only for male passangers. 

**C.** A $\$1$ increase in `fare` is associated with a $1.1\%$ increase in the odds of dying for either male or female passangers. 

**D.** A $\$1$ increase in `fare` is associated with a $1.1\%$ decrease in the odds of survival for either male or female passangers. 

*Assign your answer to the object `answer1.10` (character type surrounded by quotes).*

In [None]:
# answer1.10 <- ...

# your code here
fail() # No Answer - remove if you provide an answer


In [None]:
test_1.10()

In [None]:
#[21c] Plotting 

# Don't worry about this part of the code
data_for_plotting <-  
    tibble(fare = rep(seq(0, 499.99, 0.1), 2), 
           sex = c(rep('female', 5000), rep('male', 5000))) %>%
    add_predictions(model = model_titanic_logistic_multiple, var = 'pred', type = 'response')
##############################################

titan  %>%
    ggplot(aes(fare, survived, color = sex)) + 
    geom_point() + 
    geom_line(aes(y = pred), data = data_for_plotting) + 
    theme(text = element_text(size = 18))

### 1.4 Conclusions

- The (conditional) expectation of a binary response is the probability of success.

- A LR can not be used to model the conditional expectation of a binary response since its range extends beyond the interval $[0,1]$

- Instead, one can model a function of the conditional probability. A common choice in logistic regression is to use the *logit* function (logarithm of odds)

- The interpretation of the coefficients depends on the type of variables and the form of the model:

The raw coefficients are interpreted as:

- log odds of a reference group
- difference of log odds of a treatment vs a control group
- changes in log odds per unit change in the input
    
The exponentiated coefficients are interpreted as:
- odds of a reference group
- odds ratio of a treatment vs a control group
- multiplicative changes in odds per unit change in the input
    
- The estimated logistic model can be used to make inference using the Wald's test (see tutorial_04)

- The estimated logistic model can be used to make predictions (see tutorial_04)
    - the probability of success
    - the odds of success relative to failure