# Lab 5 - logistic regression

STAT 450\
TA: Gian Carlo Diluvi\
February 7, 2025

In [None]:
# preamble
library(tidyverse)
ggplot2::theme_set(theme_classic())
options(repr.plot.width=15, repr.plot.height=7.5)

## Data loading

Let's load the data and plot it.

In [None]:
challenger <- readr::read_delim('o-ring-erosion-or-blowby.data', col_types = 'id') %>% 
    dplyr::mutate(distress = pmin(1L,num_distress),       # convert to binary
                 launch_temp = (launch_temp-32)*5/9) %>%  # convert to celsius
    dplyr::select(distress, launch_temp)
challenger

In [None]:
challenger %>% 
    ggplot(aes(x = launch_temp, y = distress)) +
    geom_point(size=4) +
    theme(text = element_text(size=25)) +
    scale_y_continuous(breaks=0:1) +
    labs(x="Launch temperature (°C)",
        y = "O-ring failure status")

## Fitting a logistic regression to the Challenger data

We use the `glm` function in R.

In [None]:
logit_mod <- glm(distress ~ launch_temp, family = "binomial", data = challenger)
summary(logit_mod)

The summary gives us the estimates,
and it also prints out information about the optimization algorithm used to find the MLE.

Now let's plot the estimated probability of failure as a function of temperature, i.e.,

$$
    \hat{p_n} = \frac{1}{1+\exp(-(\hat{\beta}_0 + \hat{\beta}_1 x_n))}.
$$

In [None]:
challenger %>% 
    ggplot(aes(x = launch_temp, y = distress)) +
    geom_point(size=4) +
    theme(text = element_text(size=25)) +
    scale_y_continuous(breaks=0:1) +
    labs(x="Launch temperature (°C)",
        y = "O-ring failure status") +
    geom_smooth(method = "glm", formula = y~x, method.args = list(family = "binomial"), se = FALSE)

We can also predict the probability of a failure ocurring at -0.6°C
just like we would in a linear regression.

In [None]:
estimated_failure_prb <- function(x) predict.glm(logit_mod, newdata = data.frame(launch_temp = x), type="response")
print(paste0("Estimated probability of failure at -0.6°C: ", estimated_failure_prb(-0.6)))

### Now your turn!

We want to find the temperature $x$ at which the launch would be safe.

For illustration, suppose we want to guarantee that the predicted probability
of rocket failure $\hat{p}(x)$ is at most 50%.

We can use the `uniroot` function from the `stats` package
to find the value of $x$ such that $\hat{p}(x) = 0.5$.
Since `uniroot` looks for values of $x$ where the function is 0, rather than 0.5,
we have to first modify our failure probability function from above:

In [None]:
myfn <- function(x) estimated_failure_prb(x)-0.5

That way, if `myfn(x) = 0` then $\hat{p}(x) = 0.5$.

Now we find the root of `myfn`:

In [None]:
uniroot(myfn, interval = c(10., 30.))

The value of `$root` is what we are looking for: $x = 18.219$.

Since the predicted probability function is monotone decreasing,
any temperature greater than 18.22°C will have a probability of failure
at most 50%&mdash;what we wanted!

Check it out by yourself:

In [None]:
estimated_failure_prb(18.22)

### Question 1

Suppose that we want the (predicted) probability of O-ring failure to be at most 5%.
**What is the minimum temperature $x$ that guarantees this?**

Provide your answer on Canvas!

Fill in the `...` below to find the answer.

In [None]:
#### YOUR CODE HERE
myfn <- function(x) estimated_failure_prb(x)-...
uniroot(..., interval = c(10., 30.))

Check that you are correct by inputing the `$root` value that you found above in the cell below:

In [None]:
#### YOUR CODE HERE
estimated_failure_prb(...)

## Nested models

Test your understanding of the likelihood ratio test for nested models.

Below, we generate a new covariate $z_n\sim\mathcal{N}(0,1)$.
Your task is to determine whether it's worth including
this additional data in the model.

Specifically, you wish to compare
$$
    \log\frac{p_n}{1-p_n} = \beta_0 + \beta_1 x_n
    \qquad\text{vs}\qquad
    \log\frac{p_n}{1-p_n} = \beta_0 + \beta_1 x_n + \beta_2 z_n,
$$
where $x_n$ is the temperature from the Challenger data.

In [None]:
set.seed(450)
challenger <- challenger %>%
    dplyr::mutate(z = rnorm(n = 23)) %>%
    dplyr::select(distress, launch_temp, z)

Below, we fit the small model, which only includes temperature:

In [None]:
mod_small <- glm(distress ~ launch_temp, family = "binomial", data = challenger)

Now your turn!

Fill in the `...` below to fit the large model to the data.

In [None]:
#### YOUR CODE HERE
mod_large <- glm(distress ~ launch_temp + ..., family = ..., data = challenger)

### Question 2

Print out the summary of each model by filling in the `...` below.

In [None]:
#### YOUR CODE HERE
summary(..._small)

In [None]:
#### YOUR CODE HERE
summary(..._large)

**What is the deviance of the small model? And the deviance of the large model?**

Provide your answer on Canvas.

### Question 3

Fill in the `...` below to find the observed test statistic and critical value.
Use a confidence level of $\alpha = 0.05$.

*Hint: recall that $d$,
the degrees of freedom of the asymptotic $\chi_d^2$ distribution of the test statistics,
is the number of parameters of the large model minus the number of parameters of the small model.*

In [None]:
alpha = ...
d = ...

R_obs <- mod_...$deviance - mod_...$deviance
critical_value <- qchisq(p = 1.-alpha, df = d)

print(paste0("The observed test statistic is R = ", R_obs))
print(paste0("The critical value is X_{d,1-alpha}^2 = ", critical_value))

**What are the observed test statistic and the critical value?**

Provide your answer on Canvas.

### Question 4

**Do you have enough evidence to reject the null hypothesis?**

Provide your answer on Canvas. 
You have to state it as a full answer to the hypothesis test we set up.


**You're done!** 
Go enjoy your Friday, 
but if you have time and want to check out the code for diagnostic plots,
see below.

I also have some info on other generalized linear models, 
in case you want to read more.

## Other useful things we learned about

In case you want the code to generate the plots I showed in the slides.

In [None]:
challenger <- challenger %>% 
    dplyr::mutate(phat = predict.glm(mod_small, type="response")) %>% 
    dplyr::select(launch_temp, distress, phat)

**Value-prediction plot.**

Useful to make sure predictions are not way off.

On the x-axis you plot your predicted probability and 
on the y-axis you plot the true value of the response variable.

In [None]:
challenger %>% 
    ggplot(aes(x = phat, y = distress)) +
    geom_point(size=4) +
    theme(text = element_text(size=25)) +
    scale_y_continuous(breaks=0:1) +
    labs(x = expression(hat(p)),
        y = "O-ring failure status")

In general,
you want to have many points in the lower-left and upper-right corners,
and few points in the other two corners.

**Generating data from predicted values.**

For each $x_n$ we can estimate $\hat{p}_n$ as before.
We can then generate $\tilde{Y}_n \sim \mathrm{Bernoulli}(\hat{p}_n)$
and plot the generated data against the true data.

In [None]:
challenger %>% 
    ggplot(aes(x=launch_temp)) +
    geom_point(aes(y=distress), size=5) +
    geom_point(aes(y=rbinom(23,1,phat)+rnorm(1,0,0.04)), color = "blue", alpha=0.5, size=3) +
    theme(text = element_text(size=25)) +
    scale_y_continuous(breaks=0:1) +
    labs(x="Launch temperature (°C)",
        y = "O-ring failure status")

Now do it multiple times!

In [None]:
plot <- challenger %>% 
    ggplot(aes(x=launch_temp)) +
    geom_point(aes(y=distress), size=5)
for(i in 1:100){
    plot <- plot +
    geom_point(aes(y=rbinom(23,1,phat)+rnorm(1,0,0.03)), color = "blue", alpha=0.1, size=3)
}
plot +
    geom_point(aes(y=distress), size=5) +
    theme(text = element_text(size=25)) +
    scale_y_continuous(breaks=0:1) +
    labs(x="Launch temperature (°C)",
        y = "O-ring failure status")

## Further reading: Generalized linear models

These are a generalization of linear models (duh) due to [Nelder and Wedderburn (1972)](https://www.jstor.org/stable/2344614?origin=crossref&seq=1#metadata_info_tab_contents)
(but also see
[McCullagh and Nelder (1989)](http://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf)).
Logistic regression is a GLM, as are many other models.
The definition builds on the random and systemic components we discussed before.


**Random component.**

As before, we assume that the observations $y_1,\dots,y_N$ are realizations of a random variable.
The only condition here is that the distribution of the response variable $Y_i$
has to be in the [exponential family of distributions](https://en.wikipedia.org/wiki/Exponential_family).

Without getting into details,
it turns out that many popular distributions are in the exponential family:
- Bernoulli,
- Binomial,
- Gaussian,
- Poisson,
- Exponential,
- Gamma,
- Beta,
- many others!


**Systematic component.**

Here we also have the same basic principle.
If $\mu_n$ is the expectation of $Y_n$ (given the data!) then we try to estimate it
using a linear combination of covariates, $\beta_0+\beta_1 x_n$.

To ensure that we can equate $\mu_n$ with $\beta_0+\beta_1 x_n$,
we need a *link function* $g$ such that

$$
    g(\mu_n) = \beta_0+\beta_1 x_n, \qquad n=1,\dots,N.
$$

The link function is not unique,
and indeed different choices lead to different models.


**GLMs.**

A GLM can thus be specified as:

$$
    Y_n \sim f, \qquad f \text{ in exponential family with mean } \mu_n, \qquad
    g(\mu_n) = \beta_0 + \beta_1 x_n.
$$

Here are some popular combinations of random components and link functions:


| distribution of $Y_n$       | mean $\mu_n$ | link $g(\mu)$            | name                  |
|-----------------------------|--------------|--------------------------|-----------------------|
| $\mathcal{N}(\mu,\sigma^2)$ | $mu$         | $\mu$                    | linear regression     |
| Bernoulli$(p)$              | $p$          | $\log \frac{\mu}{1-\mu}$ | logistic regression   |
| Poisson$(\lambda)$          | $\lambda$    | $\log(\mu)$              | Poisson regression    |
| Gamma$(a,b)$                | $a/b$        | $-1/\mu$                 | Gamma regression      |

**Fitting GLMs.**

Other than for Gaussian data,
there are no closed-form solutions for the MLEs.
But, off-the-shelf optimization algorithms can be used to find MLEs for each case.
These are already implemented in R (and Python and Julia and...).