# Generalized linear models: Binary logistic regression

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import arviz as az
import bambi as bmb

import matplotlib.pyplot as plt

plt.rcParams.update(
    {"mathtext.default": "regular", "figure.dpi": 300, "figure.figsize": (8, 8)}
)

In [None]:
# this silences some warnings from bambi/pandas

import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

import logging

logger = logging.getLogger("pymc")
logger.setLevel(logging.ERROR)

In [None]:
rng = np.random.default_rng(seed=42)

*Gries, chapter 5.3 (pp. 293-316)*

Let us revisit a previous exercise on teenagers' ('large' vs 'small') submission of chat conversations to a sociolinguistic research project. 
Instead of performing a chi-squared test, we will now run a binary logistic regression. Important distinctions with some simpler correlation tests are that a **binary logistic regression** model:
- includes a direction: response (output) variable versus predictor (input) variables
- can include many types of **predictor** variables, not just categorical ones (**the response variable, however, is always binary!**)
- allows us to inspect the impact of multiple predictors simultaneously
- and to include interactions between them

## Loading and inspecting the data

Load and inspect 'chat.tsv'.

In [None]:
chat = pd.read_csv("../../datasets/chat/chat.tsv", sep="\t")
chat

##### Question: Does the assumption of data independence hold? 
> Are there repeated measurements in the data, i.e. >1 observation for 1 subject? This is important to check, since **data independence is an assumption of regression models without random effects** (as well as of many other statistical tests). 

In [None]:
chat.subject_ID.is_unique

Now create a new binary variable: was the participant's submission to the research project large (>500 tokens of chat data) or small (<=500 tokens)? 

In [None]:
chat["sub_large"] = chat.nr_tokens > 500
chat

## What is a Logit?

You will often see people talk about 'logits' or 'logit link functions'. These terms come from a special curve which is often used to describe the probability of **binary outcomes**. In other words, the output *must* be $A$ or $B$, but the relative probabilities of the two outcomes is *continuous*. The function is:

$\Large\frac{1}{ 1 + e^{-x}}$

The curve looks like this:

In [None]:
x = np.linspace(-6, 6, 1000)
y = 1 / (1 + np.exp(-x))
ax = sns.lineplot(x=x, y=y)
ax.axhline(0.5, linewidth=0.5, color="black")
ax.axvline(0, linewidth=0.5, color="black")
plt.show()

The reason logits are important is that it is common (for maths reasons) for models to output *exponential parameters* but mostly as humans we would prefer to read these are *pure probabilities* or *odds ratios*, as we have done earlier in the course. The logit lets us convert between those domains. This is also, by the way, why it is called 'logistic regression'.

As with `ols()`, the model fits a **regression line** that it will use to make predictions. Only two numbers are used to mathematically define a line: an **intercept** ('Intercept' in the output) and a **slope** for the line (the predictor coefficient in the output, here 'gendermale'). For a binary predictor, the regression line does not run through all the datapoints, but rather connects the estimated output for the predictor levels: so here, the regression line runs through the output (**chance** of a large submission) for female and male participants (which are the two levels of the predictor 'gender').

> MATHS: Recall that a straight line in log space is a curvy exponential line in 'normal' numbers

## Binary logistic regression with one binary predictor

Let's now move on to a **binary logistic regression model**. You will notice that the syntax looks a lot like what you are already used to for linear modeling with `ols()`. That makes sense, because what we will be fitting now, is just a special ('generalized') variant of linear regression, that is adapted in a way so that it makes sense for a binary (rather than a numeric) response variable.

HOWEVER, like the linear model we just did, this will use Bayesian parameter estimation.

First, we only include gender as a predictor.

### Model fitting

We fit a bambi model using `family="bernoulli"` and the same R-style formula language we have been learning. (As with statsmodels, this formula language support is provided by the `patsy` package).

Then inspect the model with `summary()`. Again, we have the Bayesian style model diagnostics.

In [None]:
model = bmb.Model(
    "sub_large ~ gender",
    chat,
    family="bernoulli",
)
idata = model.fit(
    target_accept=0.9,
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata)

### Bayesian Model Diagnostics

Apart from the estimates, the main diagnostics here tell us how well the model explored the parameter space. Usually if anything is really bad, the software will warn you, and a full dive into tuning Bayesian estimation is beyond the scope of this course. However, here are the concepts:
- The `mcse` is "Monte-Carlo standard error" and it tell us how well the chains have sampled the parameter space. Low numbers are better.
- the `ess` is the "effective sample size", and it tells us how much information the sampling process was able to extract. Higher numbers are better. Some sources recommend an ESS of at least 1000, and less than 100 is almost certainly bad.
- the `r_hat` is a measure of how well the 'chains' have converged (how well they agree on the parameter estimates). The ideal value is 1, or very close to 1, anything else is cause for concern.

Again, note that Bayesian modelling does not use $p$-values. We consider an effect to be 'significant' if it is almost certainly non-zero. For example, the effect of `gender[male]` has 94% of its High Density Interval (hdi) below zero. If we wanted, we could calculate a summary statistic for the estimate (we have the mean and standard deviation of an estimated normal distribution, we did this in session 2) but the Bayesian philosophy is simply to report the credible interval for the estimate.

At first glance, we can already derive from the coefficients table that male participants will make $fewer$ large donations compared to female participants, as the estimate for `gender[male]` is negative. (So this means: the regression line between `gender[female]` and `gender[male]` goes down, moving from the reference group to the non-reference group, as it has a negative slope). In combination with the HDI, we can already conclude that boys make fewer large donations than girls. In some analyses, this interpretation will suffice. However, sometimes, you may like to get actual probabilities for your predictor levels from the model. You can obtain these in multiple ways, but the methods `interpret.predictions()` and `interpret.plot_predictions()` will automatically convert to probabilities.

In [None]:
tbl = bmb.interpret.predictions(model, idata, average_by="gender")
tbl

Note that because this is the simplest possible model, the predictions look at lot like the empirical probabilities in the data. Here we show a crosstab, and note that the `True` value for `sub_large` (which is what is being modelled) looks almost the same as our model predictions.

In [None]:
pd.crosstab(chat.gender, chat.sub_large, normalize="index")

> MATHS: Out of interest, we can convert the model parameters to probabilities ourselves using the logit link function (just to make sure it works)
>
> $\Large\frac{1}{ 1 + e^{-x}}$


In [None]:
# Intercept mean estimate (base category is female)
x = az.summary(idata)["mean"]["Intercept"]
# use as the parameter for logit link
1 / (1 + np.exp(-x))

The method you will need to know, though, is to use the predictions table or plot, as below:

In [None]:
bmb.interpret.plot_predictions(
    model,
    idata,
    ["gender"],
)
plt.tight_layout()
plt.show()

## Binary logistic regression with a categorical predictor

Let us now try to improve the model by adding the three-level categorical variable 'education' as predictor. We use the formula API as before and simply add education as an independent predictor with `+`

`sub_large ~ gender + education`

Here we use an optional argument when fitting `idata_kwargs={"log_likelihood": True}` which we need to be able to run model comparisons.

In [None]:
model_ed = bmb.Model(
    "sub_large ~ gender + education",
    chat,
    family="bernoulli",
)
idata_ed = model_ed.fit(
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata_ed)

The effect of education is almost certainly not zero, so we conclude that it is associated with the outcome. Note that `general` has been chosen automatically as the reference level (just alphabetically), so the parameters are relative to that.

> EXERCISE: What does this mean? Remember what we are modelling, and then complete this sentence: "Based on the parameter estimates, we expect students in technical education to be...."

Now let's look at the predictions

In [None]:
bmb.interpret.plot_predictions(
    model_ed,
    idata_ed,
    ["education", "gender"],
    fig_kwargs={"sharey": True},
)

plt.tight_layout()
plt.show()

> EXERCISE: Repeat the plot, but level first by gender, then by education

In [None]:
bmb.interpret.plot_predictions(
    model_ed,
    idata_ed,
    ["gender", "education"],
)

plt.tight_layout()
plt.show()

## Evaluating and Selecting Models—Bayesian Edition

Last session, we looked at some tools for model selection with frequentist models. In particular, we looked at AIC and BIC as tools to compare models.

For Bayesian models, the 'confidence' of the model is easily seen by the spread of the estimates, however this is different to the model's performance. To assess model performance we recommend the ELPD ('expected log predictive density'). This can be calculated *either* with LOO (leave-one-out) or with WAIC (widely applicable information criterion); we recommend LOO.

To explain the intuition behind ELPD, consider the plot from the first sprint:

![plot](./linear_preds.png)

For observations (blue) that are in areas with a lot of predictions (orange), the *predictive density* is high. For observations by themselves (few predictions) the predictive density is low. By leaving out each observation in turn (LOO) and considering the density in that area, we get a feel for how well the model predicts in general.


## ELPD Comparison Plot
There is a built in plot to calculate the ELPD, and compare the predictive power of the models using the LOO method as described. This will automatically rank the models. Higher ELPD is better (and the plot arranges them so the models are ranked from top to bottom). The theory is very complicated, but it is similar 'in spirit' to the AIC/BIC measures. In particular, the ELPD also considers the effective number of parameters to automatically penalise models with more parameters.

## In Summary:
- HIGHER ELPD estimates mean better models
- Like AIC/BIC, ELPD is only useful to compare *different models for the same thing* to see which model has more predictive power

In [None]:
comp = az.compare(
    {
        "gender": idata,
        "gender + education": idata_ed,
    }
)
az.plot_compare(comp)
plt.show()

# Interactions

The last thing to explore for this model is the *interaction* between `gender` and `education`. When we added `+ education`, we were telling the model that we believe these predictors were *conditionally independent*. The [wikipedia article](https://en.wikipedia.org/wiki/Conditional_independence) (for once) is quite good at the start, but the intuition is even simpler. We told the model that the effect of gender on submission size was the same for all education levels (that the effect of education was *independent*).

Look back at your plots above—you see that the relative 'shapes' of the predictions per column are the same, and the (imaginary) lines connecting the points would all be roughly parallel.

We don't actually know if this is the case. Let's find out.

## Interaction Terms

So, let us include the **interaction** between our two predictors (indicated in the formula with `*`): the goal is to investigate whether the effect of one predictor on the response is affected by the other predictor in some way. E.g. the influence of gender on submission size may differ depending on the participant's educational track. Or the impact of education on submission size may be different for boys versus girls.

Fit the model and look at the summary. In the output table, you will now not only see estimates for the main effects (gender and education), but for their interaction too. An interaction is added by combining predictors with `*` rather than with `+`(which is just an addition of main effects, no interaction). In this formula language, the syntax

> response ~ predictorA * predictorB

is shorthand for

> response ~ predictorA + predictorB + predictorA:predictorB

We note that when reporting models it is considered bad practice to only include the part `predictorA:predictorB`, and thus exclude the actual main effects that create the interaction. So, **when including an interaction, always use the `*` syntax!**

In [None]:
model_interact_ed = bmb.Model(
    "sub_large ~ gender*education",
    chat,
    family="bernoulli",
)
idata_interact_ed = model_interact_ed.fit(
    target_accept=0.9,
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata_interact_ed)

Below, we visualise the estimates for you using a **forest plot**, converting the estimates to Odds to make them easier to understand (recall that an odds ratio of 1 means "equally likely")

When looking at interactions, the rule is that if an interaction effect is significant, we don't analyse the predictors independently any more.

> EXERCISE: Interpret the plot. Do you think that `education` and `gender` are conditionally independent, or does the interaction have an effect?

In [None]:
_, ax = plt.subplots(figsize=(10, 6))

az.plot_forest(
    [
        idata_interact_ed,
    ],
    combined=True,
    colors=["orange", "mediumorchid"],
    transform=lambda x: np.exp(x),
    linewidth=2.6,
    markersize=8,
    ax=ax,
)
ax.set_title("94% HDI—Odds Ratio")
ax.axvline(1, linestyle="--", linewidth=0.7, color="black")
plt.show()

Now let's compare our three models

In [None]:
comp = az.compare(
    {
        "gender": idata,
        "gender + education": idata_ed,
        "gender * education": idata_interact_ed,
    }
)
az.plot_compare(comp)
plt.show()

According to the ELPD difference, it is not certain that the interaction improves the model, but it seems pretty likely. Let's see now what the predictions look like

In [None]:
bmb.interpret.plot_predictions(
    model_interact_ed,
    idata_interact_ed,
    ["education", "gender"],
)

plt.tight_layout()
plt.show()

> EXERCISE: Interpret the prediction output above. Re-arrange the predictors (by gender then education). Interpret.

In [None]:
bmb.interpret.plot_predictions(
    model_interact_ed,
    idata_interact_ed,
    ["gender", "education"],
)

plt.tight_layout()
plt.show()

Finally, we can use `plot_comparisons()` to directly plot the *effect* of gender, **conditional on education**. The way to read this plot is "for each education level, what does changing from male to female do to the **odds** of donating a large submission. You won't be asked to reproduce one of these plots, but you may need to interpret them.

> EXERCISE: Based on the plot below, complete the following sentence: "In general education, men and women are roughly equally likely to make a large submission. However ..."

In [None]:
fig, axs = bmb.interpret.plot_comparisons(
    model_interact_ed,
    idata_interact_ed,
    contrast={"gender": ["male", "female"]},
    conditional="education",
    comparison_type="ratio",
)
axs[0].set_title(r"Contrast Male $\rightarrow$ Female, Odds Ratio")
plt.tight_layout()
plt.show()

```
Version History

Current: v1.0.0

22/10/24: 1.0.0: first draft, BN
```