# Generalized linear models: Poisson regression

*Gries, chapter 5.4.3 (pp. 324-327)*

Let us revisit the exercise on teenagers' submission/donation of chat conversations to a sociolinguistic research project. Instead of running a binary logistic regression (for response: 'small' vs 'large' submission), we will now try to predict the actual number of words donated. **For modeling (unbounded) counts or frequencies, we use count regression**. **Recall that a 'regular' linear model is not adequate to model counts** as its predictions can range from $-\infty$ to $+\infty$, and a negative count does not make any sense (nor do floating-point numbers). Count distributions always have a 'hard floor' at zero, as no negative counts occur. Additionally, they are guaranteed to be integers. 

The prototypical count distribution is a *Poisson distribution*, named after the French mathematician Poisson - hence the name 'Poisson regression' for the modeling of counts and frequencies. However, Poisson models do not deal well with overdispersed counts, which tend to happen a lot. The correct choice of model is a matter of some debate. For this course, *in general* we recommend that you model basic count data with Bambi using `family='negativebinomial'`, but you can always test with `family=poisson` and see if the model improves!

> MATHS: You can read more about modelling with negative binomials [in the Bambi examples](https://bambinos.github.io/bambi/notebooks/negative_binomial.html) but the short version is that negative binomial distributions have an extra parameter when compared to Poisson, which allows them to fit with longer tails. 

## Importing libraries

In [None]:
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Turn off logging for NUTS sampler for PyMCMC
import logging

logger = logging.getLogger("pymc")
logger.setLevel(logging.ERROR)

In [None]:
plt.rcParams.update(
    {"mathtext.default": "regular", "figure.dpi": 300, "figure.figsize": (8, 8)}
)

In [None]:
rng = np.random.default_rng(seed=42)

## Loading and inspecting the data

## Removing Outliers

**We stress that it is heavily debated whether or not such outliers should be deleted prior to Poisson regression.** While some argue that you should delete outliers as they may have a too strong influence on the modeling, others argue that outliers in Poisson distributions are really not that bad (compared to other distributions), and that all 'valid' datapoints should just be kept in. 

Here, to illustrate once more the procedure of deleting outliers, we show you how to delete them, but in the end we run all the analysis on the full data. A good exercise would be to compare the results on the cut data.

A widely applied cutoff to delete outliers, is to delete items that are 1.5 times or more the interquartile range (IQR) above the third quartile, and 1.5 times or more the IQR below the first quartile.

In [None]:
chat = pd.read_csv("../../datasets/chat/chat.tsv", sep="\t")

iqr = chat.nr_tokens.quantile(0.75) - chat.nr_tokens.quantile(0.25)
cut = chat.nr_tokens.quantile(0.75) + iqr * 1.5

chat_cut = chat[chat.nr_tokens < cut]
print(f"{chat.shape[0]- chat_cut.shape[0]} observations removed.")
chat_cut

## Run the model

The only change we need is to specify the model family - here it is `family=negativebinomial`

In [None]:
model = bmb.Model(
    "nr_tokens ~ gender",
    chat,
    family="negativebinomial",
)
idata = model.fit(
    target_accept=0.9,
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata)

## Check chains and posterior estimates

In [None]:
az.plot_trace(idata, compact=False)
plt.tight_layout()
plt.show()

## Check predicted means

In [None]:
bmb.interpret.plot_predictions(
    model,
    idata,
    ["gender"],
    fig_kwargs={"sharey": True},
)
plt.show()

> NOTE the predicted means come from the mean estimates, which are on a log scale. The intercept (women are the base class) was 7.770, To convert log values to numbers we use `np.exp`

In [None]:
# This matches the mean prediction from the plot

np.exp(7.770)

## Check Posterior Predictive Curves

It would be good to know that our model predictions look something like the data. Here we use some complicated code, modified from the [Bambi Examples](https://bambinos.github.io/bambi/notebooks/count_roaches.html). Don't worry about the code too much. The key thing to look at is whether the `observed` curve (our real observations) falls mostly within the posterior predictive estimates (could it plausibly be predicted by the model).

The predictions are occasionally very large, so here we cut some *predictions* above the 3rd Quartile to make the result easier to see. Overall, the predictions don't seem terrible. We are under-predicting for low token counts, but not drastically. Note that we did NOT remove outliers from the observations, the "cutting" here is only so we can see the curve fitting at the low end where most of the observations lie.

In [None]:
def plot_posterior_ppc(model, idata, cut=False):
    # plot posterior predictive check
    id = model.predict(idata, kind="response", inplace=False)
    var_name = "nr_tokens"
    # there is probably a better way
    ppd = id.posterior_predictive["nr_tokens"]
    obs = id.observed_data["nr_tokens"]
    if cut:
        id.posterior_predictive[var_name] = ppd.where(ppd < cut, drop=True)
        id.observed_data[var_name] = obs.where(obs < cut)
    else:
        id.posterior_predictive[var_name] = ppd
        id.observed_data[var_name] = obs

    return az.plot_ppc(id, var_names=[var_name])

In [None]:
# You can increase the quantile value to see the fit on more of the observations

plot_posterior_ppc(model, idata, cut=chat.nr_tokens.quantile(0.8))

## Running a model with education terms

We can add categorical predictors just as easily...

In [None]:
model_education = bmb.Model(
    "nr_tokens ~ gender + education",
    chat,
    family="negativebinomial",
)
idata_education = model_education.fit(
    target_accept=0.9,
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata_education)

In [None]:
bmb.interpret.plot_predictions(
    model_education,
    idata_education,
    ["gender", "education"],
    fig_kwargs={"sharey": True},
)
plt.show()

## Checking for interaction effects

In [None]:
model_interaction = bmb.Model(
    "nr_tokens ~ gender * education",
    chat,
    family="negativebinomial",
)
idata_interaction = model_interaction.fit(
    target_accept=0.9,
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata_interaction)

> QUESTION: What do you think? Are the modelled interaction effects significant?

In [None]:
az.plot_trace(idata_interaction, compact=False)
plt.tight_layout()
plt.show()

In [None]:
bmb.interpret.plot_predictions(
    model_interaction,
    idata_interaction,
    ["gender", "education"],
    fig_kwargs={"sharey": True},
)
plt.show()

## The failure of the Poisson model

> NOTE CAREFULLY that a model can still produce reasonable *mean* estimates, and (correctly) decide that effects are real while still doing a terrible job of modelling. We illustrate that here by using a Poisson model on the (massively overdispersed) data and then looking at the posterior predictive distributions.

> In Bayesian statistics, the output is THE WHOLE DISTRIBUTION, not a point estimate

In [None]:
model_poisson = bmb.Model(
    "nr_tokens ~ gender + education",
    chat,
    family="poisson",
)
idata_poisson = model_poisson.fit(
    target_accept=0.9,
    random_seed=rng,
    idata_kwargs={"log_likelihood": True},
    progressbar=False,
)
az.summary(idata_poisson)

In [None]:
plot_posterior_ppc(model_poisson, idata_poisson, cut=chat.nr_tokens.quantile(0.75))

## Comparing models

Finally, we look at the ELPD of the three negative binomial models. As we saw, the interaction effects are not useful, and so we see that the `gender * education` model is a little weaker than the `gender + education` model. This conflicts with what we saw with the logistic models, by the way, and it is probably because the very long tail makes it hard for the model to accurately detect the education interaction effect. You could try eliminating `nr_tokens` outliers above 3Q + 1.5*IQR and see what happens...

In [None]:
comp = az.compare(
    {
        "gender": idata,
        "gender + education": idata_education,
        "gender * education": idata_interaction,
    }
)
az.plot_compare(comp)
plt.show()

```
Version History

Current: v1.0.0

17/11/24: 1.0.0: first draft, BN
```