<img src="../../shared/img/banner.svg"></img>

# Lab 07 - Modeling Categorical Effects

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import random

from client.api.notebook import Notebook
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy
import seaborn as sns

In [None]:
import shared.src.utils.util as shared_util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

### Learning Objectives

1. Practice writing categorical-numerical models in pyMC3
1. Become more familiar with running and reporting the results of classical statistical tests with `scipy.stats`.
1. Recognize the benefits of the modeling flexibility of the sampling approach.

## ❤️ React for Bayesian Monte Carlo, 😡 React for Null Hypothesis Significance Testing

You're working as a research scientist studying
social cognition and social processing.
You're using fMRI machines
to search for areas of the brain that respond to emotionally and socially salient stimuli.

Since your grant money comes from Facebook,
you're trying to find areas of the brain
that respond differentially to discovering that
a social media post has received one of the six reactions:
`like`, `love`, `haha`, `wow`, `sad`, and `angry`,
pictured below.

## ![reactions](https://upload.wikimedia.org/wikipedia/en/thumb/4/49/Facebook_Reactions.png/309px-Facebook_Reactions.png)

From a literature review, you've identified a region that is likely to respond to these stimuli.

You've collected the average activation values for the center of this region
for a number of participants as they observed reactions to their social media posts.
The data is stored in `fmri_data.pkl`. (This data is simulated, not real).

In [None]:
fmri_df = pd.read_pickle("data/fmri_data.pkl")

fmri_df.sample(5)

In [None]:
fmri_df.reaction.unique()

The area of interest takes up one "pixel" of the fMRI image.
Since fMRI measures brain activity in 3D space,
that is, in a volume,
these are known as _voxel_ s,
short for "volumetric pixels".
The values you've measured are called _voxel activities_
or _voxel activations_,
shortned to `voxel_act` in the `fmri_df`.

For each participant,
you also measured the average activation of this voxel during the presentation
of non-socially-relevant, non-emotional stimuli,
and subtracted this value off.

Therefore, you're looking to see whether the average activation across subjects (rows of `fmri_df`)
is different from 0.
You suspect that an ANOVA is the right model for this data:
a categorical effects model with a Gaussian likelihood
and equal variances in each group.

#### Begin by visualizing the data.

Consider a box-and-whisker plot, or a collection of distplots.

#### Q Does this look like data on which this model can be used? Explain.

Now define an ANOVA model for this data in pyMC.

Define priors for the group means,
making sure there is a mean for each group,
and then use the group labels (`react_idx`)
to set the parameters of a `Normal`ly-distributed variable
for the likelihood.
For that variable, set `observed` equal to the observed voxel activations.
You'll also need a prior for the (pooled) standard deviation.
You are free to choose your prior as you want.

Refer back to the lecture material if you get stuck.

#### Q Explain your choices for the prior component of the model.

#### Using this model, draw samples and visualize the posterior with `pm.plot_posterior`.

#### Q Which of the group means are different from 0? What does that mean for your original hypothesis that this brain area responds to emotionally-salient stimuli?

Now, run a traditional ANOVA on the data using `scipy.stats.f_oneway`.

Save the values of $F$ and $p$ to `reaction_F` and `reaction_p`, respectively.

In [None]:
ok.grade("q1")

#### Q Was the difference in group means statistically significant? Does this agree or disagree with your results from the pyMC model above?

## Beyond ANOVA: Modeling 👏 Emoji 👏 Use 👏

Not every kind of effect can be detected by an ANOVA,
since not all data is normal and not all changes affect the mean.

Say you're working in a lab studying [Human-Computer Interaction](https://en.wikipedia.org/wiki/Human%E2%80%93computer_interaction).

Your thesis is on emoji use.
What causes people to use emojis?
How can we improve the emoji experience for users?

As an early step in this research, you're collecting some demographic data on emoji use.
You suspect that younger generations (those born after 1985, or with a birthdate `>1985`) use more emojis than do older generations (those born before 1985, or with a birthdate `<1985`).
To test your hypothesis, you collect some data, `emoji_data.pkl`. (This data is simulated, not real).

In [None]:
emoji_df = pd.read_pickle("data/emoji_data.pkl")

In [None]:
emoji_df.head()

The `emoji_ct` variable counts the number of emojis in the text message,
while the `generation_idx` and `birthdate` variables code the generation of the text message sender,
the former as a number and the latter as a string.

In [None]:
emoji_df.describe()

Visualize the distribution of the `emoji_ct` for each generation.

#### Q Use your visualization to predict whether an ANOVA will return a positive result. Explain your prediction.

#### Q Run an ANOVA and report the results.

This result doesn't accord with your understanding of the data, so you decide to dig deeper.

#### Plot a histogram of the data from each group, using `bins` as defined below.

In [None]:
bins = np.arange(0, max(emoji_df["emoji_ct"]))

The data within each group is decidedly not normally distributed.

#### Q What's the most salient non-normal aspect of the data? Hint: Look especially at the `<1985` histogram.

Despite the negative ANOVA result, the two distributions look different.

#### Q Describe, in your own words, the differences between the two distributions.

It appears that there are two kinds of text messages:
messages _with_ emojis, which then have a variable number of emojis,
and messages _without_ emojis.

Given that we are working with counts,
the natural distribution for modeling this data
is the `ZeroInflatedPoisson` distribution
(see [the pyMC docs for a plot of the distribution](https://docs.pymc.io/api/distributions/discrete.html#pymc3.distributions.discrete.ZeroInflatedPoisson)).

You can think of it as a random variable with two parts:
its value is either equal to 0, or it is Poisson-distributed.
You could implement it from scratch with `pm.math.switch`.
It has two parameters:
`theta`, which is the rate of the Poisson part,
and `psi`, which is the chance the value is Poission-distributed.

In our case,
`psi` is the chance that the text message contains any emojis,
while `theta` is the average number of emojis in the texts that contain emojis.

#### Q Explain, in your own words, why a sampling or Monte Carlo approach, like the one we are taking in this class, allows us to work with zero-inflated Poisson random variables more easily than a traditional analytical approach, based on defined statistical tests like the $F$ test.

Below, define a model for the emoji data that uses the Zero-Inflated Poisson.

Write it by analogy to the ANOVA model:
define priors for the model parameters,
making sure each group gets a random variable
for each of its prameters
(notice that the `ZeroInflatedPoisson`
variable has two parameters, not just one,
and neither is shared, unlike the `sd` in an ANOVA).
Then index into those with the group labels (`generation_idx`)
to set the parameters of the `ZeroInflatedPoisson` variable.
For that variable, set `observed` equal to the observed counts.

Think carefully about the priors for the parameters of the Zero-Inflated Poisson variable.
There are multiple right answers, but also lots of wrong answers.
The variable `psi` must be between `0` and `1`
and is continuous,
while the variable `theta` just has to be above `0` (positive-only).
You might have some other prior information or assumptions you want to include
by choosing the initial distribution for these variables.

#### (2 pts) Using this model, draw samples and visualize a posterior over the parameters `psi` and `theta` with `pm.plot_posterior` or something similar.

#### Q Report your findings. Do the groups differ? If so, how?

I found that the two groups differed in both the chance that they sent any emojis and in the number of emojis they sent when they did.
While the `<1985` cohort was less likely than the `>1985` cohort to send any emojis, when they did, they sent more of them.

In [None]:
ok.score()