<img src="../../shared/img/banner.svg"></img>

# Homework 07 - Modeling Categorical-Numerical Relationships

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed

In [None]:
import random

import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

### Learning Objectives

1. Practice writing categorical-numerical models in pyMC3
1. Become more familiar with using `statsmodels` to run classical statistical tests.
1. Recognize the benefits of the modeling flexibility of the sampling approach.

## ❤️ React for ANOVA, 😡 React for Non-Parametric Tests

You're working as a research scientist studying
social cognition and social processing.
You're using fMRI machines
to search for areas of the brain that respond to emotionally and socially salient stimuli.

Since your grant money comes from Facebook,
you're trying to find areas of the brain
that respond differentially to discovering that
a social media post has received one of the six reactions:
`Like`, `Love`, `Haha`, `Wow`, `Sad`, and `Angry`,
pictured below.

## ![reactions](https://upload.wikimedia.org/wikipedia/en/thumb/4/49/Facebook_Reactions.png/309px-Facebook_Reactions.png)

From a literature review, you've identified a region that is likely to respond to these stimuli.

You've collected the average activation values for the center of this region
for a number of participants as they observed reactions to their social media posts.
The data is stored in `fmri_data.pkl`.

In [None]:
fmri_df = pd.read_pickle("data/fmri_data.pkl")

fmri_df.sample(5)

In [None]:
fmri_df.reaction.unique()

The center takes up one pixel of the fMRI image;
since fMRI creates images over time, or videos,
these are known as _voxel_ s,
short for video pixels.
The values you've measured are called _voxel activations_,
or `voxel_act` in the `fmri_df`.

For each participant,
you also measured the average activation of this voxel during the presentation
of non-socially-relevant, non-emotional stimuli,
and subtracted this value off.

Therefore, you're looking to see whether the average activation across subjects (rows of `fmri_df`)
is different from 0.
You suspect that an ANOVA is the right model for this data.

#### Q Begin by visualizing the data. Does this look like a promising target for ANOVA? Explain.

#### Q Now define an ANOVA data model for this data.

Define variables for the model parameters,
with shape `N_groups`,
then index into those with the group labels (`target_idx`)
to set the parameters of a `Normal`ly-distributed variable.
For that variable, set `observed` equal to the observed voxel activations.

Refer back to the lecture material if you get stuck.

#### Q Using this model, draw samples, compute a posterior with `pm.plot_posterior`, and report your findings.

#### Q Run a traditional ANOVA on the data using `smf.ols` and report your findings in the APA style.

The relevant variable are `voxel_act`,
which we are modeling as dependent (`~`)
on the `C`ategorical variable `reaction`.

To get just the results of the ANOVA, use the function `sm.stats.anova_lm`
on the output of `smf.ols`
to pull out just the ANOVA results
(include the keyword argument `typ=2`).
Refer back to the lecture for details.

You might also be tempted to perform follow-up $t$-tests to determine which groups are different from each other.

## Beyond ANOVA: Modeling 👏 Emoji 👏 Use 👏

Not every kind of effect can be detected by an ANOVA,
since not all data is normal and not all changes effect the mean.

You're working in a lab studying [Human-Computer Interaction](https://en.wikipedia.org/wiki/Human%E2%80%93computer_interaction).

Your thesis is on emoji use.
What causes people to use emojis?
How can we improve the emoji experience for users?

As an early step in this research, you're collecting some demographic data on emoji use.
You suspect that younger generations (those born after 1985) use more emojis than do older generations (those born before 1985).
To test your hypothesis, you collect some data, `emoji_data.pkl`.

In [None]:
emoji_df = pd.read_pickle("data/emoji_data.pkl")

In [None]:
emoji_df.head()

The `emoji_ct` variable counts the number of emojis in the text message,
while the `generation_idx` and `birthdate` variables code the generation of the text message sender,
the former as a number and the latter as a string.

In [None]:
emoji_df.describe()

#### Q Visualize the data and predict whether an ANOVA will return a positive result. Explain your prediction.

In [None]:
import seaborn as sns

In [None]:
sns.distplot(emoji_df["emoji_ct"][emoji_df["birthdate"] == "<1985"], kde=False, norm_hist=True);
sns.distplot(emoji_df["emoji_ct"][emoji_df["birthdate"] == ">1985"], kde=False, norm_hist=True);

#### Q Run an ANOVA and report, in APA style, the results.

This result doesn't accord with your understanding of the data, so you decide to dig deeper.

#### Q Plot a histogram of the data from each group, using `bins` as defined below.

In [None]:
bins = np.arange(0, max(emoji_df["emoji_ct"]))

The data within each group is decidedly not normally distributed.

#### Q What's the most salient non-normal aspect of the data? Hint: Look especially at the `<1985` histogram.

The two distributions do look different.

#### Q Describe, in your own words, the differences between the two distributions.

It appears that there are two kinds of text messages:
messages _with_ emojis, which then have a variable number of emojis,
and messages _without_ emojis.

Given that we are working with counts,
the natural distribution for modeling this data
is the _Zero-Inflated Poisson_ distribution
(see [the pyMC docs for a plot of the distribution](https://docs.pymc.io/api/distributions/discrete.html#pymc3.distributions.discrete.ZeroInflatedPoisson)).

You can think of it as a random variable with two parts:
its value is either equal to 0, or it is Poisson-distributed.
It has two parameters:
`theta`, which is the rate of the Poisson part,
and `psi`, which is the chance the value is _not_ equal to 0.

In our case,
`psi` is the chance that the text message contains any emojis,
while `theta` is the average number of emojis in the texts that contain emojis.

#### Q Explain, in your own words, why a sampling approach, like the one we are taking in this class, allows us to work with zero-inflated Poisson random variables more easily than a traditional approach, based on defined statistical tests.

#### Q Below, define a model for the emoji data that uses the Zero-Inflated Poisson.

Write it by analogy to the ANOVA model:
define variables for the model parameters,
with shape `N_groups`,
then index into those with the group labels (`generation_idx`)
to set the parameters of the `ZeroInflatedPoisson` variable.
For that variable, set `observed` equal to the observed counts.

Think carefully about the distributions for the parameters of the Zero-Inflated Poisson variable.
There are multiple right answers, but also lots of wrong answers.
The variable `psi` must be between `0` and `1`,
while the variable `theta` just has to be above `0`.
You might have some other prior information or assumptions you want to include
by choosing the initial distribution for these variables.

#### Q Using this model, draw samples, compute a posterior with `pm.plot_posterior`, and report your findings.

I found that the two groups differed in both the chance that they sent any emojis and in the number of emojis they sent when they did.
While the `<1985` cohort was less likely than the `>1985` cohort to send any emojis, when they did, they sent more of them.