<img src="../../shared/img/banner.svg"></img>

# Homework 04 - Parameterized Models and _t_-Tests

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import math
from pathlib import Path

from client.api.notebook import Notebook
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

import shared.src.utils.util as shared_util
import utils.bound as bound
import utils.util

In [None]:
ok = Notebook("ok/config")

## Learning Objectives

1. Practice converting descriptions of priors and likelihoods into random variables.
2. Apply the (two-sided, unpaired) $t$-test to data and compute an associated $p$ value with both of the main approaches.
3. Recognize the similarities and differences between the analytical and sampling approaches to determining null distributions.

## Section 1 - Specifying Parameterized Models in pyMC

In this section, you will be asked to relate descriptions
of the behavior of a random variable
to choices for the distribution of that variable.

Information about how to relate statements like the ones below
to specific distributions appears in the Week 03 lecture on random variables
and in the Week 04 lecture on parameters.

### Priors

Priors encode beliefs about the state of the world,
and especially about the values of parameters.

Read each of the statements below,
representing a belief about a parameter.
Determine which `pm.Distribution`
best encodes each belief.

1. "This parameter could be any positive number."
2. "This parameter could be any number between X and Y."
3. "This parameter could be any number."
4. "This parameter could be any integer between X and Y."
5. "This parameter could be any number, but is close to a value `mu`, up to a spread of `sd`."

Create a dictionary, called `priors`,
that contains your answers.

The keys should be letters,
`1` through `6`,
and the values should be functions
called to add random variables to models,
like `pm.Bernoulli`.

For example, for the statement

> 7. "This parameter is a Normal random variable"

the answer is `{7: pm.Normal}`.

In [None]:
ok.grade("q1_01")

### Likelihoods

Likelihoods describe the distribution of data,
given its parameters.

For each description of a data likelihood below,
determine which `pm.Distribution` best fits.

1. "This data has a distribution subject to the Central Limit Theorem."
2. "This data is the number of occurrences of a memoryless process."
3. "This data is the timing between occurrences of a memoryless process."
4. "This data is the sum of `n` `Bernoulli` variables with the same parameter `p`."

Format your answers as above,
in a dictionary called `likelihoods`.

In [None]:
ok.grade("q1_02")

### Priors + Likelihoods

A full model typically combines a prior over the parameters with a likelihood for the data.

Convert the specifications of models below,
given in English and in mathematical notation,
into pyMC models.

Give each variable in the model the name that appears next to the "distributed as" sign, $\sim$.
The variable name you should give each model appears just before the description in `fixedwidth` text.

Once you have specified the model, draw at least 2000 samples from it,
convert the samples into a dataframe with `shared_util.samples_to_dataframe`,
and save them to `{model_name}_samples_df`.

`linear_signal_model`:
> The $S$ignal takes on random values around 0 with a typical spread of 1
and the $M$easurement of that signal experiences additive noise
with magnitude averaging 0 and a typical spread of 0.1.

$$
S \sim \text{Normal}(0, 1) \\
M \sim S + \text{Normal}(0, 0.5)
$$

```python
with pm.Model() as linear_signal_model:
    ?
    
linear_signal_model_samples_df = shared_util.samples_to_dataframe(shared_util.samples_from(
    linear_signal_model, draws=500, chains=4))
```      

In [None]:
ok.grade("q1_03")

`nhst_model`:

> The null hypothesis is true with probability $p$=`0.2`.
The test has a false positive rate or $\alpha$ of `0.1`
and a power of `0.7`.

$$
\text{null_true} \sim \text{Bernoulli}(0.2) \\
\text{positive_result} \sim \text{Bernoulli}(\texttt{testparameters}[\text{null_true}])
$$

with
```python
testparameters = [0.7, 0.1]
```

In [None]:
ok.grade("q1_04")

`neurotransmitter_model`:

[Neurons communicate across synapses](https://www.youtube.com/watch?v=WhowH0kb7n0)
by releasing sacs of neurotransmitter (the "chemistry" in "brain chemistry") called vesicles.
In many cases, the neuron receiving the neurotransmitter reacts by moving ions around,
changing its voltage.

Both the number of vesicles released and the change in voltage in response
are random, and described by the following model:

> Vesicle release is the result of many independent events,
with the $N$umber of vesicles being released averaging `2.25`.
Each individual vesicle causes the $V$oltage to change by an amount
with mean `0.4` and typical spread `0.0625`.
The effects of different vesicles are added together.

$$
N \sim \text{Poisson}(2.25) \\
V_i \sim \text{Normal}(0.4, 0.0625) \\
V \sim V_1 + V_2 ... V_N
$$

Recall from lab03 that simulating events where
some random variables don't exist on some samples
is not possible in pyMC.
You may assume that no more than 10 vesicles are ever released.
Note that `pm.math.sum` can be used to add up a list of RVs
and slicing, like `X[:ii]`,
can be used with random variables in both positions, `X` and/or `ii`.
The template below can get you started.

See [this blog post](https://charlesfrye.github.io/stats/2017/11/03/quantal-release-probabilistic-models.html)
for more on this model.
The discovery of this model was good for a
[Nobel Prize in 1970](https://www.nobelprize.org/nobel_prizes/medicine/laureates/1970/speedread.html).

```python
with pm.Model() as neurotransmitter_model:
    N = pm.?
    voltage_changes_all_possible_vesicles = pm.Normal("_V", mu=?, sd=?, shape=10)
    V = pm.Deterministic("V", pm.math.sum(?[:?]))
```

In [None]:
ok.grade("q1_05")

## Section 2 - $t$-Testing Two Ways

In this section, we will revisit the `attention` dataset from `seaborn`.

In [None]:
atten_df = sns.load_dataset("attention", data_home=Path("..") / ".." / "shared" / "data", index_col=0)

atten_df.head()

In this dataset, participants completed three tasks of varying difficulty,
indexed by the number of `solutions` (`1` being hard, `3` being easy)
and received a `score` quantifying their performance.

In some trials, participants had their `attention` to the task `divided` by a distractor,
while in others they were `focused` on the task.

It is intuitive to expect that `score`s would be higher when `attention` is `focused`,
and perhaps to expect that the degree of this effect is greater for harder tasks.

Below, we will test this expectation by checking whether the mean `score`s differ
depending on the value of `attention` for the easy task, 
`solutions == 3`, and for the hard task, `solutions == 1`,
according to the $t$-test.

### Computing $t$

The $t$-statistic for two equal-sized groups $A$ and $B$ is defined as

$$
t = \frac{\mu_A - \mu_B}{\sigma\sqrt{\frac{2}{n}}}
$$

where $\mu_A$ for a pandas series `A` is `A.mean()`,
and $n$ is `len(A)`, which is presumed equal to `len(B)`.
The value the denominator, $\sigma$, is
the estimate of the population standard deviation
and is given by

$$
\sigma^2 = \frac{\sigma^2_A + \sigma^2_B}{2}
$$

where $\sigma^2_A$ for a pandas Series `A` is `A.var()`.
Recall that `.var` has the implicit keyword argument `ddof=1`.

Define a function, `compute_t`, that takes in two `Series` and computes the `t` statistic for those two series.
See the template below for hints on how to proceed.

As part of the implementation of `compute_t`,
write a separate function to compute the pooled estimate of the population standard deviation,
as in the template below.
It is considered good programming practice to break out separate tasks into separate functions as much as possible,
to improve readability and ease the work of bug testing and code updating.

You do not need to worry about what happens when either `a` or `b` is not a `Series` or when they are of length `0` or unequal length.

```python
def compute_t(a, b):
    """Compute the t statistic for two pandas Series of equal length
    """
    # compute means, compute n
    # use means to compute numerator of t
    # compute sd, the "pooled" estimate of standard deviation
    sd = compute_pooled_sd(a, b)
    # compute np.sqrt(2 / n)
    # use them compute denominator of t
    # compute t from numerator over denominator
    return t

def compute_pooled_sd(a, b):
    pool_sd = np.sqrt((? + ?) / ?)
    return pool_sd
```

In [None]:
ok.grade("q2_01")

Define a function, `compute_t_attention`,
that applies `compute_t` to  `atten_df` to compute the $t$ statistic
for the scores of participants with different values of `"attention"`,
but with the same value of `solutions`.

The template below will get you started.

```python
def compute_t_attention(df, num_solutions):
    sub_df = ?  # select only rows with the right number of solutions
    
    # select a, the observations in the group with their attention divided
    divided_scores = sub_df[sub_df["attention"] == "divided"]["score"]
    # select b, the observations in the group with their attention focused
    focused_scores = ?
    
    return compute_t(divided_scores, focused_scores)
```

Use this function to compute $t$ for
the difference in means between the two different attention groups
when working on easy problems (`"solutions" == 3`).
Save this as `easy_t_byhand`.
Then do the same for subjects
working on hard problems (`"solutions" == 1`)
and save it as `hard_t_byhand`.

In [None]:
test_df = pd.DataFrame(
    {"attention": ["divided", "divided", "focused", "focused"],
     "score": [3, 5, 2, 4], "solutions": [1, 1, 1, 1]})

In [None]:
ok.grade("q2_02")

### Computing $p$

The decision to reject or fail to reject the null hypothesis
can be made directly from $t$,
but for a continuous measure of the plausibility of the data under the null,
we use the $p$ value,
which can also be used to choose whether or not to reject the null.

#### Using Analytical Methods

The function `compute_p_from_t` below calculates
a $p$ value corresponding to the $t$ statistic.

The sampling distribution of $t$ under the null hypothesis
has only one parameter:
`df`, for `d`egrees of `f`reedom (not `d`ata`f`rame!).
It can be calculated from the size of the groups
($n$ in the definition above, `n_group` in the function call).

The driving force behind the definition of $t$ and similar test statistics
in the early days of statistical inference was the need to have
only one or a few discrete parameters,
so that statistical tables could be calculated.

In [None]:
def compute_p_from_t(t, n_group):
    # first, we specify the sampling distribution
    # under the null
    df = 2 * n_group - 2
    t_cdf = scipy.stats.t(df=df).cdf
    
    # then, we look at the probability of extreme values of the 
    # test statistic under that distribution
    t_magnitude = np.abs(t)
    right_tail = 1 - t_cdf(t_magnitude)
    left_tail = t_cdf(-t_magnitude)

    # for a two-sided test, we add up the chance of
    # a positive or a negative value as extreme as we observed
    p = left_tail + right_tail
    
    return p

Use this function, along with `compute_t_attention`
to compute `p` values for the null hypothesis
that attention state (`divided` vs `focused`) has no effect
on the mean of the `score` for the hard problems,
`num_solutions=1`.

Save the result to `hard_p_byhand`.

When possible, we should use highly-vetted code from open source libraries,
like `scipy`, rather than our own code.

Repeat the computation of `t` and `p` 
for the `hard` problems done above with the function
`scipy.stats.ttest_ind`.
It takes two arguments,
`a` and `b`, which are the same as the arguments of `compute_t`.
It returns two numbers:
the first is the value of the $t$ statistic,
and the second is the value of the $p$ statistic.
Save the results to `hard_t_scipy` and `hard_p_scipy`.

Print the values you computed and the values from scipy
and observe that they are roughly equal.
Their approximate equality will be checked by the autograder,
but whether you printed them will not be.

Once you have done this,
compute the same values for the easy problems, `num_solutions=3`,
print them,
and save them to `easy_t_scipy` and `easy_p_scipy`.

In [None]:
# uncomment and run to see documentation
# scipy.stats.ttest_ind??

In [None]:
ok.grade("q2_03")

#### Using pyMC

The operation above relied critically on having
a mathematical form for the null distribution of $t$,
in this case provided by the methods of the `scipy.stats.t` object.

Instead of having a mathematical form of the distribution,
we can instead draw samples from it with pyMC.
Though this is inefficient for cases like the $t$ statistic,
whose sampling distribution under the null is well-known,
this method will generalize to other statistics
for which the null distribution is not known or not conveniently available.

When making null models in pyMC,
it's typically easier to create the null distribution of the data,
then compute the value of the statistic outside of pyMC,
rather than compute the statistic inside pyMC.
Most statistics are deterministic transformations of the data,
and so it doesn't make much sense to compute them inside pyMC,
which is designed to handle relationships that have a random component.

Make a function `make_null_model_t`
that takes in a value for the shared mean, standard deviation, and size of both groups
and returns a pyMC model that samples data values according to the null.
Remember that in the null model for the $t$ test,
the data is normally-distributed.

The template code suggests one way of implementing this null model.
You can also look at the lectures from week 05 for alternative inspiration.
You might also look to `pm.math.switch` for a third method,
and I'm sure there are others.

However you implement it,
make sure the variable that contains the scores of the participants
is called `"scores"` and that each sampled value
is an array of _all_ of the observations for a single dataset,
as in the template code below.

```python
def make_null_model_t(mean, sd, group_size):
    
    with pm.Model() as t_model:
        scores = pm.?("scores", ... ? ..., shape=(group_size * 2))
        
    return t_model
```

The autograder test will draw samples from this model and check that they are reasonable.

In [None]:
ok.grade("q2_04")

Now, draw at least `10000` samples from this null model
for the data observed in the case of the easy problem.

That is,
set the mean equal to the mean of all scores for the easy problem
and the standard deviation equal to the pooled standard deviation
of the two groups for the easy problem,
and provide those values, along with a `group_size=10`,
to `make_null_model_t`.

Then, draw at least `10000` samples from that model with `shared_util.sample_from`
and put them into a dataframe with `shared_util.samples_to_dataframe`.
Name that dataframe `null_model_t_easy_samples_df`.

In [None]:
ok.grade("q2_05")

Each row of this dataframe contains an entire dataset's worth of scores,
equivalent to all of the elements of the `atten_df["scores"]`
that had the same value for `solutions`.
Print `null_model_t_easy_samples_df.head()`
and `null_model_t_easy_samples_df["scores"].iloc[0]`
to take a look at the values.

Therefore in order to be able to apply our `compute_t_attention` function,
we need to turn each row of the dataframe of pyMC samples
into its own separate dataframe,
complete with `"attention"` and `"solution"` columns.

The template below will help you get started.
The `sample` argument is presumed to be one row
of the samples dataframe,
e.g. the output of `samples.iloc[0]`.
The first line converts the `"scores"` in it into a `Series` for you.

To check what your function is doing,
apply it to the first row of the samples
(you can pull that row out with `.iloc[0]`)
and print the result.

_Hint_: Do the values for `"attention"`
need to be aligned with the values for `"score"`,
given that the samples come from the null model?

_Hint_: Make sure you can apply `compute_t_attention` to the output of `null_sample_to_dataframe`!
This is assumed in the autograder tests.
The commented cell below allows you to compare your results
to the matching subset of the original dataframe, `atten_df`.
The order of the columns doesn't matter,
but the two should have the same columns,
the same number of observations in each group,
and the same unique values for `attention` and `solutions`.

```python

def null_sample_to_dataframe(sample, solutions):
    scores_series = pd.Series(sample["scores"].flatten())
    n_group = ?
    
    df = pd.DataFrame(
        {"score": ?,
         "attention": ["divided"] * ? + ["focused"] * ?,
         "solutions": ?})
    return df
```

In [None]:
# uncomment to view the output of your solution
# in comparison to atten_df without the subject column

# print(atten_df[atten_df["solutions"] == 3].drop("subject", axis=1))
# print(null_sample_to_dataframe(null_model_t_easy_samples_df.iloc[0], 3))

In [None]:
ok.grade("q2_06")

Now, use that function to make a dataframe for each sample from pyMC,
then use `compute_t_attention` on each dataframe.
Put the results into a list called `pymc_null_ts_easy`.

This list is an estimate of the sampling distribution of the $t$ statistic under the null hypothesis.
Use that estimate to compute the $p$ value of the null hypothesis on this data
by comparing the value of `t` on the real data to it.
Save the result to `easy_p_pymc`.

In [None]:
ok.grade("q2_07")

Now, use the same method to compute the $p$ value for the null hypothesis about the effect of attention state
on the harder task, with `solutions == 1`.
Save the samples from the null distribution of $t$ to `pymc_null_ts_hard`
and the $p$ value to `hard_p_pymc`.

In [None]:
ok.grade("q2_08")

In [None]:
ok.score()