<img src="../../shared/img/banner.svg"></img>

# Homework 02 - Descriptive Statistics and Bootstrapping

**REMEMBER**: if you downloaded this file to datahub with an interact link,
it will be called `hw02_blank.ipynb`.

Before doing any work, go to the dropdown menu, `File > Make a Copy`,
and then rename that copy to `hw02.ipynb`.
Work in the copy, rather than the original.

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import inspect
from pathlib import Path
import random
import string

from client.api.notebook import Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import shared.src.utils.util as util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

## Learning Objectives


1. Familiarize yourself with key descriptive statistics, like the mean and standard deviation.
1. Learn to manipulate dataframes with `groupby`.
1. Practice applying bootstrapping estimation of confidence intervals to data from a psychological experiment.

For all of the below, you can assume the input series has at least two values,
so there's no need to handle the case of an empty `Series`,
and that none of the values are null/NaN.

## Section 1 - Descriptive Statistics

### Mean, Variance, and Standard Deviation

For all of the below, your implementation must be in terms of basic Python keywords and operators:
`for` loops, `+`, `/`, `**`.

Do NOT use any methods or functions, like `.mean` or `np.var`,
from outside the standard library.

Implement a function, `mean`, that takes in a pandas `Series` and calculates its mean.


In [None]:
ok.grade("q1")

Implement a function, `var`, that takes in a pandas `Series` and calculates its variance.
In your implementation,
re-use your `mean` function at least once.

In [None]:
ok.grade("q2")

Implement a function, `std`, that takes in a pandas `Series` and calculates its standard deviation.
In your implementation,
re-use your `var` function.

In [None]:
ok.grade("q3")

### Median

Implement a function, `median`, that computes the median of a `Series`.

Hint: use `.sort_values` to get a sorted version of the series.
Once you've got that, all you need to do is index
index into the right location(s) with `.iloc`
and then handle the two cases of even-sized `Series` and odd-sized `Series`.

In [None]:
ok.grade("q4")

## Section 2 - Bootstrapping

Implement a function, `bootstrap_stat`,
that takes as its arguments a series `s`, a function `stat_fun`, and an integer `n_boots`, in that order,
and then computes and returns the value of `stat_fun` on `n_boots` bootstrap samples of the series `s`.

`bootstrap_stat(pd.Series([1,2,3]), mean, 5)` would return a list with 5 items -- each being the mean of a bootstrapped sample (sample of size 3 with replacement) of the series `[1, 2, 3]`.

Note that `Series`, like `DataFrame`s, have a `.sample` method that takes `frac` and `replace` keyword arguments.

In [None]:
ok.grade("q5")

Implement a function `inside_95CI` that takes in a bootstrap series `s` and a value `p`
and returns `True` if the value is inside the 95% confidence interval
for the bootstrap parameter being estimated
and `False` otherwise.

As an example, your function should do the following:

```python
>>> one_to_100 = pd.Series([1,2,3,(...),99,100])
>>> inside_95CI(one_to_100, 50)
>>> True
>>> inside_95CI(one_to_100, 2)
>>> False
```            

To generate your confidence intervals, 
you'll likely want to use the function `np.percentile`.
This function takes in a list/series of values `a` and list of numbers between 0 and 100 `q`
and returns the values at the `q`th percentiles of `a`.
The result will be a `np.ndarray`,
so use the functions `list` or `pd.Series` to cast it back to a more familiar datatype if you need to.

In [None]:
ok.grade("q6")

## Section 3 - `groupby`

The cell below loads some data from a (possibly fictional) psychology experiment,
in which subjects were scored on their performance on tasks of varying difficulty.
Half of the subjects had their attention divided by a distractor, while half did not.
Scores are stored in the `score` column, while attentional states are stored in the `attention` column.

In [None]:
atten_expt = sns.load_dataset("attention", data_home=Path("..") / ".." / "shared" / "data", index_col=0)

atten_expt.head()

Use `groupby` to generate a `GroupBy` object grouped by `attention` state, named `expt_groupby_atten`,
and then use it to calculate the mean `score`s of each group with `.mean`.
Save the mean scores to a `Series` called `atten_group_means`.

In [None]:
ok.grade("q7")

Participants were also presented with problems of varying difficulty.
Easier problems had multiple `solutions`, up to 3,
while the hardest problems had only 1 solution.
This information is stored in the `solutions` column.

Use `groupby` to generate a `GroupBy` object grouped
by both `attention` state and number of `solutions`, named `expt_groupby_atten_solu`,
and then use it to calculate the mean of the `score` column of each group with `.mean`.
Save the mean scores to a Series called `atten_solu_group_means`.

In [None]:
ok.grade("q8")

## Section 4 - Putting It All Together

The cell below uses seaborn to plot the means and bootstrapped confidence intervals
for each combination of score and solution count.

In [None]:
f, ax = plt.subplots(figsize=(12, 8))
sns.pointplot(x="solutions", y="score", hue="attention", data=atten_expt,
              ax=ax, palette={"divided": "C0", "focused": "C1"});

It appears that, while individuals who were focused performed better on the hardest tasks,
those with only one solution,
average performance was roughly the same when the task became easy, and had three solutions.

As a first pass at determining whether this pattern is real,
we'll combine `groupby` with bootstrapping
to estimate confidence intervals for these two specific effects.

Later in the course, we will learn more systematic approaches to analyzing this type of data.

The cell below defines a function,
`stratified_bootstrap`
that combines `groupby` and `sample`
to perform bootstrap samples of each of the six combinations
of attention state and solution count simultaneously.

Recall that the purpose of bootstrapping is to simulate
what might happen if we were to repeat our measurements.
If we were to repeat our experiment,
we would always end up with half of our subjects in the focused group
and with an equal number of observations from each difficulty level,
so we have to respect that in our bootstrap.

In [None]:
def stratified_bootstrap(df, groups=["attention", "solutions"]):
    """Performs stratified bootstrap sampling, i.e. bootstrap sampling inside groups,
    where the columns used to group are definied by the groups argument.
    """
    return df.groupby(["attention", "solutions"]).apply(pd.DataFrame.sample, frac=1, replace=True)

In [None]:
stratified_bootstrap(atten_expt).head(10)

In [None]:
stratified_bootstrap(atten_expt).tail(10)

Side Note: our bootstrapping strategy isn't quite right,
since it ignores the fact that we actually tested sujbects in all three conditions.
We'll deal with modeling subject-wise data later in the course.

Use `stratified_bootstrap` to perform bootstrap sampling on the difference in means
between the focused and divided groups
for both easy problems (`atten_expt["solutions"] == 3`)
and hard problems (`atten_expt["solutions"] == 1`).
Again, 100 bootstrap samples will suffice.

Save the former to a list called `easy_boot_delta_means`
and the latter to `hard_boot_delta_means`.

Then, for both, check whether the number `0`
is in the 95% confidence interval
using `inside_95CI`.

Save the resulting `True`/`False` values to variables called `no_difference_easy` and `no_difference_hard`.

In [None]:
ok.grade("q9")

## Scoring and Submission

In [None]:
ok.score()