<img src="../../shared/img/banner.svg" width=2560></img>

# Lab 06 - Bayesian Inference for Group Means

In [1]:
%matplotlib inline

In [2]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [3]:
from pathlib import Path

from client.api.notebook import Notebook
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

import shared.src.utils.util as shared_util

In [4]:
sns.set_context("notebook", font_scale=1.7)

## Learning Objectives

1. Become comfortable designing models based on visualizations of data.
2. Practice using pyMC to draw from priors and posteriors.
3. Draw inferences from posteriors and communicate them.

In this week's lab,
you'll develop a model that aims to infer
whether two groups differ in their average value of some variable.
You will be given the choice of dataset,
the choice of groups or variables within that dataset,
and the freedom to design an accompanying model.

This lab is significantly more open-ended than previous labs.
For example, there is no autograding portion.
This better represents the kinds of problems
data scientists and research psychologists face in their work:
there are no "tests" to check whether a model is correct,
the definition of success is at least partially under their control,
and the work passes all the way from raw data to insight.

## Loading the Datasets

The `seaborn` library can download a number of "demonstration" datasets,
many of which are classic datasets in statistics.

The sections below load and describe three datasets from this collection.

They are saved them as `csv` files in the `content/shared/data` folder of this course
and loaded into the Python workspace as `DataFrame`s.

In [5]:
shared_data_dir = Path(".") / ".." / ".." / "shared"/ "data"

#### `iris`

The `iris` dataset has a long history:
it was introduced by Ronald Fisher in the 1930s
to develop early ideas in statistical classification.

In [6]:
iris = sns.load_dataset("iris", data_home=shared_data_dir)
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

The dataset contains anatomical measurements of three different `species` of the iris flower:

In [7]:
iris["species"].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

The measurements are of the `length` and `width` of two components of the flower:
the `petal` and `sepal`, pictured below.

In [8]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg", width=250)

Petals are the component that most associate with flowers.
Sepals are a more leaf-like, typically green component that primarily serve to protect flowers before they bloom.

The question behind the `iris` dataset is whether
these anatomical features can be used to predict the `species`.

For more information about this dataset, see [Kaggle](https://www.kaggle.com/arshid/iris-flower-dataset).

#### `attention`

In previous labs and homeworks, we have considered the `attention` dataset:

In [9]:
attention = sns.load_dataset("attention", data_home=shared_data_dir, index_col=0)
attention.columns

Index(['subject', 'attention', 'solutions', 'score'], dtype='object')

Subjects attempted to complete a task with varying numbers of `solutions`,
between `1` and `3`,
while their `attention` was either
`divided` away from the task by a distractor
or `focused` on the task.

In [10]:
print(attention["attention"].unique(), "\n", attention["solutions"].unique())

['divided' 'focused'] 
 [1 2 3]


#### `exercise`

In this dataset,
healthy human volunteers
on a `low fat` or `no fat` `diet`
had their heart rate, or `pulse`,
measured while they performed exercises of different `kind`s:
either `running`, `walking`, or at `rest`.
Their heart rates were also measured
at different `time`s:
after `1`, `15`, and `30` minutes of each exercise.

In [11]:
exercise = sns.load_dataset("exercise", data_home=shared_data_dir, index_col=0)
exercise.columns

Index(['id', 'diet', 'pulse', 'time', 'kind'], dtype='object')

In [12]:
print(exercise["kind"].unique(), "\n", exercise["diet"].unique(), "\n", exercise["time"].unique())

[rest, walking, running]
Categories (3, object): [rest, walking, running] 
 [low fat, no fat]
Categories (2, object): [low fat, no fat] 
 [1 min, 15 min, 30 min]
Categories (3, object): [1 min, 15 min, 30 min]


## Visualizing the Data

The first thing to do when you embark on a new analysis,
especially with new data,
is to visualize the data.

Select at least one of the suggested visualizations below for the dataset you want to work with
and/or come up with your own.
If you come up with your own visualization,
remember that the end goal is to see whether the mean
is different between two sub-groups in the data.

#### Once you've produced one or more visualizations, answer the question(s) below.

### Some Suggested Visualizations:

#### `iris`:
  - `pairplot`, with color given by the species of the flower
  - a single figure with a `distplot` for each species on one of the anatomical measures

#### `attention`:
  - any of the plots from the previous encounters with this dataset
  - `pairplot`, with color given by attentional state and dropping the `subject` column
  - a `boxplot`, showing the distribution of scores and split by either the number of solutions, the attentional state, or both

#### `exercise`:
  - `boxplot`, showing the distribution of the pulse variable, separating out observations by diet, kind of exercise, or both.

#### Q Describe at least one pattern in the data that jumps out at your from your visualization. For example: Does one variable seem to have little or no effect? Does one variable seem to have a large effect? Do you see different patterns looking at pairs of variables than looking at each individually? Does one of the distributions you observe look strange?

## Creating a Model of the Data

Select one of the claims about group averages below
and create a `pyMC` model of your data
whose posterior can be used to perform Bayesian inference
on that claim.

That is,
write down a likelihood for the observed data given its
(unknown) parameters.
Because we are interested in a difference of means,
you'll want to select a likelihood with
a parameter for the mean.

Then, above the definition of the likelihood in your model,
write down prior distributions for the parameters of the likelihood.

Make sure you include the data in the model by placing
`observed=` in the right place!

### Some Suggested Claims

Note that all claims below are assumed to be about _averages_.
Note also that they don't state whether variances are different or not,
nor do they say anything about how parameters like the mean determine
the distribution of the data,
and so those choices are left up to you, as the modeler. 

#### `iris`:
  - Length of sepals differs between two species
  - Length of petals differs between two species
  - Width of petals differs between two species
  - Withd of sepals differs between two species

#### `attention`:
  - Score differs across attention state
  - Score differs between hard and easy problems
  - Score differs across attention state for hard problems
  - Score differs across attention state for easy problems

#### `exercise`
  - Pulse differs between running and resting
  - Pulse differs between low-fat and no-fat diet
  - Pulse differs between low-fat and no-fat diet while running
  - Pulse differs between low-fat and no-fat diet while resting

#### Q Identify the "prior" component of your model and explain your choice of distribution for each random variable in it.

#### Q Identify the "likelihood" component of your model and explain your choice of distribution for each random variable in it.

## Drawing an Inference about a Group Means

Draw samples from the posterior of your model
and then visualize the posterior over the difference in means
and obtain the highest poterior density interval. 

If you calculated the difference of means inside the model,
then you can use `plot_posterior` to visualize the posterior
over the difference in means and the other variables simultaneously.

If you did not calculate the difference of means inside the model,
then you can calculate it directly on the samples by subtracting the values of the mean variables
in each sample.
Then, use `pm.stats.hpd` to compute the interval of highest posterior density.

#### Use the visualization of the posterior samples and the HPD to answer the questions below.

#### Q Based on your posterior and the 95% HPD, do you think there is a high probability that the difference in means you checked for is greater than 0? Explain your reasoning.