In [None]:
import pymc3 as pm
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Introduction

This notebook is designed to give you more practice with PyMC3 syntax. 

It is intentionally designed to provide more guidance w.r.t. model definition, i.e. which parameters to use, such that the focus is more on PyMC3 syntax rather than the mechanics of model definition. 

If you are already feeling comfortable with PyMC3 syntax, and would like to instead move on to practice with model definition, then feel free to move onto notebook 5 instead, where you can play with the Darwin's Finches dataset. That notebook is intentionally designed with much more freedom.

## Setup

You will be experimentally analyzing the effectiveness of six different phone sterilization methods against two control methods. This research was conducted at MIT's Division of Comparative Medicine, and was published this year in the Journal of the American Association for Laboratory Animal Science. If you're interested, you can read the paper [here][jaalas].

[jaalas]: https://www.ncbi.nlm.nih.gov/pubmed/29402348

### Experiment Design

Briefly, the experiments were setup as such.

1. Pre-sterilization, three sites on the phone were swabbed and the number of colony forming units (CFUs) was determined by letting the swabbed bacteria grow on an agar plate.
1. Post-sterilization, the same three sites were swabbed and the number of CFUs was counted.
1. Sterilization efficacy was determined by taking the ratio of the difference of CFUs pre- and post-sterilization.

In the paper, we used the following formula to compute the percentage reduction:

$$\delta_{method} = \frac{{count}_{pre} - {count}_{post}}{{count}_{pre}}$$

In retrospect, a better definition would have been:

$$x = \frac{|{count}_{pre} - {count}_{post}|}{{count}_{pre}}$$
$$\delta_{method} = \begin{cases}
    0 & \text{if} & x\lt0, \\
    1 & \text{if} & x\gt1, \\
    x & \text{otherwise}
    \end{cases}$$

A few pointers:

- We want absolute value becuase sometimes, the number of colonies post-sterilization is greater than the number of colonies pre-sterilization, which may result randomly due to experimental variability.
- 

### Data

The data are located in `../data/sterilization.csv`.

In [None]:
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import missingno as msno

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Step 1: Define Data Generating Process

Just as in the previous notebook, you may want to spend 5-10 minutes talking through the data generating process before proceeding. Most important is to list out the distributions that you think are most relevant to the problem.

### Step 2: Explore the Data

#### Exercise 1

Load the dataset into a pandas DataFrame. It is available at the path `../data/sterilization.csv`. 

In [None]:
df = pd.read_csv('../data/sterilization.csv', na_filter=True, na_values=['#DIV/0!'])
df.sample(5)

To help you visualize what data are available and missing in the dataframe, run the cell below to get a visual matrix (using MissingNo). 

In [None]:
msno.matrix(df)

#### Exercise 2

Plot the average percentage reduction in colonies for each treatment.

In [None]:
df.groupby('treatment').mean()['perc_reduction colonies'].plot(kind='bar')

In [None]:
df.describe()['']

### Step 3: Implement and Fit Model

#### Exercise

Write your generative model for the data. 

In [None]:
with pm.Model() as model:
    
    
    
    like = pm.Deterministic()