In [1]:
import pymc3 as pm
import pandas as pd
import matplotlib 

from sklearn.preprocessing import LabelEncoder


%matplotlib inline

Using cuDNN version 5110 on context None
Mapped name None to device cuda: GeForce GTX 1080 (0000:01:00.0)


## Problem Type

The Bayesian estimation model is widely applicable across a number of scenarios. The classical scenario is when we have an experimental design where there is a control vs. a treatment, and we want to know what the difference is between the two. Here, "estimation" is used to estimate the "true" value for the control and the "true" value for the treatment, and the "Bayesian" part refers to the computation of the uncertainty surrounding the parameter. 

Bayesian estimation's advantages over the classical t-test was first described by John Kruschke (2013). 

In this notebook, I provide a concise implementation suitable for two-sample and multi-sample inference.

## Data structure

To use it with this model, the data should be structured as such:

- Each row is one measurement.
- The columns should indicate, at the minimum:
    - What treatment group the sample belonged to.
    - The measured value.

## Extensions to the model

As of now, the model only samples posterior distributions of measured values. The model, then, may be extended to compute differences in means (sample vs. control) or effect sizes, complete with uncertainty around it. Use `pm.Deterministic(...)` to ensure that those statistics' posterior distributions, i.e. uncertainty, are also computed.

## Reporting summarized findings

Here are examples of how to summarize the findings.

> Treatment group A was greater than control by x units (95% HPD: [`lower`, `upper`]). 

> Treatment group A was higher than control (effect size 95% HPD: [`lower`, `upper`]). 

## Other notes

Here, we make a few modelling choices.

1. We care only about the `normalized_measurement` column, and so we choose the t-distribution to model it, as we don't have a good "mechanistic" model that incorporates measurement error of OD600 and 'measurement'.

In [2]:
df = pd.read_csv('datasets/biofilm.csv')
continuous_cols = ['OD600', 'ST', 'replicate', 'measurement', 'normalized_measurement']
for c in continuous_cols:
    df[c] = pm.floatX(df[c])
df.head()

Unnamed: 0,experiment,isolate,ST,OD600,measurement,replicate,normalized_measurement
0,1,1,4.0,0.461,0.317,1.0,0.687636
1,1,2,55.0,0.346,0.434,1.0,1.254335
2,1,3,55.0,0.356,0.917,1.0,2.575843
3,1,4,4.0,0.603,1.061,1.0,1.759536
4,1,5,330.0,0.444,3.701,1.0,8.335586


In [3]:
df.dtypes

experiment                  int64
isolate                    object
ST                        float64
OD600                     float64
measurement               float64
replicate                 float64
normalized_measurement    float64
dtype: object

In [4]:

le = LabelEncoder()
le.fit(df['isolate'])
df['indices'] = le.transform(df['isolate']).astype('int32')

In [5]:
le.classes_

array(['1', '10', '11', '12', '13', '14', '15', '2', '3', '4', '5', '6',
       '7', '8', '9', 'ATCC_29212'], dtype=object)

In [6]:
with pm.Model() as best:
    nu = pm.Exponential('nu_minus_one', lam=1/30) + 1
    
    fold = pm.Flat('fold', shape=len(le.classes_))
    
    var = pm.HalfCauchy('var', beta=1, shape=len(le.classes_))
    
    mu = fold[df['indices'].values]
    sd = var[df['indices'].values]
    
    like = pm.StudentT('like', mu=mu, sd=sd, nu=nu, 
                       observed=df['normalized_measurement'])
    
    # Compute differences
    diffs = pm.Deterministic('differences', fold - fold[0])

In [7]:
with best:
    trace = pm.sample(draws=2000)

Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 66.017:  12%|█▏        | 23227/200000 [00:21<02:39, 1110.14it/s]
Convergence archived at 23300
Interrupted at 23,300 [11%]: Average Loss = 127.93


ValueError: ('The following error happened while compiling the node', GpuElemwise{Composite{Switch(Identity((GT(Composite{inv(sqr(i0))}(i0), i1) * i2 * GT(i0, i3))), (((i4 + (i5 * log(((i6 * Composite{inv(sqr(i0))}(i0)) / i7)))) - i8) - (i9 * i10 * log1p(((Composite{inv(sqr(i0))}(i0) * sqr((i11 - i12))) / i7)))), i13)}}[(0, 0)]<gpuarray>(GpuAdvancedSubtensor1.0, GpuArrayConstant{[0]}, GpuElemwise{gt,no_inplace}.0, GpuArrayConstant{[0]}, GpuElemwise{Composite{scalar_gammaln((i0 * i1))}}[]<gpuarray>.0, GpuArrayConstant{[ 0.5]}, GpuArrayConstant{[ 0.31830989]}, GpuElemwise{add,no_inplace}.0, GpuElemwise{Composite{scalar_gammaln((i0 * i1))}}[]<gpuarray>.0, GpuArrayConstant{[ 0.5]}, GpuElemwise{Add}[(0, 1)]<gpuarray>.0, GpuArrayConstant{[ 0.68763558  1.25433526  2.5758427   1.75953566  8.33558559  1.69043152
  1.54264973  1.30933852  2.23259762  2.35177866  1.42752294  1.4496124
  1.6546875   3.82629108  1.96014493  1.89095128  0.98514852  1.32915921
  2.04116223  1.32997118  6.39321357  1.68515742  1.51377634  1.30538922
  1.81346154  2.75687104  2.462       1.93927894  3.11494253  2.328125
  2.55503513  0.96802326  0.93926247  1.39884393  1.64325843  1.69983416
  8.20045045  1.7467167   1.34845735  1.35603113  2.14091681  1.80632411
  1.18715596  1.43992248  2.1015625   4.21830986  1.86594203  1.71229698
  0.93729373  1.29338104  1.8377724   1.44956772  6.30339321  1.6071964
  1.39708266  1.44161677  1.83076923  2.32558139  2.028       1.75521822
  2.3467433   2.53645833  2.90398127  1.15116279  0.80043384  1.30057804
  1.60674157  1.55058043  8.03828829  1.64727955  1.30671506  1.42217899
  1.91171477  1.77470356  1.24220183  1.60658915  2.1515625   2.63615024
  2.26992754  2.07888631  1.25082508  1.02862254  1.43341404  1.5778098
  7.1736527   1.35082459  1.35170178  1.62724551  2.08269231  2.26215645
  1.692       1.58633776  2.5210728   3.40104167  2.51288056  1.63081395]}, GpuAdvancedSubtensor1.0, GpuArrayConstant{[-inf]}), '\n', 'Cannot compute test value: input 0 (<float64>) of Op Composite{inv(sqr(i0))}(<float64>) missing default value. ')

In [None]:
pm.forestplot(trace, varnames=['fold'], ylabels=le.classes_)

In [None]:
pm.forestplot(trace, varnames=['differences'], ylabels=le.classes_)