## Problem 6.1: Modeling and parameter estimation for Boolean data

For the purposes of this problem, assume that we can pool the results from the three years to have 13/126 reversals for wild type, 39/124 reversals for ASH, and 91/124 reversals for AVA.

Our goal is to estimate θ, the probability of reversal for each strain. That is to say, we want to compute g(θ∣n,N), where n is the number of reversals in N trials.

**a)** Develop a generative model (that is, specify the joint distribution π(n,θ∣N)=f(n,∣θ,N)g(θ)) for the observed reversals. Be sure to do prior predictive checks and justify why you chose the model you did. Biological hint: C. elegans have no mode of sensing light at all. So, a wild type worm without and Channelrhodopsin has no means of detecting light. Modeling hint: The Beta distribution is very useful for modeling probabilities of probabilities, like θ in this problem.

In [1]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import tqdm

import bebi103

import altair as alt
import bokeh.io
bokeh.io.output_notebook()

We think that the likelihood will be given by a binomial distribution and the prior will be given by a beta distribution. For the wild-type, the prior should be peaked around 0 and drop off quickly. For the ASH the prior should be peaked somewhere around 0.3, and the AVA should be peaked close to 0.75.

In [90]:
# WT
alpha, beta = 0.2, 8.
theta = np.linspace(0,1)
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='theta (µm)', 
                          y_axis_label='g(theta)')
p.line(theta, st.beta.pdf(theta, alpha, beta), line_width=2)
bokeh.io.show(p)

In [92]:
# ASH
alpha, beta = 5.5, 18.
theta = np.linspace(0,1)
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='theta (µm)', 
                          y_axis_label='g(theta)')
p.line(theta, st.beta.pdf(theta, alpha, beta), line_width=2)
bokeh.io.show(p)

In [4]:
# AVA
alpha, beta = 8., 3.
theta = np.linspace(0,1)
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='theta (µm)', 
                          y_axis_label='g(theta)')
p.line(theta, st.beta.pdf(theta, alpha, beta), line_width=2)
bokeh.io.show(p)

### Wild type

In [52]:
n_ppc_samples = 1000
N = 126

# Draw parameters out of the prior
alpha, beta = 0.2, 8.
theta = np.random.beta(alpha, beta, size=n_ppc_samples)

# Draw data sets out of the likelihood for each set of prior params
ell_wt = np.array([np.random.binomial(N, t, size=500) for t in theta])

In [53]:
p = bebi103.viz.ecdf(ell_wt[0], 
                     x_axis_label='number of reversals', 
                     alpha=0.01, 
                     line_alpha=0)
for ell_vals in ell_wt[9::10]:
    p = bebi103.viz.ecdf(ell_vals, alpha=0.02, p=p, line_alpha=0)

bokeh.io.show(p)


In [54]:
data = np.hstack((np.expand_dims(theta, 1), ell_wt))
columns = ['theta'] + [f'ell[{i+1}]' for i in range(len(ell_wt[0]))]

# Make data frame to match output of Stan
df_ppc = pd.DataFrame(data=data, columns=columns)
df_ppc['warmup'] = 0
df_ppc['chain'] = 0
df_ppc['chain_idx'] = np.arange(1, n_ppc_samples+1)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(df_ppc, 
                                'ell', 
                                x_axis_label='number of reversals'))

opinions...?

## ASH

In [93]:
n_ppc_samples = 1000
N = 124

# Draw parameters out of the prior
alpha, beta = 5.5, 18.
theta = np.random.beta(alpha, beta, size=n_ppc_samples)

# Draw data sets out of the likelihood for each set of prior params
ell_ash = np.array([np.random.binomial(N, t, size=500) for t in theta])

p = bebi103.viz.ecdf(ell_ash[0], 
                     x_axis_label='number of reversals', 
                     alpha=0.01, 
                     line_alpha=0)
for ell_vals in ell_ash[9::10]:
    p = bebi103.viz.ecdf(ell_vals, alpha=0.02, p=p, line_alpha=0)

bokeh.io.show(p)

In [94]:
data = np.hstack((np.expand_dims(theta, 1), ell_ash))
columns = ['theta'] + [f'ell[{i+1}]' for i in range(len(ell_ash[0]))]

# Make data frame to match output of Stan
df_ppc = pd.DataFrame(data=data, columns=columns)
df_ppc['warmup'] = 0
df_ppc['chain'] = 0
df_ppc['chain_idx'] = np.arange(1, n_ppc_samples+1)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(df_ppc, 
                                'ell', 
                                x_axis_label='number of reversals'))

## AVA

In [10]:
n_ppc_samples = 1000
N = 124

# Draw parameters out of the prior
alpha, beta = 8., 3.
theta = np.random.beta(alpha, beta, size=n_ppc_samples)

# Draw data sets out of the likelihood for each set of prior params
ell_ava = np.array([np.random.binomial(N, t, size=500) for t in theta])

p = bebi103.viz.ecdf(ell_ava[0], 
                     x_axis_label='number of reversals', 
                     alpha=0.01, 
                     line_alpha=0)
for ell_vals in ell_ava[9::10]:
    p = bebi103.viz.ecdf(ell_vals, alpha=0.02, p=p, line_alpha=0)

bokeh.io.show(p)

This looks a bit broad? but looks mostly ok

In [11]:
data = np.hstack((np.expand_dims(theta, 1), ell_ava))
columns = ['theta'] + [f'ell[{i+1}]' for i in range(len(ell_ava[0]))]

# Make data frame to match output of Stan
df_ppc = pd.DataFrame(data=data, columns=columns)
df_ppc['warmup'] = 0
df_ppc['chain'] = 0
df_ppc['chain_idx'] = np.arange(1, n_ppc_samples+1)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(df_ppc, 
                                'ell', 
                                x_axis_label='number of reversals'))

## Plot the posterior

**b)** Plot the posterior probability density function for each of the three strains. What can you conclude from this?

In [78]:
def log_post_indep_size(params, ell):
    """Log posterior for number of reversals"""
    # Make sure parameters are physical
    if (params < 0).any():
        return -np.inf
    
    # Unpack parameters
    alpha, beta, theta, N = params

    # Log prior for theta
    log_prior = st.beta.logpdf(theta, alpha, beta, loc=0)
        
    # Likelihood
    log_likelihood = np.sum(st.binom.logpmf(ell, N, theta))
    
    return log_prior + log_likelihood

def post_indep_size(params, ell):
    """posterior for number of reversals"""
    # Make sure parameters are physical
    if (params < 0).any():
        return -np.inf
    
    # Unpack parameters
    alpha, beta, theta, N = params

    # Prior for theta
    prior = st.beta.pdf(theta, alpha, beta, loc=0)
        
    # Likelihood
    likelihood = np.prod(st.binom.pmf(ell, N, theta))
    
    return prior * likelihood

### WT

Let's first compute the posterior.

In [86]:
# Set up plotting range
theta = np.linspace(0.0, 1.0, 200)

# Set up distribution variables
alpha, beta, N, n = 0.2, 8., 124., 13.

# Compute posterior
POST = np.empty_like(theta)

for i in tqdm.tqdm(range(len(theta))):
    POST[i] = post_indep_size(np.array([alpha, beta, theta[i], N]), n)

100%|██████████| 200/200 [00:00<00:00, 2236.03it/s]


Now let's plot the posterior.

In [87]:
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='theta', 
                          y_axis_label='posterior')
p.line(theta, POST, line_width=2)
bokeh.io.show(p)

### ASH

In [95]:
# Set up plotting range
theta = np.linspace(0.0, 1.0, 200)

# Set up distribution variables
alpha, beta, N, n = 5.5, 18., 124., 39

# Compute posterior
POST = np.empty_like(theta)

for i in tqdm.tqdm(range(len(theta))):
    POST[i] = post_indep_size(np.array([alpha, beta, theta[i], N]), n)
        
p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='theta', 
                          y_axis_label='posterior')
p.line(theta, POST, line_width=2)
bokeh.io.show(p)

100%|██████████| 200/200 [00:00<00:00, 1641.29it/s]


### AVA

In [88]:
# Set up plotting range
theta = np.linspace(0.0, 1.0, 200)

# Set up distribution variables
alpha, beta, N, n = 8., 3., 124., 91

# Compute posterior
POST = np.empty_like(theta)

for i in tqdm.tqdm(range(len(theta))):
    POST[i] = post_indep_size(np.array([alpha, beta, theta[i], N]), n)
        
# Normalize
POST = POST / POST.max()

p = bokeh.plotting.figure(width=300, height=200, 
                          x_axis_label='theta', 
                          y_axis_label='posterior')

p.line(theta, POST, line_width=2)
bokeh.io.show(p)

100%|██████████| 200/200 [00:00<00:00, 2524.32it/s]


with log the calculations don't work for some reason