## Problem 9.1: Caulobacter growth: exponential or linear

In [2]:
import pandas as pd
import numpy as np

import numba

import bebi103

import altair as alt
import altair_catplot as altcat

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

color_palette=['#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f', '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac']

We would like to see if the caulobacter growth is better modeled with exponential or linear growth. We will be using a hierarchical model to perform parameter estimates and do an effective comparison between these two models for growth.

The exponential model is:
\begin{align}
a(t) = a_0 \mathrm{e}^{k t},
\end{align}

where $a(t)$ is the area of the cell in the image as a function of time and $a_0$ is the area of the cell right after a division has been completed, which we mark as $t = 0$.

The linear growth model is:
\begin{align}
a(t) = a_0 + k t.
\end{align}

First we will read in the data for the growth and division events for the bacteria.

In [3]:
df = pd.read_csv('../data/hw_4.2_caulobacter_growth_image_processing_results.csv')

df.head()

Unnamed: 0,time (min),area (sq um),growth_event,bacterium
0,1.0,1.300624,0,1
1,2.0,1.314144,0,1
2,3.0,1.295216,0,1
3,4.0,1.314144,0,1
4,5.0,1.341184,0,1


Since the length of division is different for all growth events we should add another column that restarts the time count from 0 every time there's a division. 

In [4]:
time = []
j = 0

for i in df['growth_event'].diff():
    if i == 0:
        j += 1
        time.append(j)
    else:
        j = 0
        time.append(j)

df['t'] = time

For later convenience in sampling, we will rename the 'area (sq um)' column to 'area'.

In [5]:
# Rename for convenience
df = df.rename(columns={'area (sq um)': 'area'})

# Take a look
df.head()

Unnamed: 0,time (min),area,growth_event,bacterium,t
0,1.0,1.300624,0,1,0
1,2.0,1.314144,0,1,1
2,3.0,1.295216,0,1,2
3,4.0,1.314144,0,1,3
4,5.0,1.341184,0,1,4


Let's start with a subset of the data to check our models before using the whole dataset. We will look at bacterium 1 and plot its growth curves.

In [9]:
df_bacterium1 = df.loc[df['bacterium'] == 1]

p = bokeh.plotting.figure(plot_width=650,
                          plot_height=250,
                          x_axis_label='time (min)',
                          y_axis_label='cell area (sq µm)')

# Specify the glyphs
colors = ['#1f78b4', '#a6cee3']
for i, g in df_bacterium1.groupby('growth_event'):
    p.circle(g['time (min)'], g['area'], size=3, color=colors[i%2])

bokeh.io.show(p)

#### No hierarchy -- should we include this part?

Let's first model a single growth event with no hierarchy using the exponential model. First we will perform prior predictive checks. We choose as our priors

\begin{gather}
a_0 \sim \mbox{Norm}(1.2,0.4) \\
k \sim \mbox{Norm}(0.01,0.003)\\
\sigma \sim \mbox{HalfNorm}(0.1) \\
\end{gather}

We estimate that the size of a bacterium is slightly more than 1 micron, so we've decided to center $a_0$ around 1.2. k represents...**explain how we chose k and sigma**

First we write a function to take in the time values for the bacterium growth curve and sample parameter values according to our prior.

In [11]:
def data_prior_pred(t):
    '''
    Samples parameter values according to the prior and generates
    data y at the values given in t.
    '''
    # Sample parameter values according to priors
    a = np.random.normal(1.2, 0.4)
    k = np.random.normal(0.01, 0.003)
    sigma = np.abs(np.random.normal(0, 0.1))
    
    # Generate random data according to the likelihood
    return np.random.normal(a * np.exp(k * t), sigma)

Let's now use this function to plot simulated data for our prior predictive check.

In [12]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='time',
                          y_axis_label='area')

t = df_bacterium1.loc[df_bacterium1['growth_event'] == 0, 't'].values

# Plot simulated data
for i in range(100):
    p.circle(t, data_prior_pred(t), size=3, alpha=0.1)

bokeh.io.show(p)

**commentThis looks ok? Maybe the area goes too high, but the curves cover a pretty wide range, so it seems good as broad prior.**

Now let's do the same for the linear model. We choose as our priors

\begin{gather}
a_0 \sim \mbox{Norm}(1.2,0.4) \\
b \sim \mbox{Norm}(0.01,0.003)\\
\sigma \sim \mbox{HalfNorm}(0.1) \\
\end{gather}

We estimate that the size of a bacterium is slightly more than 1 micron, so we've decided to center $a_0$ around 1.2. b represents...**explain how we chose k and sigma**

Let's modify our prior predictive function for the linear model.

In [13]:
def data_prior_pred_linear(t):
    '''
    Samples parameter values according to the prior and generates
    data y at the values given in t.
    '''
    # Sample parameter values according to priors
    a = np.random.normal(1.2, 0.4)
    b = np.random.normal(0.01, 0.003)
    sigma = np.abs(np.random.normal(0, 0.1))
    
    # Generate random data according to the likelihood
    return np.random.normal(a + b * t, sigma)

Now we can plot simulated data according to the linear model.

In [14]:
p = bokeh.plotting.figure(height=300, width=450,
                          x_axis_label='time',
                          y_axis_label='area')

t = df_bacterium1.loc[df_bacterium1['growth_event'] == 0, 't'].values

# Plot simulated data
for i in range(100):
    p.circle(t, data_prior_pred_linear(t), size=3, alpha=0.1)

bokeh.io.show(p)

This is also pretty broad and seems reasonable.

#### One level hierarchical model

Let's try now try a one level hierarchical model. 

**insert prior predictive checks**

We will begin by modeling the linear model. Our linear model is as follows:

\begin{gather}
a \sim \mbox{Norm}(\mu_a,\sigma_a) \\
k \sim \mbox{Norm}(\mu_k,\sigma_k) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{hyper}) \\
a_1 \sim \mbox{Norm}(a, \tau_a) \\
k_1 \sim \mbox{Norm}(k, \tau_k) \\
area \sim \mbox{Norm}(a + k * t, \sigma)
\end{gather}


We will start with a noncentered model.

In [15]:
model_code_linear_noncentered = """
data {
  // Total number of data points
  int N;
  
  // Number of entries in each level of the hierarchy
  int J_1;

  //Index arrays to keep track of hierarchical structure
  int index_1[N];
  
  // The measurements
  real area[N];
  
  // Time
  vector[N] t;
}

parameters {
  // Hyperparameters level 0
  real a;
  real k;
  real<lower=0> sigma;

  // How hyperparameters vary
  real<lower=0> tau_a;
  real<lower=0> tau_k;

  // Hyperparameters level 1
  vector[J_1] a_1_tilde;
  vector[J_1] k_1_tilde;
}

transformed parameters {
  // Transformations for noncentered
  vector[J_1] a_1 = a + tau_a * a_1_tilde;
  vector[J_1] k_1 = k + tau_k * k_1_tilde;
  vector[N] area_temp;
  
  for (i in 1:N) {
    area_temp[i] = a_1[index_1[i]] + k_1[index_1[i]] * t[i];
  }
}

model {
  a ~ normal(1.4, 0.3);
  k ~ normal(0.01, 0.002);
  sigma ~ normal(0, 0.1);
  tau_a ~ normal(0, 0.1);
  tau_k ~ normal(0, 0.001);

  a_1_tilde ~ normal(0, 1);
  k_1_tilde ~ normal(0, 1);

  area ~ normal(area_temp, sigma);
}

generated quantities {
  vector[N] area_ppc;
  real log_lik[N];
  
  for (i in 1:N) {
    area_ppc[i] = normal_rng(area_temp[i], sigma);
  }
  
  // Compute pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(area[i] | area_temp[i], sigma);
  }
}
"""

**Change this to stan file later**

In [19]:
sm_noncentered = bebi103.stan.StanModel(model_code=model_code_linear_noncentered)

Using cached StanModel.


Let's start by modeling two growth events from bacterium 1 to make sure our model works and makes sense. 

In [20]:
# Choose a subset of data
df_sub1 = df_bacterium1.loc[df_bacterium1['growth_event'] == 1]
df_sub2 = df_bacterium1.loc[df_bacterium1['growth_event'] == 2]
df_sub = pd.concat([df_sub1, df_sub2])

df_sub.head()

Unnamed: 0,time (min),area,growth_event,bacterium,t
98,99.0,1.403376,1,1,0
99,100.0,1.400672,1,1,1
100,101.0,1.373632,1,1,2
101,102.0,1.40608,1,1,3
102,103.0,1.362816,1,1,4


Let's turn this data into the form stan reads for sampling.

In [27]:
data = dict(N=len(df_sub),
            J_1=2,
            index_1=df_sub['growth_event'].values,
            area=df_sub['area'].values,
            t=df_sub['t'].values)

Now let's do sampling and check diagnostics.

In [25]:
# Sample
samples_linear = sm_noncentered.sampling(data=data, 
                                         seed=2389412, 
                                         control=dict(adapt_delta=0.99, max_treedepth=11))

# Convert to data frame for easy use later
df_linear = bebi103.stan.to_dataframe(samples_linear)

bebi103.stan.check_all_diagnostics(samples_linear)

n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 11.
E-BFMI indicated no pathological behavior.


0

We will check the corner plots to look at the parameters for a and k.

In [29]:
bokeh.io.show(bebi103.viz.corner(samples_linear, pars=['a', 'k']))

NameError: name 'samples_linear' is not defined

We can also plot the marginalized distributions for both parameters to look at the parameter values.

In [None]:
# Marginalized distributions of each parameter 
plots = [bebi103.viz.ecdf(df_linear[param], x_axis_label=param, plot_height=200, plot_width=250) 
                 for param in ['a', 'k']]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

We'd like to perform posterior predictive checks to make sure our model makes sense. Below is a function to plot the predictive ECDF.

In [29]:
def hw92_predictive(df, x, y=None, namex='index_1', name='F_ppc', perc=[80, 60, 40, 20], 
                    x_axis_label=None, y_axis_label=None, title=None, plot_width=350, plot_height=225, 
                    color='blue', data_color=color_palette[1], diff=False, p=None):
    '''Mimic of predictive ECDF
    df - MCMC sampling data frame
    x - input variable
    y - data
    namex - name of the input varible in the data frame
    name - name of the predictive results in the data frame
    perc - list, default [80, 60, 40, 20]
            Percentiles for making colored envelopes for confidence
            intervals for the predictive ECDFs. Maximally four can be 
            specified.'''
    
    if color not in ['green', 'blue', 'red', 'gray',
                     'purple', 'orange', 'betancourt']:
        raise RuntimeError("Only allowed colors are 'green', 'blue', 'red', 'gray', 'purple', 'orange'")
    
    colors = {'blue': ['#9ecae1','#6baed6','#4292c6','#2171b5','#084594'],
              'green': ['#a1d99b','#74c476','#41ab5d','#238b45','#005a32'],
              'red': ['#fc9272','#fb6a4a','#ef3b2c','#cb181d','#99000d'],
              'orange': ['#fdae6b','#fd8d3c','#f16913','#d94801','#8c2d04'],
              'purple': ['#bcbddc','#9e9ac8','#807dba','#6a51a3','#4a1486'],
              'gray': ['#bdbdbd','#969696','#737373','#525252','#252525'],
              'betancourt': ['#DCBCBC', '#C79999', '#B97C7C',
                             '#A25050', '#8F2727', '#7C0000']}
    if p is None:
        p = bokeh.plotting.figure(plot_width=plot_width,
                                  plot_height=plot_height,
                                  x_axis_label=x_axis_label,
                                  y_axis_label=y_axis_label,
                                  title=title)
    
    if diff:
        x = x[1:]
        if y is not None:
            y = np.diff(y)
        Nb = len(x)
        y_ppc = np.empty((len(perc) * 2 + 1, Nb))
        for i in range(Nb):
            temp = df.loc[df[namex]== i+2, name].values - df.loc[df[namex]== i+1, name].values
            y_ppc[-1, i] = np.median(temp)
            for j in range(len(perc)):
                y_ppc[j * 2, i] = np.percentile(temp, 50 - perc[j] / 2)
                y_ppc[j * 2 + 1, i] = np.percentile(temp, 50 + perc[j] / 2)
    else:                
        Nb = len(x)
        y_ppc = np.empty((len(perc) * 2 + 1, Nb))
        for i in range(Nb):
            temp = df.loc[df[namex]== i+1, name].values
            y_ppc[-1, i] = np.median(temp)
            for j in range(len(perc)):
                y_ppc[j * 2, i] = np.percentile(temp, 50 - perc[j] / 2)
                y_ppc[j * 2 + 1, i] = np.percentile(temp, 50 + perc[j] / 2)
    
    for j in range(len(perc)):
        bebi103.viz.fill_between(x, y_ppc[j * 2, :],
                     x, y_ppc[j * 2 + 1,:],
                     p=p,
                     show_line=False,
                     fill_color=colors[color][j])
        
    p.circle(x, y_ppc[-1, :],
           size=4,
           color=colors[color][-1])
    
    if y is not None:
        p.circle(x, y, size=4, color='orange')
    
    return p

Let's plot the original data in orange along with the sampled area values.

In [30]:
time = df_sub['time (min)'].values
val = df_sub['area'].values
df_linear_ppc = bebi103.stan.extract_array(samples_linear, name='area_ppc')

p1 = hw92_predictive(df_linear_ppc, 
                     time, 
                     val, 
                     perc=[99, 75, 50, 25], 
                     name='area_ppc', 
                     plot_width=500, 
                     plot_height=400)

bokeh.io.show(p1)

**comment**

Now let's try the single-level uncentered exponential model. Our model is as follows:

\begin{gather}
a \sim \mbox{Norm}(\mu_a,\sigma_a) \\
k \sim \mbox{Norm}(\mu_k,\sigma_k) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{hyper}) \\
a_1 \sim \mbox{Norm}(a, \tau_a) \\
k_1 \sim \mbox{Norm}(k, \tau_k) \\
area \sim \mbox{Norm}(a * exp(k * t), \sigma)
\end{gather}

Our stan code is below but we will compile the stan code from a standalone file.

In [31]:
model_code_exp_noncentered = """
data {
  // Total number of data points
  int N;
  
  // Number of entries in each level of the hierarchy
  int J_1;

  //Index arrays to keep track of hierarchical structure
  int index_1[N];
  
  // The measurements
  real area[N];
  
  // Time
  vector[N] t;
}

parameters {
  // Hyperparameters level 0
  real a;
  real k;
  real<lower=0> sigma;

  // How hyperparameters vary
  real<lower=0> tau_a;
  real<lower=0> tau_k;

  // Hyperparameters level 1
  vector[J_1] a_1_tilde;
  vector[J_1] k_1_tilde;
}

transformed parameters {
  // Transformations for noncentered
  vector[J_1] a_1 = a + tau_a * a_1_tilde;
  vector[J_1] k_1 = k + tau_k * k_1_tilde;
  vector[N] area_temp;
  
  for (i in 1:N) {
    area_temp[i] = a_1[index_1[i]] * exp(k_1[index_1[i]] * t[i] / 100);
  }
}

model {
  a ~ normal(1.4, 0.3);
  k ~ normal(1, 0.2);
  sigma ~ normal(0, 0.1);
  tau_a ~ normal(0, 0.1);
  tau_k ~ normal(0, 0.1);

  a_1_tilde ~ normal(0, 1);
  k_1_tilde ~ normal(0, 1);

  area ~ normal(area_temp, sigma);
}

generated quantities {
  vector[N] area_ppc;
  real log_lik[N];
  
  for (i in 1:N) {
    area_ppc[i] = normal_rng(area_temp[i], sigma);
  }
  
  // Compute pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(area[i] | area_temp[i], sigma);
  }
}
"""

Compile the stan code from standalone file.

**need to make the standalone file and change the code below**

In [32]:
sm_exp = bebi103.stan.StanModel(model_code=model_code_exp_noncentered)

Using cached StanModel.


We will use the same subset of data as before so let's begin sampling, making sure to check diagnostics.

In [33]:
# Sample
samples_exp = sm_exp.sampling(data=data, 
                              seed=2389412, 
                              control=dict(adapt_delta=0.99, max_treedepth=11))

# Convert to data frame for easy use later
df_exp = bebi103.stan.to_dataframe(samples_exp)

bebi103.stan.check_all_diagnostics(samples_exp)



n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
8 of 4000 (0.2%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.
56 of 4000 (1.4%) iterations saturated the maximum tree depth of 11.
  Try running again with max_treedepth set to a larger value to avoid saturation.
E-BFMI indicated no pathological behavior.


12

**probs need to adjust the params or stan code for divergences and tree depth**

Let's plot the corner plot to look at the parameters.

In [35]:
bokeh.io.show(bebi103.viz.corner(samples_exp, pars=['a', 'k']))

We can also plot the marginalized distributions for both parameters to look at the parameter values.

In [None]:
# Marginalized distributions of each parameter 
plots = [bebi103.viz.ecdf(df_exp[param], x_axis_label=param, plot_height=200, plot_width=250) 
                 for param in ['a', 'k']]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

Now let's perform our posterior predictive check.

In [36]:
time = df_sub['time (min)'].values
val = df_sub['area'].values
df_exp_ppc = bebi103.stan.extract_array(samples_exp, name='area_ppc')

p1 = hw92_predictive(df_exp_ppc, 
                     time, 
                     val, 
                     perc=[99, 75, 50, 25], 
                     name='area_ppc', 
                     plot_width=500, 
                     plot_height=400)

bokeh.io.show(p1)

This looks good. We'd like to do an comparison of the two models at this level. We've already calculated the log likelihood, so let's compute the loo and check the weights.

In [37]:
bebi103.stan.compare({'linear': samples_linear,
                      'exp': samples_exp},
                     log_likelihood='log_lik',
                     ic='loo')

Unnamed: 0,loo,ploo,dloo,weight,se,dse,warning
exp,-1012.47,5.73056,0.0,1.0,25.8899,0.0,0
linear,-718.811,5.30439,293.661,8.29914e-12,18.2026,14.2837,0


The exponential model has a much greater weight and a smaller loo, so it's currently the more dominant model. Let's move onto a two level hierarchical model, still choosing a small subset of data. Once we make sure the model works, we will input all the data into the sampler.

#### Two level hierarchical model

We will again start by modeling the linear model. We have one layer for each bacterium, and a second layer for each growth event. Our model is as follows:

\begin{gather}
a \sim \mbox{Norm}(\mu_a,\sigma_a) \\
k \sim \mbox{Norm}(\mu_k,\sigma_k) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{hyper}) \\
a_1 \sim \mbox{Norm}(a, \tau_a) \\
k_1 \sim \mbox{Norm}(k, \tau_k) \\
a_2 \sim \mbox{Norm}(a_1, \tau_a) \\
k_2 \sim \mbox{Norm}(k_1, \tau_k) \\
area \sim \mbox{Norm}(a + k * t, \sigma)
\end{gather}

**double check this**

In [40]:
model_code_linear_2 = """
data {
  // Total number of data points
  int N;
  
  // Number of entries in each level of the hierarchy
  int J_1;
  int J_2;
  
  //Index arrays to keep track of hierarchical structure
  int index_1[J_2];
  int index_2[N];
  
  // The measurements
  real area[N];
  
  // Time
  vector[N] t;
}

parameters {
  // Hyperparameters level 0
  real a;
  real k;
  real<lower=0> sigma;

  // How hyperparameters vary
  real<lower=0> tau_a;
  real<lower=0> tau_k;

  // Hyperparameters level 1
  vector[J_1] a_1_tilde;
  vector[J_1] k_1_tilde;
  
  // Hyperparameters level 2
  vector[J_2] a_2_tilde;
  vector[J_2] k_2_tilde;
}

transformed parameters {
  // Transformations for noncentered
  vector[J_1] a_1 = a + tau_a * a_1_tilde;
  vector[J_1] k_1 = k + tau_k * k_1_tilde;
  
  vector[J_2] a_2 = a_1[index_1] + tau_a * a_2_tilde;
  vector[J_2] k_2 = k_1[index_1] + tau_k * k_2_tilde;
  
  vector[N] area_temp;
  
  for (i in 1:N) {
    area_temp[i] = a_2[index_2[i]] + k_2[index_2[i]] * t[i];
  }
}

model {
  a ~ normal(1.4, 0.3);
  k ~ normal(0.01, 0.002);
  sigma ~ normal(0, 0.1);
  tau_a ~ normal(0, 0.1);
  tau_k ~ normal(0, 0.001);

  a_1_tilde ~ normal(0, 1);
  k_1_tilde ~ normal(0, 1);
  
  a_2_tilde ~ normal(0, 1);
  k_2_tilde ~ normal(0, 1);

  area ~ normal(area_temp, sigma);
}

generated quantities {
  vector[N] area_ppc;
  real log_lik[N];
  
  for (i in 1:N) {
    area_ppc[i] = normal_rng(area_temp[i], sigma);
  }
  
  // Compute pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(area[i] | area_temp[i], sigma);
  }
}
"""

Let's compile the stan code from a standalone file.
**change to standalone file**

In [41]:
sm_linear_2 = bebi103.stan.StanModel(model_code=model_code_linear_2)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_667b092dd3cab7212e48a056398b6bea NOW.
  tree = Parsing.p_module(s, pxd, full_module_name)


We will try a subset of data with two growth events from bacterium 1 and one growth event from bacterium 2.

In [42]:
df_sub1 = df.loc[(df['growth_event'] == 1) & (df['bacterium'] == 1)]
df_sub2 = df.loc[(df['growth_event'] == 2) & (df['bacterium'] == 1)]
df_sub3 = df.loc[(df['growth_event'] == 3) & (df['bacterium'] == 2)]
df_sub = pd.concat([df_sub1, df_sub2])
df_sub = pd.concat([df_sub, df_sub3])

df_sub.head()

Unnamed: 0,time (min),area,growth_event,bacterium,t
98,99.0,1.403376,1,1,0
99,100.0,1.400672,1,1,1
100,101.0,1.373632,1,1,2
101,102.0,1.40608,1,1,3
102,103.0,1.362816,1,1,4


Now let's convert the data into the format for stan.

In [31]:
data, df_part = bebi103.stan.df_to_datadict_hier(df_sub,
                                           level_cols=['bacterium', 'growth_event'],
                                           data_cols=['area', 't'])

Now let's sample, making sure to check diagnostics.

In [47]:
# Sample
samples_linear_2 = sm_linear_2.sampling(data=data, 
                              seed=2389412, 
                              control=dict(adapt_delta=0.99, max_treedepth=11))

# Convert to data frame for easy use later
df_linear_2 = bebi103.stan.to_dataframe(samples_linear_2)

bebi103.stan.check_all_diagnostics(samples_linear_2)



n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0 of 4000 (0.0%) iterations ended with a divergence.
640 of 4000 (16.0%) iterations saturated the maximum tree depth of 11.
  Try running again with max_treedepth set to a larger value to avoid saturation.
E-BFMI indicated no pathological behavior.


8

In [32]:
#bokeh.io.show(bebi103.viz.trace_plot(samples_linear_2, pars=['a', 'k'], line_width=2))

Let's plot the corner plot.

In [49]:
bokeh.io.show(bebi103.viz.corner(samples_linear_2, pars=['a', 'k']))

We can also plot the marginalized distributions for both parameters to look at the parameter values.

In [None]:
# Marginalized distributions of each parameter 
plots = [bebi103.viz.ecdf(df_linear_2[param], 
                          x_axis_label=param, 
                          plot_height=200, 
                          plot_width=250) 
                 for param in ['a', 'k']]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

And let's perform the posterior predictive check.

In [50]:
time = df_sub['time (min)'].values
val = df_sub['area'].values
df_lin2_ppc = bebi103.stan.extract_array(samples_linear_2, name='area_ppc')

p1 = hw92_predictive(df_lin2_ppc, 
                     time, 
                     val, 
                     perc=[99, 75, 50, 25], 
                     name='area_ppc', 
                     plot_width=500, 
                     plot_height=400)

bokeh.io.show(p1)

Everything looks good, so let's move on to the exponential model.

\begin{gather}
a \sim \mbox{Norm}(\mu_a,\sigma_a) \\
k \sim \mbox{Norm}(\mu_k,\sigma_k) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{hyper}) \\
a_1 \sim \mbox{Norm}(a, \tau_a) \\
k_1 \sim \mbox{Norm}(k, \tau_k) \\
a_2 \sim \mbox{Norm}(a_1, \tau_a) \\
k_2 \sim \mbox{Norm}(k_1, \tau_k) \\
area \sim \mbox{Norm}(a * e^{kt}, \sigma)
\end{gather}

**double check this**

In [51]:
model_code_exp_2 = """
data {
  // Total number of data points
  int N;
  
  // Number of entries in each level of the hierarchy
  int J_1;
  int J_2;
  
  //Index arrays to keep track of hierarchical structure
  int index_1[J_2];
  int index_2[N];
  
  // The measurements
  real area[N];
  
  // Time
  vector[N] t;
}

parameters {
  // Hyperparameters level 0
  real a;
  real k;
  real<lower=0> sigma;

  // How hyperparameters vary
  real<lower=0> tau_a;
  real<lower=0> tau_k;

  // Hyperparameters level 1
  vector[J_1] a_1_tilde;
  vector[J_1] k_1_tilde;
  
  // Hyperparameters level 2
  vector[J_2] a_2_tilde;
  vector[J_2] k_2_tilde;
}

transformed parameters {
  // Transformations for noncentered
  vector[J_1] a_1 = a + tau_a * a_1_tilde;
  vector[J_1] k_1 = k + tau_k * k_1_tilde;
  
  vector[J_2] a_2 = a_1[index_1] + tau_a * a_2_tilde;
  vector[J_2] k_2 = k_1[index_1] + tau_k * k_2_tilde;
  
  vector[N] area_temp;
  
  for (i in 1:N) {
    area_temp[i] = a_2[index_2[i]] * exp(k_2[index_2[i]] * t[i]);
  }
}

model {
  a ~ normal(1.4, 0.3);
  k ~ normal(0.01, 0.002);
  sigma ~ normal(0, 0.1);
  tau_a ~ normal(0, 0.1);
  tau_k ~ normal(0, 0.001);

  a_1_tilde ~ normal(0, 1);
  k_1_tilde ~ normal(0, 1);
  
  a_2_tilde ~ normal(0, 1);
  k_2_tilde ~ normal(0, 1);

  area ~ normal(area_temp, sigma);
}

generated quantities {
  vector[N] area_ppc;
  real log_lik[N];
  
  for (i in 1:N) {
    area_ppc[i] = normal_rng(area_temp[i], sigma);
  }
  
  // Compute pointwise log likelihood
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(area[i] | area_temp[i], sigma);
  }
}
"""

Compile the stan code **from a standalone file**.

In [52]:
sm_exp2 = bebi103.stan.StanModel(model_code=model_code_exp_2)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_3320abe8024cbef8637f96c0c2474507 NOW.
  tree = Parsing.p_module(s, pxd, full_module_name)


We are using the same subset of data so let's sample! We also make sure to check the diagnostics.

In [54]:
# Sample
samples_exp2 = sm_exp2.sampling(data=data, 
                              seed=2389412, 
                              control=dict(adapt_delta=0.99, max_treedepth=11))

# Convert to data frame for easy use later
df_samples_exp2 = bebi103.stan.to_dataframe(samples_exp2)

bebi103.stan.check_all_diagnostics(samples_exp2)



n_eff / iter looks reasonable for all parameters.
Rhat for parameter k_1_tilde[1] is 1.2144612297384512.
  Rhat above 1.1 indicates that the chains very likely have not mixed
0 of 4000 (0.0%) iterations ended with a divergence.
3810 of 4000 (95.25%) iterations saturated the maximum tree depth of 11.
  Try running again with max_treedepth set to a larger value to avoid saturation.
E-BFMI indicated no pathological behavior.


10

We can plot the corner plot to look at the values of a and k.

In [56]:
bokeh.io.show(bebi103.viz.corner(samples_exp2, pars=['a', 'k']))

We can also plot the marginalized distributions for both parameters to look at the parameter values.

In [None]:
# Marginalized distributions of each parameter 
plots = [bebi103.viz.ecdf(df_exp[param], x_axis_label=param, plot_height=200, plot_width=250) 
                 for param in ['a', 'k']]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

Now let's perform the posterior predictive check.

In [57]:
time = df_sub['time (min)'].values
val = df_sub['area'].values
df_exp2_ppc = bebi103.stan.extract_array(samples_exp2, name='area_ppc')

p1 = hw92_predictive(df_exp2_ppc, 
                     time, 
                     val, 
                     perc=[99, 75, 50, 25], 
                     name='area_ppc', 
                     plot_width=500, 
                     plot_height=400)

bokeh.io.show(p1)

It looks good, so let's compare the two models.

In [58]:
bebi103.stan.compare({'linear': samples_linear_2,
                      'exp': samples_exp2},
                     log_likelihood='log_lik',
                     ic='loo')

        one or more samples. You should consider using a more robust model, this is because
        importance sampling is less likely to work well if the marginal posterior and LOO posterior
        are very different. This is more likely to happen with a non-robust model and highly
        influential observations.
  influential observations."""


Unnamed: 0,loo,ploo,dloo,weight,se,dse,warning
exp,-1206.99,15.4135,0.0,0.939028,71.8226,0.0,1
linear,-1053.36,8.67816,153.631,0.0609722,29.3423,52.7889,0


The exponential model again has a much bigger weight. It also has a smaller loo.

Now we sample for the whole dataset.

In [None]:
%load_ext watermark

In [None]:
%watermark -v -p numpy,scipy,bokeh,jupyterlab