<p style="background-color:#D9EDF7">
This is great work and very well explained! 20/20. 
</p>


### Problem 9.2: Outliers in FRET binding curve

Attribution: Zhiyang did this problem, the whole group discussed together for debugging.

In [1]:
import itertools

import numpy as np
import pandas as pd
import altair as alt
import altair_catplot as altcat

import bebi103

import bokeh.io
import bokeh.plotting
import bokeh.models
import bokeh.layouts
bokeh.io.output_notebook()
color_palette=['#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f', '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac']

Features requiring DataShader will not work and you will get exceptions.
  Features requiring DataShader will not work and you will get exceptions.""")


To build the model, we want to have a general idea about what the data look like, so we load the data set into data frame first.

In [2]:
# Load the data set
df = pd.read_csv('../data/fret_binding_curve.csv', comment='#')

# Take a look
df

Unnamed: 0,buffer,fluorescence,a conc (nM),b conc (nM)
0,1256.5751,258316.2818,50.0,1500.0
1,1256.5751,267722.6277,50.0,750.0
2,1256.5751,267431.662,50.0,375.0
3,1256.5751,284596.2914,50.0,187.5
4,1256.5751,254903.3958,50.0,93.75
5,1256.5751,333810.6371,50.0,46.875
6,1256.5751,370821.7778,50.0,23.4375
7,1256.5751,408856.1424,50.0,11.71875
8,1256.5751,431000.0,50.0,5.859375
9,1256.5751,437000.0,50.0,0.0


It's a tidy data frame, and we can have an idea about the order of magnitude of the fluorescence measurement is about $10^5$. Looking at the definition given in the problem, the dissociation costant $K_d$ is:

\begin{equation}\tag{1}
K_d = \frac{c_a c_b}{c_{ab}}
\end{equation}

, where $c_i$ is the concentration of species $i$. So basically from this definition, we would know that $K_d$ cannot be negative since there is no negative concentration, and it can range from 0 to infinity. 

Then we take a look at the way the experiment is done, the data we have is the concentration of $a$ and $b$, along with the fluorescence readings. And we have the equation as following:

\begin{equation} \tag{2}
F = \hat{f}_0(c_a^0 - c_{ab}) + \hat{f}_q\, c_{ab}
= \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}}.
\end{equation}

, where $\hat{f}_0 = f_0 V$ and $\hat{f}_q = f_q V$ are tranformed parameters. We do not know about $K_d, \hat{f}_0, \hat{f}_q$ in Eq.(2), and we have the data for the rest variables, i.e. $c^0_a, c^0_b$. Thus, we set three parameters for this model, namely $K_d, \hat{f}_0, \hat{f}_q$. 

Not considering the outliers, for $K_d$, we only know that it is non-negative, so we use a half normal distribution as the prior for $K_d$, where we want to use a large $\sigma$ so that the prior is broad enough to cover all the possibilities of $K_d$. For $\hat{f}_0$ and $\hat{f}_q$, the product of them and concentration of $a$ should be in the same order of magnitude with the fluorescence readings, since if we do not have any $b$, $F$ would be $\hat{f}_0 c^a_0$ and if we have a $K_d$ close to zero, $F$ would be $\hat{f}_q c^a_0$. We also know that $\hat{f}_0 > \hat{f}_q$ because firstly it is called 'quenched' and from the data set, generally $F$ decreases as $c^0_b$ increases. Hence, we decide to have broad normal distributions as priors for those two parameters, where the means should be around $10^5 / 50$, and $\hat{f}_0$ should have a prior that gives values larger than $\hat{f}_q$ in most cases. Also, we think the measurements should have some noise in them, so we have a normal distriubtion for the final readings with the mean of the calculated $F$ and some $\sigma$ called noise, which itself is of a half normal distribution. The model is summarized as below: 

\begin{gather}
K_d \sim \mbox{HalfNorm}(\sigma_{K_d}) \\
\hat{f}_0 \sim \mbox{Norm}(\mu_{f_0}, \sigma_{f_0}) \\
\hat{f}_q \sim \mbox{Norm}(\mu_{f_q}, \sigma_{f_q}) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{noise}) \\
F_{temp} = \hat{f}_0(c_a^0 - c_{ab}) + \hat{f}_q\, c_{ab}
= \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}} \\
F \sim \mbox{Norm}(F_{temp}, \sigma).
\end{gather}

We then code out the prior predictive check.

In [39]:
pri_pred_1 = '''data {
  // Number of data points
  int N;
  // conc of a
  real ca0;
  // conc of b
  real cb0[N];
  // sigma for Kd
  real Kd_sigma;
  // mean of f0
  real f0_mu;
  // sigma of f0
  real f0_sigma;
  // mean of fq
  real fq_mu;
  // sigma for fq
  real fq_sigma;
  // sigma for measurement noise
  real noise_sigma;
}

generated quantities{
  real Kd;
  real f0;
  real fq;
  // Generated readings
  real F[N];
  // Calculated F
  real temp;
  real noise;
  
  Kd = fabs(normal_rng(0, Kd_sigma));
  f0 = normal_rng(f0_mu, f0_sigma);
  fq = normal_rng(fq_mu, fq_sigma);
  noise = fabs(normal_rng(0, noise_sigma));
  
  
  for (i in 1:N) {
  // for every data point, generate the calcualted F first
    temp = f0 * ca0 - (2 * (f0 - fq) * ca0 * cb0[i]) / (Kd + ca0 + cb0[i] + sqrt((Kd + ca0 + cb0[i])^2 - 4 * ca0 * cb0[i]));
    F[i] = normal_rng(temp, noise);
  }
}'''

sm = bebi103.stan.StanModel(model_code=pri_pred_1)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_66894181ad8124db5e7dd1343c6d3345 NOW.


Then we slice out the concentrations of $b$ and put some numbers for the priors which we think is reasonable in the order of magnitude and broad enough to cover all the possible data. Specially, we try to have a narrower prior for $\hat{f}_q$ to avoid overlapping of it and $\hat{f}_0$.

In [40]:
# Slice out the cb0
conc_b = df['b conc (nM)'].values

# Put reasonable parameters for priors
data = dict(N=len(df),
            ca0 = 50,
            cb0 = conc_b,
            Kd_sigma = 150,
            f0_mu = 9000,
            f0_sigma = 2000,
            fq_mu = 4500,
            fq_sigma = 1000,
            noise_sigma = 10000)

# Sample
samples_gen = sm.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=100)

  elif np.issubdtype(np.asarray(v).dtype, float):


Then we take a look at the data in the sampling results to see how we should plot it.

In [41]:
df_samples = bebi103.stan.extract_array(samples_gen, name='F')
# Take a look
df_samples.head()

Unnamed: 0,index_1,F,chain,chain_idx,warmup
0,1,201440.410843,1,1,0
1,1,245182.637384,1,2,0
2,1,167664.555588,1,3,0
3,1,284198.369452,1,4,0
4,1,227776.396976,1,5,0


It looks like that for each 'chain_idx', there is a set of data points coresponding to $c^0_b$, so we plot every chain_idx to see what we have from the prior predictive check.

In [42]:
# Initilize the figure
p= bokeh.plotting.figure(width=500,height=400)

# Plot vs cb0 for each chain_idx
for i in range(100):
    p.line(conc_b, df_samples.loc[df_samples['chain_idx'] == i+1, 'F'].values, alpha=0.2)

bokeh.io.show(p)

It's kind of hard to tell if there are some unphysical results, but the only thing that we think needs attention here is that $K_d$ should always decrease with incresing $c^0_b$ genearlly, even with the noise, which indicate $\hat{f}_0 > \hat{f}_q$, so we try to plot the differences of adjacent elements below to see if they are mostly negative.

In [43]:
p = bokeh.plotting.figure(width=500,height=400)

for i in range(100):
    # Plot vs cb, change the order because of the data format
    p.line(conc_b[-2::-1], np.diff(df_samples.loc[df_samples['chain_idx'] == i+1, 'F'].values[::-1]), alpha=0.2, line_width=2)

bokeh.io.show(p)

It looks like most of them are negative but it is still hard to tell, then we calculate the ECDF for those differences:

In [44]:
dif = []

for i in range(100):
    # Append differences for each curve
    dif = dif + (list(np.diff(df_samples.loc[df_samples['chain_idx'] == i+1, 'F'].values[::-1])))

p = bebi103.viz.ecdf(dif)

bokeh.io.show(p)

There are still about 10% postivie values in the differences, which could be because of the noise, so we sum them up for each curve and plot the ECDF of those sums.

In [45]:
dif = []

for i in range(100):
    # Append the sum of differences for each curve
    dif.append(np.sum(np.diff(df_samples.loc[df_samples['chain_idx'] == i+1, 'F'].values[::-1])))

p = bebi103.viz.ecdf(dif)

bokeh.io.show(p)

It's clear that there are only four points that are postive, meaning about 4% of the data have  $\hat{f}_0 < \hat{f}_q$, which is good to us. Then we proceed to write the code for this model.

Besides, we come up with a way of plotting those data with input variables as below. The way it works is that for every single data point, it compute the medians and some percentiles from the sampling results at that specific point and plot all the medians against the input variable. Initially, we want to use this as a way to plot the prior predictive check but when we do this, we, to some extent, lose information about the shape of the curve. We are not very sure about this, but we assume this plotting method could still inform us of the general trend of the data against the input variables, and could be a good way of illustration. Here, those shades just indicate that some perentage of the curves are lying inside them without giving any information about how the shape of those curves would be, while the medians may be capable of informing us of the shape because all the priors and liklihood are Gaussian distributions or half normal and the medians should give some ideas about what is the most probable values at those points, which should be from the most probable parameters drawn out of the distributions. 

In [46]:
def hw92_predictive(df, x, y=None, namex='index_1', name='F_ppc', perc=[80, 60, 40, 20], 
                    x_axis_label=None, y_axis_label=None, title=None, plot_width=350, plot_height=225, 
                    color='blue', data_color=color_palette[1], diff=False):
    '''Mimic of predictive ECDF
    df - MCMC sampling data frame
    x - input variable
    y - data
    namex - name of the input varible in the data frame
    name - name of the predictive results in the data frame
    perc - list, default [80, 60, 40, 20]
            Percentiles for making colored envelopes for confidence
            intervals for the predictive ECDFs. Maximally four can be 
            specified.'''
    
    # Initialize the color
    if color not in ['green', 'blue', 'red', 'gray',
                     'purple', 'orange', 'betancourt']:
        raise RuntimeError("Only allowed colors are 'green', 'blue', 'red', 'gray', 'purple', 'orange'")
    
    colors = {'blue': ['#9ecae1','#6baed6','#4292c6','#2171b5','#084594'],
              'green': ['#a1d99b','#74c476','#41ab5d','#238b45','#005a32'],
              'red': ['#fc9272','#fb6a4a','#ef3b2c','#cb181d','#99000d'],
              'orange': ['#fdae6b','#fd8d3c','#f16913','#d94801','#8c2d04'],
              'purple': ['#bcbddc','#9e9ac8','#807dba','#6a51a3','#4a1486'],
              'gray': ['#bdbdbd','#969696','#737373','#525252','#252525'],
              'betancourt': ['#DCBCBC', '#C79999', '#B97C7C',
                             '#A25050', '#8F2727', '#7C0000']}
    
    # Initialize the figure
    p = bokeh.plotting.figure(plot_width=plot_width,
                              plot_height=plot_height,
                              x_axis_label=x_axis_label,
                              y_axis_label=y_axis_label,
                              title=title)
    
    # See if take the diff
    if diff:
        x = x[1:]
        if y is not None:
            y = np.diff(y)
        Nb = len(x)
        y_ppc = np.empty((len(perc) * 2 + 1, Nb))
        for i in range(Nb):
            temp = df.loc[df[namex]== i+2, name].values - df.loc[df[namex]== i+1, name].values
            y_ppc[-1, i] = np.median(temp)
            for j in range(len(perc)):
                y_ppc[j * 2, i] = np.percentile(temp, 50 - perc[j] / 2)
                y_ppc[j * 2 + 1, i] = np.percentile(temp, 50 + perc[j] / 2)
    else:                
        Nb = len(x)
        y_ppc = np.empty((len(perc) * 2 + 1, Nb))
        # For each data point, take all the sampling results at this point
        for i in range(Nb):
            temp = df.loc[df[namex]== i+1, name].values
            # Find the median and corresponding percentiles
            y_ppc[-1, i] = np.median(temp)
            for j in range(len(perc)):
                y_ppc[j * 2, i] = np.percentile(temp, 50 - perc[j] / 2)
                y_ppc[j * 2 + 1, i] = np.percentile(temp, 50 + perc[j] / 2)
    
    # Plotting like predictive_ecdf
    for j in range(len(perc)):
        bebi103.viz.fill_between(x, y_ppc[j * 2, :],
                     x, y_ppc[j * 2 + 1,:],
                     p=p,
                     show_line=False,
                     fill_color=colors[color][j])
        
    p.line(x, y_ppc[-1, :],
           line_width=2,
           color=colors[color][-1])
    
    if y is not None:
        p.line(x, y, line_width=2, color='orange')
    
    return p

In [47]:
# Plot using the function above
p1 = hw92_predictive(df_samples, conc_b, name='F', perc=[99, 75, 50, 25], diff=False, plot_width=500, plot_height=400)

bokeh.io.show(p1)

It is pretty clear that this way of plotting does not show much information about the possible shapes of the curve though some. With results from the same prior predictive check, it loses the track of those curves with wrong trends. However, we again think it is still a good way to show where most of the data are, because we suppose if one has a wrong prior giving lots of unphysical values, they will still show up. To test that, we can change some parameters in the prior predictive check, for instance, we change the mean of $\hat{f}_0$ to 7000 which should give some overlapping with $\hat{f}_q$.

In [48]:
# Slice out the cb0
conc_b = df['b conc (nM)'].values

# Put reasonable parameters for priors
data = dict(N=len(df),
            ca0 = 50,
            cb0 = conc_b,
            Kd_sigma = 150,
            f0_mu = 7000,
            f0_sigma = 2000,
            fq_mu = 4500,
            fq_sigma = 1000,
            noise_sigma = 10000)

# Sample
samples_gen = sm.sampling(data=data,
                          algorithm='Fixed_param',
                          warmup=0,
                          chains=1,
                          iter=100)

Then we take a look at the data in the sampling results to see how we should plot it.

In [49]:
df_samples = bebi103.stan.extract_array(samples_gen, name='F')
# Take a look
df_samples.head()

Unnamed: 0,index_1,F,chain,chain_idx,warmup
0,1,249859.427586,1,1,0
1,1,232110.43647,1,2,0
2,1,262042.701162,1,3,0
3,1,324350.016802,1,4,0
4,1,325503.566045,1,5,0


It looks like that for each 'chain_idx', there is a set of data points coresponding to $c^0_b$, so we plot every chain_idx to see what we have from the prior predictive check.

In [50]:
# Initilize the figure
p= bokeh.plotting.figure(width=500,height=400)

# Plot vs cb0 for each chain_idx
for i in range(100):
    p.line(conc_b, df_samples.loc[df_samples['chain_idx'] == i+1, 'F'].values, alpha=0.2)
    
# Plot using the function above
p1 = hw92_predictive(df_samples, conc_b, name='F', perc=[99, 75, 50, 25], diff=False, plot_width=500, plot_height=400)

bokeh.io.show(bokeh.layouts.gridplot([[p, p1]]))

In the right figure, the trend does show up in the 99% shades, while other shades indicate most of the curves are of the right shapes. Looking at those two plots, we believe this way to plot predictive check is somewhat informative and clear, especially when one has a lot of iterations when sampling. We will keep using this function through HW9.

In [15]:
model_code_normal = """
data {
  // Number of datapoints
  int N;
  // Conc of a
  int ca0;
  // Conc of b
  real cb0[N];
  // Measured fluorescence
  real F[N];
}


parameters {
  real<lower=0> Kd;
  real<lower=0> f0;
  real<lower=0> fq;
  real<lower=0> noise;
}

transformed parameters {
  real F_temp[N];
  for (i in 1:N) {
  // Generate calculated F for each point
    F_temp[i] = f0 * ca0 - (2 * (f0 - fq) * ca0 * cb0[i]) / (Kd + ca0 + cb0[i] + sqrt((Kd + ca0 + cb0[i])^2 - 4 * ca0 * cb0[i]));
  }
}


model {
  Kd ~ normal(0, 100);
  f0 ~ normal(9000, 2000);
  fq ~ normal(4500, 1000);
  noise ~ normal(0, 10000);
  
  F ~ normal(F_temp, noise);
}


generated quantities {
  // Posterior predictive check
  real F_ppc[N];
  
  for (i in 1:N) {
    F_ppc[i] = normal_rng(F_temp[i], noise);
  }
}
"""

The Stan code is attached above for reference, but the model is compiled from the standalone file.

In [16]:
# Complie from the standalone file
sm_normal = bebi103.stan.StanModel(file='hw92_normal.Stan')

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_9ca1951dc0d7006da4e419ced82ffced NOW.


In [17]:
# Make the data
data = dict(N=len(df),
            ca0 = 50,
            cb0 = conc_b,
            F = df['fluorescence'].values)

In [18]:
# Sample out of the model
samples_normal = sm_normal.sampling(data=data)
# Run diagnostics
bebi103.stan.check_all_diagnostics(samples_normal)

  elif np.issubdtype(np.asarray(v).dtype, float):


n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.


0

The diagnostics look good, so we plot the corner plots.

In [19]:
bokeh.io.show(bebi103.viz.corner(samples_normal, 
                                 pars=['Kd', 'f0', 'fq','noise'],
                                 plot_width=200,
                                 cmap='gray',
                                 alpha=0.05))

We can also marginalize to show the ECDF of those three parameters respectively.

In [20]:
df_normal = bebi103.stan.to_dataframe(samples_normal)

plots = [bebi103.viz.ecdf(df_normal[param], x_axis_label=param, plot_height=200, plot_width=250) 
                 for param in ['Kd', 'f0', 'fq']]
                                      
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=3))

To see how well the model does, we want to plot the results from posterior predictive check, where we want to use the function mentioned above.

In [21]:
# Extract the posterior predictive results
df_samples_ppc = bebi103.stan.extract_array(samples_normal, name='F_ppc')

In [22]:
p2 = hw92_predictive(df_samples_ppc, conc_b, df['fluorescence'].values, perc=[99, 70, 50, 25], name='F_ppc', plot_width=500, plot_height=400, title='Normal')

bokeh.io.show(p2)

It is not bad. Although there are about two outliers, they are both in the 99% range and one of those is within 70%. For the rest data point, we can see that the curve of medians fits well with the measured data. To consider those outliers, we change the likelihood to student-t distribution, so the model becomes:

\begin{gather}
K_d \sim \mbox{HalfNorm}(\sigma_{K_d}) \\
\hat{f}_0 \sim \mbox{Norm}(\mu_{f_0}, \sigma_{f_0}) \\
\hat{f}_q \sim \mbox{Norm}(\mu_{f_q}, \sigma_{f_q}) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{noise}) \\
\nu \sim \mbox{HalfNorm}(1, 100) \\
F_{temp} = \hat{f}_0(c_a^0 - c_{ab}) + \hat{f}_q\, c_{ab}
= \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}} \\
F \sim \mbox{Student-t}(F_{temp}, \sigma, \nu).
\end{gather}

In [23]:
model_code_t = """
data {
  // Number of datapoints
  int N;
  // Conc of a
  int ca0;
  // Conc of b
  real cb0[N];
  // Measured fluorescence
  real F[N];
}


parameters {
  real<lower=0> Kd;
  real<lower=0> f0;
  real<lower=0> fq;
  real<lower=1> nu;
  real<lower=0> noise;
}

transformed parameters {
  real F_temp[N];
  for (i in 1:N) {
  // Generate calculated F for each point
    F_temp[i] = f0 * ca0 - (2 * (f0 - fq) * ca0 * cb0[i]) / (Kd + ca0 + cb0[i] + sqrt((Kd + ca0 + cb0[i])^2 - 4 * ca0 * cb0[i]));
  }
}


model {
  Kd ~ normal(0, 100);
  f0 ~ normal(9000, 2000);
  fq ~ normal(4500, 1000);
  noise ~ normal(0, 10000);
  nu ~ normal(1,100);
  
  F ~ student_t(nu, F_temp, noise);
}


generated quantities {
  real F_ppc[N];
  
  // Posterior predictive check
  for (i in 1:N) {
    F_ppc[i] = student_t_rng(nu, F_temp[i], noise);
  }
}
"""

The Stan code is attached above for reference, but the model is compiled from the standalone file.

In [24]:
# Complie from the standalone file
sm_t = bebi103.stan.StanModel(file='hw92_student_t.Stan')

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_015182b6199eb113204177d1b359350e NOW.


In [25]:
# Sample from the same data
samples_t = sm_t.sampling(data=data)
# Run diagnostics
bebi103.stan.check_all_diagnostics(samples_t)

n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.


0

Everything looks good, we take a look at the corner plot.

In [26]:
bokeh.io.show(bebi103.viz.corner(samples_t, 
                                 pars=['Kd', 'f0', 'fq','noise','nu'],
                                 plot_width=200,
                                 cmap='gray',
                                 alpha=0.05))

We marginalize to show the ECDF of those three parameters and compare them with those from the normal likelihood.

In [27]:
df_t = bebi103.stan.to_dataframe(samples_t)

plots_t = [bebi103.viz.ecdf(df_t[param], x_axis_label=param, plot_height=200, plot_width=250) 
                 for param in ['Kd', 'f0', 'fq']]
                                      
bokeh.io.show(bokeh.layouts.gridplot(plots_t + plots, ncols=3))

They look pretty similar, and we check the posterior predictive results and compare it with that from normal likelihood.

In [28]:
df_samples_ppc_t = bebi103.stan.extract_array(samples_t, name='F_ppc')

p3 = hw92_predictive(df_samples_ppc_t, conc_b, df['fluorescence'].values, perc=[99, 70, 50, 25], name='F_ppc', plot_width=500, plot_height=400, title='Student-t')

bokeh.io.show(bokeh.layouts.gridplot([[p2, p3]]))

We don't think there is a significant different in term of the posterior predicitve check. The 99% range in the student-t likelihood is slightly larger without compromising the medians and smaller percentile ranges, which might be the advantage of considering the outliers. We move on and try the good-bad data model, which is shown as below. The prior for the weight is chosen to be a beta distribution that gives a slight preference for good data since we think most of the data points should be good.

\begin{gather}
K_d \sim \mbox{HalfNorm}(\sigma_{K_d}) \\
\hat{f}_0 \sim \mbox{Norm}(\mu_{f_0}, \sigma_{f_0}) \\
\hat{f}_q \sim \mbox{Norm}(\mu_{f_q}, \sigma_{f_q}) \\
\sigma \sim \mbox{HalfNorm}(\sigma_{noise}) \\
\sigma_{bad} \sim \mbox{HalfNorm}(\sigma_{noise}) \mbox{ with } \sigma_{bad} > \sigma\\
w_i \sim \mbox{Beta}(3,2) \\
F_{i, temp} = \hat{f}_0(c_a^0 - c_{ab}) + \hat{f}_q\, c_{ab}
= \hat{f}_0\,c_a^0 - \frac{2(\hat{f}_0 - \hat{f}_q)c_a^0\,c_b^0}{K_d+c_a^0+c_b^0 + \sqrt{\left(K_d+c_a^0+c_b^0\right)^2 - 4c_a^0\,c_b^0}} \\
F_i \sim w_i \mbox{Norm}(F_{i, temp}, \sigma) + (1 - w_i) \mbox{Norm}(F_{i, temp}, \sigma_{bad}).
\end{gather}

In [29]:
model_code_mix = """
data {
  // Number of datapoints
  int N;
  // Conc of a
  int ca0;
  // Conc of b
  real cb0[N];
  // Measured fluorescence
  real F[N];
}


parameters {
  real<lower=0> Kd;
  real<lower=0> f0;
  real<lower=0> fq;
  positive_ordered[2] noise;
  real<lower=0, upper=1> w[N];
}

transformed parameters {
  real F_temp[N];
  for (i in 1:N) {
  // Generate calculated F for each point
    F_temp[i] = f0 * ca0 - (2 * (f0 - fq) * ca0 * cb0[i]) / (Kd + ca0 + cb0[i] + sqrt((Kd + ca0 + cb0[i])^2 - 4 * ca0 * cb0[i]));
  }
}


model {
  Kd ~ normal(0, 100);
  f0 ~ normal(9000, 2000);
  fq ~ normal(4500, 1000);
  noise ~ normal(0, 10000);
  w ~ beta(3,2);
  
  for (i in 1:N) {
    target += log_mix(w[i],
                      normal_lpdf(F[i] | F_temp[i], noise[1]),
                      normal_lpdf(F[i] | F_temp[i], noise[2]));
  }

}

generated quantities {
  real F_ppc[N];
  
  // Posterior predictive check
  for (i in 1:N) {
    if (uniform_rng(0.0, 1.0) < w[i]) {
      F_ppc[i] = normal_rng(F_temp[i], noise[1]);
    }
    else {
      F_ppc[i] = normal_rng(F_temp[i], noise[2]);
    }    
  }
}
"""

The Stan code is attached above for reference, but the model is compiled from the standalone file.

In [30]:
# Complie from the standalone file
sm_mix = bebi103.stan.StanModel(file='hw92_mix.Stan')

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_106a49bf901827b5cc7c3fa2a7cf294e NOW.


There are very few iterations with divergences in this model, so we use a larger adapt_delta instead.

In [31]:
# Sample with the same data
samples_mix = sm_mix.sampling(data=data, control=dict(adapt_delta=0.96))
# Run diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

n_eff / iter looks reasonable for all parameters.
Rhat looks reasonable for all parameters.
0.0 of 4000 (0.0%) iterations ended with a divergence.
0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.
E-BFMI indicated no pathological behavior.


0

Everything is good, and we plot the corner plots.

In [32]:
bokeh.io.show(bebi103.viz.corner(samples_mix, 
                                 pars=['Kd', 'f0', 'fq','noise[1]','noise[2]'],
                                 plot_width=200,
                                 cmap='gray',
                                 alpha=0.05))

There might be some differences but no significant ones in the values of $K_d$. To make sure, we plot the ECDF for $K_d$ sampled out of three models together.

In [33]:
df_mix = bebi103.stan.to_dataframe(samples_mix)

pc = bebi103.viz.ecdf(df_normal['Kd'], x_axis_label='K_d', plot_height=300, plot_width=400, color=color_palette[0])
pc = bebi103.viz.ecdf(df_t['Kd'], color=color_palette[1], p=pc)
pc = bebi103.viz.ecdf(df_mix['Kd'], color=color_palette[2], p=pc)

bokeh.io.show(pc)

They are almost the same. We then plot the posterior predictive results from three of them. 

In [34]:
df_samples_ppc_mix = bebi103.stan.extract_array(samples_mix, name='F_ppc')

p4 = hw92_predictive(df_samples_ppc_mix, conc_b, df['fluorescence'].values, perc=[99, 70, 50, 25], name='F_ppc', plot_width=500, plot_height=400, title='Good-bad')

bokeh.io.show(bokeh.layouts.gridplot([[p2, p3, p4]]))

Again, they look pretty much the same, while the Good-bad data model looks nicer to some extent where the smaller percentiles are tighter without compromising most of the fitting, while the 99% range covers the outliers, but as in the parameter estimates, they are almost the same. Finally, we try to put the estimates of $K_d$ together and compare them. We firstly make them into a data frame.

In [35]:
# Name of the parameters
pars = ['Kd', 'f0', 'fq']
# Initialize the data frame
df_pars = pd.DataFrame()
# List the range of those parameters
for parm in pars:
    sample_temp = samples_normal.extract(parm)[parm]
    df_pars = df_pars.append(pd.DataFrame({'parameter':[parm],
                             'method':['normal'],
                             'low':[np.percentile(sample_temp, 2.5)],
                             'middle':[np.median(sample_temp)],
                             'high':[np.percentile(sample_temp, 97.5)]}),
                            ignore_index=True)

for parm in pars:
    sample_temp = samples_t.extract(parm)[parm]
    df_pars = df_pars.append(pd.DataFrame({'parameter':[parm],
                             'method':['student'],
                             'low':[np.percentile(sample_temp, 2.5)],
                             'middle':[np.median(sample_temp)],
                             'high':[np.percentile(sample_temp, 97.5)]}),
                            ignore_index=True)
    
for parm in pars:
    sample_temp = samples_mix.extract(parm)[parm]
    df_pars = df_pars.append(pd.DataFrame({'parameter':[parm],
                             'method':['good_bad'],
                             'low':[np.percentile(sample_temp, 2.5)],
                             'middle':[np.median(sample_temp)],
                             'high':[np.percentile(sample_temp, 97.5)]}),
                            ignore_index=True)

df_pars

Unnamed: 0,parameter,method,low,middle,high
0,Kd,normal,1.892086,7.327506,20.763448
1,f0,normal,8469.526148,8862.608261,9208.715199
2,fq,normal,4856.396712,5205.658146,5483.975156
3,Kd,student,1.947378,7.753199,22.884482
4,f0,student,8496.199541,8846.596314,9185.701154
5,fq,student,4873.187513,5215.595121,5476.359565
6,Kd,good_bad,2.136518,9.258146,21.541771
7,f0,good_bad,8560.688223,8829.293131,9089.9984
8,fq,good_bad,4975.549224,5237.278568,5450.360084


Then we borrow the code from the tutorial, plot out the estimates of $K_d$ out of three models.

In [36]:
# Ordering of y-axis
order = [(g, m) for g in ['Kd'] for m in ['normal', 'student', 'good_bad']]

# Build data source and color factors for plots
grouped = df_pars.groupby(['parameter', 'method'])
cat_range, factors, color_factors = bebi103.viz._get_cat_range(
    df_pars, grouped, order, 'parameter', True)
source = bebi103.viz._cat_source(df_pars, ['parameter', 'method'], list(df_pars.columns), 'parameter')
color = bokeh.transform.factor_cmap('parameter',
                                     palette=['#f28e2b', '#e15759', '#4e79a7'],
                                     factors=color_factors)

# Make plots
p = bokeh.plotting.figure(y_range=cat_range, plot_height=300)
p.circle(source=source, x='middle', y='cat', color=color)
p.segment(source=source, y0='cat', y1='cat', x0='low', x1='high', color=color)

bokeh.io.show(p)

Thus, we think the results are almost the same no matter whether we try to detect outliers or not. There might be some better results in the good-bad data model, but generally we think for data set like this which has only a few outliers, the detection of outliers would not improve the model much and if much more efforts are needed for those detections, we think it is not worth doing, while if the data set is large and one is not sure how many outliers there are, embeding some outlier detections may be a good idea.

In [37]:
%load_ext watermark

In [38]:
%watermark -v -p numpy,scipy,bokeh,jupyterlab

CPython 3.7.0
IPython 7.0.1

numpy 1.15.2
scipy 1.1.0
bokeh 0.13.0
jupyterlab 0.35.0
