In [1]:
import arviz as az
import numpy as np
import pymc as pm
from pymc.math import dot

%load_ext watermark
%watermark --iversions

arviz: 0.11.4
numpy: 1.22.3
pymc : 4.0.0b4



# Rats 

I'm just going to do ratsignorable2.odc for now since it is relevant for HW6. Eventually I'll add the other examples.

Adapted from [Codes for Unit 8: ratsignorable2.odc](https://www2.isye.gatech.edu/isye6420/supporting.html).

Associated lecture video: [Unit 8 Lesson 2](https://www.youtube.com/watch?v=T5vkLsIs3f8&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=83).

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/rats.txt).

We had a previous example about [Dugongs](https://areding.github.io/6420-pymc/Unit6-dugongs.html) that dealt with missing data in the observed data (y values). This example shows how to deal with missing data in the input data (x). It's still pretty easy. You could look at it like creating another likelihood in the model, a very simple one where the observed data is x, and you use a single distribution to fill in the missing values (see ```x_imputed``` in the model below).

For now I'm leaving the variable names the same as the BUGS example. Might go back and make them more descriptive later.

There are some differences with my version:

1. My gamma priors on tau are more informative. This is because PyMC was having some computational issues with the Gamma(.001, .001) priors the professor used. I don't know if it's because of the sampling algorithm or if it's a problem with the new computational backend. I will look into it more later, for now I just want to get these examples up.

2. I imputed the x values with a more informative prior for similar reasons. That Uniform(0, 500) prior seems kind of crazy to me, and I wanted to rule out more computational issues.

3. I got rid of the separate definition of the intercept as alpha, it is now beta[0].

I have not checked this model for any kind of correctness or compared the answers to the BUGS version. It may make no sense at all (considering the amount of divergences, it probably doesn't)! But I hope it gives you an idea for how to handle the missing data question on HW6, which I have confirmed works well in PyMC.

In [32]:
# note that I added a 1 to the first value for x, this is for the intercept beta[0]
x = [1.0, 8.0, 15.0, 22.0, np.nan, 36.0]
y = np.loadtxt("./data/rats.txt")
y.shape

(30, 5)

In [44]:
# create masked data
y = y.copy()
y = np.nan_to_num(y, nan=-1)
y = np.ma.masked_values(y, value=-1)

x = x.copy()
x = np.nan_to_num(x, nan=-1)
x = np.ma.masked_values(x, value=-1)

In [45]:
with pm.Model() as m:
    tau_c = pm.Gamma("tau.c", 1, 1)
    beta_c = pm.Normal("beta.c", 0, tau=1e-6)
    beta_tau = pm.Gamma("beta.tau", 1, 1)

    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=6)

    x_imputed = pm.Normal("x_imputed", mu=20, sigma=10, observed=x)

    mu = dot(beta, x_imputed)
    likelihood = pm.Normal("likelihood", mu, tau=tau_c, observed=y)

    trace = pm.sample(
        10000,
        tune=2000,
        cores=4,
        init="jitter+adapt_diag",
    )


Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau.c, beta.c, beta.tau, beta, x_imputed_missing, likelihood_missing]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 49 seconds.
There were 1038 divergences after tuning. Increase `target_accept` or reparameterize.
There were 5858 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6115, but should be close to 0.8. Try to increase the number of tuning steps.
There were 2123 divergences after tuning. Increase `target_accept` or reparameterize.
There were 6350 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5889, but should be close to 0.8. Try to increase the number of tuning steps.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/190

In [46]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta.c,2.087,1.349,0.254,3.792,0.063,0.050,616.0,691.0,1.01
beta[0],2.057,2.241,-2.291,6.402,0.110,0.078,451.0,293.0,1.01
beta[1],2.106,2.133,-1.931,6.110,0.089,0.063,565.0,503.0,1.01
beta[2],2.033,1.972,-1.388,5.815,0.077,0.059,566.0,891.0,1.01
beta[3],2.142,1.842,-1.536,5.838,0.103,0.073,365.0,197.0,1.01
...,...,...,...,...,...,...,...,...,...
likelihood_missing[55],215.260,58.663,99.309,329.502,1.837,1.299,1020.0,1697.0,1.01
likelihood_missing[56],214.769,56.703,104.572,323.976,1.639,1.159,1195.0,2565.0,1.00
likelihood_missing[57],215.084,58.134,98.220,327.924,1.943,1.501,948.0,1183.0,1.00
tau.c,0.000,0.000,0.000,0.000,0.000,0.000,670.0,1007.0,1.01


In [47]:
trace

Notes:

can't impute data with pm.Data(mutable=True)? 
reading:
https://github.com/pymc-devs/pymc/issues/4441
https://github.com/pymc-devs/pymc/pull/5295
