In [1]:
import arviz as az
import numpy as np
import pymc as pm
from pymc.math import dot

%load_ext watermark
%watermark --iversions

arviz: 0.11.4
numpy: 1.22.3
pymc : 4.0.0b4



# Rats 

I'm just going to do ratsignorable2.odc for now since it is relevant for HW6. Eventually I'll add the other examples.

Adapted from [Codes for Unit 8: ratsignorable2.odc](https://www2.isye.gatech.edu/isye6420/supporting.html).

Associated lecture video: [Unit 8 Lesson 2](https://www.youtube.com/watch?v=T5vkLsIs3f8&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=83).

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/rats.txt).

We had a previous example about [Dugongs](https://areding.github.io/6420-pymc/Unit6-dugongs.html) that dealt with missing data in the observed data (y values). This example shows how to deal with missing data in the input data (x). It's still pretty easy. You could look at it like creating another likelihood in the model, a very simple one where the observed data is x, and you use a single distribution to fill in the missing values (see ```x_imputed``` in the model below).

For now I'm leaving the variable names the same as the BUGS example. Might go back and make them more descriptive later.

There are some differences with my version:

1. My gamma priors on tau are more informative. This is because PyMC was having some computational issues with the Gamma(.001, .001) priors the professor used. I don't know if it's because of the sampling algorithm or if it's a problem with the new computational backend. I will look into it more later, for now I just want to get these examples up.

2. I imputed the x values with a more informative prior for similar reasons. That Uniform(0, 500) prior seems kind of crazy to me, and I wanted to rule out more computational issues.

3. I got rid of the separate definition of the intercept as alpha, it is now beta[0].

I have not checked this model for any kind of correctness or compared the answers to the BUGS version. It may make no sense at all (considering the amount of divergences, it probably doesn't)! But I hope it gives you an idea for how to handle the missing data question on HW6, which I have confirmed works well in PyMC.

In [2]:
# note that I added a 1 to the first value for x, this is for the intercept beta[0]
x = [1.0, 8.0, 15.0, 22.0, np.nan, 36.0]
y = np.loadtxt("./data/rats.txt")

# thanks Huaxiang, don't remember what this was for anymore ;)
# y = np.concatenate((np.ones((y.shape[0], 1)), y), axis=1)
y.shape

(30, 5)

In [3]:
# create masked data
y = y.copy()
y = np.nan_to_num(y, nan=-1)
y = np.ma.masked_values(y, value=-1)

x = x.copy()
x = np.nan_to_num(x, nan=-1)
x = np.ma.masked_values(x, value=-1)

In [4]:
with pm.Model() as m:
    beta_c = pm.Normal("beta.c", 0, sigma=10)

    beta = pm.Normal("beta", beta_c, sigma=10, shape=x.shape[0])

    x_imputed = pm.Normal("x_imputed", mu=20, sigma=10, observed=x)

    mu = dot(beta, x_imputed)
    likelihood = pm.Normal("likelihood", mu, sigma=10, observed=y, shape=y.shape[0])

    trace = pm.sample(
        10000,
        tune=2000,
        cores=4,
        init="jitter+adapt_diag",
        step=[pm.NUTS(target_accept=.9)]
    )


Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta.c, beta, x_imputed_missing, likelihood_missing]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 938 seconds.
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 18 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 8 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 11 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or rep

In [6]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta.c,1.635,4.447,-6.979,10.523,0.034,0.024,17192.0,24017.0,1.0
beta[0],1.676,10.921,-20.038,22.626,0.077,0.054,20154.0,26506.0,1.0
beta[1],1.693,10.315,-18.382,21.913,0.068,0.048,23012.0,27194.0,1.0
beta[2],1.892,9.715,-17.021,20.989,0.066,0.047,21663.0,26443.0,1.0
beta[3],2.128,8.979,-15.001,20.132,0.065,0.046,19189.0,24807.0,1.0
...,...,...,...,...,...,...,...,...,...
likelihood_missing[53],213.454,10.084,193.169,232.510,0.061,0.043,27579.0,27965.0,1.0
likelihood_missing[54],213.521,10.081,193.732,232.963,0.062,0.044,26500.0,27627.0,1.0
likelihood_missing[55],213.512,10.069,194.295,233.529,0.060,0.042,28524.0,28207.0,1.0
likelihood_missing[56],213.556,10.115,193.422,233.046,0.058,0.041,30936.0,29553.0,1.0


Notes:

can't impute data with pm.Data(mutable=True)? 

reading:
https://github.com/pymc-devs/pymc/issues/4441

https://github.com/pymc-devs/pymc/pull/5295
