In [94]:
import arviz as az
import numpy as np
import pymc as pm
from pymc.math import dot, stack, concatenate, exp, invlogit

%load_ext lab_black
%load_ext watermark

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


# Rats

This example goes further into dealing with missing data in PyMC, including in the predictor variables.

Adapted from [unit 8: ratsignorable1.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit8/ratsignorable1.odc), [ratsignorable2.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit8/ratsignorable2.odc), and [ratsinformative.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit8/ratsinformative.odc).

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/rats.txt).

## Associated lecture videos: Unit 8 Lesson 2

In [1]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed?v=xomK4tcePmc&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=83" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

## Problem statement

We had a previous example about [Dugongs](https://areding.github.io/6420-pymc/Unit6-dugongs.html) that dealt with missing data in the observed data (y values). This example shows how to deal with missing data in the input data (x). It's still pretty easy. You could look at it like creating another likelihood in the model, a very simple one where the observed data is x, and you use a single distribution to fill in the missing values (see ```x_imputed``` in the model below).

Original paper [here.](https://www.jstor.org/stable/pdf/2289594.pdf)

Gelfand et al 1990 consider the problem of missing data, and delete the last observation of cases 6-10, the last two from 11-20, the last 3 from 21-25 and the last 4 from 26-30.  The appropriate data file is obtained by simply replacing data values by NA (see below). The model specification is unchanged, since the distinction between observed and unobserved quantities is made in the data file and not the model specification. - bugs problem statement

This first example only 

In [21]:
# adjusting the shape of x for vectorized calculations (the BUGS example is written as a loop)
x = np.array([8.0, 15.0, 22.0, 29.0, 36.0])

In [51]:
# import y data and create mask (missing data is represented as nan in the file)
y = np.loadtxt("../data/rats.txt")
y = np.nan_to_num(y, nan=-1)  # nan to -1
y = np.ma.masked_values(y, value=-1)  # create mask

original model shapes:

shapes of 1:
tau_c
alpha_c
alpha_tau
beta_c
beta_tau

shapes of 30:
alpha
beta

shapes of 30, 5:
mu
likelihood


make a note here about broadcasting, it's the only reason this works.

https://numpy.org/doc/stable/user/basics.broadcasting.html

another note: pymc doesn't seem to like gamma(.001, .001) for tau here, maybe values getting too close to 0? getting tau > 0 warning.

In [48]:
with pm.Model() as m:
    alpha_c = pm.Normal("alpha_c", 0, tau=1e-6)
    alpha_tau = pm.Gamma("alpha_tau", 0.01, 0.01)
    beta_c = pm.Normal("beta_c", 0, tau=1e-6)
    beta_tau = pm.Gamma("beta_tau", 0.01, 0.01)

    alpha = pm.Normal("alpha", alpha_c, tau=alpha_tau, shape=(30, 1)) # (30, 1) for broadcasting
    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=(30, 1))
    lik_tau = pm.Gamma("lik_tau", 0.01, 0.01)
    sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)

    mu = alpha + beta * x

    pm.Normal("likelihood", mu, tau=lik_tau, observed=y)

    trace = pm.sample(
        5000,
        tune=1000,
        cores=4,
    )

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_c, alpha_tau, beta_c, beta_tau, alpha, beta, lik_tau, likelihood_missing]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 84 seconds.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.
There were 11 divergences after tuning. Increase `target_accept` or reparameterize.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
There were 9 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 25% for some parameters.


In [49]:
az.summary(
    trace,
    hdi_prob=0.95,
    var_names=["alpha_c", "alpha_tau", "beta_c", "beta_tau", "sigma"],
    kind="stats",
)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%
alpha_c,101.223,2.297,96.64,105.714
alpha_tau,0.025,0.074,0.004,0.049
beta_c,6.567,0.164,6.238,6.884
beta_tau,2.865,1.208,0.953,5.181
sigma,6.009,0.652,4.768,7.307


Notes:

- can't impute data with pm.Data(mutable=True)? 

    - reading: https://github.com/pymc-devs/pymc/issues/4441 https://github.com/pymc-devs/pymc/pull/5295


## Model 2: Imputing missing x data

In [58]:
# in the second example, the x data also has a missing value.
x = np.array([8.0, 15.0, 22.0, -1, 36.0])
x = np.ma.masked_values(x, value=-1)

In [59]:
x

masked_array(data=[8.0, 15.0, 22.0, --, 36.0],
             mask=[False, False, False,  True, False],
       fill_value=-1.0)

In [72]:
with pm.Model() as m:
    alpha_c = pm.Normal("alpha_c", 0, tau=1e-6)
    alpha_tau = pm.Gamma("alpha_tau", 0.01, 0.01)
    beta_c = pm.Normal("beta_c", 0, tau=1e-6)
    beta_tau = pm.Gamma("beta_tau", 0.01, 0.01)

    alpha = pm.Normal("alpha", alpha_c, tau=alpha_tau, shape=(30, 1))
    beta = pm.Normal("beta", beta_c, tau=beta_tau, shape=(30, 1))
    lik_tau = pm.Gamma("lik_tau", 0.01, 0.01)
    sigma = pm.Deterministic("sigma", 1 / lik_tau**0.5)

    x_imputed = pm.TruncatedNormal("x_imputed", mu=20, sigma=10, lower=0, observed=x)

    mu = alpha + beta * x_imputed

    pm.Normal("likelihood", mu, tau=lik_tau, observed=y)

    trace = pm.sample(5000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha_c, alpha_tau, beta_c, beta_tau, alpha, beta, lik_tau, x_imputed_missing, likelihood_missing]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 90 seconds.
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
There were 14 divergences after tuning. Increase `target_accept` or reparameterize.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.


In [73]:
az.summary(
    trace,
    hdi_prob=0.95,
    var_names=[
        "alpha_c",
        "alpha_tau",
        "beta_c",
        "beta_tau",
        "sigma",
        "x_imputed_missing",
    ],
    kind="stats",
)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%
alpha_c,101.872,2.366,97.278,106.499
alpha_tau,0.022,0.057,0.004,0.042
beta_c,6.507,0.169,6.187,6.853
beta_tau,3.023,1.293,1.054,5.6
sigma,5.974,0.657,4.799,7.356
x_imputed_missing[0],29.502,0.389,28.751,30.274


## Model 3: Non-ignorable missingness

Probability of missingness increases approx. at a rate of 1% with increasing the weight.


I don't think the model can actually be translated to PyMC in this way because of the different way input data is handled. This definitely isn't constraining the missing value of y in the same way. If someone wants to figure this out I would appreciate it!

In [103]:
y = np.array([151.0, 199.0, 246.0, 283.0, -1])  # original value was 320
y = np.ma.masked_values(y, value=-1)  # create masked array
# note: can access mask with y.mask, equivalent to the "miss" array from the Professor's example
miss = y.mask
x = np.array([8.0, 15.0, 22.0, 29.0, 36.0])

In [108]:
with pm.Model() as m:
    a = pm.Logistic("a", mu=0, s=100)
    b = np.log(1.01)
    alpha = pm.Flat("alpha")
    beta = pm.Flat("beta")
    log_sigma = pm.Flat("log_sigma")
    tau = pm.Deterministic("tau", 1 / exp(2 * log_sigma))

    p = pm.Deterministic("p", invlogit(a + b * y))
    missing = pm.Bernoulli("missing", p, observed=miss)

    mu = alpha + beta * x

    pm.Normal("likelihood", mu, tau=tau, observed=y)

    trace = pm.sample(5000)

Auto-assigning NUTS sampler...
INFO:pymc:Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
INFO:pymc:Initializing NUTS using jitter+adapt_diag...
Apply node that caused the error: flat_rv{0, (), floatX, False}(RandomStateSharedVariable(<RandomState(MT19937) at 0x151830F40>), TensorConstant{[]}, TensorConstant{11})
Toposort index: 2
Inputs types: [RandomStateType, TensorType(int64, vector), TensorType(int64, scalar)]
Inputs shapes: ['No shapes', (0,), ()]
Inputs strides: ['No strides', (8,), ()]
Inputs values: [RandomState(MT19937) at 0x151830F40, array([], dtype=int64), array(11)]
Outputs clients: [['output'], [Shape(alpha), InplaceDimShuffle{x}(alpha)]]

Backtrace when the node is created (use Aesara flag traceback__limit=N to make it longer):
  File "/Users/aaron/mambaforge/envs/pymc/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3185, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/Users/a

  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 19 seconds.
INFO:pymc:Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 19 seconds.
There were 611 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc:There were 611 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.6379, but should be close to 0.8. Try to increase the number of tuning steps.
There were 3449 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc:There were 3449 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.02252, but should be close to 0.8. Try to increase the number of tuning st

In [112]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
a,-3.37,1.649,-6.219,-0.93,0.416,0.3,14.0,34.0,1.21
alpha,102.162,12.412,76.06,121.021,0.386,0.273,449.0,693.0,1.03
beta,6.351,0.639,5.331,7.639,0.02,0.014,433.0,924.0,1.02
log_sigma,1.761,0.582,0.728,2.964,0.076,0.054,45.0,215.0,1.12
likelihood_missing[0],330.85,16.503,303.956,358.831,0.425,0.307,523.0,1164.0,1.01
tau,0.049,0.05,0.0,0.152,0.007,0.005,45.0,215.0,1.13
p[0],0.221,0.193,0.0,0.553,0.064,0.048,14.0,34.0,1.27
p[1],0.289,0.231,0.001,0.666,0.075,0.059,14.0,34.0,1.27
p[2],0.363,0.26,0.001,0.761,0.081,0.066,14.0,34.0,1.27
p[3],0.425,0.277,0.001,0.822,0.082,0.068,14.0,34.0,1.21


In [113]:
%watermark --iversions -v

Python implementation: CPython
Python version       : 3.10.1
IPython version      : 7.31.0

pymc : 4.0.0b1
numpy: 1.22.0
arviz: 0.11.4

