In [1]:
import pymc as pm
import pandas as pd
import numpy as np
import arviz as az

%load_ext lab_black
%load_ext watermark

# Paraguay vaccination status

This example goes over a multilevel, or hierarchical, logistic regression model. It also shows how to use the PyMC coordinate system.

Adapted from [Unit 7: paraguay.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/paraguay.odc) and [paraguaynocluster.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/paraguaynocluster.odc)

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/paraguay.csv).

Associated lecture video: Unit 7 lesson 19

## Problem statement

This example considers factors influencing the vaccination status among 3424 children of 2552 mothers among 264 clusters in Paraguay. In this analysis, we're specifically interested in mother-level factors related to child immunization. However, there is randomness associated with different clusters.

- ID3:		   Cluster number
- VACCODE:  =1 if fully immunized, =0 otherwise
- LB.TOT:	  No. of live births
- MAGE2:	  mother age  <20 =1, otherwise = 0	
- UN2:		    consensual union = 1, otherwise = 0 
- TOILET2:	  unsafe toilet in hh = 1, otherwise = 0	
- PR.SPOC1:  spouse unskilled laborer = 1, otherwise = 0
- SPANISH2:  Spanish not hh language = 1, otherwise = 0

## Notes

We need to add a random effect by cluster. This is a good use case for the PyMC coordinates system.

In [2]:
data = pd.read_csv("../data/paraguay.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3424 entries, 0 to 3423
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   ID3       3424 non-null   int64
 1   VACCODE   3424 non-null   int64
 2   LB.TOT    3424 non-null   int64
 3   MAGE2     3424 non-null   int64
 4   UN2       3424 non-null   int64
 5   TOILET2   3424 non-null   int64
 6   PR.SPOC1  3424 non-null   int64
 7   SPANISH2  3424 non-null   int64
dtypes: int64(8)
memory usage: 214.1 KB


In [3]:
y = data["VACCODE"].to_numpy()
# separate array for clusters
clusters = data["ID3"].to_numpy()
X = data.drop(["VACCODE", "ID3"], axis=1).to_numpy()
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
y.shape, clusters.shape, X_aug.shape

((3424,), (3424,), (3424, 7))

In [4]:
cols = X_aug.shape[1]

In [5]:
# set up alternate coordinates, the ID3 or clusters column
cluster_idx, clusters = pd.factorize(data.ID3)
coords = {"cluster": clusters, "id": data.index.to_numpy()}

In [6]:
# note that the coords dict is passed to pm.Model call
with pm.Model(coords=coords) as m:
    X_data = pm.Data("X_data", X_aug, mutable=True)
    y_data = pm.Data("y_data", y, mutable=True)
    clust_idx = pm.Data("cluster_idx", cluster_idx, dims="id", mutable=True)

    cluster_tau = pm.Gamma("cluster_tau", 0.01, 0.01)
    cluster_variance = pm.Deterministic("cluster_variance", 1 / cluster_tau)
    beta = pm.Normal("beta", 0, tau=1e-3, shape=cols)

    cluster_effect = pm.Normal("cluster_effect", 0, tau=cluster_tau, dims="cluster")
    p = pm.math.dot(X_data, beta) + cluster_effect[clust_idx]

    pm.Bernoulli("likelihood", logit_p=p, observed=y_data)

    trace = pm.sample(3000)

Ambiguities exist in dispatched function _unify

The following signatures may result in ambiguous behavior:
	[object, ConstrainedVar, Mapping], [ConstrainedVar, Var, Mapping]
	[ConstrainedVar, object, Mapping], [object, ConstrainedVar, Mapping]
	[ConstrainedVar, Var, Mapping], [object, ConstrainedVar, Mapping]
	[object, ConstrainedVar, Mapping], [ConstrainedVar, object, Mapping]


Consider making the following additions:

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)

@dispatch(ConstrainedVar, ConstrainedVar, Mapping)
def _unify(...)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [cluster_tau, beta, cluster_effect]


Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 142 seconds.


In [7]:
az.summary(
    trace, var_names=["beta", "cluster_variance"], filter_vars="like", kind="stats"
)

Unnamed: 0,mean,sd,hdi_3%,hdi_97%
beta[0],1.444,0.131,1.187,1.68
beta[1],-0.07,0.015,-0.099,-0.043
beta[2],-0.563,0.206,-0.946,-0.169
beta[3],-0.196,0.096,-0.367,-0.007
beta[4],-0.691,0.132,-0.947,-0.451
beta[5],-0.285,0.11,-0.494,-0.083
beta[6],-0.621,0.1,-0.815,-0.44
cluster_variance,0.531,0.095,0.361,0.716


Based on this coordinates example:

https://oriolabrilpla.cat/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html

## No clusters

In [8]:
with pm.Model() as m_nc:
    X_data = pm.Data("X_data", X_aug, mutable=True)
    y_data = pm.Data("y_data", y, mutable=True)

    beta = pm.Normal("beta", 0, tau=1e-3, shape=cols)

    p = pm.math.dot(X_data, beta)

    pm.Bernoulli("likelihood", logit_p=p, observed=y_data)

    trace_nc = pm.sample(3000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta]


Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 67 seconds.


In [9]:
az.summary(
    trace_nc, var_names=["beta", "cluster_variance"], filter_vars="like", kind="stats"
)

Unnamed: 0,mean,sd,hdi_3%,hdi_97%
beta[0],1.429,0.107,1.208,1.615
beta[1],-0.064,0.014,-0.088,-0.037
beta[2],-0.562,0.189,-0.914,-0.201
beta[3],-0.189,0.087,-0.349,-0.022
beta[4],-0.717,0.116,-0.934,-0.501
beta[5],-0.451,0.089,-0.616,-0.28
beta[6],-0.606,0.084,-0.759,-0.446


In [10]:
%watermark -n -u -v -iv -p aesara,aeppl

Last updated: Fri Feb 03 2023

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.9.0

aesara: 2.8.10
aeppl : 0.1.1

numpy : 1.24.1
arviz : 0.14.0
pandas: 1.5.3
pymc  : 5.0.1

