## [High-dimensional Bayesian workflow, with applications to SARS-CoV-2 strains](http://pyro.ai/examples/workflow.html#High-dimensional-Bayesian-workflow,-with-applications-to-SARS-CoV-2-strains)

#### The fastest way to find a good model of your data is to quickly discard many bad models, i.e. to iterate. In statistics we call this iterative workflow Box’s loop. 
#### An efficient workflow allows us to discard bad models as quickly as possible. Workflow efficiency demands that code changes to upstream components don’t break previous coding effort on downstream components. 
#### Pyro’s approaches to this challenge include strategies for variational approximations (pyro.infer.autoguide) and strategies for transforming model coordinate systems to improve geometry (pyro.infer.reparam

1. Clean the data.

2. Create a generative model.

3. Sanity check using MAP or mean-field inference.

4. Create an initialization heuristic.

5. Reparameterize the model, evaluating results under mean field VI.

6. Customize the variational family (autoguides, easyguides, custom guides).

##### The model is a high-dimensional regression model with around 1000 coefficients, a multivariate logistic growth function (using a simple torch.softmax()) and a Multinomial likelihood. While the number of coefficients is relatively small, there are about 500,000 local latent variables to estimate, and plate structure in the model should lead to an approximately block diagonal posterior covariance matrix

In [1]:
from collections import defaultdict
from pprint import pprint
import functools
import math
import os

In [2]:
import torch
import pyro

In [3]:
import pyro.distributions as dist
import pyro.poutine as poutine
from pyro.distributions import constraints
from pyro.infer import SVI, Trace_ELBO

In [4]:
from pyro.infer.autoguide import (
    AutoDelta,
    AutoNormal,
    AutoMultivariateNormal,
    AutoLowRankMultivariateNormal,
    AutoGuideList,
    init_to_feasible
)

In [5]:
from pyro.infer.reparam import AutoReparam, LocScaleReparam

In [6]:
from pyro.nn.module import PyroParam

In [7]:
from pyro.optim import ClippedAdam

In [8]:
from pyro.ops.special import sparse_multinomial_likelihood

In [9]:
import matplotlib.pyplot as plt

In [11]:
if torch.cuda.is_available():
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
else:
    pass

___

In [12]:
from pyro.contrib.examples.nextstrain import load_nextstrain_counts

In [13]:
data_set = load_nextstrain_counts()

In [20]:
isinstance(data_set, dict)

True

In [25]:
for k, v in data_set.items():
    print(k, type(v))

start_date <class 'datetime.datetime'>
time_step_days <class 'int'>
locations <class 'list'>
lineages <class 'list'>
mutations <class 'list'>
features <class 'torch.Tensor'>
counts <class 'torch.Tensor'>
sparse_counts <class 'dict'>


In [21]:
isinstance(data_set, torch.Tensor)

False

#### The first step to using Pyro is creating a generative model, either a python function or a pyro.nn.Module. Start simple. Start with a shallow hierarchy and later add latent variables to share statistical strength. Start with a slice of your data then add a plate over multiple slices. Start with simple distributions like Normal, LogNormal, Poisson and Multinomial, then consider overdispersed versions like StudentT, Gamma, GammaPoisson/NegativeBinomial, and DirichletMultinomial. Keep your model simple and readable so you can share it and get feedback from domain experts. Use weakly informative priors.

#### Note we scale coef by 1/100 because we want to model a very small number, but the automatic parts of Pyro and PyTorch work best for numbers on the **order of 1.0 rather than very small numbers**. When we later interpret coef in a volcano plot we’ll need to duplicate this scaling factor.

In [26]:
data_set.keys()

dict_keys(['start_date', 'time_step_days', 'locations', 'lineages', 'mutations', 'features', 'counts', 'sparse_counts'])

In [27]:
data_set['features'].shape

torch.Size([1316, 2634])

In [28]:
data_set['counts'].shape

torch.Size([27, 202, 1316])