## [Conditional Independence, Subsampling, and Amortization](http://pyro.ai/examples/svi_part_ii.html#SVI-Part-II:-Conditional-Independence,-Subsampling,-and-Amortization)

 The model has observations x and latent random variables z as well as parameters $\theta$. It has a joint probability density of the form: $$p_{\theta}({\bf x}, {\bf z}) = p_{\theta}({\bf x}|{\bf z}) p_{\theta}({\bf z})$$



$$\log p_{\theta}({\bf x}) = \log \int\! d{\bf z}\; p_{\theta}({\bf x}, {\bf z})$$

$$\theta_{\rm{max}} = \underset{\theta}{\operatorname{argmax}} \log p_{\theta}({\bf x})$$

$$p_{\theta_{\rm{max}}}({\bf z} | {\bf x}) = \frac{p_{\theta_{\rm{max}}}({\bf x} , {\bf z})}{
\int \! d{\bf z}\; p_{\theta_{\rm{max}}}({\bf x} , {\bf z}) }$$

The basic idea is that we introduce a parameterized distribution $q_{\phi}({\bf z})$, where  are known as the variational parameters. This distribution is called the variational distribution in much of the literature, and in the context of Pyro it’s called the **guide** (one syllable instead of nine!). 

**Pyro enforces that model() and guide() have the same call signature, i.e. both callables should take the same arguments**.

Learning will be setup as an optimization problem where each iteration of training takes a step in $\theta-\phi$ space that moves the guide closer to the exact posterior. To do this we need to define an appropriate objective function.

The **ELBO**, which is a function of both $\theta$ and $\phi$, is defined as an expectation w.r.t. to samples from the guide:

$${\rm ELBO} \equiv \mathbb{E}_{q_{\phi}({\bf z})} \left [
\log p_{\theta}({\bf x}, {\bf z}) - \log q_{\phi}({\bf z})
\right]$$

$$\log p_{\theta}({\bf x}) - {\rm ELBO} =
\rm{KL}\!\left( q_{\phi}({\bf z}) \lVert p_{\theta}({\bf z} | {\bf x}) \right)$$

$$\sum_{i=1}^N \log p({\bf x}_i | {\bf z}) \approx  \frac{N}{M}
\sum_{i\in{\mathcal{I}_M}} \log p({\bf x}_i | {\bf z})$$

If a user wants to do this sort of thing in Pyro, he or she first needs to make sure that the model and guide are written in such a way that Pyro can leverage the relevant conditional independencies. Let’s see how this is done. Pyro provides two language primitives for marking conditional independencies: plate and markov. Let’s start with the simpler of the two.

In [1]:
import os
import sys

import pyro
import torch

In [7]:
import pyro.distributions as dist
import torch.distributions.constraints as constraints
from pyro.distributions.testing.fakes import NonreparameterizedBeta
from pyro.infer import SVI, Trace_ELBO

In [3]:
assert pyro.__version__.startswith('1.8.2')

In [4]:
max_steps = 2

In [5]:
def param_abs_error(name, target):
    return torch.sum(torch.abs(target - pyro.param(name))).item()

In [13]:
from pyro.optim import Adam

This simply tells Pyro to use a learning rate of 0.010 for the Pyro parameter my_special_parameter and a learning rate of 0.001 for all other parameters.

In [4]:
import pyro.distributions as dist

For this model the observations are conditionally independent given the latent random variable latent_fairness. To explicitly mark this in Pyro we basically just need to replace the Python builtin range with the Pyro construct plate:

For example, pyro.plate is **not appropriate for temporal models** where each iteration of a loop depends on the previous iteration; in this case a range or **pyro.markov** should be used instead.

The variational parameters are torch.tensors. The **requires_grad flag is automatically set to True by pyro.param**.

___

$$p({\bf x}, {\bf z}, \beta) = p(\beta)
\prod_{i=1}^N p({\bf x}_i | {\bf z}_i) p({\bf z}_i | \beta)$$

$$q({\bf z}, \beta) = q(\beta) \prod_{i=1}^N q({\bf z}_i | \beta, \lambda_i)$$

 where subsampling appears in both the model and guide. To make things simple let’s keep the discussion somewhat abstract and avoid writing a complete model and guide.

### Amortization $$q(\beta) \prod_{n=1}^N q({\bf z}_i | f({\bf x}_i))$$

### ELBO Gradient Estimators

$${\rm ELBO} \equiv \mathbb{E}_{q_{\phi}({\bf z})} \left [
\log p_{\theta}({\bf x}, {\bf z}) - \log q_{\phi}({\bf z})
\right]$$

#### Reparameterizable Random Variables $\mathbb{E}_{q_{\phi}({\bf z})} \left [f_{\phi}({\bf z}) \right]=\mathbb{E}_{q({\bf \epsilon})} \left [f_{\phi}(g_{\phi}({\bf \epsilon})) \right]$

$$\nabla_{\phi}\mathbb{E}_{q({\bf \epsilon})} \left [f_{\phi}(g_{\phi}({\bf \epsilon})) \right]=
\mathbb{E}_{q({\bf \epsilon})} \left [\nabla_{\phi}f_{\phi}(g_{\phi}({\bf \epsilon})) \right]$$

#### Non-reparameterizable Random Variables¶
What if we can’t do the above reparameterization? Unfortunately this is the case for many distributions of interest, for example all discrete distributions. In this case our estimator takes a bit more complicated form.

$$\nabla_{\phi}\mathbb{E}_{q_{\phi}({\bf z})} \left [
f_{\phi}({\bf z}) \right]=
\nabla_{\phi} \int d{\bf z} \; q_{\phi}({\bf z}) f_{\phi}({\bf z})$$

$$\int d{\bf z} \; \left \{ (\nabla_{\phi}  q_{\phi}({\bf z})) f_{\phi}({\bf z}) + q_{\phi}({\bf z})(\nabla_{\phi} f_{\phi}({\bf z}))\right \}$$

$$\nabla_{\phi}  q_{\phi}({\bf z}) =
q_{\phi}({\bf z})\nabla_{\phi} \log q_{\phi}({\bf z})$$

$$\mathbb{E}_{q_{\phi}({\bf z})} \left [
(\nabla_{\phi} \log q_{\phi}({\bf z})) f_{\phi}({\bf z}) + \nabla_{\phi} f_{\phi}({\bf z})\right]$$

$${\rm surrogate \;objective} \equiv
\log q_{\phi}({\bf z}) \overline{f_{\phi}({\bf z})} + f_{\phi}({\bf z})$$

$$\nabla_{\phi} {\rm ELBO} = \mathbb{E}_{q_{\phi}({\bf z})} \left [
\nabla_{\phi} ({\rm surrogate \; objective}) \right]$$

In [20]:
class BernoulliBeta:
    def __init__(self, max_steps):
        self.max_steps = max_steps
        self.alpha0 = 10.0
        self.beta0 = 10.0
        self.data = torch.zeros(10)
        self.data[0:6] = torch.ones(6)
        self.n_data = self.data.size(0)

        self.alpha_n = self.data.sum() + self.alpha0
        self.beta_n = -self.data.sum() + torch.tensor(self.beta0 + self.n_data)

        self.alpha_q_0 = 15.0
        self.beta_q_0 = 15.0
    def model(self, use_decaying_avg_baseline):
        f = pyro.sample("latent_fairness", dist.Beta(self.alpha0, self.beta0))
        with pyro.plate("data_plate"):
            pyro.sample("obs", dist.Bernoulli(f), obs=self.data)

    def guide(self, use_decaying_avg_baseline):
        alpha_q = pyro.param("alpha_q", torch.tensor(self.alpha_q_0), constraint=constraints.positive)
        beta_q = pyro.param("beta_q", torch.tensor(self.beta_q_0), constraint=constraints.positive)
        baseline_dict = {
            'use_decaying_avg_baseline': use_decaying_avg_baseline,
            'baseline_beta': 0.90
        }

        pyro.sample("latent_fairness", NonreparameterizedBeta(alpha_q, beta_q), infer=dict(baseline=baseline_dict))

    def do_inference(self, use_decaying_avg_baseline, tolorance=0.8):
        pyro.clear_param_store()
        optimizer = Adam({'lr': 0.0005, 'betas': (0.93, 0.999)})
        svi = SVI(self.model, self.guide, optimizer, loss=Trace_ELBO())

        print("Doing inference with use_decaying_avg_baseline=%s" % use_decaying_avg_baseline )

        for k in range(self.max_steps):
            svi.step(use_decaying_avg_baseline)
            if k % 100 == 0:
                print('.', end="")
                sys.stdout.flush()
            alpha_error = param_abs_error("alpha_q", self.alpha_n)
            beta_error = param_abs_error("beta_q", self.beta_n)

            if alpha_error < tolorance and beta_error < tolorance:
                break
        
        print("\nDid %d steps of inference." % k)
        print("Final absolute erros for the two variational parameters " + 
            "are %.4f & %.4f" % (alpha_error, beta_error))

In [33]:
bbe = BernoulliBeta(max_steps=2000)

In [34]:
bbe.do_inference(use_decaying_avg_baseline=True)

Doing inference with use_decaying_avg_baseline=True
..
Did 146 steps of inference.
Final absolute erros for the two variational parameters are 0.7997 & 0.7949


In [35]:
bbe.do_inference(use_decaying_avg_baseline=False)

Doing inference with use_decaying_avg_baseline=False
....
Did 381 steps of inference.
Final absolute erros for the two variational parameters are 0.7987 & 0.7978
