In [1]:
import torch
import math
import numpy as np
dtype = torch.float
device = torch.device("cpu")


yobs = 8.0
sigma2 = 1.0


First we'll learn how to get the variational posterior for $x | y$ for a single value of y. We have
$$y= 4x-x^2/2+N(0,\sigma^2)$$

## Variational inference

Maximize the ELBO
$$ELBO(q) = E[\log p(y | x)] − KL (q(x)|| p(x)).$$

We'll use $q(x) = N(m, s^2)$ as the variational family
In this case, the ELBO can be computed exactly, but its a pain to do. So I've approximated the E[log p(y | x)] with a Monte Carlo sum.
$\Phi=(m, \log s^2)$ are the variational parameters


KL (q(x) || p(x)) = E[log q(x)] − E[log p(x)]
where expectations are with respect to q.

Note, if $p(x) \propto 1$, then $E\log p(x)$ does not depend on $\phi$ so can be ignored
If $q(x) =N(m, \sigma^2)$, then $E[\log q(x)] = -0.5 \log(2 \pi \sigma^2) - 1/2$

In [2]:
def KL(log_var):   
    return(-0.5*np.log(2.0*math.pi)-0.5*log_var-0.5)

def f(x):
    return(4.*x-0.5*torch.pow(x,2.))
# How do we define functions? x needs declaring?




def ZtoX(m, log_s2):
    #reparameterization trick
    return(Z*torch.exp(log_s2/2.)+m)


In [3]:

def log_dnorm(x, mean, var):
    return(-(x-mean).pow(2)/(2.*var)-0.5*np.log(2*math.pi*var))
    # checked - could used built-in torch dnorm



def Eloglike(m,log_s2):
    X= ZtoX(m, log_s2)
    
    #print(X)
    #Rewrite with f
    loglikes = log_dnorm(f(X), yobs, sigma2) # observation likelihood
    #-(yobs- 4.*X+0.5*X.pow(2)).pow(2)    #f(ZtoX(phi)),2.)
    
    return(loglikes.mean())



In [4]:
### initialize the variational parameters
m = torch.full((), 10.,dtype=dtype, requires_grad=True, device=device)
log_s2 = torch.full((),np.log(15.), requires_grad=True, device=device)


# Samples fixed here - but try adding them into the loop
nsamples = 1000
#Z = torch.randn((nsamples), dtype=dtype, requires_grad=False, device=device)


learning_rate = 1e-2
for t in range(10000):
    Z = torch.randn((nsamples), dtype=dtype, requires_grad=False, device=device)

    negELBO = -Eloglike(m,log_s2)+KL(log_s2)
    if t % 100 == 99:
        print(t, negELBO.item(), 'm=', m.item(), 's2=', log_s2.exp().item())
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    negELBO.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    with torch.no_grad():
        m -= learning_rate * m.grad
        log_s2 -= learning_rate * log_s2.grad
        
        # Manually zero the gradients after updating weights
        m.grad = None
        log_s2.grad = None
        
print(f'Result: p(x|y) = N({m.item()}, {log_s2.exp().item()}) ')


99 0.7689577341079712 m= 4.89589786529541 s2= 0.10756976157426834
199 0.46154969930648804 m= 4.566242694854736 s2= 0.1664574295282364
299 0.23953628540039062 m= 4.369981288909912 s2= 0.25673651695251465
399 0.041483163833618164 m= 4.22088623046875 s2= 0.38433316349983215
499 -0.08927857875823975 m= 4.112062931060791 s2= 0.5364567637443542
599 -0.131186842918396 m= 4.0486979484558105 s2= 0.6696256995201111
699 -0.20196151733398438 m= 4.015183925628662 s2= 0.7517797350883484
799 -0.18288254737854004 m= 3.999823808670044 s2= 0.7941275238990784
899 -0.1509265899658203 m= 3.99350643157959 s2= 0.8094796538352966
999 -0.11520564556121826 m= 3.9968996047973633 s2= 0.8120530843734741
1099 -0.1278231143951416 m= 3.999289035797119 s2= 0.8111606240272522
1199 -0.1050577163696289 m= 4.003118515014648 s2= 0.8121963739395142
1299 -0.11719095706939697 m= 4.006368160247803 s2= 0.8148861527442932
1399 -0.11520218849182129 m= 4.001572132110596 s2= 0.8133485913276672
1499 -0.1510089635848999 m= 4.00154924

Next steps

- amortized inference
- Can we use GPtorch as function?
- write as a class with a fwd and backward method
- use torch.nn class to define params?
- use random Z at each stage? Should the number of samples increase as we converge?
- add prior for x
- change f to avoid bimodal posterior.

## Amortized Variational Inference

Let's assume $q(x|y)$ can be modelled as $N(m(y), s2(y))$ where $m(y)$ and $s2(y)$ are both modelled as neural networks.


