In [1]:
import torch
import math
import numpy as np
dtype = torch.float
device = torch.device("cpu")


yobs = 8.0
sigma2 = 1.0


First we'll learn how to get the variational posterior for $x | y$ for a single value of y. We have
$$y= 4x-x^2/2+N(0,\sigma^2)$$

## Variational inference

Maximize the ELBO
$$ELBO(q) = E[\log p(y | x)] − KL (q(x)|| p(x)).$$

We'll use $q(x) = N(m, s^2)$ as the variational family
In this case, the ELBO can be computed exactly, but its a pain to do. So I've approximated the E[log p(y | x)] with a Monte Carlo sum.
$\Phi=(m, \log s^2)$ are the variational parameters


KL (q(x) || p(x)) = E[log q(x)] − E[log p(x)]
where expectations are with respect to q.

Note, if $p(x) \propto 1$, then $E\log p(x)$ does not depend on $\phi$ so can be ignored
If $q(x) =N(m, \sigma^2)$, then $E[\log q(x)] = -0.5 \log(2 \pi \sigma^2) - 1/2$

In [2]:
def KL(log_var):   
    return(-0.5*np.log(2.0*math.pi)-0.5*log_var-0.5)

def f(x):
    return(4.*x-0.5*torch.pow(x,2.))
# How do we define functions? x needs declaring?




def ZtoX(m, log_s2):
    #reparameterization trick
    return(Z*torch.exp(log_s2/2.)+m)


In [3]:

def log_dnorm(x, mean, var):
    return(-(x-mean).pow(2)/(2.*var)-0.5*np.log(2*math.pi*var))
    # checked - could used built-in torch dnorm



def Eloglike(m,log_s2):
    X= ZtoX(m, log_s2)
    
    #print(X)
    #Rewrite with f
    loglikes = log_dnorm(f(X), yobs, sigma2) # observation likelihood
    #-(yobs- 4.*X+0.5*X.pow(2)).pow(2)    #f(ZtoX(phi)),2.)
    
    return(loglikes.mean())



In [12]:
### initialize the variational parameters
m = torch.full((), 10.,dtype=dtype, requires_grad=True, device=device)
log_s2 = torch.full((),np.log(15.), requires_grad=True, device=device)


# Samples fixed here - but try adding them into the loop
nsamples = 1000
#Z = torch.randn((nsamples), dtype=dtype, requires_grad=False, device=device)


learning_rate = 1e-2
for t in range(10000):
    Z = torch.randn((nsamples), dtype=dtype, requires_grad=False, device=device)

    negELBO = -Eloglike(m,log_s2)+KL(log_s2)
    if t % 100 == 99:
        print(t, negELBO.item(), 'm=', m.item(), 's2=', log_s2.exp().item())
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    negELBO.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    with torch.no_grad():
        m -= learning_rate * m.grad
        log_s2 -= learning_rate * log_s2.grad
        
        # Manually zero the gradients after updating weights
        m.grad = None
        log_s2.grad = None
        
print(f'Result: p(x|y) = N({m.item()}, {log_s2.exp().item()}) ')


99 0.9968737363815308 m= 4.919249534606934 s2= 0.06470414251089096
199 0.6894204616546631 m= 4.6123948097229 s2= 0.10261012613773346
299 0.4494328498840332 m= 4.438750267028809 s2= 0.16267293691635132
399 0.23781156539916992 m= 4.3007707595825195 s2= 0.25440698862075806
499 0.04173636436462402 m= 4.181687355041504 s2= 0.38362643122673035
599 -0.08224809169769287 m= 4.09814453125 s2= 0.5195024609565735
699 -0.13930487632751465 m= 4.041082859039307 s2= 0.6588190793991089
799 -0.14893770217895508 m= 4.013210296630859 s2= 0.7463732957839966
899 -0.11489439010620117 m= 4.0027079582214355 s2= 0.7893606424331665
999 -0.11926639080047607 m= 4.000077247619629 s2= 0.8124529123306274
1099 -0.1333543062210083 m= 3.99855899810791 s2= 0.8152623176574707
1199 -0.1518338918685913 m= 3.9989781379699707 s2= 0.8170526027679443
1299 -0.17739498615264893 m= 4.0056538581848145 s2= 0.8173894286155701
1399 -0.17237865924835205 m= 3.998837471008301 s2= 0.8135724663734436
1499 -0.16475892066955566 m= 3.99889111

KeyboardInterrupt: 

Next steps

- amortized inference
- Can we use GPtorch as function?
- write as a class with a fwd and backward method
- use torch.nn class to define params?
- use random Z at each stage? Should the number of samples increase as we converge?
- add prior for x
- change f to avoid bimodal posterior.

## Amortized Variational Inference

Let's assume $q(x|y)$ can be modelled as $N(m(y), s2(y))$ where $m(y)$ and $s2(y)$ are both modelled as neural networks.


