# HW4
Geoffrey Woollard

My code lives in the repo https://github.com/geoffwoollard/prob_prog

# Acknowledgments
I acknowledge helpful discussions with Masoud Mokhatari, Dylan Green, Kevin Yang, Gaurav Bhatt, Ilias Karimalis, Ali Seyfi, Kim Dinh, Alan Milligan, Yuan Tian, and many other classmates on the control variate term in Eq. 4.42 of the course textbook, and for providing ELBOs for comparison.

I gratefully acknowledge helpful code snippets from Kevin Yang (UniformContinuous proposal from Gamma), and Kim Dinh for advise on using a global optimizer.

# Code snippets

In [1]:
from dill.source import getsource, getsourcelines

At a high level `graph_bbvi_algo12` parses the graph, initializes the proposal distributions (using the distributions from Beren's starter code for uncontrained optimization) by sampling from the joint *and returning the distribution objects* (not just sampled values). To do this I redefined the distribution primitives to use the unconstrained optimization distributions.

Then I just step through the graph with ancestral sampling, and evaluate each linking function *with a deterministic evaluator*, `eval_algo11_deterministic`. The sample and observe cases are not handled in the evaluator (from algo 11 in the course textbook), but instead in `evaluate_link_function_algo11`, which is very similar to how things were done in `sample_from_joint` for ancestral sampling in previous homeworks.

Using `autograd`, I can get the gradients of each sample `t,l` w.r.t the parameters of each proposal distribution. These are collected for a whole minibatch of size L, and then the elbo-gradients function in algorithm 12 uses all the information in the samples, and binds the $b_{d,v}$ terms from Eq. 4.42-4.44 / lines 16,17 in algorithm 12. Note that the textbook is ambigous over the sums in line 16, and is clarified in Eq. 9 of [Ranganath, Gerrish, & Blei (2014). Black box variational inference](https://arxiv.org/pdf/1401.0118.pdf). This is done in `elbo_gradients`. 

A step is taken in the direction of these stochastic minibatch averaged gradients w.r.t. the log prob (see the `log_prob.backward()` in `grad_log_prob`). This step links to a *global* optimizer that is only initialized once when the distributions are set up (I only once do `.make_copy_with_grads()`) in `optimizer_step`. Note that really autograd is not needed here, I know the analytical forms of the gradients, because I know the analytical form of the proposal distributions. I can even link these analytical gradients back up with an optimizer (or not) and use optimizers in pytorch like `Adam`, `SGD`, etc. with things like weight decay by setting the `.grad` of each parameters to be optimized.

The $logW^{t,l}$ are collected and are used for weighting the samples a la importance sampling when computing functions over the returns in the posterior. In principle, I not only have the return for each sample, but the whole sample is defined because I used a graph based sampler. I also keep track of the *best* elbo and use the proposals from this iteration $t_{best}$. Note that this is after the gradients step has been taken in the mini-batch. 

In [3]:
import bbvi 

for line_number, function_line in enumerate(getsourcelines(bbvi)[0]):
    print(line_number, function_line,end='')

0 import logging
1 
2 import numpy as np
3 import torch
4 from torch import tensor
5 
6 from primitives import primitives_d, distributions_d, number, distribution_types
7 import distributions # for unconstrained optimization
8 from graph_based_sampling import sample_from_joint, score, topsort
9 from distributions import Normal
10 
11 number = (int,float)
12 
13 logging.basicConfig(format='%(levelname)s:%(message)s')
14 logger = logging.getLogger('simple_example')
15 logger.setLevel(logging.DEBUG)
16 
17 logging.basicConfig(format='%(levelname)s:%(message)s')
18 logger_graph = logging.getLogger('simple_example')
19 logger_graph.setLevel(logging.DEBUG)
20 
21 def eval_algo11_deterministic(e,sigma,local_env={},defn_d={},do_log=False,logger_string='',vertex=None):
22     """
23     do not handle sample or observe.
24     done in higher level parser of linker function.
25     just eval distribution object that gets sampled or observed
26     """
27     # remember to return evaluate (recursi

In [4]:
from graph_based_sampling import sample_from_joint, evaluate_link_function

list_of_programs = [sample_from_joint, evaluate_link_function]

for program in list_of_programs:
    for line_number, function_line in enumerate(getsourcelines(program)[0]):
        print(line_number, function_line,end='')
    print()
    

0 def sample_from_joint(graph,sigma=tensor(0.),local_env={'prior_dist':{}},do_log=False,verteces_topsorted=None):
1     """This function does ancestral sampling starting from the prior.
2 
3     graph output from `daphne graph -i sugared.daphne`
4     * list of length 3
5       * first entry is defn dict
6         * {"string-defn-function-name":["fn", ["var_1", ..., "var_n"], e_function_body], ...}
7       * second entry is graph: {V,A,P,Y}
8         * "V","A","P","Y" are keys in dict
9         * "V" : ["string_name_vertex_1", ..., "string_name_vertex_n"] # list of string names of vertices
10         * "A" : {"string_name_vertex_1" : [..., "string_name_vertex_i", ...] # dict of arc pairs (u,v) with u a string key in the dict, and the value a list of string names of the vertices. note that the keys can be things like "uniform" and don't have to be vetex name strings
11         * "P" : "string_name_vertex_i" : ["sample*", e_i] # dict. keys vertex name strings and value a rested list with

In [2]:
from distributions import UniformContinuous

for line_number, function_line in enumerate(getsourcelines(UniformContinuous)[0]):
    print(line_number, function_line,end='')    

0 class UniformContinuous(dist.Gamma):
1     """
2     Gamma to approx a posterior distribution with support on the positive real line (not including zero)
3     """
4     def __init__(self, low, high, copy=False):
5         super().__init__(concentration=low,
6                              rate=high)
7 
8     def Parameters(self):
9         """Return a list of parameters for the distribution"""
10         return [self.concentration, self.rate]
11 
12     def make_copy_with_grads(self):
13         """
14         Return a copy  of the distribution, with parameters that require_grad
15         """
16 
17         ps = [p.clone().detach().requires_grad_() for p in self.Parameters()]
18 
19         return UniformContinuous(*ps, copy=True)
20 
21     def log_prob(self, x):
22 
23         return super().log_prob(x)
