# Comments on spatialVAE code

## Likelihood functions
In general VAE performance relies strongly on the choice of likelihood function, the 'sum' mode work better compared to taking the 'mean' across the data points and batch. Does this essentially put more __weight__ on the reconstruction loss and as a consequence downweights the KL term (latent space prior)? This is common when training VAEs though, e.g. see [this blog](http://adamlineberry.ai/vae-series/vae-code-experiments)

In [None]:
F.binary_cross_entropy(input=y_hat, target=y, reduction='sum')

In __spatialVAE__ they use the BCE with logits likelihood function. Not sure why they would want to use this, since the image pixel values are scaled to (0,1). Also their __decoder__ output is not pass through a __Sigmoid__ layer. I changed the code to pass it through Sigmoid instead and use the BCE loss. Looking at few epochs the BCE seemed better but need to test this further.

In [None]:
-F.binary_cross_entropy_with_logits(input=y_hat, target=y, reduction='sum')

## Decoder

Still cannot fully follow the reasoning behind the Decoder implementation. 

1. The first layer from __latent z__ to hidden layers, they explicitly set bias=False. What is the reason? Probably because they set it in `coord_linear` and at the end they concatenate them as the input layer?

In [None]:
self.coord_linear = nn.Linear(2, hidden_dim) # For xy-coordinates
self.latent_linear = nn.Linear(latent_dim, hidden_dim, bias=False) # For z

2. The decoder part for x-coordinates: they feed in the same coordinates across the whole batch.

In [None]:
# Flatten x so dim is now (batch_size * num_coordinates, 2)
x = x.view(batch_size*n, -1)
# Pass x coordinates through linear layer to obtain latent space
h_x = self.coord_linear(x)

# Pass latent z
h_z = self.latent_linear(z)

3. Combine layers

In [None]:
# For each coordinate we add the unstructed and structured elements
h = h_x + h_z  # (batch_size, num_coords, hidden_dim)

4. Transform dimensions (flatten) and pass to next layers where the final output size is one value.

In [None]:
# Transform dimensions
h = h.view(batch_size * n, -1)

y = self.layers(h) # (batch_size*num_coords, output_dim), where output_dim = 1

5. Finally they transform to the appropriate size(?)

In [None]:
# Num coordinates = data dimensions? that is why we do this here?
# Reshape the output appropriately
y = y.view(y.size(0), *self.data_dim)

# Comments on Pyro implementation

## Comparison with implementation in blog post

1. In the blog they use a Softplus to constrain the output of the variance parameter to be positive, so then the sampling from the Normal distribution does not have issues with negative variances. Instead, I exponentiate the continuous value obtained from the encoder.

2. In the manuscript the authors propose to use a specific prior for the rotation parameter $\theta$. However, with the Pyro implementation, there is no such specific prior defined. I assume, they consider a Normal prior, similar to all other latent variables. This results at the end of having a different KL divergence and loss function to optimize. __Need to write the equations here to be more specific__.

3. 