# Aside

In previous iterations of the class we gave a tutorial on coherent vs. incoherent imaging. That knowledge is still valuable but less suited to an interactive lab. You can find the notebook for that tutorial here:
https://deepimaging.github.io/data/Comparing_incoherent_vs_coherent_imaging.ipynb

# Part 1 -- Eager TF

Throughout these labs we have been working with a single side of tensorflow. That is, the graph based approach (with the fit function). While there is nothing you **can't** do in the graph based version of tensorflow, there are design advantages of using other modes (eager mode) or other frameworks.

Personally, I've found in my research the PyTorch is a great alternative to Tensorflow, as it follows a much more pythonic interface and is easier to debug. PyTorch follows a dynamic graph approach, where operations are applied on the fly. This may seem a bit weird at first, but it at the end allows for more flexibility and control.

Recently, in tensorflow's 2.0 update, eager mode was added. This mode is similar to PyTorch in that is allows for dynamic graphs and a more flexibile representation of computation.

# Code Comparison


## Keras (graph based TF)


In [None]:
from tensorflow import keras
# creating a model
def create_model(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)
    x = inputs
    x = keras.layers.Conv2D(16, 3, activation='relu')(x)
    x = keras.layers.Conv2D(16, 3, activation='relu')(x)
    x = keras.layers.GlobalAveragePooling2D()(x)
    x = keras.layers.Dense(num_classes)(x)
    return keras.Model(inputs=inputs, outputs=x)
    
model = create_model((64, 64, 3), 2)

# training a model
model.fit(...)

## Tensorflow Eager

In [None]:
# creating a model
model = keras.Sequential([
  keras.layers.Conv2D(16,[3,3], activation='relu',
                         input_shape=(None, None, 1)),
  keras.layers.Conv2D(16,[3,3], activation='relu'),
  keras.layers.GlobalAveragePooling2D(),
  keras.layers.Dense(10)
])

# training a model
optimizer = keras.optimizers.Adam()
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

for i in range(num_epochs):
    for images, labels in dataset:
        with tf.GradientTape() as tape:
            model_out = model(images)
            loss_value = loss_fn(labels, images)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

## So what is going on here?

Before we would specify our models graph apriori (keras model) then compile it. This essentially fixes the order of operations and sets up the gradient pathways (we know the function to calculate the gradient at any given point).

When using the eager execution we apply the operations and define the gradient pathway simulataneously. This situation consists of a couple key elements:
1. The model, or models. We still need to define our model, but now we don't need to define it using a `Model` class. This is because all tensorflow requires is a definition of the operations being performed, and objects which store the parameters for those operations
2. The loss function. Now you can see we have to use a loss function object. This will calculate the loss between our targets (labels) and our models output. Before Keras was doing this internally
3. The optimizer. Again this should look familiar but a bit different. The optimizer is still the same optimizer we used previously, but now we are calling the optimizer with arguments (gradients + variables)
4. Finally the `GradientTape`. This is the most important part. This "tape" is tracking the operations as they occur, so that when we want to calculate gradients we can "trace back" the loss from the models output to each of the parameters.

## Why do this?

While this approach may seem to make sense, it sure seems like a lot more work than what we were doing before. There are however a couple advantages to this approach:
1. It allows us to be more flexible about how data is processed. You don't strictly need to put all your trainable variables into one model object (see HW). It also means the pathway the gradients follow can be dynamic (you can swap out the model partway through training, or easily share components amoung several models)
2. It can be much easier to debug. Since all the components are seperated you can inspect the intermediate state of your variables and examine how your data is being processed.
3. You are not restricted to layer objects. Since everything is about tracking gradients, all that is required is that the operations are differentiable. This means that if you are implementing a custom layer, you can do so directly without having to encapsulate it in a class

## Other considerations

While this approach may seem very different than what you've been using so far, the changes are actually quite minor. You still use the same functions to generate models (minus the Model and Inputs declarations) and the same principles that we have been building on all semester still apply.

What you use is up to you, the increased flexibility of eager mode may not be worth the amount of extra coding you need to do. Additionally there is a computational overhead to eager mode from static mode (what we were doing before), so for large models you'll notice that things train slower.

# Part 2 - Random Distributions and Deep Learning

Random distributions are an incredibly useful tool in machine learning. They allow us to work in a more realistic, less deterministic, space. You should all be familiar with the basic distributions, but a quick review of the two most useful:

1. Gaussian -- Continuous random normal (bell curve shape)
    - Parameterized by a mean and standard deviation (can be multi-dimensional)
2. Categorical -- Discrete random, fixed number of outcomes (this encapsultes binomial)
    - Parameterized by a per-class probability

## Probability + DL

Random distributions are useful within deep learning as we often have situations which are not deterministic, or we want to understand the degree of certainty we have within our model. 

A great example of this is the popular *Variational AutoEncoder*, which uses a mean and standard deviation at the encoding level instead of a fixed vector.

We can also use random distributions within the context of physical layers to define how data is sampled. While a bit too cutting edge for this lab, the newly (2016) developed distrubtions called "Relaxed Categorical" distrubitons allow us to simulate discrete sampling using a differentiable discrete distribution.
- Aside, why can't we use a normal categorical distribution within our neural network (or as a physical layer)?

## Exercises - VAE


We are going to build an autoencoder similar to Lab 11, but this time we will add a variational element to it.

What does this mean practically? It means that the embedding layer is now a mean + standard deviation.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt


(train_images, _), (test_images, _) = tf.keras.datasets.mnist.load_data()
def preprocess_images(images):
  images = images.reshape((images.shape[0], 28, 28, 1)) / 255.
  return np.where(images > .5, 1.0, 0.0).astype('float32')

train_images = preprocess_images(train_images)
test_images = preprocess_images(test_images)

In [None]:
train_size = 60000
batch_size = 128
test_size = 10000

train_dataset = (tf.data.Dataset.from_tensor_slices(train_images)
                 .shuffle(train_size).batch(batch_size))
test_dataset = (tf.data.Dataset.from_tensor_slices(test_images)
                .shuffle(test_size).batch(batch_size))

In [None]:
from tensorflow.keras.layers import Conv2D, MaxPool2D, Input, UpSampling2D, Dense, Reshape, Flatten, InputLayer
import tensorflow as tf

# defining the two portions of our model

latent_dim = 1
encoder = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=(28, 28, 1)),
            tf.keras.layers.Conv2D(
                filters=32, kernel_size=3, strides=(2, 2), activation='relu'),
            tf.keras.layers.Conv2D(
                filters=64, kernel_size=3, strides=(2, 2), activation='relu'),
            tf.keras.layers.Flatten(),
            # No activation
            tf.keras.layers.Dense(latent_dim + latent_dim),
        ]
    )

decoder = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=(latent_dim,)),
            tf.keras.layers.Dense(units=7*7*32, activation=tf.nn.relu),
            tf.keras.layers.Reshape(target_shape=(7, 7, 32)),
            tf.keras.layers.Conv2DTranspose(
                filters=64, kernel_size=3, strides=2, padding='same',
                activation='relu'),
            tf.keras.layers.Conv2DTranspose(
                filters=32, kernel_size=3, strides=2, padding='same',
                activation='relu'),
            # No activation
            tf.keras.layers.Conv2DTranspose(
                filters=1, kernel_size=3, strides=1, padding='same'),
        ]
    )

## Exercise - 1A

Define the training loop using TF eager. Much of the code is provided for you, fill in the blanks and try to understand whats going on...

In [None]:
import tensorflow as tf
from tqdm.notebook import tqdm
# choose an optimizer
optimizer = ...

# this is a bit too advanced for this class...calculating the marginal log liklihood
# see: https://www.tensorflow.org/tutorials/generative/cvae
def log_normal_pdf(sample, mean, logvar, raxis=1):
  log2pi = tf.math.log(2. * np.pi)
  return tf.reduce_sum(
      -.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
      axis=raxis)

def compute_loss(enc_mean, enc_log_variance, sampled,  output, labels):
    cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=output, labels=labels)
    logpx_z = -tf.reduce_sum(cross_entropy, axis=[1, 2, 3])
    logpz = log_normal_pdf(sampled, 0., 0.)
    logqz_x = log_normal_pdf(sampled, mean, logvar)
    return -tf.reduce_mean(logpx_z + logpz - logqz_x)

def sample_encoding(mean, logvar):
    eps = tf.random.normal(shape=mean.shape)
    return eps * tf.exp(logvar * .5) + mean

num_epochs = 10
for epoch_num in range(num_epochs):
    # remember our images and labels are the same, duplicated for convenience
    with tqdm(total=len(dataset)) as pbar:
        avg_loss = .0
        for step, (images, labels) in enumerate(dataset):
            with tf.GradientTape() as tape:
                # get the model encoding
                encoding = ...
                mean, logvar = tf.split(encoding, num_or_size_splits=2, axis=1)
                # use the provided function to sample an encoding using the mean + std

                # decode the sampled encoding

                # use the provided loss function to calculate loss

                # calculate gradients using gradient tape
                # hint you'll need to do this for BOTH model objects (encoder + decoder)
                trainable_variables = ... # hint they are both lists


                # apply the gradients using the optimizer

                # code provided to log the loss
                avg_loss = (avg_loss * step + loss.numpy().mean())/(step + 1)
                pbar.set_description(f"Loss = {avg_loss:.3f}")
                pbar.update()

## Exercise 1B - Sample Decoded Images

The nice thing about VAEs is that we can sample multiple representations for each image reconstruction. This lets us know which features the model is certain of, and which ones it's not.

In [None]:
# step 1. encode a sample image or set of images, it is easy to simply take the most recent set of images from the training set:
sample_input_images = images
sample_image_encoding = ...
# step 2. Use the split code (in previous cell) to get the mean + logvar for these images

# step 3. Use the sampling function to generate five independent samples 

# step 4. Decode the samples using the decoder

# step 5. post-process the decoded samples by applying the sigmoid function
post_processed = tf.sigmoid(...)

## Exercise 1C - Visualize the Decoded Images

Now that you have the decoded images, (5 samples per image) you can display them to see how they vary.

## Exercise 1D - Visualize the Variance between Images