# Neural Computation - 2021

# Tutorial - Variational Auto-Encoders (VAEs)

**Aims of this tutorial**:
- Implement and train Variational Auto-Encoders (VAEs) in Pytorch.
- Investigate how the learned latent space looks.
- Investigate whether we can synthesize new data with Variational Auto-Encoders.
- Investigate whether Variational Auto-Encoders trained with unlabelled data are useful to improve training of Supervised Classifiers when labelled data are limited.

It may be long, but it should be easy to complete. The core points investigated here are of **high importance and part of the assessable material for the course**.

**Prerequisites**:
- Familiar with python, numpy, and basic PyTorch.
- Familiar with MNIST, Multi-Layer-Perceptrons (MLPs), and AutoEncoders (previous tutorial).


**Notes**:
- Docs for Pytorch's functions you will need:  
https://pytorch.org/docs/stable/tensors.html  
https://pytorch.org/docs/stable/nn.html  
- Some helper functions for loading and plotting data are given in `./utils` folder. They will be used out of the box below.

## Preliminary: Loading and refreshing MNIST

Loading and inspecting MNIST data. Same as previous tutorial...

In [None]:
# -*- coding: utf-8 -*-
# The below is for auto-reloading external modules after they are changed, such as those in ./utils.
# Issue: http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import numpy as np
from utils.data_utils import get_mnist # Helper function. Use it out of the box.

# Constants
DATA_DIR = './data/mnist' # Location we will keep the data.
SEED = 111111

# If datasets are not at specified location, they will be downloaded.
train_imgs, train_lbls = get_mnist(data_dir=DATA_DIR, train=True, download=True)
test_imgs, test_lbls = get_mnist(data_dir=DATA_DIR, train=False, download=True)

print("[train_imgs] Type: ", type(train_imgs), "|| Shape:", train_imgs.shape, "|| Data type: ", train_imgs.dtype )
print("[train_lbls] Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, "|| Data type: ", train_lbls.dtype )
print('Class labels in train = ', np.unique(train_lbls))

print("[test_imgs] Type: ", type(test_imgs), "|| Shape:", test_imgs.shape, " || Data type: ", test_imgs.dtype )
print("[test_lbls] Type: ", type(test_lbls), "|| Shape:", test_lbls.shape, " || Data type: ", test_lbls.dtype )
print('Class labels in test = ', np.unique(test_lbls))

N_tr_imgs = train_imgs.shape[0] # N hereafter. Number of training images in database.
H_height = train_imgs.shape[1] # H hereafter
W_width = train_imgs.shape[2] # W hereafter
C_classes = len(np.unique(train_lbls)) # C hereafter

Above we see that data have been loaded in *numpy arrays*.    
Arrays with images have **shape ( N = number of images, H = height, W = width )**.  
Arrays with labels have **shape ( N = number of images)**, holding one integer per image, the digit's class.

MNIST comprises of a **train set (N_tr = 60000) images** and a **test set (N_te = 10000) images**.  
We will use the train set for unsupervised learning. The test set will only be used for evaluating generalisation of classifiers towards the end of the tutorial.

Lets plot a few image in one collage to have a look...

In [None]:
%matplotlib inline
from utils.plotting import plot_grid_of_images # Helper functions, use out of the box.
plot_grid_of_images(train_imgs[0:100], n_imgs_per_row=10)

Notice that the intensities in the images take **values from 0 to 255**.

## Preliminary: Data pre-processing

A first step in almost all pipelines is to pre-process the data, to make them more appropriate for a model.

Below, we will perform 3 points:  
a) Change the labels from an integer representation to a **one-hot representation** of the **C=10 classes**.\
b) Re-scale the **intensities** in the images, from the range \[0,255\], to be instead in the range \[-1,+1\].\
c) **Vectorise the 2D images into 1D vectors for the MLP**, which only gets vectors as input.

In [None]:
# a) Change representation of labels to one-hot vectors of length C=10.
train_lbls_onehot = np.zeros(shape=(train_lbls.shape[0], C_classes ) )
train_lbls_onehot[ np.arange(train_lbls_onehot.shape[0]), train_lbls ] = 1
test_lbls_onehot = np.zeros(shape=(test_lbls.shape[0], C_classes ) )
test_lbls_onehot[ np.arange(test_lbls_onehot.shape[0]), test_lbls ] = 1
print("BEFORE: [train_lbls]        Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, " || Data type: ", train_lbls.dtype )
print("AFTER : [train_lbls_onehot] Type: ", type(train_lbls_onehot), "|| Shape:", train_lbls_onehot.shape, " || Data type: ", train_lbls_onehot.dtype )

In [None]:
# b) Re-scale image intensities, from [0,255] to [-1, +1].
# This commonly facilitates learning:
# A zero-centered signal with small magnitude allows avoiding exploding/vanishing problems easier.
from utils.data_utils import normalize_int_whole_database # Helper function. Use out of the box.
train_imgs = normalize_int_whole_database(train_imgs, norm_type="minus_1_to_1")
test_imgs = normalize_int_whole_database(test_imgs, norm_type="minus_1_to_1")

# Lets plot one image.
from utils.plotting import plot_image, plot_images # Helper function, use out of the box.
index = 0  # Try any, up to 60000
print("Plotting image of index: [", index, "]")
print("Class label for this image is: ", train_lbls[index])
print("One-hot label representation: [", train_lbls_onehot[index], "]")
plot_image(train_imgs[index])
# Notice the magnitude of intensities. Black is now negative and white is positive float.
# Compare with intensities of figure further above.

In [None]:
# c) Flatten the images, from 2D matrices to 1D vectors. MLPs take feature-vectors as input, not 2D images.
train_imgs_flat = train_imgs.reshape([train_imgs.shape[0], -1]) # Preserve 1st dim (S = num Samples), flatten others.
test_imgs_flat = test_imgs.reshape([test_imgs.shape[0], -1])
print("Shape of numpy array holding the training database:")
print("Original : [N, H, W] = [", train_imgs.shape , "]")
print("Flattened: [N, H*W]  = [", train_imgs_flat.shape , "]")

## Task 1: Variational Auto-Encoder

In this task, you are called to implement the architecture and losses of a Variational Auto-Encoder.
**Fill in the blanks where requested**, to create the below architecture:

![title](./documentation/vae_2d.png)


In [None]:
# -*- coding: utf-8 -*-
import torch
import torch.optim as optim
import torch.nn as nn


class Network():
    
    def backward_pass(self, loss):
        # Performs back propagation and computes gradients
        # With PyTorch, we do not need to compute gradients analytically for parameters were requires_grads=True, 
        # Calling loss.backward(), torch's Autograd automatically computes grads of loss wrt each parameter p,...
        # ... and **puts them in p.grad**. Return them in a list.
        loss.backward()
        grads = [param.grad for param in self.params]
        return grads
    
    
class VAE(Network):
    def __init__(self, rng, D_in, D_hid_enc, D_bottleneck, D_hid_dec):
        # Construct and initialize network parameters
        D_in = D_in # Dimension of input feature-vectors. Length of a vectorised image.
        D_hid_1 = D_hid_enc # Dimension of Encoder's hidden layer
        D_hid_2 = D_bottleneck
        D_hid_3 = D_hid_dec  # Dimension of Decoder's hidden layer
        D_out = D_in # Dimension of Output layer.
        
        self.D_bottleneck = D_bottleneck  # Keep track of it, we will need it.
        
        ##### TODO: Initialize the VAE's parameters. Also see forward_pass(...)) #####################
        # Dimensions of parameter tensors are (number of neurons + 1) per layer, to account for +1 bias.
        # -- (Encoder) layer 1
        w1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        # -- (Encoder) layer 2, predicting p(z|x)
        w2_mu_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))  # Weights for predicting means.
        w2_std_init = rng.normal(loc=0.0, scale=0.01, size=(??????+1, D_hid_2))  # <----- weights for predicting std
        # -- (Decoder) layer 3
        w3_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_hid_3))
        # -- (Decoder) layer 4, the output layer
        w4_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_3+1, D_out))
        
        # Pytorch tensors, parameters of the model
        # Use the above numpy arrays as of random floats as initialization for the Pytorch weights.
        # (Encoder)
        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)
        # (Encoder) Layer 2, predicting p(z|x)
        w2_mu = torch.tensor(?????????, dtype=torch.float, requires_grad=True)   # <------- ?????
        w2_std = torch.tensor(w2_std_init, dtype=torch.float, requires_grad=True)
        # (Decoder)
        w3 = torch.tensor(w3_init, dtype=torch.float, requires_grad=True)
        w4 = torch.tensor(w4_init, dtype=torch.float, requires_grad=True)
        #########################################################################################
            
        # Keep track of all trainable parameters:
        self.params = [w1, w2_mu, w2_std, w3, w4]

        
    
    def encode(self, batch_imgs):
        # batch_imgs: Numpy array or Pytorch tensor of shape: [number of inputs, dimensionality of x]
        [w1, w2_mu, w2_std, w3, w4] = self.params
        
        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float) if type(batch_imgs) is np.ndarray else batch_imgs
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs_t.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # ========== TODO: Fill in the gaps with the correct parameters of the VAE ========
        # Encoder's Layer 1
        h1_preact = x.mm(w1)
        h1_act = h1_preact.clamp(min=0)
        # Encoder's Layer 2 (predicting p(z|x) of Z coding):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        # ... mu
        h2_mu_preact = ???????.mm(w2_mu)   # <-------------------------------
        h2_mu_act = h2_mu_preact #h2_preact.clamp(min=0)
        # ... log(std). Why do we do this, instead of directly predicting std deviation? See lecture slides.
        h2_logstd_preact = h1_ext.mm(??????????)  # <------------------------
        h2_logstd_act = h2_logstd_preact  # No (linear) activation function in this tutorial, but can use any.
        # ==============================================================================
        
        z_coding = (h2_mu_act, h2_logstd_act)
        
        return z_coding
        
        
    def decode(self, z_codes):
        # z_codes: numpy array or pytorch tensor, shape [N, dimensionality of Z]
        [w1, w2_mu, w2_std, w3, w4] = self.params
        
        z_codes_t = torch.tensor(z_codes, dtype=torch.float) if type(z_codes) is np.ndarray else z_codes
        
        unary_feature_for_bias = torch.ones(size=(z_codes_t.shape[0], 1)) # [N, 1] column vector.
        
        # ========== TODO: Fill in the gaps with the correct parameters of the VAE ========
        # Decoder's 1st layer (Layer 3 of whole VAE):
        h2_ext = torch.cat((z_codes_t, unary_feature_for_bias), dim=1)
        h3_preact = h2_ext.mm(???????)  # <------------------------
        h3_act = h3_preact.clamp(min=0)
        # Decoder's 2nd layer (Layer 4 of whole VAE): The output layer.
        h3_ext = torch.cat((h3_act, unary_feature_for_bias), dim=1)
        h4_preact = h3_ext.mm(w4)
        h4_act = torch.tanh(h4_preact)
        # ==============================================================================
        
        # Output
        x_pred = h4_act
        
        return x_pred
        
        
    def sample_with_reparameterization(self, z_mu, z_logstd):
        # Reparameterization trick to sample from N(mu, var) using N(0,1) as intermediate step.
        # param z_mu: Tensor. Mean of the predicted Gaussian p(z|x). Shape: [Num samples, Dimensionality of Z]
        # param z_logstd: Tensor. Log of standard deviation of predicted Gaussian p(z|x). [Num samples, Dim of Z]
        # return: Tensor. [Num samples, Dim of Z]
        
        N_samples = z_mu.shape[0]
        Z_dims = z_mu.shape[1]

        # ========== TODO: Fill in the gaps to complete the reparameterization trick ========
        z_std = torch.exp(???????)       #   <------------------- compute std from log(std)
        eps = torch.randn(size=[N_samples, Z_dims])  # torch.randn_like(std)
        z_samples = ??????? * z_std + z_mu    #           <---------------- Re-parameterization trick
        # ==============================================================================
        
        return z_samples
        
        
    def forward_pass(self, batch_imgs):
        # Performed at every batch during training.
        # Takes an input batch, encodes it, samples a code from p(z|x) with reparameterization, decodes it.
        # Returns: Reconstruction x_pred, predicted means z_mu, predicted log(std) z_logstd, sampled codes z_samples.
        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float)  # Makes numpy array to pytorch tensor.
        
        # ========== TODO: Call the appropriate functions, as you defined them above ========
        # Encoder
        z_mu, z_logstd = self.????????(batch_imgs_t)  # <----------------------- ????????????
        z_samples = self.??????????(z_mu, z_logstd)  # <------------- ????????????
        # Decoder
        x_pred = self.?????????(z_samples)  # <------------- ????????????
        # ===================================================================================
        
        return (x_pred, z_mu, z_logstd, z_samples)
        
        
def reconstruction_loss(x_pred, x_real, eps=1e-7):
    # x_pred: [N, D_out] Prediction returned by forward_pass. Numpy array of shape [N, D_out]
    # x_real: [N, D_in]
    
    # If number array is given, change it to a Torch tensor.
    x_pred = torch.tensor(x_pred, dtype=torch.float) if type(x_pred) is np.ndarray else x_pred
    x_real = torch.tensor(x_real, dtype=torch.float) if type(x_real) is np.ndarray else x_real
    
    ######## TODO: Complete the calculation of Reconstruction loss for each sample ###########
    loss_recon = torch.mean(torch.square(???????? - x_real), dim=1)  # <---------- same as for AEs
    ##########################################################################################
    
    cost = torch.mean(loss_recon, dim=0) # Expectation of loss: Mean over samples (axis=0).
    
    return cost


def regularizer_loss(mu, log_std):
    # mu: Tensor, [number of samples, dimensionality of Z]. Predicted means per z dimension
    # log_std: Tensor, [number of samples, dimensionality of Z]. Predicted log(std.dev.) per z dimension.
    
    ######## TODO: Complete the calculation of the Regularizer for each sample ###########
    std = torch.exp(log_std)  # Compute std.dev. from log(std.dev.)
    reg_loss_per_sample = 0.5 * torch.sum(????**2 + std**2 - 1 - 2 * log_std, dim = 1)  # <------ See lecture slides
    reg_loss = torch.mean(reg_loss_per_sample, dim = 0)  # Mean over samples.
    ##########################################################################################
    
    return reg_loss


def vae_loss(x_real, x_pred, z_mu, z_logstd, lambda_rec=1., lambda_reg=0.005, eps=1e-7):
    
    rec_loss = reconstruction_loss(x_pred, x_real, eps=1e-7)
    reg_loss = regularizer_loss(z_mu, z_logstd)
    
    ################### TODO: compute the total loss: #####################################
    # ...by weighting the reconstruction loss by lambda_rec, and the Regularizer by lambda_reg
    weighted_rec_loss = lambda_rec * ????????
    weighted_reg_loss = lambda_reg * ????????
    total_loss = weighted_rec_loss + weighted_reg_loss
    #######################################################################################
    
    return total_loss, weighted_rec_loss, weighted_reg_loss
    
    

If this task is completed correctly, you should be able to run the cell and get no errors. Though no output will be given yet. We will use this in the next task, and then we will find out if everything went well :-)

## Task 2: Unsupervised training of VAE

Below you are given the main training function, which performs gradient descent in unsupervised fashion.

In the below code, a random batch of images is given to the VAE for a forward_pass (encode, sampling via reparameterization trick, decode). Then, it returns the reconstruction of the sample (x_pred), the predicted mean and logarithm(of standard deviation) of the distribution p(z|x) of codes z for the code of sample x. It also returns the code z passed to the decoder, which here is a sample from the predicted p(z|x) for each sample.

Then, the total loss of the VAE is calculated via vae_loss(), implemented above, and minimized via Adam.

Fill in the 2 blanks in the code, to simply pass the correct parameters (predicted means (mu) and log(std.dev)) to the loss function (vae_loss()), so that it can get optimized.

In [None]:
from utils.plotting import plot_train_progress_VAE, plot_grids_of_images  # Use out of the box


def get_random_batch(train_imgs, train_lbls, batch_size, rng):
    # train_imgs: Images. Numpy array of shape [N, H * W]
    # train_lbls: Labels of images. None, or Numpy array of shape [N, C_classes], one hot label for each image.
    # batch_size: integer. Size that the batch should have.
    
    indices = range(0, batch_size)  # Remove this line after you fill-in and un-comment the below. 
    indices = rng.randint(low=0, high=train_imgs.shape[0], size=batch_size, dtype='int32')
    
    train_imgs_batch = train_imgs[indices]
    if train_lbls is not None:  # Enables function to be used both for supervised and unsupervised learning
        train_lbls_batch = train_lbls[indices]
    else:
        train_lbls_batch = None
    return [train_imgs_batch, train_lbls_batch]


def unsupervised_training_VAE(net,
                             loss_func,
                             lambda_rec,
                             lambda_reg,
                             rng,
                             train_imgs_all,
                             batch_size,
                             learning_rate,
                             total_iters,
                             iters_per_recon_plot=-1):
    # net: Instance of a model. See classes: Autoencoder, MLPClassifier, etc further below
    # loss_func: Function that computes the loss. See functions: reconstruction_loss or cross_entropy.
    # lambda_rec: weighing of reconstruction loss in total loss. Total = lambda_rec * rec_loss + lambda_reg * reg_loss
    # lambda_reg: same as above, but for regularizer
    # rng: numpy random number generator
    # train_imgs_all: All the training images. Numpy array, shape [N_tr, H, W]
    # batch_size: Size of the batch that should be processed per SGD iteration by a model.
    # learning_rate: self explanatory.
    # total_iters: how many SGD iterations to perform.
    # iters_per_recon_plot: Integer. Every that many iterations the model predicts training images ...
    #                      ...and we plot their reconstruction. For visual observation of the results.
    loss_total_to_plot = []
    loss_rec_to_plot = []
    loss_reg_to_plot = []
    
    optimizer = optim.Adam(net.params, lr=learning_rate)  # Will use PyTorch's Adam optimizer out of the box
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        x_batch, _ = get_random_batch(train_imgs_all, None, batch_size, rng)
        
        ################### TODO: compute the total loss: ################################################
        # Pass parameters of the predicted distribution per x (mean mu and log(std.dev) to the loss function
        
        # Forward pass: Encodes, samples via reparameterization trick, decodes
        x_pred, z_mu, z_logstd, z_codes = net.forward_pass(x_batch)

        # Compute loss:
        total_loss, rec_loss, reg_loss = loss_func(x_batch, x_pred, ??????, ??????, lambda_rec, lambda_reg) # <-------------
        ####################################################################################################
        # Pytorch way
        optimizer.zero_grad()
        _ = net.backward_pass(total_loss)
        optimizer.step()
        
        # ==== Report training loss and accuracy ======
        total_loss_np = total_loss if type(total_loss) is type(float) else total_loss.item()  # Pytorch returns tensor. Cast to float
        rec_loss_np = rec_loss if type(rec_loss) is type(float) else rec_loss.item()
        reg_loss_np = reg_loss if type(reg_loss) is type(float) else reg_loss.item()
        if t%10==0:  # Print every 10 iterations
            print("[iter:", t, "]: Total training Loss: {0:.2f}".format(total_loss_np))
        loss_total_to_plot.append(total_loss_np)
        loss_rec_to_plot.append(rec_loss_np)
        loss_reg_to_plot.append(reg_loss_np)
        
        # Every few iterations, show reconstructions
        if t==total_iters-1 or t%iters_per_recon_plot == 0:
            # Reconstruct all images, to plot reconstructions.
            x_pred_all, z_mu_all, z_logstd_all, z_codes_all = net.forward_pass(train_imgs_all)
            # Cast tensors to numpy arrays
            x_pred_all_np = x_pred_all if type(x_pred_all) is np.ndarray else x_pred_all.detach().numpy()
            
            # Predicted reconstructions have vector shape. Reshape them to original image shape.
            train_imgs_resh = train_imgs_all.reshape([train_imgs_all.shape[0], H_height, W_width])
            x_pred_all_np_resh = x_pred_all_np.reshape([train_imgs_all.shape[0], H_height, W_width])
            
            # Plot a few images, originals and predicted reconstructions.
            plot_grids_of_images([train_imgs_resh[0:100], x_pred_all_np_resh[0:100]],
                                  titles=["Real", "Reconstructions"],
                                  n_imgs_per_row=10,
                                  dynamically=True)
            
    # In the end of the process, plot loss.
    plot_train_progress_VAE(loss_total_to_plot, loss_rec_to_plot, loss_reg_to_plot, iters_per_point=1, y_lims=[1., 1., None])
    

If you completed the above correctly you should get no error message here. Finally, lets use the above and implementation of VAE in Task 1, to train a VAE!

Fill in the below gap, to make the VAE shown in figure of Task1 with a 2-dimensional Z representation...

In [None]:
##################### TODO: Fill in the blank ##############################
# Create the network
rng = np.random.RandomState(seed=SEED)
vae = VAE(rng=rng,
          D_in=H_height*W_width,
          D_hid_enc=256,
          D_bottleneck=??????,  # <--- Set to correct value for instantiating VAE shown & implemented in Task 1. Note: We treat D as dimensionality of Z, rather than number of neurons.
          D_hid_dec=256)
########################################################################
# Start training
unsupervised_training_VAE(vae,
                          vae_loss,
                          lambda_rec=1.0,  # <-------- lambda_rec, weight on reconstruction loss.
                          lambda_reg=0.005,  # <------- lambda_reg, weight on regularizer. 0.005 works ok.
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,
                          total_iters=1000,
                          iters_per_recon_plot=50)


The above requires you to have completed both Task 1 and Task 2. If everything is completed correctly, you should see the model getting trained and the total training loss printed every few iterations.

In the end of training, after 1000 iterations, you will see 3 curve, one for the TOTAL training loss, one for the *weighted* reconstruction loss, and one for the *weighted* regularization loss (*weighted* = after multiplication with the weights *lambda_rec* and *lambda_reg* respectively, when computing the total loss. If everything is done well, the total and reconstruction loss are expected to decrease down to approximately 0.25, and the regularizer down to approx 0.01-0.02.

You should also see printed side by side a set of real images, and their reconstructed version.\
In the end, the reconstructions should start being reasonable.

## Task 3: Encode training data in Z representation and examine

We now have a trained VAE with 2-dimensional representation Z from Task 2. We will here use it to encode training data and obtain the means and standard deviations of the predicted distributions for the codes p(z|x). We will then plot the predicted means for p(z|x) for each sample x, in a 2D plot, to observe how codes are clustered.

Note: Fill in the 1 blank below, run the code, and observe output...


In [None]:
import matplotlib.pyplot as plt

def encode_training_images(net,
                           imgs_flat,
                           lbls,
                           batch_size,
                           total_iterations=None,
                           plot_2d_embedding=True,
                           plot_hist_mu_std_for_dim=0):
    # This function encodes images, plots the first 2 dimensions of the codes in a plot, and finally...
    # ... returns the minimum and maximum values of the codes for each dimensions of Z.
    # ... We will use  this at a layer task.
    # Arguments:
    # imgs_flat: Numpy array of shape [Number of images, H * W]
    # lbls: Numpy array of shape [number of images], with 1 integer per image. The integer is the class (digit).
    # total_iterations: How many batches to encode. We will use this so that we dont encode and plot ...
    # ... the whoooole training database, because the plot will get cluttered with 60000 points.
    
    # If total iterations is None, the function will just iterate over all data, by breaking them into batches.    
    if total_iterations is None:
        total_iterations = (train_imgs_flat.shape[0] - 1) // batch_size + 1
    
    z_mu_all = []
    z_std_all = []
    lbls_all = []
    for t in range(total_iterations):
        # Sample batch for this SGD iteration
        x_batch = imgs_flat[t*batch_size: (t+1)*batch_size]
        lbls_batch = lbls[t*batch_size: (t+1)*batch_size]  # Just to color the embeddings (z codes) in the plot.
        
        ####### TODO: Fill in the blank ##################################
        # Encode a batch of x inputs:
        z_mu, z_logstd = net.encode(????????)  # <------------------------
        #################################################################
        z_mu_np = z_mu if type(z_mu) is np.ndarray else z_mu.detach().numpy()
        z_logstd_np = z_logstd if type(z_logstd) is np.ndarray else z_logstd.detach().numpy()
        
        z_mu_all.append(z_mu_np)
        z_std_all.append(np.exp(z_logstd_np))
        lbls_all.append(lbls_batch)
        
    z_mu_all = np.concatenate(z_mu_all)  # Make list of arrays in one array by concatenating along dim=0 (image index)
    z_std_all = np.concatenate(z_std_all)
    lbls_all = np.concatenate(lbls_all)
    
    if plot_2d_embedding:
        print("Z-Space and the MEAN of the predicted p(z|x) for each sample (std.devs not shown)")
        # Plot the codes with different color per class in a scatter plot:
        plt.scatter(z_mu_all[:,0], z_mu_all[:,1], c=lbls_all, alpha=0.5)  # Plot the first 2 dimensions.
        plt.show()
    
    print("Histogram of values of the predicted MEANS")
    plt.hist(z_mu_all[:,plot_hist_mu_std_for_dim], bins=20)
    plt.show()
    print("Histogram of values of the predicted STANDARD DEVIATIONS")
    plt.hist(z_std_all[:,plot_hist_mu_std_for_dim], bins=20)
    plt.show()
    
    


# Encode and plot
encode_training_images(vae,
                       train_imgs_flat,
                       train_lbls,
                       batch_size=100,
                       total_iterations=200,
                       plot_2d_embedding=True,
                       plot_hist_mu_std_for_dim=1)



If all went well, you should see 3 plots:
- top plot should show the 2D space of Z with 1 point per sample x. Only means are shown, not corresponding std.deviations.
- A histogram of the values of predicted means (means for both dimensions of z aggregated)
- A histogram of the values of predicted standard deviations (std.devs for both dimensions of z aggregated)


**Questions:**
- Around what value are they mostly centered? Why?
- What range of values do the means span? Why?
- What range of values do the standard deviations span? Why do they tend to be smaller than 1?
- How does the form of the top plot compare with the same plot for basic Auto-Encoders? Compare range of values, existence of gaps between classes, and holes in the general space.
- Observe the way the different samples are clustered by this VAE. Is this VAE as good for clustering as the basic AE from previous tutorial? Do you think that VAEs, in general, are better or worse than AEs in clustering datapoints? (Reminder: Clustering = similar points to similar codes, dissimilar points well separated and mapped to dissimilar codes) 

## Task 4: Train VAE from Task 1 and 2 only with Reconstruction loss:

The code below is complete. Just run it and compare results with those of previous Tasks, where the VAE was trained both with a reconstruction and the regularizer.

In [None]:
# Create the network
rng = np.random.RandomState(seed=SEED)
vae_2 = VAE(rng=rng,
            D_in=H_height*W_width,
            D_hid_enc=256,
            D_bottleneck=2,
            D_hid_dec=256)
# Start training
unsupervised_training_VAE(vae_2,
                          vae_loss,
                          lambda_rec=1.0,
                          lambda_reg=0.0,  # Essentially not minimizing regularizer. Only reconstruction.
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,
                          total_iters=1000,
                          iters_per_recon_plot=50)

**Questions:**
- Observe the Reconstruction loss for the VAE trained only with the reconstruction loss (here) and the VAE trained to minimize both the Reconstruction loss and the Regularizer (Task 2). How do they compare? (if not much difference is visually obvious, which one do you think should be smaller in theory?)
- Which of the two VAEs do you expect should achieve better reconstruction in theory? VAE_2 from this task, trained only with reconstruction loss, or VAE from Task 2, trained also using the regularizer?

In [None]:
# Encode and plot
encode_training_images(vae_2, # The second VAE, trained only with Reconstruction loss.
                       train_imgs_flat,
                       train_lbls,
                       batch_size=100,
                       total_iterations=200,
                       plot_2d_embedding=True,
                       plot_hist_mu_std_for_dim=1)

**Questions:**
- Compare the 2D plots of codings in Z-space between this VAE and the one trained with both recon and regularizer loss in Task 3. What do you oberse in the way the different samples are encoded and grouped? How do you relate this to the basic AutoEncoder from previous tutorial?
- Compare predicted means of p(z|x). Compare the values they take with those from Task 3. Does the reconstruction loss encourage them to keep low values? Is there a benefit for reconstruction if means get higher values?
- Observe standard deviations predicted for p(z|x) in this task (reconstruction loss only). What values do they take? In comparision to values in Task-3, these values should be significant smaller. Why does the reconstruction loss encourage as small as possible standard deviations?
- How do you relate this model with the basic Auto-Encoder?

## Task 5: Train VAE from Task 1 and 2 to minimize only the Regularizer

The code below is complete. Just run it and compare results with those of previous Tasks 2,3,4.

In [None]:
# Create the network
rng = np.random.RandomState(seed=SEED)
vae_3 = VAE(rng=rng,
            D_in=H_height*W_width,
            D_hid_enc=256,
            D_bottleneck=2,
            D_hid_dec=256)
# Start training
unsupervised_training_VAE(vae_3,
                          vae_loss,
                          lambda_rec=0.0,  # <------- No reconstruction loss. Only regularizer
                          lambda_reg=0.005,
                          rng=rng,
                          train_imgs_all=train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,
                          total_iters=1000,
                          iters_per_recon_plot=50)

**Questions:**\
    - How good are reconstructions? Why?

In [None]:
# Encode and plot
encode_training_images(vae_3, # The second VAE, trained only with Reconstruction loss.
                       train_imgs_flat,
                       train_lbls,
                       batch_size=100,
                       total_iterations=200,
                       plot_2d_embedding=True,
                       plot_hist_mu_std_for_dim=1)

**Questions:**
- Observe the predicted means of p(z|x). What values do they take? Why?
- Observe the standard deviations of p(z|x). What values do they take? Why?
- What type of information do you think this encoder has learned to represent about the data in space of Z?

## Task 6: Train a VAE with a larger bottleneck layer

Below, we train a VAE with a bottleneck layer of 32 dimensions (1 mu and std.dev predicted for each), and train it appropriately, both with the reconstruction and the regularizer. 

The code is complete. Just run it and observe the results.

In [None]:
# Same as in Task 2, but using a bottle neck with 32 dimension

# Create the network
rng = np.random.RandomState(seed=SEED)
vae_wide = VAE(rng=rng,
          D_in=H_height*W_width,
          D_hid_enc=256,
          D_bottleneck=32,  # <-----------------------------------
          D_hid_dec=256)
# Start training
unsupervised_training_VAE(vae_wide,
                          vae_loss,
                          1.0,  # alpha on the recon loss.
                          0.005,  # 0.005 works well for synthesis! 0.0005 better for smooth z values for 32n.
                          rng,
                          train_imgs_flat,
                          batch_size=40,
                          learning_rate=3e-3,  # 3e-3
                          total_iters=1000,
                          iters_per_recon_plot=50)


**Questions:**
- Compare the Reconstruction loss at the end of training with the reconstruction loss achieved by the corresponding basic AE in the previous Tutorial, in Task 4. If a VAE and an AE have the same architectures, which one do you expect to achieve lower reconstruction loss on the training data? Why?

## Task 7: Synthesizing (generating) new data with a VAE

Below we will use a VAE to generate new data.

![title](./documentation/vae_synthesis.png)

A trained VAE has learned, via the regularizer, to encode samples in such a way so that the distribution of codes z matches the 'prior' distribution p(z)=N(0,I) (Gaussian with 0 mean and 1 std deviation in all dimensions of space Z).

To synthesize new data:
- We sample a code z from the 'prior' p(z) = N(0,I).
- We decode it with the VAE.

**FILL IN THE BLANKS** in the below code, to enable it to sample from the N(0,I) normal distribution to synthesize data:


In [None]:
def synthesize(enc_dec_net,
               rng,
               n_samples):
    # enc_dec_net: Network with encoder and decoder, pretrained.
    # n_samples: how many samples to produce.
    
    z_dims = enc_dec_net.D_bottleneck  # Dimensionality of z codes (and input to decoder).
    
    ############################## TODO: Fill in the blanks #############################
    # Create samples of z from Gaussian N(0,I), where means are 0 and standard deviations are 1 in all dimensions.
    z_samples = np.random.normal(loc=?????, scale=?????, size=[n_samples, z_dims])
    #####################################################################################
    
    z_samples_t = torch.tensor(z_samples, dtype=torch.float)
    x_samples = enc_dec_net.decode(z_samples_t)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
       
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

synthesize(vae_wide,
           rng,
           n_samples=20)

If everything was filled correctly, you should see above images created by the decoder for the randomly sampled z-codes.

**Questions:**

- Compare the above results with those obtained from the basic AE with similar architecture in the previous Tutorial (Task 5)? What do you observe? What is the main property of the VAE to achieve this, when it comes to the area where we sample codes from?

## Task 8: For a given x, reconstruct random samples from the predicted posterior p(z|x).

Given an input x, the encoder of a VAE predicts the distribution p(z|x), which explains which values of z are the "probable" codes for x. During training, random z samples are sampled via the reparameterization trick from p(z|x), and the decoder is trained to decode them all to reconstruct x. 

If p(z|x) is parameterized as a Gaussian, as commonly done in VAE (and in this tutorial), **the predicted mean is the most probable code, and will also be sampled the most**. The probability of a code being sampled decreases as we move away from the mean, with a rate dependent on the predicted standard deviation of p(z|x). Therefore, one could wonder how well does the decoder learn to reconstruct z codes from whole p(z|x) (not just the mean), and how do they look. We explore this here.

In the below:
- We will encode a single image x
- We will encode it with the pre-trained VAE (32D Z) from Task 6, to predict mean and std.dev of p(z|x).
- We will sample codes z from p(z|x) and reconstruct based on them.

FILL IN THE BLANKS below, to enable the code to sample from predicted distribution p(z|x) for each sample x: 


In [None]:
def sample_variations_of_x(enc_dec_net,
                           imgs_flat,
                           idx_img_x,
                           rng,
                           n_samples):
    # enc_dec_net: Network with encoder and decoder, pretrained.
    # imgs_flat:
    # idx_img_x:
    # n_samples: how many samples to produce.
    
    img_x_nparray = imgs_flat[idx_img_x:idx_img_x+1]  # Shape: [num samples = 1, H * W]
    
    # Encode:
    z_mu, z_logstd = enc_dec_net.encode(img_x_nparray)  # expects array shape [N, dims_z]
    
    z_dims = z_mu.shape[1]  # Dimensionality of z codes (and input to decoder).
    z_mu = z_mu.detach().numpy()  # Maky pytorch tensor a numpy
    z_logstd = z_logstd.detach().numpy()
    
    ############# TODO: Fill in the blanks ##################################
    # Samples z values from the predicted probability of z for this sample x: p(z|x) = N(mu(x), std^2(x))
    z_std = np.exp(?????????)   # <------ what you need is returned by the encoding above -------------
    z_samples = np.random.normal(loc=???????, scale=z_std, size=[n_samples, z_dims]) # <------------------
    #########################################################################
    
    x_samples = enc_dec_net.decode(z_samples)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    print("Real input to encoder:")
    plot_image(img_x_nparray.reshape([H_height, W_width]))   
    print("Reconstructions based on samples from p(z|x=input):")
    plot_grid_of_images(x_samples_np.reshape([n_samples, H_height, W_width]),
                        n_imgs_per_row=10,
                        dynamically=False)
    print("Going to plot all the reconstructed variations one by one, for easier visual investigation:")
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
    
    diff = img_x_nparray[0] - x_samples_np[0]
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

sample_variations_of_x(vae_wide,  # The VAE with 32 dimensional Z.
                       train_imgs_flat,
                       idx_img_x=1,  # We will encode the image with index 1, and then reconstruct it.
                       rng=rng,
                       n_samples=100)

**Questions:**
- Do the reconstructions look like the original digit? Has the decoder learned to reconstruct from whole p(z|x)?
- Feel free to experiment with other digits.

## Task 9: Interpolate between x_1 and x_2 in space Z.

Here, we want to create variations of an input more "systematically" (not random as above). We want to create images that look partly as an input x1 and partly as an input x2, by interpolating between x1 and x2 in the latent space of Z codes.

Steps:

0. We are given a pre-trained VAE.
1. We will encode x1
2. We will encode x2
3. We will create z codes by walking between mu(x1) and mu(x2) for various alpha values:\
    z = mu(x1) + alpha * (mu(x2) - mu(x1))
4. We will decode these z codes to look at how the images look.

The code below is complete. Run it and observe the output.


In [None]:
def interpolate_between_x1_x2(enc_dec_net,
                              imgs_flat,
                              idx_x1,
                              idx_x2,
                              rng):
    # enc_dec_net: Network with encoder and decoder, pretrained.
    # imgs_flat: [number of images, H * W]
    # idx_x1: index of x1: x1 = imgs_flat[idx_x1]
    # idx_x2: index of x2: x2 = imgs_flat[idx_x2]
    # n_samples: how many samples to produce.
    
    img_x1_nparray = imgs_flat[idx_x1]
    img_x2_nparray = imgs_flat[idx_x2]
    z_mus, z_logstds = enc_dec_net.encode(np.array([img_x1_nparray, img_x2_nparray]))
    z_mus = z_mus.detach().numpy()
    
    z_mu1 = z_mus[0]  # np vector with [z-dims] elements
    z_mu2 = z_mus[1]
    
    z_dims = z_mu1.shape[0]  # Dimensionality of z codes (and input to decoder).
    
    # Reconstruct x1 and x2 based on mu codes:
    x_samples = enc_dec_net.decode(np.array([z_mu1, z_mu2]))
    x_samples = x_samples.detach().numpy()
    x1_rec = x_samples[0]
    x2_rec = x_samples[1]
    
    # Interpolate:
    alphas = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    
    alphas_np = np.ones([11, z_dims], dtype="float16")  # [number of interpolated samples = 11, z-dimensions]
    for row_idx in range(alphas_np.shape[0]):
        alphas_np[row_idx] = alphas_np[row_idx] * alphas[row_idx]  # now whole 1st row == 0.0, 2nd row == 0.1, ...
    
    # Interpolate new z values
    zs_to_decode = z_mu1 + alphas_np * (z_mu2 - z_mu1)
    
    x_samples= enc_dec_net.decode(zs_to_decode)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    print("Inputs to encoder:")
    plot_images([img_x1_nparray.reshape([H_height, W_width]), img_x2_nparray.reshape([H_height, W_width])],
               titles=["Real x1", "Real x2"])
    print("Reconstructions of x1 and x2 based on their most likely predicted z codes (corresponding mus):")
    plot_images([x1_rec.reshape([H_height, W_width]), x2_rec.reshape([H_height, W_width])],
               titles=["Recon of x1", "Recon of x2"])
    print("Decodings based on z samples interpolated between mu(x1) and mu(x2) predicted by encoder:")
    plot_grid_of_images(x_samples_np.reshape([11, H_height, W_width]),
                        n_imgs_per_row=11,
                        dynamically=False)
    print("Going to plot all the reconstructed variations one by one, for easier visual investigation:")
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
    
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

interpolate_between_x1_x2(vae_wide,
                          train_imgs_flat,
                          idx_x1=1,
                          idx_x2=3,
                          rng=rng)

**Questions:**
- As the one digit "morphs" into the other, do the intermediate interpolations look like garbage or like digits? Why? How do you explain this based on what you have seen in Task 3?
- How do you think this result would compare with similar results if performed with a standard AE? How would they compare if you would train the VAE only with the Reconstruction loss?

# Task 10: Learning from Unlabelled data with a VAE, to complement Supervised Classifier when Labelled data are limited: Lets first train a supervised Classifier 'from scratch'

We saw in the previous Tutorial 2 approaches for **using a pre-trained AE to improve performance of a Supervised Classifier, when labelled data are limited**. Approach 1: Use weights of AE's encoder as "frozen" feature extractor (with the classifier attached and trained on top), and Approach 2: Use weights of AE's encoder to "initialize" the corresponding layers of a Classifier, and then "refine" the whole classifier with the labelled data.

We saw this clearly improved performance when done with a basic Auto-Encoder.

Here, we will attempt exactly the same with a VAE.

In this task, we will create and train a fully-supervised MLP classifier only **on very limited (100) labelled data**. This is to compare the performance of this classifier with what we achieve when complementing it with unlabelled data using a VAE (in later Task).

![title](./documentation/classifier_scratch.png)

**The below code for creating a classifier, cross entropy loss, and training loop is complete.**\
This is **EXACTLY the same as the code of corresponding Task 6 of the previous Tutorial. Just run it.**

In [None]:
class Classifier_3layers(Network):
    def __init__(self, D_in, D_hid_1, D_hid_2, D_out, rng):
        D_in = D_in
        D_hid_1 = D_hid_1
        D_hid_2 = D_hid_2
        D_out = D_out
        
        # === NOTE: Notice that this is exactly the same architecture as encoder of AE in Task 4 ====
        w_1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        w_2_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        w_1 = torch.tensor(w_1_init, dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(w_2_init, dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_1, w_2, w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_imgs_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs_t.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # === NOTE: This is the same architecture as encoder of AE in Task 4, with extra classification layer ===
        # Layer 1
        h1_preact = x.mm(w_1)
        h1_act = h1_preact.clamp(min=0)
        # Layer 2 (corresponds to bottleneck of the AE):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w_2)
        h2_act = h2_preact.clamp(min=0)
        # Output classification layer
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        h_out = h2_ext.mm(w_out)
        
        logits = h_out
        
        # === Addition of a softmax function for 
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred

    
def cross_entropy(y_pred, y_real, eps=1e-7):
    # y_pred: Predicted class-posterior probabilities, returned by forward_pass. Numpy array of shape [N, D_out]
    # y_real: One-hot representation of real training labels. Same shape as y_pred.
    
    # If number array is given, change it to a Torch tensor.
    y_pred = torch.tensor(y_pred, dtype=torch.float) if type(y_pred) is np.ndarray else y_pred
    y_real = torch.tensor(y_real, dtype=torch.float) if type(y_real) is np.ndarray else y_real
    
    x_entr_per_sample = - torch.sum( y_real*torch.log(y_pred+eps), dim=1)  # Sum over classes, axis=1
    
    loss = torch.mean(x_entr_per_sample, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return loss



from utils.plotting import plot_train_progress_2

def train_classifier(classifier,
                     pretrained_VAE,
                     loss_func,
                     rng,
                     train_imgs,
                     train_lbls,
                     test_imgs,
                     test_lbls,
                     batch_size,
                     learning_rate,
                     total_iters,
                     iters_per_test=-1):
    # Arguments:
    # classifier: A classifier network. It will be trained by this function using labelled data.
    #             Its input will be either original data (if pretrained_VAE=0), ...
    #             ... or the output of the feature extractor if one is given.
    # pretrained_VAE: A pretrained AutoEncoder that will *not* be trained here.
    #      It will be used to encode input data.
    #      The classifier will take as input the output of this feature extractor.
    #      If pretrained_VAE = None: The classifier will simply receive the actual data as input.
    # train_imgs: Vectorized training images
    # train_lbls: One hot labels
    # test_imgs: Vectorized testing images, to compute generalization accuracy.
    # test_lbls: One hot labels for test data.
    # batch_size: batch size
    # learning_rate: come on...
    # total_iters: how many SGD iterations to perform.
    # iters_per_test: We will 'test' the model on test data every few iterations as specified by this.
    
    values_to_plot = {'loss':[], 'acc_train': [], 'acc_test': []}
    
    optimizer = optim.Adam(classifier.params, lr=learning_rate)
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        train_imgs_batch, train_lbls_batch = get_random_batch(train_imgs, train_lbls, batch_size, rng)
        
        # Forward pass
        if pretrained_VAE is None:
            inp_to_classifier = train_imgs_batch
        else:
            ############### TODO FOR TASK-11 #########################################
            # FILL IN THE BLANK, to provide as input to the classifier the predicted MEAN of p(z|x) for each x.
            # Why? Because the mean is the most likely (probable) code z for x!!
            #
            z_codes_mu, z_codes_logstd = pretrained_VAE.encode(train_imgs_batch)  # AE encodes. Output will be given to Classifier
            inp_to_classifier = ???????????????  # <----------------------------------------
            ############################################################################
            
        y_pred = classifier.forward_pass(inp_to_classifier)
        
        # Compute loss:
        y_real = train_lbls_batch
        loss = loss_func(y_pred, y_real)  # Cross entropy
        
        # Backprop and updates.
        optimizer.zero_grad()
        grads = classifier.backward_pass(loss)
        optimizer.step()
        
        
        # ==== Report training loss and accuracy ======
        # y_pred and loss can be either np.array, or torch.tensor (see later). If tensor, make it np.array.
        y_pred_numpy = y_pred if type(y_pred) is np.ndarray else y_pred.detach().numpy()
        y_pred_lbls = np.argmax(y_pred_numpy, axis=1) # y_pred is soft/probability. Make it a hard one-hot label.
        y_real_lbls = np.argmax(y_real, axis=1)
        
        acc_train = np.mean(y_pred_lbls == y_real_lbls) * 100. # percentage
        
        loss_numpy = loss if type(loss) is type(float) else loss.item()
        if t%10 == 0:
            print("[iter:", t, "]: Training Loss: {0:.2f}".format(loss), "\t Accuracy: {0:.2f}".format(acc_train))
        
        # =============== Every few iterations, test accuracy ================#
        if t==total_iters-1 or t%iters_per_test == 0:
            if pretrained_VAE is None:
                inp_to_classifier_test = test_imgs
            else:
                z_codes_test_mu, z_codes_test_logstd = pretrained_VAE.encode(test_imgs)
                inp_to_classifier_test = z_codes_test_mu
                
            y_pred_test = classifier.forward_pass(inp_to_classifier_test)
            
            # ==== Report test accuracy ======
            y_pred_test_numpy = y_pred_test if type(y_pred_test) is np.ndarray else y_pred_test.detach().numpy()
            
            y_pred_lbls_test = np.argmax(y_pred_test_numpy, axis=1)
            y_real_lbls_test = np.argmax(test_lbls, axis=1)
            acc_test = np.mean(y_pred_lbls_test == y_real_lbls_test) * 100.
            print("\t\t\t\t\t\t\t\t Testing Accuracy: {0:.2f}".format(acc_test))
            
            # Keep list of metrics to plot progress.
            values_to_plot['loss'].append(loss_numpy)
            values_to_plot['acc_train'].append(acc_train)
            values_to_plot['acc_test'].append(acc_test)
                
    # In the end of the process, plot loss accuracy on training and testing data.
    plot_train_progress_2(values_to_plot['loss'], values_to_plot['acc_train'], values_to_plot['acc_test'], iters_per_test)
    

Now below, we create an instance of this 3-layered classifier and train it on 100 labeled samples. We evaluate generalization on Test samples.

In [None]:
# Train Classifier from scratch (initialized randomly)

# Create the network
rng = np.random.RandomState(seed=SEED)
net_classifier_from_scratch = Classifier_3layers(D_in=H_height*W_width,
                                                 D_hid_1=256,
                                                 D_hid_2=32,
                                                 D_out=C_classes,
                                                 rng=rng)
# Start training
train_classifier(net_classifier_from_scratch,
                 None,  # No pretrained AE
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)


This is "exactly" the same as the corresponding Task 6 in the previous Tutorial. Simply run it, and note down the final Accuracy on the Test data.

## Task 11: Use pre-trained VAE as 'feature-extractor' for a supervised Classifier when labels are limited.

Approach-1: We take the encoder of the pre-trained VAE and place an untrained, small (1 layer) Classifier on top. The Classifier receives as input, codes that the VAE's encoder predicts when given input x. We then use the limited labelled data for training. Importantly, we only train the small Classifier. The encoder is used as 'frozen' (does not get trained further) feature extractor. See next figure for a visual explanation.

![title](./documentation/vae_refine_1.png)

**TODO: **
This is the same as Task 7 of previous Tutorial on AEs, **with one important peculiarity**: When the encoder predicts a whole distribution of codes for each x, p(z|x), what code should we use as oputput of the "feature extractor" (encoder of VAE) and as input to the classifier? Note: We want the classifier to be "deterministic", not stochastic, so we wont be sampling. Probably we want the most probable code z for each x...

**Go back** to Task 10 and the function **train_classifier(...)** defined therein. Fill in the gap, choosing which code to use as input to the Classifier. **AFTER** you have done that, run the code below...

In [None]:
# Train classifier on top of pre-trained AE encoder

class Classifier_1layer(Network):
    # Classifier with just 1 layer, the classification layer
    def __init__(self, D_in, D_out, rng):
        # D_in: dimensions of input
        # D_out: dimension of output (number of classes)
        
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_out))
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        self.params = [w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_inp_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_inp_t.shape[0], 1))  # [N, 1] column vector.
        batch_inp_ext = torch.cat((batch_inp_t, unary_feature_for_bias), dim=1)  # Extra feature=1 for bias.
        
        # Output classification layer
        logits = batch_inp_ext.mm(w_out)
        
        # Output layer activation function
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred
    
    
    
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
# As input, it will be getting z-codes from the AE with 32-neurons bottleneck from Task 4.
classifier_1layer = Classifier_1layer(vae_wide.D_bottleneck,  # Input dimension is dimensions of AE's Z
                                      C_classes,
                                      rng=rng)

train_classifier(classifier_1layer,
                 vae_wide,  # Pretrained AE, to use as feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

If you completed the task appropriately, you should see the model getting trained and performance reported at the bottom. The expected TRAINING accuracy is approximately 80%, and the TESTING accuracy is around 53% in the end of training.

**Questions:**
- Compare with test accuracy from Task 10, when training the classifier from Scratch. Did this improve performance?
- Compare with the results obtained with the standard AE in the previous Tutorial (Task 7). What do you observe?
- Is the AE or VAE better for pretraining a Classifier? How can you theoreticall justify this?

## Task 12: Use parameters of VAE's encoder to initialize weights of a supervised Classifier, followed by refinement using limited labels

Approach-2: The second approach is to build a Classifier that has the same architecture as the encoder of the VAE, followed by an extra classification layer. We first train the VAE (already done in Task 6). Then, we **use the pre-trained weights of the VAE's encoder to initialize the corresponding parameters of the Classifier**. The classification layer of the Classifier is initialized randomly. Then, **with the limited labelled data, we refine (train) all the parameters of the classifier**.

![title](./documentation/vae_refine_2.png)

This is the same as Task 8 of the previous Tutorial on AEs, **with one important peculiarity** (related to Task 11 here): Since the Classifier needs to be deterministic, the **we do not use the weights that predict the standard-deviation** in the VAE's encoder. We **only use the neurons that predict the mean of p(z|x)** (the most likely code) to initilize the corresponding layers of the Supervised Classifier. 

**The code below is complete.**\
Read it, understand it, run it, and try to answer the questions below.

In [None]:
# Pre-train a classifier.

# The below classifier has THE SAME architecture as the 3-layer Classifier that we trained...
# ... in a purely supervised manner in Task-10.
# This is done by inheriting the class (Classifier_3layers), therefore uses THE SAME forward_pass() function.
# THE ONLY DIFFERENCE is in the construction __init__.
# This 'pretrained' classifier receives as input a pretrained autoencoder (pretrained_VAE) from Task 6.
# It then uses the parameters of the AE's encoder to initialize its own parameters, rather than random initialization.
# The model is then trained all together.
class Classifier_3layers_pretrained(Classifier_3layers):
    def __init__(self, pretrained_VAE, D_in, D_out, rng):
        D_in = D_in
        D_hid_1 = 256
        D_hid_2 = 32
        D_out = D_out

        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        [vae_w1, vae_w2_mu, vae_w2_std, vae_w3, vae_w4] = pretrained_VAE.params  # Pre-trained parameters of pre-trained VAE.
        
        w_1 = torch.tensor(vae_w1, dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(vae_w2_mu, dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
classifier_3layers_pretrained = Classifier_3layers_pretrained(vae_wide,  # The AE pre-trained in Task 4.
                                                              train_imgs_flat.shape[1],
                                                              C_classes,
                                                              rng=rng)

# Start training
# NOTE: Only the 3-layer pretrained classifier is used, and will be trained all together.
# No frozen feature extractor.
train_classifier(classifier_3layers_pretrained,  # classifier that will be trained.
                 None,  # No pretrained AE to act as 'frozen' feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

**Questions:**

- Does this approach improve over results in Tasks 10 and 11?
- How does it compare with the same experiments with the basic AE in the previous tutorial?
- After all these experiments, what do you conclude: Is the basic AE or the VAE a better method for pre-training weights using unlabelled data, and then using them to improve performance of a Supervised Classifier with limited labels? Why?

## This notebook:
Copyright 2021, University of Birmingham  
Tutorial for Neural Computation  
For issues e-mail: k.kamnitsas@bham.ac.uk