# Neural Computation - 2021

# Tutorial - AutoEncoders in Pytorch

**Aims of this tutorial**:
- Implement and train basic Auto-Encoders in Pytorch.
- Investigate how the learned latent space looks.
- Investigate whether we can synthesize new data with basic Auto-Encoders.
- See how we can use Auto-Encoders trained with Unsupervised Learning to improve training of Supervised Classifiers when labelled data are limited.

It may look long, but it should be easy to complete. Understanding the core points of this tutorial are of **high importance and is part of the assessable material for the course**. Invest some time to study and understand it, and don't hesitate to ask if you don't understand something.

**Prerequisites**:
- Familiar with python, numpy, and basic PyTorch.
- Familiar with MNIST, Multi-Layer-Perceptrons (MLPs), and how to train MLPs (forward/backward pass) in PyTorch.


**Notes**:
- Docs for Pytorch's functions you will need:  
https://pytorch.org/docs/stable/tensors.html  
https://pytorch.org/docs/stable/nn.html  
- Some helper functions for loading and plotting data are given in `./utils` folder. They will be used out of the box below.

** **THE CODE BELOW WILL NOT RUN BY DEFAULT.** ** \
This is because it includes **blanks** (noted with **???**) that you need to fill appropriately where requested.

## Preliminary: Loading and refreshing MNIST

We will be using MNIST data in this tutorial. Because the images are small, the database allows small networks to be quickly trained using CPU. Anything larger afterwards will require GPUs.

Important point to understand is the structure of the loaded data. Especially the **shape** of the loaded numpy arrays, because we need to manipulate it carefully, when processing it with neural networks.

Lets load and inspect the data...

In [None]:
# -*- coding: utf-8 -*-
# The below is for auto-reloading external modules after they are changed, such as those in ./utils.
# Issue: http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import numpy as np
from utils.data_utils import get_mnist # Helper function. Use it out of the box.

# Constants
DATA_DIR = './data/mnist' # Location we will keep the data.
SEED = 111111

# If datasets are not at specified location, they will be downloaded.
train_imgs, train_lbls = get_mnist(data_dir=DATA_DIR, train=True, download=True)
test_imgs, test_lbls = get_mnist(data_dir=DATA_DIR, train=False, download=True)

print("[train_imgs] Type: ", type(train_imgs), "|| Shape:", train_imgs.shape, "|| Data type: ", train_imgs.dtype )
print("[train_lbls] Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, "|| Data type: ", train_lbls.dtype )
print('Class labels in train = ', np.unique(train_lbls))

print("[test_imgs] Type: ", type(test_imgs), "|| Shape:", test_imgs.shape, " || Data type: ", test_imgs.dtype )
print("[test_lbls] Type: ", type(test_lbls), "|| Shape:", test_lbls.shape, " || Data type: ", test_lbls.dtype )
print('Class labels in test = ', np.unique(test_lbls))

N_tr_imgs = train_imgs.shape[0] # N hereafter. Number of training images in database.
H_height = train_imgs.shape[1] # H hereafter
W_width = train_imgs.shape[2] # W hereafter
C_classes = len(np.unique(train_lbls)) # C hereafter

Above we see that data have been loaded in *numpy arrays*.    
Arrays with images have **shape ( N = number of images, H = height, W = width )**.  
Arrays with labels have **shape ( N = number of images)**, holding one integer per image, the digit's class.

MNIST comprises of a **train set (N_tr = 60000) images** and a **test set (N_te = 10000) images**.  
We will use the train set for unsupervised learning. The test set will only be used for evaluating generalisation of classifiers towards the end of the tutorial.

Lets plot a few image in one collage to have a look...

In [None]:
%matplotlib inline
from utils.plotting import plot_grid_of_images # Helper functions, use out of the box.
plot_grid_of_images(train_imgs[0:100], n_imgs_per_row=10)

Notice that the intensities in the images take **values from 0 to 255**.

## Preliminary: Data pre-processing

A first step in almost all pipelines is to pre-process the data, to make them more appropriate for a model.

Below, we will perform 3 points:  
a) Change the labels from an integer representation to a **one-hot representation** of the **C=10 classes**.\
b) Re-scale the **intensities** in the images, from the range \[0,255\], to be instead in the range \[-1,+1\].\
c) **Vectorise the 2D images into 1D vectors for the MLP**, which only gets vectors as input.

In [None]:
# a) Change representation of labels to one-hot vectors of length C=10.
train_lbls_onehot = np.zeros(shape=(train_lbls.shape[0], C_classes ) )
train_lbls_onehot[ np.arange(train_lbls_onehot.shape[0]), train_lbls ] = 1
test_lbls_onehot = np.zeros(shape=(test_lbls.shape[0], C_classes ) )
test_lbls_onehot[ np.arange(test_lbls_onehot.shape[0]), test_lbls ] = 1
print("BEFORE: [train_lbls]        Type: ", type(train_lbls), "|| Shape:", train_lbls.shape, " || Data type: ", train_lbls.dtype )
print("AFTER : [train_lbls_onehot] Type: ", type(train_lbls_onehot), "|| Shape:", train_lbls_onehot.shape, " || Data type: ", train_lbls_onehot.dtype )

In [None]:
# b) Re-scale image intensities, from [0,255] to [-1, +1].
# This commonly facilitates learning:
# A zero-centered signal with small magnitude allows avoiding exploding/vanishing problems easier.
from utils.data_utils import normalize_int_whole_database # Helper function. Use out of the box.
train_imgs = normalize_int_whole_database(train_imgs, norm_type="minus_1_to_1")
test_imgs = normalize_int_whole_database(test_imgs, norm_type="minus_1_to_1")

# Lets plot one image.
from utils.plotting import plot_image # Helper function, use out of the box.
index = 0  # Try any, up to 60000
print("Plotting image of index: [", index, "]")
print("Class label for this image is: ", train_lbls[index])
print("One-hot label representation: [", train_lbls_onehot[index], "]")
plot_image(train_imgs[index])
# Notice the magnitude of intensities. Black is now negative and white is positive float.
# Compare with intensities of figure further above.

In [None]:
# c) Flatten the images, from 2D matrices to 1D vectors. MLPs take feature-vectors as input, not 2D images.
train_imgs_flat = train_imgs.reshape([train_imgs.shape[0], -1]) # Preserve 1st dim (S = num Samples), flatten others.
test_imgs_flat = test_imgs.reshape([test_imgs.shape[0], -1])
print("Shape of numpy array holding the training database:")
print("Original : [N, H, W] = [", train_imgs.shape , "]")
print("Flattened: [N, H*W]  = [", train_imgs_flat.shape , "]")

## Task 1: Unsupervised training with SGD for Auto-Encoders

Below you are given the main training function, which performs gradient descent in unsupervised fashion. This will be called by all following parts of the tutorial.

The function takes a model and data, and performs an iteration of stochastic gradient descent.

In the below, change the code to make gradient descent stochastic, by sampling a **random** batch per iteration instead of constantly the same training samples. (just a warmup, for you to read the training function :-))

In [None]:
from utils.plotting import plot_train_progress_1, plot_grids_of_images  # Use out of the box


def get_random_batch(train_imgs, train_lbls, batch_size, rng):
    # train_imgs: Images. Numpy array of shape [N, H * W]
    # train_lbls: Labels of images. None, or Numpy array of shape [N, C_classes], one hot label for each image.
    # batch_size: integer. Size that the batch should have.
    
    ####### TODO: Sample a random batch of images for training. Fill in the blanks (???) ######### 
    indices = rng.randint(low=0, high=train_imgs.shape[????], size=??????, dtype='int32')
    ##############################################################################################
    
    train_imgs_batch = train_imgs[indices]
    if train_lbls is not None:  # Enables function to be used both for supervised and unsupervised learning
        train_lbls_batch = train_lbls[indices]
    else:
        train_lbls_batch = None
    return [train_imgs_batch, train_lbls_batch]


def unsupervised_training_AE(net,
                             loss_func,
                             rng,
                             train_imgs_all,
                             batch_size,
                             learning_rate,
                             total_iters,
                             iters_per_recon_plot=-1):
    # net: Instance of a model. See classes: Autoencoder, MLPClassifier, etc further below
    # loss_func: Function that computes the loss. See functions: reconstruction_loss or cross_entropy.
    # rng: numpy random number generator
    # train_imgs_all: All the training images. Numpy array, shape [N_tr, H * W]
    # batch_size: Size of the batch that should be processed per SGD iteration by a model.
    # learning_rate: self explanatory.
    # total_iters: how many SGD iterations to perform.
    # iters_per_recon_plot: Integer. Every that many iterations the model predicts training images ...
    #                      ...and we plot their reconstruction. For visual observation of the results.
    loss_values_to_plot = []
    
    optimizer = optim.Adam(net.params, lr=learning_rate)  # Will use PyTorch's Adam optimizer out of the box
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        x_imgs, _ = get_random_batch(train_imgs_all, None, batch_size, rng)
        
        # Forward pass
        x_pred, z_codes = net.forward_pass(x_imgs)

        # Compute loss:
        loss = loss_func(x_pred, x_imgs)
        
        # Pytorch way
        optimizer.zero_grad()
        _ = net.backward_pass(loss)
        optimizer.step()
        
        # ==== Report training loss and accuracy ======
        loss_np = loss if type(loss) is type(float) else loss.item()  # Pytorch returns tensor. Cast to float
        print("[iter:", t, "]: Training Loss: {0:.2f}".format(loss))
        loss_values_to_plot.append(loss_np)
        
        # =============== Every few iterations, show reconstructions ================#
        if t==total_iters-1 or t%iters_per_recon_plot == 0:
            # Reconstruct all images, to plot reconstructions.
            x_pred_all, z_codes_all = net.forward_pass(train_imgs_all)
            # Cast tensors to numpy arrays
            x_pred_all_np = x_pred_all if type(x_pred_all) is np.ndarray else x_pred_all.detach().numpy()
            
            # Predicted reconstructions have vector shape. Reshape them to original image shape.
            train_imgs_resh = train_imgs_all.reshape([train_imgs_all.shape[0], H_height, W_width])
            x_pred_all_np_resh = x_pred_all_np.reshape([train_imgs_all.shape[0], H_height, W_width])
            
            # Plot a few images, originals and predicted reconstructions.
            plot_grids_of_images([train_imgs_resh[0:100], x_pred_all_np_resh[0:100]],
                                  titles=["Real", "Reconstructions"],
                                  n_imgs_per_row=10,
                                  dynamically=True)
            
    # In the end of the process, plot loss.
    plot_train_progress_1(loss_values_to_plot, iters_per_point=1)
    

Running the above should give no output yet. But we will use this in task 2, so hopefully you completed Task 1 right :)

## Task 2: Auto-Encoder

In this task, you are called to create the architecture of an Auto-Encoder.
Make necessary modifications where requested, to create the below architecture:

![title](./documentation/ae_bottleneck_2.png)


In [None]:
# -*- coding: utf-8 -*-
import torch
import torch.optim as optim
import torch.nn as nn

class Network():
    
    def backward_pass(self, loss):
        # Performs back propagation and computes gradients
        # With PyTorch, we do not need to compute gradients analytically for parameters were requires_grads=True, 
        # Calling loss.backward(), torch's Autograd automatically computes grads of loss wrt each parameter p,...
        # ... and **puts them in p.grad**. Return them in a list.
        loss.backward()
        grads = [param.grad for param in self.params]
        return grads
    
class Autoencoder(Network):
    def __init__(self, rng, D_in, D_hid_enc, D_bottleneck, D_hid_dec):
        # Construct and initialize network parameters
        D_in = D_in # Dimension of input feature-vectors. Length of a vectorised image.
        D_hid_1 = D_hid_enc # Dimension of Encoder's hidden layer
        D_hid_2 = D_bottleneck
        D_hid_3 = D_hid_dec  # Dimension of Decoder's hidden layer
        D_out = D_in # Dimension of Output layer.
        
        self.D_bottleneck = D_bottleneck  # Keep track of it, we will need it.
        
        ##### TODO: Initialize the Auto-Encoder's parameters. Also see forward_pass(...)) ########
        # Dimensions of parameter tensors are (number of neurons + 1) per layer, to account for +1 bias.
        w1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, ?????))
        w2_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w3_init = rng.normal(loc=0.0, scale=0.01, size=(?????+1, D_hid_3))
        w4_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_3+1, D_out))
        # Pytorch tensors, parameters of the model
        # Use the above numpy arrays as of random floats as initialization for the Pytorch weights.
        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)
        w2 = torch.tensor(w2_init, dtype=torch.float, requires_grad=True)
        w3 = torch.tensor(???????, dtype=torch.float, requires_grad=True)
        w4 = torch.tensor(w4_init, dtype=torch.float, requires_grad=True)
        # Keep track of all trainable parameters:
        self.params = [w1, w2, w3, w4]
        ###########################################################################
        
        
    def forward_pass(self, batch_imgs):
        # Get parameters
        [w1, w2, w3, w4] = self.params
        
        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float)  # Makes pytorch array to pytorch tensor.
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        #### TODO: Implement the operations at each layer #####
        # Layer 1
        h1_preact = x.mm(w1)
        h1_act = h1_preact.clamp(min=0)
        # Layer 2 (bottleneck):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w2)
        h2_act = h2_preact.clamp(min=0)   # <--------- This is the Representation Z
        # Layer 3:
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        h3_preact = h2_ext.mm(w3)
        h3_act = h3_preact.clamp(min=0)
        # Layer 4 (output):
        h3_ext = torch.cat((h3_act, unary_feature_for_bias), dim=1)
        h4_preact = h3_ext.mm(w4)
        h4_act = torch.tanh(h4_preact)
        # Output layer
        x_pred = h4_act
        #######################################################
        
        ### TODO: Get bottleneck's activations ######
        # Bottleneck actications
        z_code = ?????????
        #############################################
                
        return (x_pred, z_code)
        
        
def reconstruction_loss(x_pred, x_real, eps=1e-7):
    # x_pred: [N, D_out] Prediction returned by forward_pass. Numpy array of shape [N, D_out]
    # x_real: [N, D_in]
    
    # If number array is given, change it to a Torch tensor.
    x_pred = torch.tensor(x_pred, dtype=torch.float) if type(x_pred) is np.ndarray else x_pred
    x_real = torch.tensor(x_real, dtype=torch.float) if type(x_real) is np.ndarray else x_real
    
    ######## TODO: Complete the calculation of Reconstruction loss for each sample ###########
    loss_recon = torch.mean(torch.square(????? - x_real), dim=1)
    ##########################################################################################
    
    cost = torch.mean(loss_recon, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return cost


# Create the network
rng = np.random.RandomState(seed=SEED)
autoencoder_thin = Autoencoder(rng=rng,
                               D_in=H_height*W_width,
                               D_hid_enc=256,
                               D_bottleneck=2,
                               D_hid_dec=256)
# Start training
unsupervised_training_AE(autoencoder_thin,
                         reconstruction_loss,
                         rng,
                         train_imgs_flat,
                         batch_size=40,
                         learning_rate=3e-3,
                         total_iters=1000,
                         iters_per_recon_plot=50)


If this task is completed correctly, the AutoEncoder should get trained, and you should see the loss decreasing.\
In the end of training, after 1000 iterations, you will see a curve of the training loss.\
The loss should decrease down to approximately 0.2.

You should also see printed side by side a set of real images, and their reconstructed version.\
In the end, the reconstructions should start being reasonable.

## Task 3: Encode all training samples in the latent (bottleneck) representation

We have now a trained auto-encoder from the above task. We will now use it to encode training data, obtain the codes z for the training data, and plot the first 2 dimensions of Z in a 2D plot to observe how the codes are clustered. (Note: The original notebook should have implementation in the above task of an AE with a bottleneck layer with 2 neurons, therefore making plotting in 2D space easy).

Note: The code below is fully implemented. Run it, and observe the output. Think about the questions below the results...


In [None]:
import matplotlib.pyplot as plt

def encode_and_get_min_max_z(net,
                             imgs_flat,
                             lbls,
                             batch_size,
                             total_iterations=None,
                             plot_2d_embedding=True):
    # This function encodes images, plots the first 2 dimensions of the codes in a plot, and finally...
    # ... returns the minimum and maximum values of the codes for each dimensions of Z.
    # ... We will use  this at a layer task.
    # Arguments:
    # imgs_flat: Numpy array of shape [Number of images, H * W]
    # lbls: Numpy array of shape [number of images], with 1 integer per image. The integer is the class (digit).
    # total_iterations: How many batches to encode. We will use this so that we dont encode and plot ...
    # ... the whoooole training database, because the plot will get cluttered with 60000 points.
    # Returns:
    # min_z: numpy array, vector with [dimensions-of-z] elements. Minimum value per dimension of z.
    # max_z: numpy array, vector with [dimensions-of-z] elements. Maximum value per dimension of z.
    
    # If total iterations is None, the function will just iterate over all data, by breaking them into batches.    
    if total_iterations is None:
        total_iterations = (train_imgs_flat.shape[0] - 1) // batch_size + 1
    
    z_codes_all = []
    lbls_all = []
    for t in range(total_iterations):
        # Sample batch for this SGD iteration
        x_batch = imgs_flat[t*batch_size: (t+1)*batch_size]
        lbls_batch = lbls[t*batch_size: (t+1)*batch_size]
        
        # Forward pass
        x_pred, z_codes = net.forward_pass(x_batch)

        z_codes_np = z_codes if type(z_codes) is np.ndarray else z_codes.detach().numpy()
        
        z_codes_all.append(z_codes_np)  # List of np.arrays
        lbls_all.append(lbls_batch)
    
    z_codes_all = np.concatenate(z_codes_all)  # Make list of arrays in one array by concatenating along dim=0 (image index)
    lbls_all = np.concatenate(lbls_all)
    
    if plot_2d_embedding:
        # Plot the codes with different color per class in a scatter plot:
        plt.scatter(z_codes_all[:,0], z_codes_all[:,1], c=lbls_all, alpha=0.5)  # Plot the first 2 dimensions.
        plt.show()
    
    # Get the minimum and maximum values of z per dimension (neuron) of Z. We will use this at a later task
    min_z = np.min(z_codes_all, axis=0)  # min and max for each dimension of z, over all samples.
    max_z = np.max(z_codes_all, axis=0)  # Numpy array (vector) of shape [number of z dimensions]
    
    return min_z, max_z


# Encode training samples, and get the min and max values of the z codes (for each dimension)
min_z, max_z = encode_and_get_min_max_z(autoencoder_thin,
                                        train_imgs_flat,
                                        train_lbls,
                                        batch_size=100,
                                        total_iterations=100)
print("Min Z value per dimension of bottleneck:", min_z)
print("Max Z value per dimension of bottleneck:", max_z)


If all went well, you should see a plot where the codes span from 0 to +80 (approx).\

**Questions:**
- Why dont we have negative values in these codes? Is this a general property of AutoEncoders, or of the specific implementation?\
- What do you observe about how codes of different classes (colors) are grouped? Do similar samples seem to be encoded far or close? Why is the AE learning in this way?
- Is the space of Z full of samples everywhere, or are there holes? Why?

## Task 4: Train an Auto-Encoder with a larger bottleneck layer

The smaller the bottleneck layer (less neurons), the less information is allowed to be encoded for the inputs.\
We previously constructed and trained an AE with only 2 neurons in the bottleneck layer. Quite a restriction!\

We will now train an AE with a larger bottleneck layer to observe the difference in loss and how well are images reconstructed.

Below, train an AE with a bottleneck layer of 32 neurons. Then think about the questions further below...

In [None]:
# The below is a copy paste from Task 2.
# ========== TODO: Use a bottleneck of 32 neurons and train it =================

# Create the network
rng = np.random.RandomState(seed=SEED)
autoencoder_wide = Autoencoder(rng=rng,
                               D_in=H_height*W_width,
                               D_hid_enc=256,
                               D_bottleneck=????????,  # <--------- Width of Bottleneck
                               D_hid_dec=256)
# Start training
unsupervised_training_AE(autoencoder_wide,
                         reconstruction_loss,
                         rng,
                         train_imgs_flat,
                         batch_size=40,
                         learning_rate=3e-3,
                         total_iters=1000,
                         iters_per_recon_plot=50)

**Questions:**
- Compare the loss at the end of the training in comparison to the loss of the AE with bottleneck with 2 neurons. Is it higher or lower? Why?
- How do the reconstructed images compare with those from Task 2?

## Task 5: Is basic Auto-Encoder appropriate for synthesizing new data?

An Encoder of an AE learns a mapping from x (image space) to z (code).\
A Decoder of an AE learns a mapping from z (code) to x (image space).

We could try to use the Decoder of a pre-trained basic AE to try and synthesize non-existing data.\
How?\
We could sample a random code z, and decode it back to image space with the decoder.\
Will it work? Lets explore...

![title](./documentation/synthesize_data.png)

**We will create a new network, that is a Decoder-only network.**\
We will **NOT train** it.\
Instead, we will use the **pre-trained weights of the decoder of the AE from the previous Task 4**,\
to initialize the weights of this new Decoder.\
This is equivalent to taking the previous pre-trained AE, and throwing away its encoder part.\
We will then use this Decoder to **map randomly sampled codes z back to image space**, generating novel images.\
We will finally judge whether this approach is a solid approach to image synthesis (generation).

Lets go step by step... Lets first create the decoder...


In [None]:
class Decoder():
    def __init__(self, pretrained_ae):
        ############ TODO: Fill in the gaps. The aim is: ... ############
        # ... to use the weights of the pre-trained AE's DECODER,... ####
        # ... to initialize this Decoder.                            ####
        # Reminder: pretrained_ae.params[LAYER] contrains the params of the corresponding layer. See Task 2.
        w1 = torch.tensor(pretrained_ae.params[????], dtype=torch.float, requires_grad=False)
        w2 = torch.tensor(pretrained_ae.params[3], dtype=torch.float, requires_grad=False)
        self.params = [w1, w2]
        ###########################################################################
        
        
    def decode(self, z_batch):
        # Reconstruct a batch of images from a batch of z codes.
        # z_batch: Random codes. Numpy array of shape: [batch size, number of z dimensions]
        [w1, w2] = self.params
        
        z_batch_t = torch.tensor(z_batch, dtype=torch.float)  # Making a Pytorch tensor from Numpy array.
        # Adding an activation with value 1, for the bias. Similar to Task 2.
        unary_feature_for_bias = torch.ones(size=(z_batch_t.shape[0], 1)) # [N, 1] column vector.
        
        ##### TODO: Fill in the gaps, to REPLICATE the decoder of the AE from Task 4 #####
        # Hidden Layer of Decoder:
        z_batch_act_ext = torch.cat((z_batch_t, unary_feature_for_bias), dim=1)
        h1_preact = z_batch_act_ext.mm(w1)
        h1_act = h1_preact.???????(min=0) # <--------------- RELU, like the AE's decoder
        # Output Layer:
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.???????(w2)
        h2_act = torch.tanh(h2_preact)
        ##################################################################################
        # Output
        x_pred = h2_act
        
        return x_pred
        
# Lets instantiate this Decoder, using the pre-trained AE with 32-dims ("wider") bottleneck:
net_decoder_pretrained = Decoder(autoencoder_wide)

If this were implemented correctly, there should be no output (except a "UserWarning" from pytorch perhaps).
If you see any other problem, then you may have done something wrong.

Assuming that the above did not return any issue (except perhaps a warning), lets continue...

We want to sample random Z and give it to the decoder, to synthesize a new image.

**But, from what range of values should we draw Z codes???**\
We have to find what range of values is covered by the embeddings when training the original AutoEncoder!!!\
We previously saw this for the AE with 2-dim Z. Now lets do it for the AE with 32-dim ...

In [None]:
# NOTE: This function was implemented in Task 3. We simply call it again, but for a different AE, the wider.

# Encode training samples, and get the min and max values of the z codes (for each dimension)
min_z_wider, max_z_wider = encode_and_get_min_max_z(autoencoder_wide,
                                                    train_imgs_flat,
                                                    train_lbls,
                                                    batch_size=100,
                                                    total_iterations=None,  # So that it runs over all data.
                                                    plot_2d_embedding=False)  # Code is 32-Dims. Cant plot in 2D
print("Min Z value per dimension:", min_z_wider)
print("Max Z value per dimension:", max_z_wider)

**Questions:**

Compare min and max values per dimension. **What do you observe?** Dont be surprised if some dimensions of Z have "collapsed" and are just 0. Happens during training of neural networks that some neurons just "die"...

**Now lets use this range \[min,max\] of values, to finally try and synthesize some new images...**

In [None]:
def synthesize(net_decoder,
               rng,
               z_min,
               z_max,
               n_samples):
    # net_decoder: Decoder with pre-trained weights.
    # z_min: numpy array (vector) of shape [dimensions-of-z]
    # z_max: numpy array (vector) of shape [dimensions-of-z]
    # n_samples: how many samples to produce.
    
    assert len(z_min.shape) == 1 and len(z_max.shape) == 1
    assert z_min.shape[0] == z_max.shape[0]
    
    z_dims = z_min.shape[0]  # Dimensionality of z codes (and input to decoder).
    
    # Create samples of z uniformly sampled from [z_min, z_max]
    z_samples = np.random.random_sample([n_samples, z_dims])  # Returns samples from uniform([0, 1))
    z_samples = z_samples * (z_max - z_min)  # Scales [0,1] range ==> [0,(max-min)] range
    z_samples = z_samples + z_min  # Puts the [0,(max-min)] range ==> [min, max] range
    
    x_samples = net_decoder.decode(z_samples)
    
    x_samples_np = x_samples if type(x_samples) is np.ndarray else x_samples.detach().numpy()  # torch to numpy
    
    for x_sample in x_samples_np:
        plot_image(x_sample.reshape([H_height, W_width]))
       
    
# Lets finally run the synthesis and see what happens...
rng = np.random.RandomState(seed=SEED)

synthesize(net_decoder_pretrained,
           rng,
           min_z_wider,  # From further above
           max_z_wider,  # From further above
           n_samples=20)

If everything was implemented correctly, you should see above images created by the decoder for the randomly sampled z-codes.

**Questions:**

- Observe the images. How many of them look realistic and good quality digits? Many? Half? Few?
- Are they comparable in quality with the reconstructions obtained for the actual training data at Task 4?
- If the quality of this synthesized data is not as good as you d expect from what you saw in Task 4, why do you think that is? Can you relate it with a characteristic of the plot of the embeddings in Task 3?

## Task 6: Learning from Unlabelled data with AE, to complement Supervised Classifier when Labelled data are limited: Lets first train a supervised Classifier 'from scratch'

We often want to train a **Classifier with labelled** data for a specific task. But labelled data are often **limited**, leading to sub-optimal performance of the Classifiers trained on them.

AEs learn useful representations from unlabelled data, which are easy to collect in large numbers. We would like to use the learned 'knowledge' (parameters) of an Unsupervised AE, to train even stronger Classifiers with limited labelled data.

How? We will see in the following 3 tasks.

First, we will create and train a fully-supervised MLP classifier only **on very limited (100) labelled data**.

The goal is to compare the performance of this classifier with what we achieve when complementing it with unlabelled data using an AutoEncoder (in later Task).

To make such a comparison "fair", **this fully-supervised MLP classifier will have exactly the same architecture as the encoder of the autoencoder_wide we built in Task 4, plus one additional classification layer**. This is exactly the **same as** in the task of pre-training the MLP classifier with using the AE weights.

![title](./documentation/classifier_scratch.png)

**The below code for creating a classifier, cross entropy loss, and training loop is complete.**\
Read it, and understand it, because we will use it for the next 2 tasks as well, to compare with performance achieved with classifier pre-trained with AE.

In [None]:
class Classifier_3layers(Network):
    def __init__(self, D_in, D_hid_1, D_hid_2, D_out, rng):
        D_in = D_in
        D_hid_1 = D_hid_1
        D_hid_2 = D_hid_2
        D_out = D_out
        
        # === NOTE: Notice that this is exactly the same architecture as encoder of AE in Task 4 ====
        w_1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_hid_1))
        w_2_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_1+1, D_hid_2))
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        w_1 = torch.tensor(w_1_init, dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(w_2_init, dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_1, w_2, w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_imgs_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_imgs_t.shape[0], 1)) # [N, 1] column vector.
        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # === NOTE: This is the same architecture as encoder of AE in Task 4, with extra classification layer ===
        # Layer 1
        h1_preact = x.mm(w_1)
        h1_act = h1_preact.clamp(min=0)
        # Layer 2 (corresponds to bottleneck of the AE):
        h1_ext = torch.cat((h1_act, unary_feature_for_bias), dim=1)
        h2_preact = h1_ext.mm(w_2)
        h2_act = h2_preact.clamp(min=0)
        # Output classification layer
        h2_ext = torch.cat((h2_act, unary_feature_for_bias), dim=1)
        h_out = h2_ext.mm(w_out)
        
        logits = h_out
        
        # === Addition of a softmax function for 
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred

    
def cross_entropy(y_pred, y_real, eps=1e-7):
    # y_pred: Predicted class-posterior probabilities, returned by forward_pass. Numpy array of shape [N, D_out]
    # y_real: One-hot representation of real training labels. Same shape as y_pred.
    
    # If number array is given, change it to a Torch tensor.
    y_pred = torch.tensor(y_pred, dtype=torch.float) if type(y_pred) is np.ndarray else y_pred
    y_real = torch.tensor(y_real, dtype=torch.float) if type(y_real) is np.ndarray else y_real
    
    x_entr_per_sample = - torch.sum( y_real*torch.log(y_pred+eps), dim=1)  # Sum over classes, axis=1
    
    loss = torch.mean(x_entr_per_sample, dim=0) # Expectation of loss: Mean over samples (axis=0).
    return loss



from utils.plotting import plot_train_progress_2

def train_classifier(classifier,
                     pretrained_AE,
                     loss_func,
                     rng,
                     train_imgs,
                     train_lbls,
                     test_imgs,
                     test_lbls,
                     batch_size,
                     learning_rate,
                     total_iters,
                     iters_per_test=-1):
    # Arguments:
    # classifier: A classifier network. It will be trained by this function using labelled data.
    #             Its input will be either original data (if pretrained_AE=0), ...
    #             ... or the output of the feature extractor if one is given.
    # pretrained_AE: A pretrained AutoEncoder that will *not* be trained here.
    #      It will be used to encode input data.
    #      The classifier will take as input the output of this feature extractor.
    #      If pretrained_AE = None: The classifier will simply receive the actual data as input.
    # train_imgs: Vectorized training images
    # train_lbls: One hot labels
    # test_imgs: Vectorized testing images, to compute generalization accuracy.
    # test_lbls: One hot labels for test data.
    # batch_size: batch size
    # learning_rate: come on...
    # total_iters: how many SGD iterations to perform.
    # iters_per_test: We will 'test' the model on test data every few iterations as specified by this.
    
    values_to_plot = {'loss':[], 'acc_train': [], 'acc_test': []}
    
    optimizer = optim.Adam(classifier.params, lr=learning_rate)
        
    for t in range(total_iters):
        # Sample batch for this SGD iteration
        train_imgs_batch, train_lbls_batch = get_random_batch(train_imgs, train_lbls, batch_size, rng)
        
        # Forward pass
        if pretrained_AE is None:
            inp_to_classifier = train_imgs_batch
        else:
            _, z_codes = pretrained_AE.forward_pass(train_imgs_batch)  # AE encodes. Output will be given to Classifier
            inp_to_classifier = z_codes
            
        y_pred = classifier.forward_pass(inp_to_classifier)
        
        # Compute loss:
        y_real = train_lbls_batch
        loss = loss_func(y_pred, y_real)  # Cross entropy
        
        # Backprop and updates.
        optimizer.zero_grad()
        grads = classifier.backward_pass(loss)
        optimizer.step()
        
        
        # ==== Report training loss and accuracy ======
        # y_pred and loss can be either np.array, or torch.tensor (see later). If tensor, make it np.array.
        y_pred_numpy = y_pred if type(y_pred) is np.ndarray else y_pred.detach().numpy()
        y_pred_lbls = np.argmax(y_pred_numpy, axis=1) # y_pred is soft/probability. Make it a hard one-hot label.
        y_real_lbls = np.argmax(y_real, axis=1)
        
        acc_train = np.mean(y_pred_lbls == y_real_lbls) * 100. # percentage
        
        loss_numpy = loss if type(loss) is type(float) else loss.item()
        print("[iter:", t, "]: Training Loss: {0:.2f}".format(loss), "\t Accuracy: {0:.2f}".format(acc_train))
        
        # =============== Every few iterations, show reconstructions ================#
        if t==total_iters-1 or t%iters_per_test == 0:
            if pretrained_AE is None:
                inp_to_classifier_test = test_imgs
            else:
                _, z_codes_test = pretrained_AE.forward_pass(test_imgs)
                inp_to_classifier_test = z_codes_test
                
            y_pred_test = classifier.forward_pass(inp_to_classifier_test)
            
            # ==== Report test accuracy ======
            y_pred_test_numpy = y_pred_test if type(y_pred_test) is np.ndarray else y_pred_test.detach().numpy()
            
            y_pred_lbls_test = np.argmax(y_pred_test_numpy, axis=1)
            y_real_lbls_test = np.argmax(test_lbls, axis=1)
            acc_test = np.mean(y_pred_lbls_test == y_real_lbls_test) * 100.
            print("\t\t\t\t\t\t\t\t Testing Accuracy: {0:.2f}".format(acc_test))
            
            # Keep list of metrics to plot progress.
            values_to_plot['loss'].append(loss_numpy)
            values_to_plot['acc_train'].append(acc_train)
            values_to_plot['acc_test'].append(acc_test)
                
    # In the end of the process, plot loss accuracy on training and testing data.
    plot_train_progress_2(values_to_plot['loss'], values_to_plot['acc_train'], values_to_plot['acc_test'], iters_per_test)
    

Now below, lets finally create an instance of this 3-layered classifier.

Fill in the number of neurons in the 2 hidden layers of the MLP Classifier, to be the same as the encoder of the AE (with the wide 32 bottleneck).

In [None]:
# Train Classifier from scratch (initialized randomly)

# Create the network
rng = np.random.RandomState(seed=SEED)
net_classifier_from_scratch = Classifier_3layers(D_in=H_height*W_width,
                                                 D_hid_1=???????, # TODO: Use same as layer 1 of encoder of wide AE (Task 4)
                                                 D_hid_2=???????,  # TODO: Use same as layer 2 of encoder of wide AE (Task 4)
                                                 D_out=C_classes,
                                                 rng=rng)
# Start training
train_classifier(net_classifier_from_scratch,
                 None,  # No pretrained AE
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)


If everything went as expected, you should see the training loss of the classifier going down, reaching (close to) 0. At the end, you will find a plot of the loss curve, the training accuracy (approx 100%), and the testing accuracy (approx 57%).

**Questions**:

- How do you explain the "generalization gap" between training and testing accuracy?
- How do you think the amount of labelled data (above =100 by default) affect this gap? Feel free to experiment with other amounts of labelled data.


## Task 7: Use Unsupervised AE as 'pre-trained feature-extractor' for a supervised Classifier when labels are limited

There are 2 methods that a pre-trained AE can be used to help training better Classifiers when labelled data are limited.

Approach-1: We take the encoder of the pre-trained AE (already trained with unlabelled data) and place an untrained, small (often 1 layer) Classifier on top. The Classifier receives as input the output of the encoder, i.e. the codes z of data x. We then use the limited labelled data for training. Importantly, we only train the small Classifier. The encoder is used as 'frozen' (does not get trained further) feature extractor. This method allows training with limited labelled data because we only train the small Classifier (the encoder is frozen), therefore we are less likely to overfit in theory. See next figure for a visual explanation.

![title](./documentation/encoder_frozen.png)

Approach-2 will be explored in the next and final task.

Complete and run the code below.


In [None]:
# Train classifier on top of pre-trained AE encoder

class Classifier_1layer(Network):
    # Classifier with just 1 layer, the classification layer
    def __init__(self, D_in, D_out, rng):
        # D_in: dimensions of input
        # D_out: dimension of output (number of classes)
        
        #### TODO: Fill in the blanks ######################
        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, D_out))
        w_out = torch.tensor(????????, dtype=torch.float, requires_grad=True)
        ####################################################
        self.params = [w_out]
        
        
    def forward_pass(self, batch_inp):
        # compute predicted y
        [w_out] = self.params
        
        # In case input is image, make it a tensor.
        batch_inp_t = torch.tensor(batch_inp, dtype=torch.float) if type(batch_inp) is np.ndarray else batch_inp
        
        unary_feature_for_bias = torch.ones(size=(batch_inp_t.shape[0], 1)) # [N, 1] column vector.
        batch_inp_ext = torch.cat((batch_inp_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.
        
        # Output classification layer
        logits = batch_inp_ext.mm(w_out)
        
        # Output layer activation function
        # Softmax activation function.
        exp_logits = torch.exp(logits)
        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) 
        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.
        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).
        
        return y_pred
    
    
    
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
# As input, it will be getting z-codes from the AE with 32-neurons bottleneck from Task 4.
classifier_1layer = Classifier_1layer(autoencoder_wide.D_bottleneck,  # Input dimension is dimensions of AE's Z
                                      C_classes,
                                      rng=rng)

########### TODO: Fill in the gaps to start training ####################
# Give to the function the 1-layer classifier, as well as the pre-trained AE that will work as feature extractor.
# For the pre-trained AE, give the instance of 'wide' AE that has 32-neurons bottleneck, which you trained in Task 4.
train_classifier(?????????????,
                 autoencoder_wide,  # Pretrained AE, to use as feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,   # 5e-3, is the best for 1-layer classifier and all data.
                 total_iters=1000,
                 iters_per_test=20)

If you completed the task appropriately, you shoudl see the model getting trained and performance reported at the bottom. Training accuracy should approach 100% and test accuracy should be approximately 65%

**Questions:**
- Compare with test accuracy from Task 7. How do the 2 differ? How would you justify the difference?
- Try the same experiments (Task 6 and Task 7) using different amounts of labelled data. What do you observe as the number of training data increases?

## Task 8: Use parameters of an Unsupervised AE's encoder to initialize weights of a supervised Classifier, followed by refinement using limited labels

Approach-2: The second approach is to build a Classifier that has the same architecture as the encoder of an AutoEncoder, followed by an extra classification layer. We first train the Autoencoder on unlabelled data (as done in Task 4). Then, we **use the pre-trained weights of the AE's encoder to initialize the corresponding parameters of the Classifier**. The classification layer of the Classifier is initialized randomly. Then, **with the limited labelled data, we refine (train) all the parameters of the classifier**. The advantage of this approach is that the Classifier begins with the "knowledge" (parameters) extracted from unlabelled data, and then all of it is refined with extra "knowledge" from the labelled data. See figure below for visual explanation.

![title](./documentation/refinement.png)

**The code below is complete.**\
Read it, understand it, run it, and try to answer the questions below.

In [None]:
# Pre-train a classifier.

# The below classifier has THE SAME architecture as the 3-layer Classifier that we trained...
# ... in a purely supervised manner in Task-6.
# This is done by inheriting the class (Classifier_3layers), therefore uses THE SAME forward_pass() function.
# THE ONLY DIFFERENCE is in the construction __init__.
# This 'pretrained' classifier receives as input a pretrained autoencoder (pretrained_AE) from Task 4.
# It then uses the parameters of the AE's encoder to initialize its own parameters, rather than random initialization.
# The model is then trained all together.
class Classifier_3layers_pretrained(Classifier_3layers):
    def __init__(self, pretrained_AE, D_in, D_out, rng):
        D_in = D_in
        D_hid_1 = 256
        D_hid_2 = 32
        D_out = D_out

        w_out_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid_2+1, D_out))
        
        w_1 = torch.tensor(pretrained_AE.params[0], dtype=torch.float, requires_grad=True)
        w_2 = torch.tensor(pretrained_AE.params[1], dtype=torch.float, requires_grad=True)
        w_out = torch.tensor(w_out_init, dtype=torch.float, requires_grad=True)
        
        self.params = [w_1, w_2, w_out]
        
# Create the network
rng = np.random.RandomState(seed=SEED) # Random number generator
classifier_3layers_pretrained = Classifier_3layers_pretrained(autoencoder_wide,  # The AE pre-trained in Task 4.
                                                              train_imgs_flat.shape[1],
                                                              C_classes,
                                                              rng=rng)

# Start training
# NOTE: Only the 3-layer pretrained classifier is used, and will be trained all together.
# No frozen feature extractor.
train_classifier(classifier_3layers_pretrained,  # classifier that will be trained.
                 None,  # No pretrained AE to act as 'frozen' feature extractor.
                 cross_entropy,
                 rng,
                 train_imgs_flat[:100],
                 train_lbls_onehot[:100],
                 test_imgs_flat,
                 test_lbls_onehot,
                 batch_size=40,
                 learning_rate=3e-3,
                 total_iters=1000,
                 iters_per_test=20)

**Questions:**

- How does generalization performance of this Classifier, pretrained via AE, and where all its parameters are refined with the limited labelled data, compares with the same classifier trained ONLY with labelled data (Task 6)?
- How does it compare with the approach in Task 7? Is the generalization better or worse? Why?
- How do you think this approach (Task 8) compares with previous approach (Task 7) when more or less labelled data are used? What is the advantage and disadvantage of this approach that can influence whether its better or worse when we use more or less labelled data?
- Feel free to repeat Tasks 6,7,8 for different amounts of labelled data to investigate.

**Note:** By default, the code of Tasks 6,7,8 above trains the model using only 100 out of the 60000 labelled training data. Feel free to repeat Tasks 6,7,8 for different amounts of labelled data, compare performance behaviour. This may help you answer the question above.


## This notebook:
Copyright 2021, University of Birmingham  
Tutorial for Neural Computation  
For issues e-mail: k.kamnitsas@bham.ac.uk