# **Lecture Day 9 Practical** - 09/03/21
# **Contrastive Learning**

Today we will explore contrastive learning, aiming to build good visual representations useful in downstream tasks in an unsupervised manners.

More specifically we will be undertaking the task of replicating a recent and relevant research paper! Today we are going to implement the SimCLR network and NT-Xent Loss function from the paper:\
[A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/pdf/2002.05709.pdf)

Keep this paper open during the practical so you can refer back to specific sections!

---

Any questions after the practical session just drop me an email:

a.durrant.20@abdn.ac.uk

 \- Aiden

##**Outline and Objectives:**
Today we are going to implement the SimCLR network in keras! We are not going to exactly replicate as it will take too long for this practical session, but feel free to keep refining in your own time!
- [ ] Construct the NT-Xent Loss Function
- [ ] Build the training routine (siamese network & projection MLP's)
- [ ] Build the evaluation method (linear classification)
- [ ] View the natural semantic clusters formed via t-SNE 

**Extra Tasks (Not required, but extra if find this interesting)**
- [ ] Implement a MoCo style memory bank.
- [ ] Implement the BYOL method of training.

To check off the tasks update the markdown '- [ ]' -> to '- [x]'

## **Remember!**
Set your runtime type to allow GPU utilization!

`Runtime -> Change runtime type -> GPU`

If you get stuck with Colab check out the practical from Day one or have a look at these examples: \\
- https://colab.research.google.com/notebooks/intro.ipynb \\
- https://jupyter-notebook.readthedocs.io/en/stable/notebook.html

## **Imports**

Next, let us load all the appropriate modules!

I have also imported a small CNN to train from my GitHub 

In [None]:
!wget https://raw.githubusercontent.com/AidenDurrant/DMV_Practicals/master/SmallCNN.py

--2021-02-27 21:53:53--  https://raw.githubusercontent.com/AidenDurrant/DMV_Practicals/master/SmallCNN.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1622 (1.6K) [text/plain]
Saving to: ‘SmallCNN.py.1’


2021-02-27 21:53:53 (24.0 MB/s) - ‘SmallCNN.py.1’ saved [1622/1622]



In [None]:
import os
import shutil
import random
import math
import numpy as np
import glob as glob # For easy pathname pattern matching
from tqdm import tqdm # Aesthetic progress bar
import sklearn.metrics as metrics # Easier metric definition
from sklearn.manifold import TSNE # Dimensionality reduction for visualisation.
from matplotlib import pyplot as plt # Plotting
import natsort # better sorting
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import regularizers
from keras import backend as K

from SmallCNN import SmallCNN

# **Data Loading / Pre-Processing**

Let us first download the dataset we are going to use today! <br>

This is the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset.
<br>
This dataset is comprised of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The classes:
- airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck

In [None]:
(trainX, trainY), (testX, testY) = tf.keras.datasets.cifar10.load_data() # Download both train and test and their labels

# Get the pixel intensities into range [0,1]
trainX = trainX / 255.
testX = testX / 255.

## **Data Augmentations**

The main contribution of the SimCLR paper was the replacement of human designed pre-text tasks with "standard" image augmentations. 

Recall, each image $x$ is transformed by two random augmentation procedures ($\tau_{1}, \tau_{2}$) to produce two views of the same image $v_{1}$ and $v_{2}$

So first we are going to define how we are going to define our augmentation procedure!

I have implemented this into a class CustomAugmentation that we can call during training to produce augementations of a batch.

This augmentations is inspired by my pytorch implementation, if you are interesting in the full augmentation set see [Pytorch_simclr](https://github.com/AidenDurrant/SimCLR-Pytorch)

***Paper:  Section 3 - Data Augmentation for Contrastive
Representation Learning***

In [None]:
class CustomAugment(object):
    def __call__(self, sample):
        # As we are doing self-supervised learning we don;t need the label    
        sample, _ = sample

        # Randomly apply transformation (color distortions) with probability p.
        sample = self._random_apply(self._color_jitter, sample, p=0.8)
        sample = self._random_apply(self._color_drop, sample, p=0.2)

        # Resize Crop
        sample = tf.image.random_crop(sample, [sample.shape[0],28, 28, 3])
        sample = tf.image.resize(sample, [32, 32])

        # Random flips
        sample = self._random_apply(tf.image.flip_left_right, sample, p=0.5)

        # Normalize
        mean=[[0.49139968, 0.48215841, 0.44653091]]
        std=[[0.24703223, 0.24348513, 0.26158784]]

        mean = tf.reshape(tf.repeat(mean, repeats=sample.shape[0] ,axis=0),[sample.shape[0],1, 1, 3]) 
        std = tf.reshape(tf.repeat(std, repeats=sample.shape[0] ,axis=0),[sample.shape[0],1, 1, 3]) 

        sample = (sample - mean) / (std + 1E-12)

        return sample

    def _color_jitter(self, x, s=0.5):
        # one can also shuffle the order of following augmentations
        # each time they are applied.
        x = tf.image.random_brightness(x, max_delta=0.8*s)
        x = tf.image.random_contrast(x, lower=1-0.8*s, upper=1+0.8*s)
        x = tf.image.random_saturation(x, lower=1-0.8*s, upper=1+0.8*s)
        x = tf.image.random_hue(x, max_delta=0.2*s)
        x = tf.clip_by_value(x, 0, 1)
        return x
    
    def _color_drop(self, x):
        x = tf.image.rgb_to_grayscale(x)
        x = tf.tile(x, [1, 1, 1, 3])
        return x
    
    def _random_apply(self, func, x, p):
        return tf.cond(
          tf.less(tf.random.uniform([], minval=0, maxval=1, dtype=tf.float32),
                  tf.cast(p, tf.float32)),
          lambda: func(x),
          lambda: x)

In [None]:
# Define this class as a sequential model!
data_augmentation = keras.Sequential([keras.layers.Lambda(CustomAugment())])

# **Architecture**

Now we have created the views via augmentations we now have to construct the network architecture.

To keep things simple I have already created a CNN architecture for you, this can be found [here](). This is already imported so you only have to call the model `SmallCNN()`.

Your task is to define the linear projection head $g(\cdot)$ in the paper!

***Paper:  Section 3 - Data Augmentation for Contrastive
Representation Learning***

[Keras Layers Documentation](https://keras.io/api/layers/)

In [None]:
!# Model
def get_cnn(hidden_1, hidden_2):
    base_model = SmallCNN(out_dim=10)
    base_model.trainable = True
    inputs = keras.layers.Input((32, 32, 3))
    h, x = base_model(inputs)

    ## TASK: YOUR CODE ##

    projection_1 = keras.layers.Dense(hidden_1)(h)
    projection_1 = keras.layers.ReLU()(projection_1)
    projection_2 = keras.layers.Dense(hidden_2)(projection_1)

    ## END ##

    cnn_simclr = keras.models.Model(inputs, projection_2)

    return cnn_simclr

## **Helper Functions**

Just some functions to help us with the next section!

`mask_maker()` simply creates a binary mask that helps us define what is a positive and what is a negative pair of views.
<br>

`label_maker()` just returns a tensor of zeros to be used in the softmax cross entropy loss! Zero because we will have the first logit always refering to the positive pair.

In [None]:
def mask_maker(batch_size):
  negative_mask = np.ones((batch_size, batch_size), dtype=bool)
  for i in range(batch_size):
    negative_mask[i, i] = 0
  return tf.constant(negative_mask)

def label_maker(batch_size):
  return tf.zeros(batch_size*2, dtype=tf.int32)

# **NT-Xent Loss**

Now for the contrastive loss, your task is to implement the NT-Xent loss function defined in the SimCLR Paper:

$\ell_{i,j}=-\log \frac{\exp(sim(z_i , z_j)/ \tau)}{\sum^{2N}_{k=1} 1_{[k\neq i]}\exp(sim(z_i , z_k)/ \tau)}$

where,

$sim(z_i , z_j) = \frac{z_i^{\top}z_j}{\|z_i\|\|z_j\|}$

This might look scary to implement but don't worry I will go through this with you all after you've had a go yourself!

If you want some additional help unhide the following section which gives you step by step psuedo instruction on how to implement this :)

***Paper:  Section 3 - Data Augmentation for Contrastive
Representation Learning --> Equation (1)***

In [None]:
# Loss function
def nt_xent(zis, zjs, criterion, args):

  ## TASK: YOUR CODE ##
  
  # normalize projection feature vectors
  zis = tf.math.l2_normalize(zis, axis=1)
  zjs = tf.math.l2_normalize(zjs, axis=1)

  aa = K.dot(zis,tf.transpose(zis)) / args['temperature']
  bb = K.dot(zjs,tf.transpose(zjs)) / args['temperature']
  ab = K.dot(zis,tf.transpose(zjs)) / args['temperature']
  ba = K.dot(zjs,tf.transpose(zis)) / args['temperature']
  
  mask = mask_maker(args['batch_size'])
  labels = label_maker(args['batch_size'])

  # Compute Postive Logits
  ab_pos = tf.reshape(tf.boolean_mask(ab, tf.math.logical_not(mask)), (args['batch_size'], 1))
  ba_pos = tf.reshape(tf.boolean_mask(ba, tf.math.logical_not(mask)), (args['batch_size'], 1))

  # Compute Negative Logits
  aa_neg = tf.reshape(tf.boolean_mask(aa, mask), (args['batch_size'], -1))
  bb_neg = tf.reshape(tf.boolean_mask(bb, mask), (args['batch_size'], -1))
  ab_neg = tf.reshape(tf.boolean_mask(ab, mask), (args['batch_size'], -1))
  ba_neg = tf.reshape(tf.boolean_mask(ba, mask), (args['batch_size'], -1))

  # Postive Logits over all samples
  pos = tf.concat([ab_pos, ba_pos], axis=0) 

  # Negative Logits over all samples
  neg_a = tf.concat([aa_neg, ab_neg], axis=1) 
  neg_b = tf.concat([bb_neg, ba_neg], axis=1) 

  neg = tf.concat([neg_a, neg_b], axis=0) 

  # Final Logits
  logits = tf.concat([pos, neg], axis=1) 

  loss = criterion(y_pred=logits, y_true=labels)

  loss = loss / (2 * args['batch_size'])

  ## END ##

  return loss

 # **XT-Xent implementation steps**:

1. Normalize the representations by $\ell_2$ norm
2. Compute the cosine similarity between all views. (matrix multiplication `K.dot(a,b)`)

**Note:**

Cosine similarity matrix of all samples in batch:
  a = $z_i$
  b = $z_j$
```
How the similarity matrix will look after a matrix multiplication of the representations.
    ____ ____
  | aa | ab |
  |____|____|
  | ba | bb |
  |____|____|
```

  Postives:
  Leading diagonals of ab and ba `'\'`

  Negatives:
  All values that do not lie on leading diagonals of aa, bb, ab, ba.

3. Divide by our temperature $\tau$
4. Retrieve a mask and our labels from our helper functions
5. Get the similarities previously computed for the **positive** pairs. (Leading diagonals of our similarity matrix, use our mask!)
6. Get the similarities previously computed for the **negative** pairs. (off the leading diagonals of our similarity matrix, use our mask!)
7. concatenate all our positives together, and concatenate all our negatives together.
8. concatenate our postive and negatives together ensuringt that the positive is ordered first!
9. Take this concatenation and run through a cross_entropy loss with our labels from the helper function! (We've already passed this as `criterion`)

[Normalise Docs](https://www.tensorflow.org/api_docs/python/tf/math/l2_normalize)

[Transpose Docs](https://www.tensorflow.org/api_docs/python/tf/transpose)

[Boolean Mask Docs](https://www.tensorflow.org/api_docs/python/tf/boolean_mask)

[logical not Docs](https://www.tensorflow.org/api_docs/python/tf/math/logical_not)

[reshape Docs](https://www.tensorflow.org/api_docs/python/tf/reshape)

[concatenate Docs](https://www.tensorflow.org/api_docs/python/tf/concat)

# **Training Method**

Given the nature of the SimCLR model we need a little more control for running the training loop, ususally we use `model.fit()` however in this case we manually run the training loop!

Your task today is to complete the training loop! I have missed a few steps out that directly relate to the SimCLR model. Psuedo-code is provided in the paper this can help you!

**Paper: Section 2 - Method --> Algorithm (1)**

What I have given you is the general training routing for a custom training loop in Keras, this can be applied to any model!

**Tip:** 
<br>
We have just defined lots of functions that make up SimCLR, let's put them to use!

In [None]:
def train_simclr(model, dataset, args):
    epoch_wise_loss = []
    step_wise_loss = []

    # Our cross entroopy loss used in the NT_Xent loss we previously implemented!
    criterion = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)

    # Learning rate decay during training
    decay_steps = 1000
    lr_decayed_fn = tf.keras.experimental.CosineDecay(initial_learning_rate=args['lr'], decay_steps=decay_steps)

    # optimizer
    optimizer = tf.keras.optimizers.SGD(lr_decayed_fn)

    # Batch the dataset dropping batches that are not len(`batch_size`)
    dataset = dataset.batch(args['batch_size'], drop_remainder=True)

    # iterate over epochs
    for epoch in range(args['epochs']):
        # Shuffle training set at each epoch
        dataset = dataset.shuffle(buffer_size=1024)
        print("\n Epoch: {}\n".format(epoch))

        # Iterate over all batches in the training set!
        for image_batch in dataset:

            ## TASK: YOUR CODE ##

            # Create the views of the images in image_batch
            a = data_augmentation(image_batch)
            b = data_augmentation(image_batch)

            # Record the operations so we can compute the gradient 
            with tf.GradientTape() as tape:

              # Pass to the model
              zis = model(a)
              zjs = model(b)

              # Compute the loss!
              loss = nt_xent(zis, zjs, criterion, args)

              ## END ##

              # keep track of our loss
              step_wise_loss.append(loss)

            # Compute the gradients of the model and step with our optimiser
            gradients = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

        # keep track of our epoch loss
        epoch_wise_loss.append(np.mean(step_wise_loss))
        
        # save the latest model weights
        model.save_weights("./checkpoint/cp.ckpt")

        if epoch % 1 == 0:
            print("epoch: {} loss: {:.3f}".format(epoch + 1, np.mean(step_wise_loss)))

    return epoch_wise_loss, model

# **Run!**

Great, now we have all the components to train the self-supervised SimCLR network!

We will put them all together now, load our model, pass it and the hyperparameters(`args`) to our train loop along with our dataset!

Tips for training SimCLR are also in the paper!

**Paper: Appendix B.9 - CIFAR10**

In [None]:
tf.get_logger().setLevel('ERROR')

# HyperParams
args = {'epochs': 50, 'batch_size': 256, 'lr': 1.0, 'temperature':0.5}

# Load our encoder
cnn_simclr = get_cnn(512, 128)
cnn_simclr.summary()

# Take the numpy dataset and make a TF dataset
dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY))

# Self-supervised training!
epoch_loss, model = train_simclr(cnn_simclr, dataset, args)

# **Linear Evaluation!**

Now we have trained our network and saved our latest weights, we now have to evaluate the weights as to determine if we have learnt good image representations! (Refer back to the lecture if this doesn't make sense).

First lets start by simply making a linear classifer (i.e. the last softmax classification layer with 10 classes!)

[Layer Documentation](https://keras.io/api/layers/core_layers/dense/)

In [None]:
def linear_classifier(num_classes, features):
    ## TASK: YOUR CODE ##

    linear_model = keras.models.Sequential([keras.layers.Dense(num_classes, input_shape=(features, ), activation="softmax")])

    ## END ##

    return linear_model

### **Train the Classifier**

Using what you've seen from training the self-supervised training above I want you to define the encoder, load the model and freeze the encoder weights.

[Trainable Weights Documentation](https://keras.io/api/layers/base_layer/#trainable_weights-property)

Once we've loaded the self-supervised weights we can now compute the representations of the data that we are going to use to train the classifier.

In [None]:
# HyperParams
tf.get_logger().setLevel('ERROR')
args = {'epochs': 50, 'batch_size': 256, 'lr': 0.1}

## TASK: YOUR CODE ##
cnn_simclr = get_cnn(512, 128)
cnn_simclr.load_weights("./checkpoint/cp.ckpt")

cnn_simclr.layers[1].trainable = False
cnn_simclr.summary()

## END ##

# Define the model to output the layer before the MLP projection head g(.)
projection = keras.models.Model(cnn_simclr.input, cnn_simclr.layers[-4].output)


# Load the data from numpy to TF dataset
dataset = tf.data.Dataset.from_tensor_slices((trainX, trainY))

# Produce the representations of the dataset from the frozen self-supervised encoder
train_features, _ = projection.predict(trainX)
test_features, _ = projection.predict(testX)

# Early stopping, you can change this as you see fit!
es = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, verbose=2, restore_best_weights=True)

# Initialise our classifier from above!
linear_model = linear_classifier(10, 256)

# Train using the standard keras method only training the classifier from the 
# representations produced from the self-supervised encoder!
linear_model.compile(loss="sparse_categorical_crossentropy", metrics=["accuracy"],
                     optimizer="adam")
history = linear_model.fit(train_features, trainY,
                 validation_data=(test_features, testY),
                 batch_size=64,
                 epochs=35,
                 callbacks=[es])

## **Visualise the Representations**

Now we have finished training our model, we want to see how our representations are distributed in space in relation to the semantic classes they belong to! Remember these representations were learnt in an self-supervised manner with no semantic labels!

We feed our previously computed representations from the frozen encoder, yet as these are high-dimensional we need a method to appropriately visualise them.

To visualise this high-dimensional vector we will employ [t-SNE](https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) a unsupervised dimensionality reduction method!

Once in a more manageable dimension (2 dimensions (x,y)) we will plot the representations with their corresponding true semantic label.

Your task is to simply perform the t-SNE visualisation, and plot the tsne representations with their corresponding true semantic label! You can use a pre-built package.

[t-SNE scikitlearn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)

[seaborn scatter plot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html)

In [None]:
## TASK: YOUR CODE ##

# TSNE of representations
v = TSNE(n_components=2).fit_transform(train_features)

# Plot the tsne representations with their corresponding true semantic label!
fig = plt.figure(figsize = (10, 10))
sns.set_style("darkgrid")
sns.scatterplot(v[:,0], v[:,1], hue=trainY[:,0], legend='full', palette=sns.color_palette("bright", 10))
plt.show()

## END ##

## **That's all**

So today's practical session has asked you to implement the main elements of a relevant and important paper outlining a key method in the field of self-supervised learning! I hope you gained a greater understanding of how these contrastive methods work now you have implemented them!


If you're interested in this research field check out some other works:
- [BYOL](https://arxiv.org/abs/2006.07733)
- [MoCo](https://arxiv.org/abs/1911.05722)
- [SimCLR v2](https://arxiv.org/abs/2006.10029)

And if you still have time or want to work on this in your own time try out the bonus tasks listed at the top of this notebook :)

I hope you enjoyed this type of practical replicating papers!

\- Aiden

## **References**

https://github.com/AidenDurrant/SimCLR-Pytorch/

https://github.com/sayakpaul/SimCLR-in-TensorFlow-2

