# Understanding Contrastive Loss with Numpy
> A breif overview of siamese networks then a complete explnation of contrastive loss

- toc: false 
- badges: false
- comments: true
- categories: [deep learning, loss functions, unsupervised learning]
- image: images/twins-siamese.png

Contrastive loss has been used recently in a number of papers investigating unsupervised learning. Momentum Contrast (MoCo), PIRL  and SimCLR all follow very similar patterns of using a siamese network with contrastive loss. When reading these papers I found that the general idea is very straight forward but I found it difficult to understand how the implementation of the contrastive loss function worked. As is often the case in machine learning papers, the ideas aren’t very difficult once you’ve gotten past all the math notation and obfuscating language. I will attempt to clearly explain how contrastive loss works and provide a complete implementation using nothing but Python and Numpy.

## Siamese Networks

<img src="images-2020-02-20/siamese_network.png" style="align:left;" height=900 width=500 />

Before digging into the details, it will be helpful to talk about Siamese Networks (also called Twin Networks but it’s not a widely used term). When training a Siamese network, 2 or more inputs are encoded and the output features are compared. This comparison can be done in a number of ways. Some of the comparisons are triplet loss, pseudo labeling with cross-entropy loss, and contrastive loss. See [Understanding Ranking Loss](https://gombru.github.io/2019/04/03/ranking_loss)


Let’s looks at an example where we want make features out of MNIST numbers. We’d want to have each image of an MNIST number encode into a vector that is close to vectors from images from the same class. Conversely different numbers should encode into vectors that are far from each other. 

<img src="images-2020-02-20/vectorspace.png" height=400, width=400 />

Since we have the class labels for MNIST we could use a regular network and Categorical Cross-Entropy loss, but what if we didn’t? This is where contrastive loss comes in. 

Contrastive loss takes the output of the network for a positive example and calculates its distance to an example of the same class and contrasts that distance with the distance of the positive example to negative examples. Said another way, the loss is low if positive samples are encoded to similar representations and negative examples are encoded to different representations.

This is accomplished by taking the cosine distances of the vectors and treating the resulting distances as prediction probabilities from a typical categorization network. This was made the implementation make sense to me. The big idea is that you can treat the distance of the positive example and the distances of the negative examples as output probabilities for a categorization network.
In the categorization network, the outputs are typically run through a Softmax function then the negative log of the true category is taken. 

Let’s make this more concrete. Here we will have two vectors that are similar (p1 and p2), and an array of dissimilar ones (neg).
Example setup:


### Code Example

In [38]:
import numpy as np

def normalize(x):
    return x/np.linalg.norm(x)

feature_size = 3
num_neg = 5
p1 = np.random.normal(0, size=(feature_size))
p2 = p1+0.001
p1 = p1/np.linalg.norm(p1)
p2 = p2/np.linalg.norm(p2)
neg = np.random.normal(0, size=(num_neg, feature_size))
for i in range(num_neg):
    neg[i] = normalize(neg[i])

In [39]:
p1

array([ 0.64988086,  0.50815491, -0.56518445])

In [40]:
p2

array([ 0.6510342 ,  0.50946698, -0.5626703 ])

In [41]:
neg

array([[ 0.58600673, -0.71811791,  0.3753702 ],
       [-0.30291444, -0.15027459,  0.94109531],
       [ 0.16247364,  0.8575065 ,  0.48814436],
       [-0.08889491,  0.98651357,  0.1374361 ],
       [-0.53939327, -0.83836898, -0.07869148]])

### Calculating distance
The cosine distance is the difference in angle between two vectors. This distance metric is useful since most points in a large dimensional space tend to be far apart by a euclidian measure. One other nice thing is that the cosine of 0 deg is 1, so similar vectors are make bigger numbers. The simple way of calculating the cosine distance is by taking the dot product. The dot product can be calculated by doing a matrix multiply of the transpose of one vector with an other.

In [42]:
# P1 and p2 are nearly identically, thus close to 1.0
pos_dot = p1.dot(p2)
pos_dot

0.9999953136719998

In [43]:
# Most of the negatives are pretty far away, so small or negative
neg_dot = np.zeros(num_neg)
for i in range(num_neg):
    neg_dot[i] = p1.dot(neg[i])
neg_dot

array([-0.19623398, -0.8051135 ,  0.26544305,  0.36585386, -0.73208747])

### Softmax
<img src="images-2020-02-20/softmax.png" width=600 height=300/>


When calculating the loss for categorical cross-entropy, the first step is to take the Softmax  of the values.
Softmax has two nice properties. The first is that it normalizes our values from 0 to 1. The other is that the biggest value gets much bigger and the other values are squished.

For our example above:

In [44]:
exp = np.exp(np.concatenate(([pos_dot], neg_dot))) 
softmax = exp/np.sum(exp)
softmax

array([0.37681601, 0.11392357, 0.06196987, 0.18076625, 0.19985969,
       0.06666461])

Our positive example is now much bigger than the random ones, and all bigger than 0 and less than one.
To calculate the loss we take the negative log of the category that the label is for.

In [45]:
-np.log(softmax[0]) 

0.9759982414466235

### Contrastive Loss
<img src="images-2020-02-20/contrastiveloss.png" width=600 height=300 />
We finally arrive at the focus of this article, the contrastive loss function.

It looks suspiciously like the softmax function, and that’s because it is, with the addition of the similarity function and temperature normalization factor. The similarity function is just the cosine distance that we talked about before. The other difference is that values in the denominator are the cosign distance from the positive example to the negative samples. That’s it, not much different from the typical CrossEntropyLoss.

In [46]:
# Here is the Numpy Implementation of Contrastive Loss
t = 0.07
logits = np.concatenate(([pos_dot], neg_dot))/t
exp = np.exp(logits)
loss = - np.log(exp[0]/np.sum(exp))
loss

0.000144060894240411