# Facial Recognition Loss Functions

This is an overview of loss functions in facial recognition domain. Facial detection problems are slightly different from object detections; there is a need for comparing embeddings between faces.

## Softmax Loss + Contrastive Loss

Notable examples:
- [Deep Learning Face Representation by Joint Identification-Verification](https://arxiv.org/abs/1406.4773)
- [Deeply learned face representations are sparse, selective, and robust](https://arxiv.org/abs/1412.1265)
- [DeepID3: Face Recognition with Very Deep Neural Networks](https://arxiv.org/abs/1502.00873)

### Siamese Networks

Siamese networt is a class of neural network architectures that contain 2 or more identical sub-networks. They have the same configuration with the same parameters and weights. Basically it's the same network that is used twice before doing backpropagation. They should encode the same information given the same input. However, during training phase, 2 or more inputs are encoded and the output features are compared. It's the same network that's fed 2 or more different pieces of data.

Let’s look at an example where we want to make features out of MNIST numbers. Each image of an MNIST number should encode into a vector that is close to vectors from images of the same class. Conversely different numbers should encode into vectors that are far from each other.

![Similar Numbers](./assets/embed-to-vector-space.png)

Since we have the class labels for MNIST digits, we could use a regular CNN and categorical cross-entropy loss to finish the job. However, for facial data, we don't have the labels for every individual in the data set. This is where contrastive loss comes in.

> Contrastive loss takes the output of the network for a positive example and calcluate its distance to an example of the same class and contrasts that with the distance to negative examples.

Minimizing loss is equivalent to encoding similar samples closer to together and different samples farther apart. This is accomplished by taking the **cosine distances** of the vectors and treating the resulting distances as prediction probabilities from a typical classification network. 

## Loss

Here's an example called [Normalized Temperature-scaled Cross Entropy Loss](https://paperswithcode.com/method/nt-xent)

Let's define temperature-scaled distance as

$$
d(z_i, z_j) = \frac{\text{sim}(z_i, z_j)}{\tau}
$$

where $\text{sim}(u,v)$ is the cosine similarity between two vectors, and $\tau$ is called a temperature parameter.

$$
L(i, j) = -\text{log} \frac{ \text{exp}(d(z_i, z_j))}{ \Sigma_{k=1}^{2N} \text{exp}(d(z_i, z_k))} 
$$

Now why is there 2N? This will be explained in [A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/pdf/2002.05709v3.pdf)

## Triplet Loss

Notable examples:
- [FaceNet: A Unified Embedding for Face Recognition and Clustering](https://arxiv.org/abs/1503.03832)
- [Targeting Ultimate Accuracy: Face Recognition via Deep Embedding](https://arxiv.org/abs/1506.07310)

## Center Loss

Notable examples:
- [A Discriminative Feature Learning Approach
for Deep Face Recognition](http://ydwen.github.io/papers/WenECCV16.pdf)

## Large Margin Softmax Loss

Notable examples:
- [Large-Margin Softmax Loss for Convolutional Neural Networks](https://arxiv.org/abs/1612.02295)

## Angular Softmax Loss

Notable examples:
- [SphereFace: Deep Hypersphere Embedding for Face Recognition](https://arxiv.org/abs/1704.08063)

## Feature Normalization Loss

Notable examples:
- [NormFace: L2 Hypersphere Embedding for Face Verification](https://arxiv.org/abs/1704.06369)
- [Learning Deep Features via Congenerous Cosine Loss for Person Recognition](https://arxiv.org/abs/1702.06890)

## Cosine Loss

- [CosFace: Large Margin Cosine Loss for Deep Face Recognition](https://arxiv.org/abs/1801.09414)