# 1 - Face Recognition

In face recognition, we want to identify a person from a database of $K$ persons, i.e. we want a single input image to map to the ID of one of the $K$ persons in the database (or no output if the person was not recognized). This is different from face verification where we compare the input image only to a single person and verify whether the input image is that of the claimed person.

## 1.1 - One-shot Learning

Up to this point we have only seen CNNs that needed a lot of pictures to be trained. However, because we usually don't have a lot of pictures of the same person, the problem with face recognition is that a CNN needs to be trained such that it is able to identify a person based on just a single picture. This process is called **one-shot learning.** Conventional CNNs are not suitable for this kind of task, not only because they require a huge amount of training data, but also because the whole network would need to be re-trained if we want to identify a new person who is just added to the database.

When performing face recognition, we apply a similarity function 

$$d \left( x^{(i)}, x^{(j)} \right)$$

that is able to calculate the (dis)similarity between two images: $x^{(i)}$ and $x^{(j)}$ as a value $\tau$ (degree of difference). $\tau$ is small for persons who look alike and large for different persons:

$$d \left( x^{(i)}, x^{(j)} \right) = \begin{cases}
\leq \tau & \text{(same person)} \\
\gt \tau & \text{(different persons)} \\
\end{cases}
$$

## 1.2 - Siamese Networks

One way to implement this similarity function is a siamese network. Such a network encodes an input image as a vector of arbitrary dimensions (e.g. 128 components). The network can be understood as a function $f(x)$ that encodes an image $x$ where similar pictures lead to similar encodings.

<br>

<div style="text-align">
    <img src="media/siamese-network-function.png" width=600>
</div>

The similarity function can then be implemented as the vector norm of two image vectors/encodings:

$$d \left( x^{(i)}, x^{(j)} \right) = || f\left( x^{(i)} \right) - f\left( x^{(j)} \right) ||_2^{2}$$

## 1.3 - Triplet loss

A siamese network should calculate similar image vectors for similar images and different vectors for different images. In other words: the distance between image vectors should be small for similar images and large for dissimilar images. We need to train the siamese network to exhibit this property. To do this, we can use the triplet loss function (TLF). When using the TLF, we define the image of one specific person as **anchor image $(A)$** and compare it with another image of the same person **(positive image $P$)** and an image of a different person **(negative image $N$).** Because of the initially formulated condition, the following equation needs to hold true:

$$d(A, P) = || f(A) - f(P) ||_2^{2} \leq || f(A) - f(N) ||_2^{2} = d(A, N)$$

We can rearrange this equation and get:

$$||f(A) - f(P)||_2^{2} - ||f(A) - f(N)||_2^{2} \leq 0$$

However, there a catch with this equation: We could achieve it to be true by simply "calculating" the zero vector for each image! In other words, if the network learns the trivial zero vector for all images or the "same" vector for all images, the different between them will always be less than or equan to zero.

To prevent this, we add a parameter $\alpha$ and get:

$$||f(A) - f(P)||_2^{2} - ||f(A) - f(N)||_2^{2}  + \alpha \leq 0$$

By rearranging it back to the original form we get:

$$||f(A) - f(P)||_2^{2} + \alpha \leq ||f(A) - f(N)||_2^{2}$$

The parameter $\alpha$ is also called **margin.** The effect of this margin is that the value of $\tau$ for pictures of the same person differs a lot from pictures of different persons (i.e. $d(A,P)$ is separated from $d(A,N)$ by a big margin).


<br>

<div style="text-align">
    <img src="media/tlf-distance-matrix.png" width=600>
</div>

Considering all the points mentioned above we can define the TLF as follows:

$$\mathcal{L}(A, P, N) = \max{\left( 0, ||f(A) - f(P)||_2^{2} - ||f(A) - f(N)||_2^{2} + \alpha \right)}$$

Maximizing the two values prevents the network from calculating negative losses. The total cost can be calculated as usual by summing the losses over all triplets:

$$J = \sum_{i=1}^m \mathcal{L}\left( A^{(i)}, P^{(i)}, N^{(i)} \right)$$

## 1.4 - TLF and Binary Classification

The definition of the TLF function implies that in order to train a siamese network that exhibits the required properties, we need at least two different images of the same person. To ensure a strong discrimination, we should also consider triplets $(A,P,N)$ where $N$ is the image of a person who looks similar to $A$. That way we force the network to also learn to differentiate "hard" cases.

In other words, during training, if $A,P,N$ are chosen randomly, then $d(A,P) + \alpha \leq d(A,N)$ is easily satisfied. Choose triplets that're "hard" to train on.

An alternative approach for face recognition is to treat it as a binary classification problem. We could store precomputed image vectors in a database and would only have to calculate/compare a new person's image vector. We can do this by training a CNN which calculates a value close to 1 for pictures of the same person and a value close to 0 for pictures of different persons. The calculation of this value could be as follows:

- Get the feature vectors/embeddings for each person from the CNN
- To compute a binary label of either 1 (same person) or 0 (different persons), we can input the two feature vectors into a binary classifier as:

$$\hat{y} = \sigma \left( \sum_{k=1}^K w_k \cdot ||f\left(x_k^{(i)}\right) - f\left(x_k^{(j)}\right)|| + b \right)$$

Where we first find the element-wise difference of the two feature vectors , compute $W \cdot X + b$ where $X$ is the computed difference, and then use sigmoid to output a binary label.

Alternatively, we could use the $\chi^2$-similarity instead of the element-wise similarity:

$$\hat{y} = \sigma \left( \sum_{k=1}^K w_k \cdot \frac{ \left( f\left(x_k^{(i)}\right) - f\left(x_k^{(j)}\right) \right)^2}{f\left(x_k^{(i)}\right) + f\left(x_k^{(j)}\right) } \right)$$