# Centroids-based Person Re-identification

# Introduction

Person re-identification is essentailly a person retrieval problem across non-overlapping cameras. Given a query image, find the same person in multiple footage coming from different cameras at different time. This is commonly applied in video survelliance system. The early research tended to focus on hand-crafted feature construction using body structures and poses with distance metric learning (e.g. find the two features with miminal Euclidean distance in feature space.) More recently, the features are done via embedding in deep learning.

## Approach to Features

- **Global Feature Representation Learning** extracts a global feature vector for each person image.
- **Local Feature Representation Learning** learns part/region aggregated features, e.g. one embedding for face, one mebedding for upper torso, and etc...
- **Auxiliary Feature Representation Learning** maps person image to annotated attributes such as gender, age, clothing color, and etc...
- **Video Feature Representation Learning** factors in temporal information like how does a person move across the frame.

## Approach to Training Objectives

- Identity Loss
- Verification Loss
- Triplet Loss

### Identity Loss

This treats re-identification as a classification problem using categorical cross entropy loss. Each person is considered a distinct class. Given an input image $x_i$ with label $y_i$, the model predicts probability of $x_i$ being recognized as class $y_i$. The loss function encodes probability with softmax.

$$
L = \frac{-1}{N} \Sigma^{N}_{i=1} \text{log}(p(y_i \mid x_i ))
$$

However this approach is hard to scale with the increasing number of identities in the dataset. There may be tens of millions of identities. The fully connected layer that holds the embeddings will be too big to fit in one GPU.

### Verification Loss

This is sometimes called the _pair-wise loss_. It optimizes the pairwise relationship, either with a constrastive loss or binary verification loss.

The contrastive loss improves the relative pairwise distance between two non-matching sample, i.e. two different people.

$$
L(i, j) = (1 - \delta_{ij}) \cdot max(0, \rho - d_{ij})^2 + \delta_{ij}d^2_{ij}
$$

$d_{ij}$ represents the Euclidean distance between the embedding features of two inputs $x_i$ and $x_j$. $\delta_{ij}$ is 1 when two inputs belong to the same identity, and 0 when two inputs belong to different identities. $\rho$ is a margin parameter.

In other words:

> When two inputs belong to the same identity, the loss is measured by their distance away from each other. Lower the distance, lower the loss.

> When two inputs belong to different identities, the loss is measured by how close is distance to margin. The loss is lower then the distance betweeo two features are close to $\rho$, the margin.

On the other hand, verification loss is a binary classifier that predicts whether two inputs belong to the same identity, with 0 means no and 1 means yes. This becomes a binary cross entropy loss.

$$
L(i, j) = \left[ -\delta_{ij} log (p(\delta_{ij} \mid f_{ij}) \right] - \left[(1 - \delta_{ij}) log(1 - p(\delta_{ij} \mid f_{ij}))\right]
$$

where $f_{ij}$ is called the differential feature.

$$
f_{ij} = (f_{i} - f_{j})^2
$$

The verification is often combined with identity loss to improve performance.

###  Triplet Loss

This treats re-identification as a retrieval ranking problem. The basic idea is that the distance between positive pair should be smaller than the negative pair by a pre-defined margin. A triplet contains one anchor sample $x_i$, one positive sample $x_j$, and one negative sample $x_k$. 

$$
L(i, j, k) = max(\rho + d_{ij} - d_{ik}, 0)
$$

> Loss is minimized when distance between i and j is very small, and distance between i and k is greater than margin.

However if the model directly optimizes this loss function, a large portion of easy triplets will dominate the training process, such as two images that are vastly different. The model will result in limited discriminability, i.e. it can easily confuse two similar persons. The solution is to hand pick informative triplets which can be quite tedious.

# Centroids Triplet Loss

In [On the Unreasonable Effectiveness of Centroids in Image Retrieval](https://arxiv.org/abs/2104.13643), the authors propose that triplet loss can be enhanced by comparising query to centroids of postive and negative samples instead of individual embeddings.

There are few issues with the vanilla triplet loss.

1. Hard negative sampling may lead to bad local minima, basically overfit to few hand-selected examples. (Refer to above)
2. Hard negative sampling is computationally expensive.
3. Triplet loss is prone to outliers and noisy labels.

> To alleviate problems stemming from the point to point nature of Triplet Loss, changes to point-to-set/point-to-centroid formulations were proposed, where the distances are measured between a sample and a prototype/centroid representing a class.

$$
L_{triplet} = (f_A - c_P )^2 - (f_A - c_N)^2 + \rho
$$

$f_A$ is the embedding for anchor. $c_P$ is the centroid to positive class, $c_N$ is the centroid to negative class, and $rho$ is the margin, same as above.

![CTL Model](./assets/ctl-model.png)

For each class in the batch, center loss computes the center of that particular class and the minimize the distance of each sample of the class from this center. It increases inter-class separability and intra-class compactness. [A Discriminative Feature Learning Approach for Deep Face Recognition](https://kpzhang93.github.io/papers/eccv2016.pdf) is a famous example.