## **Contrastive Learning**
#### **1. Motivation**

In supervised learning, we learn from **labeled pairs** of input and output, e.g.
image → class label.

But **contrastive learning** is a **self-supervised** technique — it doesn’t need explicit labels.
Instead, it learns by **comparing pairs of samples** and figuring out which ones should be **similar** (positive pairs) and which ones should be **different** (negative pairs).

The goal:
Learn a representation (embedding) where similar inputs are close together, and dissimilar inputs are far apart in the feature space.

---

#### **2. Contrastive Learning Intuition**

We take an image, apply **two random augmentations**, and feed both into an encoder (e.g., ResNet).
These two augmented views form a **positive pair** because they come from the same image.
Augmentations of *other* images are **negative pairs**.

So, during training:

* Pull positive pairs **together** in embedding space.
* Push negative pairs **apart**.

This produces an embedding function
$$
f(x): \mathbb{R}^{H\times W \times 3} \rightarrow \mathbb{R}^d
$$
that maps similar samples close together.

---

# **3. Contrastive Loss**

The core of contrastive learning is the **contrastive loss function**.
There are two popular formulations:

---

## **(a) Classic Contrastive Loss (Siamese Networks)**

Used in **Siamese networks** (e.g., face verification).

For two samples $x_i$ and $x_j$ with label
(y_{ij}=1) if they are similar (positive pair), else (y_{ij}=0):

$$
L_{ij} = y_{ij} , D_{ij}^2 + (1 - y_{ij}) , \max(0, m - D_{ij})^2
$$

where:

* $ D_{ij} = | f(x_i) - f(x_j) |_2 $ is the distance in feature space
* $ m $ is a margin — ensures dissimilar samples are at least distance (m) apart

✅ If the pair is positive → minimize their distance
✅ If the pair is negative → make sure they’re far apart (beyond margin)

---

## **(b) InfoNCE Loss (used in SimCLR, MoCo, CLIP, etc.)**

Modern contrastive learning methods use a **softmax-based loss**:

Given an anchor $ i $, a positive sample $ j $, and $N-1$ negative samples:

$$
L_i = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{N} \exp(\text{sim}(z_i, z_k)/\tau)}
$$

where:

* $z_i = f(x_i)$ and (z_j = f(x_j)$ are normalized embeddings
* $\text{sim}(a,b) = a^\top b$ (cosine similarity)
* $\tau$ is a temperature hyperparameter controlling sharpness

This is similar to a cross-entropy loss where we classify the correct positive among all negatives.

---

# **4. How the Learning Works**

1. **Data Augmentation**
   Create two random augmentations $x_i$ and $x_j$ from the same image.

2. **Encoder Network**
   Pass both through an encoder $f(\cdot)$, e.g. ResNet, to get embeddings $z_i$, $z_j$.

3. **Normalization**
   Normalize each embedding to unit length.

4. **Loss Computation**
   Apply contrastive loss to pull together embeddings of same images and push apart others.

---

# **5. After Training**

The encoder $f(\cdot)$ learns a **semantic representation**:

* Similar images → nearby in embedding space
* Dissimilar images → far apart

Then you can:

* Freeze $f(\cdot)$ and train a small linear classifier (Linear Evaluation Protocol)
* Fine-tune on downstream tasks (classification, detection, etc.)

---

# **6. Example (Toy Numerical Example)**

Suppose we have 3 embeddings (unit normalized):

|        Sample       | Embedding (z) |
| :-----------------: | :------------ |
|          A₁         | [1, 0]        |
|   A₂ (same image)   | [0.9, 0.1]    |
| B (different image) | [0, 1]        |

Compute cosine similarities:

$$
\text{sim}(A₁, A₂) = 0.9 \quad \text{(positive pair)}
$$
$$
\text{sim}(A₁, B) = 0.0 \quad \text{(negative pair)}
$$

Then for anchor A₁:

$$
L_{A₁} = -\log \frac{\exp(0.9/\tau)}{\exp(0.9/\tau) + \exp(0.0/\tau)}
$$

If $\tau = 0.1$,

$$
L_{A₁} \approx -\log \frac{e^{9}}{e^{9}+e^{0}} = -\log \frac{8103}{8104} \approx 0.00012
$$

Very small loss ⇒ the model did a good job (positive much higher than negative).

---

# **7. Summary**

| Concept        | Description                                                             |
| :------------- | :---------------------------------------------------------------------- |
| Goal           | Learn embeddings where similar samples are close, dissimilar are far    |
| Label type     | Self-supervised (no manual labels)                                      |
| Positive pairs | Different augmentations of same image                                   |
| Negative pairs | Other images in batch                                                   |
| Typical loss   | InfoNCE                                                                 |
| Examples       | SimCLR, MoCo, CLIP, BYOL (no negatives), DINO (teacher-student variant) |

---



 famous **contrastive learning frameworks** or architectures that use **contrastive loss** (or related self-supervised objectives).

Here’s a structured list, from classical to modern, showing the evolution of **contrastive learning** in deep learning.

---

# **1. Classical Contrastive Networks (Metric Learning)**

| Method              | Year | Key Idea                                                                                          | Typical Architecture        |
| :------------------ | :--: | :------------------------------------------------------------------------------------------------ | :-------------------------- |
| **Siamese Network** | 1993 | Two identical networks (shared weights) compare two inputs; use **contrastive loss** with margin. | Twin CNNs or MLPs           |
| **Triplet Network** | 2015 | Uses triplets: (anchor, positive, negative) with **triplet loss** to learn embedding distances.   | CNN backbone (e.g. FaceNet) |
| **FaceNet**         | 2015 | Learns facial embeddings via triplet loss for face recognition.                                   | Inception-style CNN         |

---

# **2. Modern Contrastive Learning (Self-Supervised Representation Learning)**

| Method                                            |         Year         | Framework                                   | Key Idea                                                                                        |
| :------------------------------------------------ | :------------------: | :------------------------------------------ | :---------------------------------------------------------------------------------------------- |
| **SimCLR**                                        |     2020 (Google)    | Contrastive, simple & direct                | Two augmented views → encoder (ResNet) → projection MLP → InfoNCE loss. Needs large batch size. |
| **MoCo (Momentum Contrast)**                      | 2020 (Facebook/Meta) | Contrastive + memory queue                  | Keeps a momentum encoder and a dictionary of past embeddings (avoids large batch).              |
| **BYOL (Bootstrap Your Own Latent)**              |    2020 (DeepMind)   | Self-distillation, no negatives             | Uses an online & target network; learns from similarity of embeddings, no negative pairs.       |
| **SwAV (Swapping Assignments)**                   |    2020 (Facebook)   | Cluster-based contrastive                   | Learns by predicting cluster assignments of augmentations (no explicit negatives).              |
| **SimSiam**                                       |    2021 (Facebook)   | Simple, no negatives, no momentum           | Like BYOL but simpler; prevents collapse via stop-gradient.                                     |
| **Barlow Twins**                                  |    2021 (Facebook)   | Redundancy reduction                        | Encourages embeddings of two views to be similar (diagonal of cross-correlation = 1).           |
| **DINO**                                          |    2021 (Facebook)   | Self-distillation with teacher–student ViTs | Contrastive-like, uses cross-view prediction between teacher and student ViTs.                  |
| **CLIP (Contrastive Language–Image Pretraining)** |     2021 (OpenAI)    | Cross-modal contrastive                     | Aligns image and text embeddings using contrastive loss across modalities.                      |
| **ALIGN**                                         |     2021 (Google)    | CLIP-like, large scale                      | Same idea as CLIP but trained on massive noisy web data.                                        |
| **SLIP**                                          |         2022         | CLIP + SimCLR                               | Combines supervised and self-supervised contrastive objectives.                                 |

---

# **3. Specialized or Extended Contrastive Frameworks**

| Method                                  | Year | Domain              | Description                                                     |
| :-------------------------------------- | :--: | :------------------ | :-------------------------------------------------------------- |
| **CPC (Contrastive Predictive Coding)** | 2018 | Audio, vision, NLP  | Predicts future latent representations using contrastive loss.  |
| **CMC (Contrastive Multiview Coding)**  | 2019 | Multi-view          | Learns shared representations from multiple views/modalities.   |
| **TCL (Temporal Contrastive Learning)** | 2020 | Video / time-series | Uses temporal augmentations to learn temporal consistency.      |
| **PixPro**                              | 2021 | Dense prediction    | Contrastive learning for pixel-level features.                  |
| **DetCon**                              | 2021 | Object-level        | Instance-level contrastive learning for segmentation/detection. |

---

# **4. Key Variations by Type**

| Type                                           | Examples            | Description                                           |
| :--------------------------------------------- | :------------------ | :---------------------------------------------------- |
| **Instance-level contrastive**                 | SimCLR, MoCo        | Treats each image as its own class                    |
| **Cluster-based contrastive**                  | SwAV, DeepCluster   | Learns by grouping similar embeddings                 |
| **Cross-modal contrastive**                    | CLIP, ALIGN         | Aligns different modalities (image–text, audio–video) |
| **Teacher–student contrastive (distillation)** | BYOL, DINO, SimSiam | Learns without explicit negatives                     |
| **Predictive contrastive**                     | CPC, PixPro         | Predicts representations across space or time         |

---

# **5. Summary Table of Core Ideas**

| Method            |  Negatives? | Architecture                 | Notes               |
| :---------------- | :---------: | :--------------------------- | :------------------ |
| Siamese / Triplet |      ✅      | Twin CNNs                    | Metric learning     |
| SimCLR            |      ✅      | Shared ResNet + MLP          | Needs large batches |
| MoCo              |      ✅      | Two encoders (momentum)      | Queue of negatives  |
| BYOL              |      ❌      | Online / target              | Self-distillation   |
| SimSiam           |      ❌      | Online / stop-grad           | Simpler BYOL        |
| SwAV              | ⚙️ implicit | Shared encoder + prototypes  | Cluster-based       |
| CLIP              |      ✅      | Image encoder + text encoder | Cross-modal         |
| DINO              | ⚙️ implicit | Student / teacher ViTs       | No labels           |

---

Would you like me to organize them visually — e.g., in a **chronological timeline diagram** showing how each evolved from the previous one (Siamese → SimCLR → BYOL → DINO → CLIP)?




## **1. The setup**

You have a batch of images:
$$
{A, B, C, D, \dots}
$$

For each image, you create **two random augmentations**, for example:

| Original | Augmentation 1 | Augmentation 2 |
| :------- | :------------- | :------------- |
| A        | A₁             | A₂             |
| B        | B₁             | B₂             |
| C        | C₁             | C₂             |

So now the batch size (after augmentation) is **2 × N**.

---

## **2. Pass all of them through the encoder**

You pass **all** (2N) augmented images through the same encoder $f(\cdot)$ (e.g. ResNet).
You get embeddings:

$$
{z_{A₁}, z_{A₂}, z_{B₁}, z_{B₂}, \dots, z_{N₁}, z_{N₂}}
$$

These are typically L2-normalized.

---

## **3. Define positives and negatives**

* Each pair ((A₁, A₂)) is a **positive pair** — they come from the same original image.
* Everything else (e.g. (A₁) vs (B₁), (A₁) vs (B₂), etc.) are **negative pairs**.

So for each anchor (say (A₁)):

* **Positive:** (A₂)
* **Negatives:** all other (2N - 2) embeddings

---

## **4. Compute all similarities**

You compute **cosine similarities** between every pair of embeddings in the batch:

$$
\text{sim}(z_i, z_j) = \frac{z_i^\top z_j}{|z_i||z_j|}
$$

That gives you a $2N \times 2N$ similarity matrix.

Example for a batch with A and B (N=2):

| Anchor | A₁    | A₂    | B₁    | B₂    |
| :----- | :---- | :---- | :---- | :---- |
| A₁     | –     | **+** | –     | –     |
| A₂     | **+** | –     | –     | –     |
| B₁     | –     | –     | –     | **+** |
| B₂     | –     | –     | **+** | –     |

(“+” means positive pair)

---

## **5. Apply the InfoNCE loss**

For anchor $i$ (say A₁), the loss is:

$$
L_i = -\log \frac{\exp(\text{sim}(z_i, z_{pos})/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}
$$

So it tries to make:

* $\text{sim}(z_i, z_{pos})$ large (positive pair close)
* $\text{sim}(z_i, z_{neg})$ small (negative pairs far apart)

---

## **6. Overall training objective**

Average over all anchors:

$$
L = \frac{1}{2N} \sum_{i=1}^{2N} L_i
$$

The network learns an encoder $f(\cdot)$ that produces embeddings where:

* Two augmentations of the same image → high cosine similarity
* Augmentations from different images → low cosine similarity

---

## **7. Summary**

| Step | Description                                                        |
| :--- | :----------------------------------------------------------------- |
| 1    | Start with batch of N images                                       |
| 2    | Create two augmented views → total 2N samples                      |
| 3    | Encode all 2N images → embeddings                                  |
| 4    | Compute cosine similarity matrix                                   |
| 5    | For each sample, find its positive and treat the rest as negatives |
| 6    | Compute InfoNCE loss                                               |
| 7    | Backpropagate, update encoder weights                              |

---



## **1. What we learn**

In contrastive learning, we are **not** learning a classifier (like in supervised learning).
We are learning a **representation function** (an encoder) that maps raw data (e.g., images) to a **semantic embedding space**.

Formally:

$$
f_\theta: \mathbb{R}^{H \times W \times 3} \rightarrow \mathbb{R}^{D}
$$

where $D = 2048$ for ResNet-50 (after global average pooling).

The idea is to learn parameters $ \theta $ so that:

* If two images are semantically similar → their embeddings $ f_\theta(x_1) $ and $ f_\theta(x_2) $ have **high cosine similarity**.
* If two images are different → their embeddings have **low cosine similarity**.

So, the encoder learns to **cluster semantically similar inputs together** in feature space — **without any labels**.

---

## **2. Why we remove the classification head**

The classification head (usually a `Linear(2048, num_classes)`) is task-specific.
But in self-supervised contrastive learning, there are **no class labels**.
So we remove the head and only keep the encoder backbone (like ResNet-50 up to global pooling).

Sometimes we add a **projection head**, typically a small MLP (used in SimCLR):

$$
z = g(f_\theta(x))
$$

where $ g(\cdot) $ is a small 2-layer MLP that projects the representation into a latent space where the contrastive loss is applied.

At inference time, we **discard** $g(\cdot)$ and only use $f_\theta(x)$.

---

## **3. What happens inside**

By training with contrastive loss, the encoder (f_\theta) gradually learns:

* To ignore **augmentations** (e.g., color jitter, crop, rotation)
* To capture **semantic meaning** (what’s actually in the image)

So the embedding becomes *invariant* to data augmentations and *discriminative* between different images.

In effect, the model builds **features similar to those learned with labels**, but purely from self-supervision.

---

## **4. Example: ResNet-50 Encoder**

Let’s say we use ResNet-50 pretrained with SimCLR.

* Input: (x \in \mathbb{R}^{224\times224\times3})
* Encoder (ResNet-50, no head): outputs a 2048-dimensional vector
* Projection head: maps 2048 → 128 (for contrastive loss)
* Loss: InfoNCE applied on those 128-D vectors

After training, we discard the projection head and use the **2048-D feature vectors** as rich, semantically meaningful representations.

---

## **5. What do these 2048-D embeddings represent?**

Each dimension in the 2048-D embedding does not correspond to an explicit human concept.
But together, they represent the *position* of an image in a semantic space where:

* Images of **cats** cluster near each other
* Images of **cars** form another cluster
* Similar poses, lighting, or styles are close as well

This is called a **representation space** or **feature space**.

---

## **6. Using these learned features**

Once trained, you can use (f_\theta) for downstream tasks:

| Downstream Task                     | How to Use (f_\theta)                                       |
| :---------------------------------- | :---------------------------------------------------------- |
| **Image Classification**            | Freeze (f_\theta), train a linear classifier on top         |
| **Object Detection / Segmentation** | Fine-tune (f_\theta) as the backbone                        |
| **Clustering / Retrieval**          | Use embeddings directly (e.g., cosine similarity search)    |
| **Visualization**                   | Reduce 2048-D to 2-D with t-SNE or UMAP to inspect clusters |

---

## **7. Key takeaway**

| Concept                    | Meaning                                                                |
| :------------------------- | :--------------------------------------------------------------------- |
| Training objective         | Learn embeddings where similar inputs are close and dissimilar are far |
| What’s learned             | Feature extractor (f_\theta) (e.g., ResNet without head)               |
| What’s used after training | Encoder output (e.g., 2048-D embedding)                                |
| Why useful                 | Works as a pretrained backbone for many downstream tasks               |

---


 we update the **ResNet (encoder) parameters** (and optionally the small projection head), just like in supervised training.

Let’s break it down step by step:

---

## **1. What parameters exist**

When doing contrastive learning (e.g., SimCLR), we typically have two learnable modules:

1. **Encoder** ( f_\theta(\cdot) ):

   * A backbone CNN (like ResNet-50)
   * Parameters ( \theta ) are all convolution weights, batch norm, etc.
   * Output: a 2048-dim feature vector

2. **Projection Head** ( g_\phi(\cdot) ):

   * Usually a small MLP (e.g. 2048→512→128)
   * Parameters ( \phi ) are learnable weights of that MLP
   * Output: a low-dim embedding ( z ) used in the loss

So the full forward path for one image is:

$$
z = g_\phi(f_\theta(x))
$$

---

## **2. What we optimize**

We compute the **contrastive loss (InfoNCE)** over all positive and negative pairs:

$$
L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k} \exp(\text{sim}(z_i, z_k)/\tau)}
$$

Then we perform **backpropagation**:

$$
\nabla_{\theta,\phi} L
$$

So the gradients flow **through the whole network** — both the projection head (g_\phi) and the encoder (f_\theta).

That means **all ResNet layers (convolutions, BatchNorm, etc.) are updated.**

---

## **3. What happens at inference time**

After training:

* The **projection head (g_\phi)** is discarded.
* We keep only the **encoder (f_\theta)**.

Why?
Because (f_\theta) now produces embeddings that are **semantically meaningful** — we don’t need the contrastive projection head anymore.

So in downstream use (classification, detection, etc.), we use:

$$
h = f_\theta(x)
$$

where (h \in \mathbb{R}^{2048}).

---

## **4. Optional frozen variants**

Sometimes people:

* **Freeze** the encoder (f_\theta) and train only (g_\phi) (for ablation)
* Or, in methods like MoCo, **maintain a momentum encoder** (a moving average of parameters, not directly updated by gradient descent)

But in standard **SimCLR-style contrastive learning**, **both (f_\theta) and (g_\phi) are optimized jointly**.

---

## **5. In summary**

| Component                       | Learnable? | Updated during training? | Used after training? |
| :------------------------------ | :--------: | :----------------------: | :------------------: |
| Encoder (ResNet, (f_\theta))    |      ✅     |             ✅            |           ✅          |
| Projection Head (MLP, (g_\phi)) |      ✅     |             ✅            |     ❌ (discarded)    |
| Classification Head             |      ❌     |             ❌            |     ❌ (not used)     |

---

## **6. Gradient flow (schematic)**

```
x1 ──► fθ ──► gφ ──► z1
x2 ──► fθ ──► gφ ──► z2
             │
             ▼
         Contrastive Loss (InfoNCE)
             │
             ▼
        Backprop → updates θ, φ
```

---

So yes — **the ResNet parameters are updated** to learn a general-purpose visual representation that encodes semantic similarity, even though we never gave it labels.

---



Let’s go through a **minimal, fully working PyTorch example** that shows how **InfoNCE contrastive loss** works step by step — with actual tensor values you can inspect.

---

## **1. Setup**

We’ll simulate a small batch of “images” (just random vectors here for simplicity),
compute two augmented versions of each, and apply the InfoNCE loss.

```python
import torch
import torch.nn.functional as F

# For reproducibility
torch.manual_seed(0)

# === Parameters ===
batch_size = 4       # number of original images
feature_dim = 128     # embedding size (like output of projection head)
temperature = 0.1

# === Simulate embeddings for 2 augmentations of each image ===
# Normally, you'd pass images through encoder f(x) + projection head g(x)
z1 = F.normalize(torch.randn(batch_size, feature_dim), dim=1)
z2 = F.normalize(torch.randn(batch_size, feature_dim), dim=1)

# Concatenate them to form 2N embeddings (each original image has 2 views)
z = torch.cat([z1, z2], dim=0)   # shape [2N, D]
```

---

## **2. Compute cosine similarity matrix**

We compute cosine similarity for all pairs (2N × 2N matrix):

```python
# Cosine similarity between every pair
sim_matrix = torch.matmul(z, z.T) / temperature

# To avoid comparing a sample with itself
mask = torch.eye(2 * batch_size, dtype=torch.bool)
sim_matrix.masked_fill_(mask, -9e15)
```

---

## **3. Define positive pairs**

In SimCLR:

* For each sample `i`, its positive pair is `i + N` (the augmented view)
* So, for example, z[0] ↔ z[4], z[1] ↔ z[5], etc.

```python
# Positive pairs: index offset by N
pos_indices = torch.arange(batch_size, 2 * batch_size)
pos_sim_1 = torch.diag(sim_matrix, batch_size)
pos_sim_2 = torch.diag(sim_matrix, -batch_size)

# Combine the positive similarities for all anchors
positives = torch.cat([pos_sim_1, pos_sim_2], dim=0)
```

---

## **4. Compute InfoNCE loss**

We treat the positive sample as the “correct class” among all negatives.

```python
# Numerator: exp(similarity of positive pair / tau)
exp_pos = torch.exp(positives)

# Denominator: sum of exp(similarity to all others)
exp_all = torch.exp(sim_matrix).sum(dim=1)

# InfoNCE loss
loss = -torch.log(exp_pos / exp_all)
loss = loss.mean()

print(f"Contrastive (InfoNCE) Loss: {loss.item():.4f}")
```

---

## **5. Output example**

You’ll get something like:

```
Contrastive (InfoNCE) Loss: 5.3162
```

The exact value depends on random initialization, but lower loss means
the model’s embeddings bring positives closer and push negatives apart.

---

## **6. Recap**

Here’s what we did conceptually:

| Step | Operation                                  | Purpose                                       |
| :--- | :----------------------------------------- | :-------------------------------------------- |
| 1    | Created two augmented batches (`z1`, `z2`) | Positive pairs                                |
| 2    | Normalized embeddings                      | For cosine similarity                         |
| 3    | Built similarity matrix                    | All pairwise similarities                     |
| 4    | Applied InfoNCE loss                       | Pull positives together, push negatives apart |
| 5    | Averaged loss                              | Backprop to update encoder & projection head  |

---

Would you like me to extend this example to include a **real ResNet backbone** and **projection head** (so you can see how gradients flow through the network)?
