## Personal Updates

Hello, and welcome back everybody to the blog! This is my first blog of the year 2023 and as publicly announced on [Twitter](https://twitter.com/amaarora/status/1623082761052635136), I am returning to blogging with a commitment of 1 blog a week, planned to be released every Monday at 9am AEST. 

Also, in case you missed it, I was also recently interviewed by [Radek Osmulski](https://twitter.com/radekosmulski) - **"How to Blog to Advance Your Career and Learn Faster"** (in AI). In the video, we discuss and talk about my motivation for writing blogs, blogging to advance your career and learn, how to get started with blogging & more!

I have also updated my personal blog to use [Quarto](https://quarto.org/). The idea is to release all future blog posts which are working Jupyter Notebooks themeselves.

Now, with personal updates out of the way, let's get started with CLIP. 

## Introduction

So what is CLIP and what are we going to cover in this blog post?

In this blog post we will be going through the CLIP research paper - *"Learning Transferable Visual Models From Natural Language Supervision"* (@clip), and also look at the model implementation in PyTorch. 

From the official blog ["CLIP: Connecting text and images"](https://openai.com/research/clip), 

> We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.

In my head, if I am to summarise **"What is CLIP?"**, the answer would be: 

"CLIP" is a way to effective pretrain vision models (& more), which is architecture agnostic, by training on huge amounts of data much beyond ImageNet. In CLIP, we are not learning from expensive labels, but rather, model learns visual representations from free text. 

Learning good visual and vision-language representations is critical to solving computer vision problems — image retrieval, image classification, video understanding.

## Motivation for CLIP

Why was CLIP needed? What problem did it solve?

Please note that CLIP was written in 2021, at a time where text transformer based models like GPT-3 were then competitive across many tasks with bespoke models on various benchmark datasets, swhile requiring little to no dataset specific training data.

From the paper: 
    
*Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled taskagnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.*

But, for vision based models, it was still standard practice to pre-train models on crowd-labeled datasets such as ImageNet. The question then is ***Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision?***

Before CLIP, there had been few research papers - VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) that demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.

Therefore, this idea of learning visual representations from text in CLIP isn't new at all, but, the zero shot performane on ImageNet before CLIP was around 15%, much lower than 88.4% accuracy of state of the art at the time. (Xie et al., 2020).

Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset. However, both works carefully design, and in the process limit, their supervision to 1000 and 18291 classes respectively.

*Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.*

As I am catching up with deep learning research, one latest update that I've seen happen in the last year or so is the shear scale of models. Pretraining on huge amounts of data at scale has led to further advancements in the field of deep learning. 

Also, CLIP image and text encoders are a critical part of [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), released last year. 

*We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability.*

## Approach

*At the core of our approach is the idea of learning perception from supervision contained in natural language. As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised*

*Learning from natural language has several potential strengths over other training methods.*

### Summary

In this section I will present the summary of CLIP architecture from the paper. 

![Summary of CLIP approach](../images/clip.png){#fig-clip}

From the paper: 

*Given a batch of $N$ (image, text) pairs, CLIP is trained to predict which of the $N × N$ possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 − N$ incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores.*

If the above doesn't make complete sense, that's okay. Let's look at the Pseudo Code from the paper. 

```
# image_encoder - ResNet or Vision Transformer 
# text_encoder - CBOW or Text Transformer 
# I[n, h, w, c] - minibatch of aligned images 
# T[n, l] - minibatch of aligned texts 
# W_i[d_i, d_e] - learned proj of image to embed 
# W_t[d_t, d_e] - learned proj of text to embed 
# t - learned temperature parameter 

# extract feature representations of each modality 
I_f = image_encoder(I) #[n, d_i] 
T_f = text_encoder(T) #[n, d_t] 

# joint multimodal embedding [n, d_e] 
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) 
T_e = l2_normalize(np.dot(T_f, W_t), axis=1) 

# scaled pairwise cosine similarities [n, n] 
logits = np.dot(I_e, T_e.T) * np.exp(t) 

# symmetric loss function 
labels = np.arange(n) 
loss_i = cross_entropy_loss(logits, labels, axis=0) 
loss_t = cross_entropy_loss(logits, labels, axis=1) 
loss = (loss_i + loss_t)/2
```

Let's look at what it all means with the help of Microsoft Excel.

![Contrastive Loss](../images/contrastive_loss.png){#fig-contrastive-loss}

Let's just say we have 4 images - image of earrings,image of tea cup & saucer, image of furniture and an image of cake. 

Also, let's say we have 4 captions with each of the images - cute earrings, tea cup & saucer, dining furniture, orchid wedding cake. 

Essentially, with Contrastive Loss, as shown in @fig-contrastive-loss, what we want to do, is that we want the green diagonal to have high values and everywhere else to have lower values. 

## Model Architecture

Having looked at the CLIP in theory, it is now time to look at the CLIP model architectures in code.

We will first look at the Image Encoder.

### Image Encoder 

From section 2.4 of the CLIP paper: 

*We consider two different architectures for the image encoder. For the first, we use ResNet-50 (@resnet) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNetD improvements from @bag_of_tricks and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.*


So the first change from original ResNet to the ResNet architecture used in CLIP Image Encoder, is the stem. In the original ResNet architecture:

*The input stem has a 7 × 7 convolution with an output channel of 64 and a stride of 2, followed by a 3 × 3 max pooling layer also with a stride of 2. The input stem reduces the input width and height by 4 times and increases its channel size to 64.*

```python
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = norm_layer(self.inplanes)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
```

*A 7 × 7 convolution is 5.4 times more expensive than a 3 × 3 convolution. So this tweak replacing the 7 × 7 convolution in the input stem with three conservative 3 × 3 convolutions, with the first and second convolutions have
their output channel of 32 and a stride of 2, while the last convolution uses a 64 output channel.*
```python 
# the 3-layer stem
self.conv1 = nn.Conv2d(
    3, width // 2, kernel_size=3, stride=2, padding=1, bias=False
)
self.bn1 = nn.BatchNorm2d(width // 2)
self.relu1 = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(
    width // 2, width // 2, kernel_size=3, padding=1, bias=False
)
self.bn2 = nn.BatchNorm2d(width // 2)
self.relu2 = nn.ReLU(inplace=True)
self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
self.bn3 = nn.BatchNorm2d(width)
self.relu3 = nn.ReLU(inplace=True)
self.avgpool = nn.AvgPool2d(2)
```