> ### PAPER WITH CODE; LEARNING TRANSFERABLE VISUAL MODELS FROM NATURAL LANGUAGE SUPERVISION

> #### NOTE: THIS IS NOT AFFLIATED WITH @openai in any way rather we are standing on the Soldiers of Giants
>>The Source Code for this Writeup is gotten from opensource implementations like:
        > * https://github.com/openai/CLIP. Originally MIT License, Copyright (c) 2021 OpenAI.
        > * https://github.com/ML Foundations/open_clip. Originally MIT License, Copyright (c) 2021 OpenAI.

In [None]:
from Ipython.display import Image
Image(filename='C:\Users\369 Osu\Desktop\ANNOTATED PAPERS\ANOTATED PAPERS\CLIP')

In [None]:
import torch
from collections import OrderedDict
from dataclasses import dataclass
from typing import Tuple, Union, Callable, Optional

import numpy as np
import torch
import torch.nn.functional as F
from torch import nn

from .timm_model import TimmModel
from .utils import freeze_batch_norm_2d


## ABSTRACT

>> State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.


### ABOUT CLIP

> CLIP is a Neural Networkk that efficiently learns Visual Concepts from Natural Language Supervision, it is a significant step towards flexible and practical
zero-shot computer vision classifiers which is trained on text paired with images on the internet.
> Given a batch of N (image, text) pairs, CLIP is trained to
predict which of the N x N possible (image, text) pairings
across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image
encoder and text encoder to maximize the cosine similarity
of the image and text embeddings of the N real pairs
in the batch while minimizing the cosine similarity of the
embeddings of the N^2  N incorrect pairings.

> > CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples.

>> CLIP offers significant benefit for tasks that have relatively
little data given its zero-shot capabilities.

### COMPONONET ARCHITECTURE
> ### IMAGE ENCODER:



>> We consider two different architectures for the image encoder.
For the first, we use ResNet-50 (He et al., 2016a)
as the base architecture for the image encoder due to its
widespread adoption and proven performance. We make several
modifications to the original version using the ResNet-
D improvements from He et al. (2019) and the antialiased
rect-2 blur pooling from Zhang (2019). We also replace
the global average pooling layer with an attention pooling
mechanism. The attention pooling is implemented as a single
layer of “transformer-style” multi-head QKV attention
where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we
experiment with the recently introduced Vision Transformer
(ViT) (Dosovitskiy et al., 2020). We closely follow their
implementation with only the minor modification of adding
an additional layer normalization to the combined patch
and position embeddings before the transformer and use a
slightly different initialization scheme.

> FIRST ARCHITECTURE THE ResNet-50

        class Bottleneck(nn.Module):
            expansion = 4

            def __init__(self, inplanes, planes, stride=1):
                super().__init__()

                # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
                self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
                self.bn1 = nn.BatchNorm2d(planes)

                self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
                self.bn2 = nn.BatchNorm2d(planes)

                self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()

                self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
                self.bn3 = nn.BatchNorm2d(planes * self.expansion)

                self.relu = nn.ReLU(inplace=True)
                self.downsample = None
                self.stride = stride

                if stride > 1 or inplanes != planes * Bottleneck.expansion:
                    # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
                    self.downsample = nn.Sequential(OrderedDict([
                        ("-1", nn.AvgPool2d(stride)),
                        ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
                        ("1", nn.BatchNorm2d(planes * self.expansion))
                    ]))

            def forward(self, x: torch.Tensor):
                identity = x

                out = self.relu(self.bn1(self.conv1(x)))
                out = self.relu(self.bn2(self.conv2(out)))
                out = self.avgpool(out)
                out = self.bn3(self.conv3(out))

                if self.downsample is not None:
                    identity = self.downsample(x)

                out += identity
                out = self.relu(out)
                return out

> #### THE ATTENTION MECHANISM MODIFIED

    class AttentionPool2d(nn.Module):
        def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):
            super().__init__()
            self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)
            self.k_proj = nn.Linear(embed_dim, embed_dim)
            self.q_proj = nn.Linear(embed_dim, embed_dim)
            self.v_proj = nn.Linear(embed_dim, embed_dim)
            self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
            self.num_heads = num_heads

        def forward(self, x):
            x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1)  # NCHW -> (HW)NC
            x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
            x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
            x, _ = F.multi_head_attention_forward(
                query=x, key=x, value=x,
                embed_dim_to_check=x.shape[-1],
                num_heads=self.num_heads,
                q_proj_weight=self.q_proj.weight,
                k_proj_weight=self.k_proj.weight,
                v_proj_weight=self.v_proj.weight,
                in_proj_weight=None,
                in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
                bias_k=None,
                bias_v=None,
                add_zero_attn=False,
                dropout_p=0,
                out_proj_weight=self.c_proj.weight,
                out_proj_bias=self.c_proj.bias,
                use_separate_proj_weight=True,
                training=self.training,
                need_weights=False
            )

            return x[0]

>

    class ModifiedResNet(nn.Module):
        """
        A ResNet class that is similar to torchvision's but contains the following changes:
        - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
        - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
        - The final pooling layer is a QKV attention instead of an average pool
        """

        def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
            super().__init__()
            self.output_dim = output_dim
            self.input_resolution = input_resolution

            # the 3-layer stem
            self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
            self.bn1 = nn.BatchNorm2d(width // 2)
            self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
            self.bn2 = nn.BatchNorm2d(width // 2)
            self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
            self.bn3 = nn.BatchNorm2d(width)
            self.avgpool = nn.AvgPool2d(2)
            self.relu = nn.ReLU(inplace=True)

            # residual layers
            self._inplanes = width  # this is a *mutable* variable used during construction
            self.layer1 = self._make_layer(width, layers[0])
            self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
            self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
            self.layer4 = self._make_layer(width * 8, layers[3], stride=2)

            embed_dim = width * 32  # the ResNet feature dimension
            self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)

        def _make_layer(self, planes, blocks, stride=1):
            layers = [Bottleneck(self._inplanes, planes, stride)]

            self._inplanes = planes * Bottleneck.expansion
            for _ in range(1, blocks):
                layers.append(Bottleneck(self._inplanes, planes))

            return nn.Sequential(*layers)

        def forward(self, x):
            def stem(x):
                for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2), (self.conv3, self.bn3)]:
                    x = self.relu(bn(conv(x)))
                x = self.avgpool(x)
                return x

            x = x.type(self.conv1.weight.dtype)
            x = stem(x)
            x = self.layer1(x)
            x = self.layer2(x)
            x = self.layer3(x)
            x = self.layer4(x)
            x = self.attnpool(x)

            return x


    class LayerNorm(nn.LayerNorm):
        """Subclass torch's LayerNorm to handle fp16."""

        def forward(self, x: torch.Tensor):
            orig_type = x.dtype
            ret = super().forward(x.type(torch.float32))
            return ret.type(orig_type)


    class QuickGELU(nn.Module):
        def forward(self, x: torch.Tensor):
            return x * torch.sigmoid(1.702 * x)


    class ResidualAttentionBlock(nn.Module):
        def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
            super().__init__()

            self.attn = nn.MultiheadAttention(d_model, n_head)
            self.ln_1 = LayerNorm(d_model)
            self.mlp = nn.Sequential(OrderedDict([
                ("c_fc", nn.Linear(d_model, d_model * 4)),
                ("gelu", QuickGELU()),
                ("c_proj", nn.Linear(d_model * 4, d_model))
            ]))
            self.ln_2 = LayerNorm(d_model)
            self.attn_mask = attn_mask

        def attention(self, x: torch.Tensor):
            self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
            return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]

        def forward(self, x: torch.Tensor):
            x = x + self.attention(self.ln_1(x))
            x = x + self.mlp(self.ln_2(x))
            return x

> SECOND ARCHITECTURE THE VISION TRANSFORMER   

    class VisionTransformer(nn.Module):
        def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):
            super().__init__()
            self.input_resolution = input_resolution
            self.output_dim = output_dim
            self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)

            scale = width ** -0.5
            self.class_embedding = nn.Parameter(scale * torch.randn(width))
            self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
            self.ln_pre = LayerNorm(width)
            self.transformer = Transformer(width, layers, heads)
            self.ln_post = LayerNorm(width)
            self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

        def forward(self, x: torch.Tensor):
            x = self.conv1(x)  # shape = [*, width, grid, grid]
            x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
            x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
            x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
            x = x + self.positional_embedding.to(x.dtype)
            x = self.ln_pre(x)

            x = x.permute(1, 0, 2)  # NLD -> LND
            x = self.transformer(x)
            x = x.permute(1, 0, 2)  # LND -> NLD

            x = self.ln_post(x[:, 0, :])

            if self.proj is not None:
                x = x @ self.proj

            return x

> ### TEXT ENCODER

>> The text encoder is a Transformer (Vaswani et al., 2017)
with the architecture modifications described in Radford
et al. (2019). As a base size we use a 63M-parameter 12-
layer 512-wide model with 8 attention heads. The transformer
operates on a lower-cased byte pair encoding (BPE)
representation of the text with a 49,152 vocab size (Sennrich
et al., 2015). For computational efficiency, the max
sequence length was capped at 76. The text sequence is
bracketed with [SOS] and [EOS] tokens and the activations
of the highest layer of the transformer at the [EOS]
token are treated as the feature representation of the text
which is layer normalized and then linearly projected into
the multi-modal embedding space. Masked self-attention
was used in the text encoder to preserve the ability to initialize
with a pre-trained language model or add language
modeling as an auxiliary objective, though exploration of
this is left as future work. For the text encoder, we
only scale the width of the model to be proportional to the
calculated increase in width of the ResNet and do not scale
the depth at all, as we found CLIP’s performance to be less
sensitive to the capacity of the text encoder.

> 
    class Transformer(nn.Module):
            def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):
                super().__init__()
                self.width = width
                self.layers = layers
                self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])

            def forward(self, x: torch.Tensor):
                return self.resblocks(x)   

> ## EXPERIMENTS

> In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled
by the large amounts of publicly available data of this form
on the internet, we create a new dataset of 400 million (image,
text) pairs and demonstrate that a simplified version of
ConVIRT trained from scratch, which we call CLIP, for Contrastive
Language-Image Pre-training, is an efficient method
of learning from natural language supervision.

> We find that CLIP, similar to the GPT family, learns
to perform a wide set of tasks during pre-training including
OCR, geo-localization, action recognition, and many others.

> We also confirm these findings with linear-probe
representation learning analysis and show that CLIP outperforms
the best publicly available ImageNet model while
also being more computationally efficient. We additionally
find that zero-shot CLIP models are much more robust than
equivalent accuracy supervised ImageNet models which
suggests that zero-shot evaluation of task-agnostic models is
much more representative of a model’s capability.

> At the core of our approach is the idea of learning perception
from supervision contained in natural language.

> Learning from natural language has several potential
strengths over other training methods. It’s much easier
to scale natural language supervision compared to standard
crowd-sourced labeling for image classification since it does
not require annotations to be in a classic “machine learning
compatible format” such as the canonical 1-of-N majority
vote “gold label”. Instead, methods which work on natural
language can learn passively from the supervision contained
in the vast amount of text on the internet.
Learning from natural language also has an important advantage over most
unsupervised or self-supervised learning approaches in that
it doesn’t “just” learn a representation but also connects that
representation to language which enables flexible zero-shot
transfer.

>CLIP is pre-trained to predict if an image and a text snippet
are paired together in its dataset. To perform zero-shot classification,
we reuse this capability. For each dataset, we use
the names of all the classes in the dataset as the set of potential
text pairings and predict the most probable (image, text)
pair according to CLIP.

>In a bit more detail, we first compute
the feature embedding of the image and the feature embedding
of the set of possible texts by their respective encoders.
The cosine similarity of these embeddings is then calculated,
scaled by a temperature parameter  , and normalized into a
probability distribution via a softmax. Note that this prediction
layer is a multinomial logistic regression classifier with
L2-normalized inputs, L2-normalized weights, no bias, and
temperature scaling.

> When interpreted this way, the image
encoder is the computer vision backbone which computes a
feature representation for the image and the text encoder is a
hypernetwork (Ha et al., 2016) which generates the weights
of a linear classifier based on the text specifying the visual
concepts that the classes represent. Lei Ba et al. (2015) first
introduced a zero-shot image classifier of this form while
the idea of generating a classifier from natural language
dates back to at least Elhoseiny et al. (2013).

>> ### PROMPT ENGINEERING AND EMSEMBLING


> Most standard image classification datasets treat the information
naming or describing classes which enables natural
language based zero-shot transfer as an afterthought. The
vast majority of datasets annotate images with just a numeric
id of the label and contain a file mapping these ids back to
their names in English. Using the prompt template
“A photo of a { label }.” to be a good default that
helps specify the text is about the content of the image. This
often improves performance over the baseline of using only
the label text.

> We also experimented with ensembling over multiple zeroshot
classifiers as another way of improving performance.
These classifiers are computed by using different context
prompts such as ‘A photo of a big flabelg” and
“A photo of a small flabelg”. We construct the
ensemble over the embedding space instead of probability
space. This allows us to cache a single set of averaged text
embeddings so that the compute cost of the ensemble is the
same as using a single classifier when amortized over many
predictions.

> Due to the large size of our pre-training dataset, over-fitting
is not a major concern and the details of training CLIP are
simplified compared to the implementation of Zhang et al.
(2020). We train CLIP from scratch without initializing the
image encoder with ImageNet weights or the text encoder
with pre-trained weights. We do not use the non-linear
projection between the representation and the contrastive
embedding space, a change which was introduced by Bachman
et al. (2019) and popularized by Chen et al. (2020b).
We instead use only a linear projection to map from each encoder’s
representation to the multi-modal embedding space.
We did not notice a difference in training efficiency between
the two versions and speculate that non-linear projections
may be co-adapted with details of current image only in
self-supervised representation learning methods. We also
remove the text transformation function tu from Zhang et al.
(2020) which samples a single sentence at uniform from
the text since many of the (image, text) pairs in CLIP’s pretraining
dataset are only a single sentence. We also simplify
the image transformation function tv. A random square
crop from resized images is the only data augmentation
used during training. Finally, the temperature parameter
which controls the range of the logits in the softmax,  , is
directly optimized during training as a log-parameterized
multiplicative scalar to avoid turning as a hyper-parameter.

> CLIP has a wide range of capabilities due to its ability to
carry out arbitrary image classification tasks. One can give
it images of cats and dogs and ask it to classify cats, or give
it images taken in a department store and ask it to classify
shoplifters–a task with significant social implications and
for which AI may be unfit. Like any image classification
system, CLIP’s performance and fitness for purpose need to
be evaluated, and its broader impacts analyzed in context.
CLIP also introduces a capability that will magnify and alter
such issues: CLIP makes it possible to easily create your
own classes for categorization (to ‘roll your own classifier’)
without a need for re-training. This capability introduces
challenges similar to those found in characterizing other,
large-scale generative models like GPT-3 (Brown et al.,
2020); models that exhibit non-trivial zero-shot (or fewshot)
generalization can have a vast range of capabilities,
many of which are made clear only after testing for them.

> Our studies of CLIP in a zero-shot setting show that the
model displays significant promise for widely-applicable
tasks like image retrieval or search. For example, it can find
relevant images in a database given text, or relevant text
given an image. Further, the relative ease of steering CLIP
toward bespoke applications with little or no additional data
or training could unlock a variety of novel applications that
are hard for us to envision today, as has occurred with large
language models over the past few years.

However, CLIP does unlock a certain aspect of usability
given how it removes the need for training data

> CLIP is instead focused
on learning visual models from scratch via natural
language supervision and does not densely connect the two
domains with a joint attention model.

 
> The only interaction
in a CLIP model between the image and text domain is a
single dot product in a learned joint embedding space.