# Natural Language Processing

### Multimodal NLP: Contrastive Methods, CLIP

<br><br>
Prof. Iacopo Masi and Prof. Stefano Faralli

In [3]:
import matplotlib.pyplot as plt
import scipy
import random
import numpy as np
import pandas as pd
pd.set_option('display.colheader_justify', 'center')

In [4]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#plt.style.use('seaborn-whitegrid')

font = {'family' : 'Times',
        'weight' : 'bold',
        'size'   : 12}

matplotlib.rc('font', **font)


# Aux functions

def plot_grid(Xs, Ys, axs=None):
    ''' Aux function to plot a grid'''
    t = np.arange(Xs.size) # define progression of int for indexing colormap
    if axs:
        axs.plot(0, 0, marker='*', color='r', linestyle='none') #plot origin
        axs.scatter(Xs,Ys, c=t, cmap='jet', marker='.') # scatter x vs y
        axs.axis('scaled') # axis scaled
    else:
        plt.plot(0, 0, marker='*', color='r', linestyle='none') #plot origin
        plt.scatter(Xs,Ys, c=t, cmap='jet', marker='.') # scatter x vs y
        plt.axis('scaled') # axis scaled
        
def linear_map(A, Xs, Ys):
    '''Map src points with A'''
    # [NxN,NxN] -> NxNx2 # add 3-rd axis, like adding another layer
    src = np.stack((Xs,Ys), axis=Xs.ndim)
    # flatten first two dimension
    # (NN)x2
    src_r = src.reshape(-1,src.shape[-1]) #ask reshape to keep last dimension and adjust the rest
    # 2x2 @ 2x(NN)
    dst = A @ src_r.T # 2xNN
    #(NN)x2 and then reshape as NxNx2
    dst = (dst.T).reshape(src.shape)
    # Access X and Y
    return dst[...,0], dst[...,1]


def plot_points(ax, Xs, Ys, col='red', unit=None, linestyle='solid'):
    '''Plots points'''
    ax.set_aspect('equal')
    ax.grid(True, which='both')
    ax.axhline(y=0, color='gray', linestyle="--")
    ax.axvline(x=0, color='gray',  linestyle="--")
    ax.plot(Xs, Ys, color=col)
    if unit is None:
        plotVectors(ax, [[0,1],[1,0]], ['gray']*2, alpha=1, linestyle=linestyle)
    else:
        plotVectors(ax, unit, [col]*2, alpha=1, linestyle=linestyle)

def plotVectors(ax, vecs, cols, alpha=1, linestyle='solid'):
    '''Plot set of vectors.'''
    for i in range(len(vecs)):
        x = np.concatenate([[0,0], vecs[i]])
        ax.quiver([x[0]],
                   [x[1]],
                   [x[2]],
                   [x[3]],
                   angles='xy', scale_units='xy', scale=1, color=cols[i],
                   alpha=alpha, linestyle=linestyle, linewidth=2)

## My own latex definitions

$$\def\mbf#1{\mathbf{#1}}$$
$$\def\bmf#1{\boldsymbol{#1}}$$
$$\def\bx{\mbf{x}}$$
$$\def\bxt#1{\mbf{x}_{\text{#1}}}$$
$$\def\bv{\mbf{v}}$$
$$\def\bz{\mbf{z}}$$
$$\def\bmu{\bmf{\mu}}$$
$$\def\bsigma{\bmf{\Sigma}}$$
$$\def\Rd#1{\in \mathbb{R}^{#1}}$$
$$\def\chain#1#2{\frac{\partial #1}{\partial #2}}$$
$$\def\loss{\mathcal{L}}$$
$$\def\params{\bmf{\theta}}$$


# This week lectures
## - Generative vs Discriminative Models
## - Contrastive Methods
## - Multimodal NLP: NLP as supervision for the Visual domain
## - CLIP
## - unCLIP (Dall-E)

# This lecture material is taken from

📘 **Mainly from research papers**

### Contrastive methods

- [A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/pdf/2002.05709.pdf) **SimCLR** (vision)
- [Sampling Matters in Deep Embedding Learning](https://openaccess.thecvf.com/content_ICCV_2017/papers/Wu_Sampling_Matters_in_ICCV_2017_paper.pdf) **Contrastive method: geometric/non-probabilistic** (vision)
- [Learning Visual Features from Large Weakly Supervised Data](https://arxiv.org/pdf/1511.02251.pdf) **CLIP precursor (vision/nlp)**
- [Learning Visual Features from Large Weakly Supervised Data (slides)](https://cs.nyu.edu/~fergus/teaching/vision/9_detection_pt2.pdf) (vision/nlp)
- [Contrastive Learning of Medical Visual Representations
from Paired Images and Text](https://arxiv.org/pdf/2010.00747.pdf) **CLIP idea is from this (medical field)** (vision/medical)
- [Learning Transferable Visual Models From Natural Language Supervision](http://proceedings.mlr.press/v139/radford21a/radford21a.pdf) **CLIP paper** (vision/nlp)
- [Hierarchical Text-Conditional
Image Generation with CLIP Latents](https://arxiv.org/pdf/2204.06125.pdf) **unCLIP Dalle-E** (vision/nlp)

### Non-contrastive methods
- [Exploring Simple Siamese Representation Learning](https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Exploring_Simple_Siamese_Representation_Learning_CVPR_2021_paper.pdf) **SimSiam** (vision)
- [Bootstrap Your Own Latent
A New Approach to Self-Supervised Learning](https://arxiv.org/pdf/2006.07733.pdf) **BYOL** (vision)

# Generative vs Discriminative Models in Machine Learning

### Learning from complex distributions is in common to vision/NLP (curse of dimensionality)

# Vision: high-dimensional, continuous data

$$ p(\mbf{x} )$$
<br/>
<div align='center'><img src="figs/noise.png" width='70%' ></div>



# NLP: combinatorial/compositional problem of discrete symbols

$$ p(w_1, \ldots, w_t) $$
```
pot and enough deep secret rat thunder black industry answer death material angle crime probable nation debt organization spade acid soup reward free circle west forward board bone substance parcel south scissors move window hanging needle sticky pipe table old river attack design expert```

# <ins>Unknown</ins> density of the data  

$p_{\text{data}(\bx)}$

<div align='center'><img src="figs/hinton_simple_framework/data_density.png" width='80%' ></div>

<small>Picture from [Yang-Song Blog](https://yang-song.net/blog/2021/score/)</small>

# <ins>Known</ins> data samples dataset

$\{\mbf{x}_i\} \sim p_{\text{data}(\bx)}$
<div align='center'><img src="figs/hinton_simple_framework/samples.png" width='80%' ></div>

<small>Picture from [Yang-Song Blog](https://yang-song.net/blog/2021/score/)</small>

# Generative vs Discriminative Models

**Generative models** objective is to learn an approximation of the data density given the data samples (dataset):

$$ p_{\text{data}}(\bx) \approx p_{\theta}(\bx)$$

$$ p_{\text{data}}(\bx,y) \approx p_{\theta}(\bx,y)$$

**Discriminative models** objective is to learn a decision boundary to "separate" data samples to perform classification. 

$$p(y == \text{class}~k | \bx)$$

# Generative

<div align='center'><img src="figs/hinton_simple_framework/density_jem.png" width='30%' ></div>

# Discriminative

<div align='center'><img src="figs/hinton_simple_framework/class_density_jem.png" width='30%' ></div>

# Generative and Discriminative are Connected

Think **"classes" as latent factors** in the data--the class label just "reveals" the latent factor. <ins>[There could be, there are others]</ins>.
Think as "slicing the data density" using the latent factor ("coloring" the data density).

<div align='center'><img src="figs/hinton_simple_framework/joint_density_jem.png" width='15%' ></div>

$$ p_{\theta}(\bx) = \sum_{y^{\prime}} p_{\theta}(\bx,y^{\prime}) \qquad \text{marginalization}$$

$$ p_{\theta}(\bx,y) = \underbrace{p(\bx|y)}_{\text{class-cond. density}}p(y) = \underbrace{p(y|\bx)}_{\text{discriminative}}p(\bx)  \qquad \text{product rule}$$

# Generative and Discriminative are Connected

$$\underbrace{p(y|\bx)}_{\text{discriminative}}p(\bx) = p_{\theta}(\bx,y) =  \underbrace{p(\bx|y)}_{\text{generative}}p(y)$$

$$\underbrace{p(y|\bx)}_{\text{discriminative}} = \frac{p_{\theta}(\bx,y)}{p(\bx)} =  \frac{p(\bx|y)p(y)}{p(\bx)} = \frac{p(\bx|y)p(y)}{\sum_{y^{\prime}} p(\bx|y)p(y^{\prime})}$$

# From Generative go Discriminative
If you need to do just  classification (and do not need probabilities): 

$$\arg\max_{y^{\prime}} p(y^{\prime}|\bx)$$
Then you can:
1. Model/Estimate the class conditional data density $p(\bx|y)$ _- think of the data density "restricted to one latent factor, a class"._
2. Estimate how much do you think a latent factor (class) is probable $p(y)$
3. Classify as:

$$ \arg\max_{y^{\prime}} p(\bx|y^{\prime})p(y^{\prime}) \qquad \text{Given that} \quad p(y|\bx) \propto p(\bx|y)p(y)$$

<div align='center'><img src="figs/hinton_simple_framework/joint_density_jem.png" width='20%' ></div>

<div align='center'><img src="figs/hinton_simple_framework/all_densities.png" width='50%' ></div>

<small>Picture from [YOUR CLASSIFIER IS SECRETLY AN ENERGY BASED
MODEL AND YOU SHOULD TREAT IT LIKE ONE](https://openreview.net/pdf?id=Hkxzx0NtDB)</small>

# Where is the connection with NLP?

# Where is the connection with NLP?

**LM (language modeling)** is about learning $ p(w_1, \ldots, w_t) $ (thus **generative**) but it is implemented as **a discriminative classifier** that predicts a pmf of $w_{t+1}$ given $w_1, \ldots, w_t$.

$$ p(w_1, \ldots, w_t)  \rightarrow \prod_{i=1}^T p(w_{i+1}|w_1, \ldots, w_i)$$

# Do autoregressive models exist in vision?

<div align='center'><img src="figs/pixelCNN_00.png" width='70%' ></div>

<div align='center'><img src="figs/pixelCNN_01.png" width='70%' ></div>

<div align='center'><img src="figs/pixelCNN_02.png" width='70%' ></div>

# Contrastive Methods

# Can you remember in which part of the course we have seen a Contrastive method?

# Scaling word2vec with....

# Negative Sampling!

# word2vec with Skip-Gram at a glance

... and why it can be seen as a tiny neural net.

<div align='center'><img src="figs/word2vec_layers.png" width='65%' ></div>

# Skip-gram

Instead of doing:

1. **Center word vs ground-truth context embedding** $\longrightarrow \bmf{\theta}_{C}[gt]\cdot\bmf{\theta}_{W}[i]^T$
2. normalize as a distribution: **all context vs center word** $\longrightarrow \sum_{v=1}^{V} \exp \big(\bmf{\theta}_{C}[v]\cdot\bmf{\theta}_{W}[i]^T\big)$

$$
\mathcal{L}(w_{t-1},w_{t};\mbf{\theta}) = \underbrace{-\bmf{\theta}_{C}[gt]\cdot\bmf{\theta}_{W}[i]^T}_{\text{similarity center vs context}} + \underbrace{\log\Big(\sum_{v=1}^{V} \exp \big(\bmf{\theta}_{C}[v]\cdot\bmf{\theta}_{W}[i]^T\big)\Big)}_{\text{make sure it is a probability}}
$$

# Two solutions to approximate the denominator


1. **Negative sampling (Contrastive method)**
2. Hierarchical Softmax (Tree-based solution)

# Scaling word2vec with Negative Sampling

<br>
<div align='center'><img src="figs/positive.png" width='35%' ></div>

<br>
<div align='center'><img src="figs/negative.png" width='55%' ></div>

# Push up positive and push down negatives

$$ \min_{\params} \underbrace{-\log \sigma \left(\params_{C}[gt]^T\params_W[i]\right)}_{\text{push up positive prob.}} - \underbrace{\sum_{k=1}^K \log \big[ \sigma\left(-\params_{C}[k]^T\params_W[i]\right)\big]}_{\text{push down negative prob.}}$$

# Visualization


<div align='center'><img src="figs/negative_params.png" width='65%' ></div>

# Why we need the negatives?

<div align='center'><img src="figs/why_negatives.png" width='65%' ></div>

# Other forms of Contrastive Methods

# Other forms of Contrastive Methods

1. ~Contrastive loss implemented with logistic function and negative sampling [word2vec]~ (nlp)
2. Contrastive loss implemented with Softmax and large mini-batches (vision)
3. Contrastive loss with geometric interpretation and margin (vision)

# Contrastive loss implemented with Softmax and large mini-batches 
## Method name: SimCLR
## Vision

<div align='center'><img src="figs/hinton_simple_framework/title.png" width='85%' ></div>

# [ImageNet](https://www.image-net.org/index.php): A [computer] vision dataset
<br>

> ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research. The data is available for free to researchers for non-commercial use.


<div align='center'><img src="figs/imagenet_banner.jpeg" width='45%' ></div>


# [ImageNet](https://www.image-net.org/index.php): A [computer] vision dataset
<br>
<div align='center'><img src="figs/imagenet_samples.png" width='65%' ></div>

<small>Source: Ye, Tengqi. 2018. "Visual Object Detection from Lifelogs using Visual Non-lifelog Data." Researchgate, January. Accessed 2019-06-20. </small>

# Performance: Supervised vs Self-Supervised

<div align='center'><img src="figs/hinton_simple_framework/performance.png" width='45%' ></div>

# Take away from the paper
<br>
<div align='center'><img src="figs/hinton_simple_framework/takeaway.png" width='45%' ></div>


# Main Idea

<div align='center'><img src="figs/hinton_simple_framework/idea.png" width='45%' ></div>


# SimCLR Contrastive Loss
<br><br>
<div align='center'><img src="figs/hinton_simple_framework/loss_1.png" width='100%' ></div>

### Normalized Temperature-scaled Cross-Entropy Loss
<br>
<div align='center'><img src="figs/hinton_simple_framework/loss_2.png" width='100%' ></div>

# Data Augmentation
<br>
<div align='center'><img src="figs/hinton_simple_framework/data_aug.png" width='70%' ></div>

# Data Augmentation: A way to perturb the original data
## Create new, meaningful samples that you may find in the <ins>original data distribution</ins>
Red box = actually used for training
<div align='center'><img src="figs/hinton_simple_framework/data_aug_2.png?1" width='60%' ></div>

# Comparison of loss functions
<br>
<div align='center'><img src="figs/hinton_simple_framework/loss_comparison.png" width='40%' ></div>

# Comparison of loss functions
<br>
<div align='center'><img src="figs/hinton_simple_framework/loss_comparison_b.png?1" width='40%' ></div>


# Comparison of loss functions
<br>
<div align='center'><img src="figs/hinton_simple_framework/loss_comparison_c.png?1" width='90%' ></div>

# SimCLR: main algorithm


<br>
<div align='center'><img src="figs/hinton_simple_framework/algo.png?1" width='35%' ></div>

# Results: Linear classifier on top of SimCLR representation

<br>

<div align='center'><img src="figs/hinton_simple_framework/exp_a.png" width='50%' ></div>
<small>$\times 4$ is a "depth width" multiplier of 4</small>

# Results: Models trained with few labels

<br>
<div align='center'><img src="figs/hinton_simple_framework/exp_b.png" width='50%' ></div>

# Results: Results across datasets

<br>
<div align='center'><img src="figs/hinton_simple_framework/exp_c.png" width='85%' ></div>

# Other forms of Contrastive Methods

1. ~Contrastive loss implemented with logistic function and negative sampling [word2vec]~ (nlp)
2. ~Contrastive loss implemented with Softmax and large mini-batches (vision)~
3. Contrastive loss with geometric interpretation and margin (vision)

# Contrastive loss with geometric interpretation and margins (vision)

<div align='center'><img src="figs/sampling_matters_smola/title.png" width='75%' ></div>

# Neural Net as an embedding function

Let $f_{\theta}(\bx_i)$ be an embedding of a high-dimensional data point (in case of images) $\bx_i \in \mathbb{R}^N$ where $f : \mathbb{R}^N \mapsto \mathbb{R}^D$ is a differentiable deep network with parameters $\theta$.



## Geometric idea

Our goal is to learn an embedding that **keeps similar data points close, while pushing dissimilar datapoints apart**. Note distance in measured in the embedding space.

$$ D_{ij}= \vert\vert f(\bx_i) - f(\bx_j) \vert\vert_2 $$
<div align='center'><img src="figs/sampling_matters_smola/deep_embedding.png" width='35%' ></div>

For any **positive pair** of datapoints:
- $y_{ij} = 1$ this distance should be small
- $y_{ij} = 0$  it should be large if negative pair

# Contrastive Loss

The contrastive loss directly optimizes this distance by:
- encouraging **all positive distances to approach 0**
- while **keeping negative distances above a certain threshold** $\alpha$
<br><br>
<div align='center'><img src="figs/sampling_matters_smola/contrastive_loss.png?1" width='95%' ></div>

**where** **$[\cdot]_{+}=\max(0,\cdot)$**

Right part of the math reads as if you are already far away of at least $\alpha$ then **do not pay a penalty.**


<br><div align='center'><img src="figs/word2vec_contrastive.png?1" width='65%' ></div>

# Contrastive loss graph (loss in function of distance)
The solid blue lines show the loss function for positive pairs, the dotted green for
negative pairs. 
<br><div align='center'><img src="figs/sampling_matters_smola/contrastive_loss_graph.png" width='35%' ></div>

# Contrastive loss graph (in the embedding space)

<br><div align='center'><img src="figs/sampling_matters_smola/contrastive_loss_graph_2.png" width='35%' ></div>

# Contrastive loss limitations

- One drawback of the contrastive loss is that **we have to select a <ins>constant margin $\alpha$ for all pairs of negative samples.**</ins> This implies that visually diverse classes are embedded in the same small space as visually similar ones. The embedding space does not allow for distortions.

- **Scale Quadratically** with the sampling of points.
    - this is why `SimCLR` works better with large batch size $\rightarrow$ you have more negative training samples to constrain the search space
    - `SimCLR` takes negative samples from the large mini-batch (**negative samples are the denominator in SimCLR loss)**.

# Triplet Loss

**Idea:** Take relative comparisons between pairs, not just consider the single pairs.

- So we have to have **two pairs (a positive pair, and a negative pair).**
- Keep **all positives closer to any negatives**

This formulation allows the embedding space to be arbitrarily distorted and does not impose a constant margin $\alpha$.

<div align='center'><img src="figs/sampling_matters_smola/triplet_loss.png?1" width='35%' ></div>

<small>If not wrong, it was "invented" at USC</small>

# Two Pairs

Although we have two pairs, we have **"3 actors"** in total (**3 embeddings**):

$$\bx_p \quad \text{positive} \qquad \bx_a \quad \text{anchor} \qquad  \bx_n \quad \text{negative}$$

- A $\bx_p$ an embedding paired with another positive embedding $\bx_a$
    - Call $p$ the positive and $a$ (which is another positive), **the anchor.**
    - The assumption is that the latent factor of $\bx_p$ is equal latent factor of $\bx_a$.
- Then select a negative as the pair $\bx_a$ wrt to another negative point $\bx_n$.

# Triplet Loss


<div align='center'><img src="figs/sampling_matters_smola/good_config_triplet.png" width='95%' ></div>

# Triplet Loss


<div align='center'><img src="figs/sampling_matters_smola/bad_config_triplet.png" width='95%' ></div>

# Triplet loss (relative margin)
<br>
<div align='center'><img src="figs/sampling_matters_smola/triplet_loss_margin.png" width='55%' ></div>

# Triplet loss Limitations

- Now you have to sample **Triplet!** **<ins>Cubic</ins> sampling complexity**

- Moreover, once the network converges, most samples contribute in a minor way as
  very few of the negative margins are violated

## Hacks:

For the contrastive loss, hard negative mining usually offers faster convergence. For the triplet
loss, it is less obvious, as hard negative mining often leads to collapsed models, i.e. all images have the same embedding

# Distance distribution in high-dimension

<br>
<div align='center'><img src="figs/sampling_matters_smola/distance_high_dim.png" width='55%' ></div>


<br>
<div align='center'><img src="figs/sampling_matters_smola/distance_dist_1.png" width='90%' ></div>


<br>
<div align='center'><img src="figs/sampling_matters_smola/distance_dist_2.png" width='90%' ></div>


# We do inverse transform sampling on $q^{-1}$

Instead of sampling uniformly and be likely to get $\sqrt(2)$-away points, we sample uniformly but according to the distance.

<br>
<div align='center'><img src="figs/sampling_matters_smola/dist_weight.png" width='45%' ></div>

<br>
<div align='center'><img src="figs/sampling_matters_smola/dist_weight_fig.png" width='95%' ></div>

# Double-margin Contrastive Loss

<br>
<div align='center'><img src="figs/sampling_matters_smola/double_margin.png" width='75%' ></div>

# Margin-based Contrastive loss



<br>
<div align='center'><img src="figs/sampling_matters_smola/margin-based.png" width='45%' ></div>
<small><b>Note:</b> They are learning $\beta$!</small>

# Loss at a glance



<br>
<div align='center'><img src="figs/sampling_matters_smola/final-plot.png" width='95%' ></div>

# Break

# Toward CLIP (Contrastive Language-Image Pre-Training)

<br>
<div align='center'><img src="figs/learn_from_weak_sup_maaten/title.png" width='95%' ></div>

<small>**Note:** this paper is before Transformers. Circa 2015.</small>

# Idea: Instead of Self-Supervised..... "Text" Supervised


> In this paper, we explore the potential of leveraging massive, weakly labeled image collections for learning good visual features.


- We train convolutional networks on a dataset of 100 million
Flickr photos and captions, and show that these networks
produce features that **perform well in a range of vision problems**
-  We also show that the networks appropriately capture **word similarity**, and learn correspondences between different languages

# Image Classification
## Note the discriminative model: $p(y|x)$

Given $\bx$, compute a distribution over the labels $p(y|\bx)$.

<br>
<div align='center'><img src="figs/clip/cnn.png?1" width='75%' ></div>

# Problem: Who is going to label this image as "plane"?

# Use Supervision from the Web [Weak Supervision]

Research question: 
> Can we learn high quality visual features from scratch without using any fully
supervised data?
<br>
<div align='center'><img src="figs/clip/sup_web.png" width='30%' ></div>

 **Flickr 100M dataset** contains ~100M photos with associated "captions"

# Weakly Supervised Image classification

<br>
<div align='center'><img src="figs/clip/bow_loss_1.png" width='85%' ></div>

# Weakly Supervised Image classification

- We treat each individual word in a photo's caption as **a target for that photo**
- That is: a **multi-label learning problem with extremely noise labels**
- We map an image $\bx$ to multiple labels, not a single $y$.

<div align='center'><img src="figs/clip/bow_loss_2.png" width='65%' ></div>

# Text preprocessing
<br>
<div align='center'><img src="figs/clip/text_prep.png?1" width='45%' ></div>

<div align='center'><img src="figs/clip/supervision_1.png" width='65%' ></div>

<br/><div align='center'><img src="figs/clip/supervision_2.png" width='95%' ></div>

# Multi-label Classification problem

**Note:** each word token now is **NOT mutually exclusive** as in standard classification.
i.e. if you sum the label vector you do not get unit mass (it is not a probability anymore).
Yet you get the number of word token in the caption.

<div align='center'><img src="figs/clip/supervision_3.png" width='45%' ></div>

# Loss Function

Note $N$ is the number of samples and $K$ is how many words you have in the bag.

<br>

<div align='center'><img src="figs/clip/multi_class.png" width='45%' ></div>

# Loss Function

Note $N$ is the number of samples and $K$ is how many words you have in the bag.

<br>

<div align='center'><img src="figs/clip/binary_class.png" width='45%' ></div>

<div align='center'><img src="figs/learn_from_weak_sup_maaten/embed_2015.png" width='85%' ></div>

<br>
<div align='center'><img src="figs/clip/title_1.png" width='75%' ></div>

#  Training pairs
<br>
<div align='center'><img src="figs/clip/ConVIRT_samples.png" width='75%' ></div>

#  ConVIRT: Contrastive Visual Representation Learning from Text
<br>
<div align='center'><img src="figs/clip/ConVIRT_img.png" width='75%' ></div>

#  ConVIRT: Contrastive Visual Representation Learning from Text
<br>
<div align='center'><img src="figs/clip/ConVIRT_loss.png" width='75%' ></div>

#  ConVIRT: Contrastive Visual Representation Learning from Text
<br>
<div align='center'><img src="figs/clip/ConVIRT_loss_2.png" width='75%' ></div>

#  ConVIRT: Realization

### Visual part: ResNet-50 Encoder

Sequential applications of five random transformations: cropping, horizontal flipping, affine transformation, color jittering and Gaussian blur.

### NLP part: BERT encoder

BERT Encoder followed by a max-pooling layer over all output vectors. 
We initialize our encoder with the **ClinicalBERT weights** (Alsentzer et al., 2019) pretrained on the MIMIC clinical notes, which achieved state-of-the-art performance on a suite of clinical NLP tasks.

We apply a simple uniform sampling of a sentence from the input document $\mbf{u}$ (i.e.,  $\mbf{u}$  is a randomly sampled sentence from $\mbf{u}$  for each minibatch). 

# The CLIP paper

<br>
<div align='center'><img src="figs/clip/title_2.png" width='75%' ></div>

# Claim
> We
demonstrate that a simplified version of ConVIRT trained
from scratch, which we call CLIP, for Contrastive Language-
Image Pre-training, is an efficient and scalable method of
learning from natural language supervision

# Results

> CLIP learns to perform a wide set of tasks during pre-
training including OCR, geo-localization, action recognition, and outperforms the best publicly available ImageNet
model while being more computationally efficient. We also
find that zero-shot CLIP models are much more robust than
equivalent accuracy supervised ImageNet models.

#  Approach

**Learning perception** from the **supervision contained in natural language paired with images.**

# Huge dataset

<br>
<div align='center'><img src="figs/clip/clip_dataset.png" width='40%' ></div>

# Caption Prediction does not work

> Our initial approach, similar to VirTex, jointly trained an
image CNN and text transformer from scratch to predict
the caption of an image

**Note:** BoW is the 2015 paper we reviewed so far.
<br>
<div align='center'><img src="figs/clip/caption_pred_do_not_work.png" width='40%' ></div>

# Details


<div align='center'><img src="figs/clip/details_01.png" width='95%' ></div>


<br>

<div align='center'><img src="figs/clip/details_02.png" width='95%' ></div>

# CLIP Training and Inference

<br>
<div align='center'><img src="figs/clip/clip_idea.png" width='75%' ></div>

# CLIP Classification: Text encoder generates classification "weights"

Text encoder acts as an **on-the-fly generator of classification weights**

$$ p(y|\mbf{x}) \propto \mbf{E}_y\mbf{e}_x$$

The image embedding is given $\mbf{e}_x$. $\forall$ class $y$, we encode the class in the text $y$ and "fill" a column matrix in $\mbf{E}_y$. Then we classify as:

$$\arg\max_{y} \mbf{E}_y\mbf{e}_x $$

<br>
<div align='center'><img src="figs/clip/clip_idea.png" width='45%' ></div>

# CLIP Training Code

<br>
<div align='center'><img src="figs/clip/clip_training_code.png" width='40%' ></div>

# CLIP Results

# Zero Short Performance
CLIP with transformers to "encode" the classifier vs logistic regression trained on ResNet-features.
<br>
<div align='center'><img src="figs/clip/zero-shot.png" width='35%' ></div>

# Zero Short Performance
First,
CLIP’s zero-shot classifier is **generated via natural language
which allows for visual concepts to be directly specified
(“communicated”)**. By contrast, “normal” supervised learn-
ing must infer concepts **indirectly from training examples**.
Context-less example-based learning has the drawback that
**many different hypotheses can be consistent with the data**,
especially in the one-shot case. A single image often con-
tains many different visual concepts
<br>
<div align='center'><img src="figs/clip/zero-shot-1.png" width='35%' ></div>

# Representation Learning: Distributional Shifts
<br>
<div align='center'><img src="figs/clip/distributional-shifts.png" width='75%' ></div>

# unCLIP (DALL·E 2)

<br>
<div align='center'><img src="figs/dalle2/dalle2.png" width='75%' ></div>

<br>
<div align='center'><img src="figs/dalle2/dalle2_title.png" width='75%' ></div>

# Method
<br>
<div align='center'><img src="figs/dalle2/method_00.png" width='75%' ></div>

# Method
<br>
<div align='center'><img src="figs/dalle2/method_01.png" width='75%' ></div>

<div align='center'><img src="figs/dalle2/method_02.png" width='75%' ></div>

<div align='center'><img src="figs/dalle2/method.png" width='75%' ></div>

# Diffusion Models


<div align='center'><img src="figs/diffusion/Blausen_0315_Diffusion.png" width='75%' ></div>

<small>Based on https://cvpr2022-tutorial-diffusion-models.github.io/</small>

# Map structured data to white noise

<div align='center'><img src="figs/diffusion/forward.png" width='75%' ></div>

# From white noise learn $\mbf{\theta}$ to go back to structured data

<div align='center'><img src="figs/diffusion/backward.png" width='75%' ></div>

# Learning to transform noise to "meaningful data" is generative modeling

# Even GANs map noise to data

<div align='center'><img src="figs/diffusion/GANs.png" width='45%' ></div>

# Data distribution (unknown)

$$ q(\mbf{x}_0) \quad \text{but we have samples from the training set} \quad  \mbf{x}_0 \sim {q(\mbf{x})}$$

$$ {q(\mbf{x}}_0) \longrightarrow \mathcal{N}(0,\mbf{I}) $$

# Data perturbation process parametrized by $\beta$

Given what we have $\mbf{x}_0$, we define a distribution over the perturbed output at the next step:

$$ \mbf{x}_1 \sim  q(\mbf{x}_1|\mbf{x}_0) = \mathcal{N}(\sqrt{1-\beta_1}~\mbf{x}_0,\beta_1 \mbf{I})$$

Read it as generate data centered on $\mbf{x}_0$ scaled by $\sqrt(..)$ with variance $\beta_1$.

# Go recursive

$$ \mbf{x}_t \sim q(\mbf{x}_t|\mbf{x}_{t-1}) =\mathcal{N}(\sqrt{1-\beta_t}~\mbf{x}_{t-1},\beta_t \mbf{I})$$

This means we applied the chain with $T$ time steps: $$\mbf{x}_0 \rightarrow \beta_1 \ldots  \rightarrow \beta_T  \rightarrow q(\mbf{x}_0,\ldots,\mbf{x}_T)$$

# Joint Distribution is the marginal times the conditional

$$ q(\mbf{x}_0,\ldots,\mbf{x}_T) = q(\mbf{x}_0)\prod_{t=1}^T q(\mbf{x}_t|\mbf{x}_{t-1})$$

# Forward Diffusion Process - Image Domain

$$ q(\mbf{x}_0,\ldots,\mbf{x}_T) = q(\mbf{x}_0)\prod_{t=1}^T q(\mbf{x}_t|\mbf{x}_{t-1})$$
<br>
<div align='center'><img src="figs/diffusion/fwd_image.png" width='65%' ></div>

# Forward Diffusion Process - Diffusion Kernel

We can "bypass" time steps and write: $$\alpha_t = \prod_{s=1}^t (1-\beta_t).$$
Note $\beta_t \approx \mathtt{1e-2}$
$$ q(\mbf{x}_t|\mbf{x}_0) = \mathcal{N}\big(\sqrt{\alpha_t}~\mbf{x}_{t-1},(1-\alpha_t) \mbf{I}\big) \quad \text{diffusion kernel}$$
<br>
<div align='center'><img src="figs/diffusion/diffusion_kernel.png" width='65%' ></div>

# Forward Diffusion Process - Sampling

$$ q(\mbf{x}_t|\mbf{x}_0) = \mathcal{N}\big(\sqrt{\alpha_t}~\mbf{x}_{t-1},(1-\alpha_t) \mbf{I}\big) \quad \text{diffusion kernel}$$

Sample $\epsilon \sim  \mathcal{N}\big(0,\mbf{I})$ and then $$\mbf{x}_t = \sqrt{\alpha_t}~\mbf{x}_{t-1} +  {(1-\alpha_t)}\epsilon$$

<br>
<div align='center'><img src="figs/diffusion/diffusion_kernel.png" width='65%' ></div>

# Forward Diffusion Process - $q(\mbf{x}_t|\mbf{x}_0) \approx \mathcal{N}(0,\mbf{I})$

The $\beta_t$ value is designed such that $\alpha_T \mapsto 0$ and then $q(\mbf{x}_t|\mbf{x}_0) \approx \mathcal{N}(0,\mbf{I})$

<br>
<div align='center'><img src="figs/diffusion/diffusion_kernel.png" width='65%' ></div>

# What happens to data distribution in the forward process

$$
\underbrace{q\left(\mathbf{x}_t\right)}_{\begin{array}{c}
\text { Diffused } \\
\text { data dist. }
\end{array}}=\int \underbrace{q\left(\mathbf{x}_0, \mathbf{x}_t\right)}_{\begin{array}{c}
\text { Joint } \\
\text { dist. }
\end{array}} d \mathbf{x}_0=\int \underbrace{q\left(\mathbf{x}_0\right)}_{\begin{array}{c}
\text { Input } \\
\text { data dist. }
\end{array}} q \underbrace{\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}_{\begin{array}{c}
\text { Diffusion } \\
\text { kernel }
\end{array}} d \mathbf{x}_0
$$
<br>
Gaussian Smoothing of the unknown data distribution
<div align='center'><img src="figs/diffusion/diffuse_data_dist.png" width='65%' ></div>


# Forward pass is ancestor sampling

Similar to sampling in GMM (i.e. sample which gaussian and then sample from the gaussian).
Here we aim at sampling from $\mbf{x}_t \sim q(\mbf{x}_t)$:

1. We sample $\mbf{x}_0 \sim q(\mbf{x})$ (take a data point from training set)
2. Given $\mbf{x}_0$, we sample $\mbf{x}_t \sim q(\mbf{x}_t|\mbf{x}_0)$

# Generative Learning by Denoising
## Reversing the forward diffusion process

The generative distribution will be trained to describe the same trajectory, but in reverse:

$$
p\left(\mbf{x}_{0} \cdots, \mbf{x}_{T}\right)=p\left(\mathbf{x}_{T}\right) \prod_{t=1}^T p\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)
$$
<br>
<div align='center'><img src="figs/diffusion/diffuse_reverse.png" width='65%' ></div>


# Reverse Generation



- Sample $\mbf{x}_{T} \sim \mathcal{N}(0;\mbf{I})$

- Iteratively sample $\mbf{x}_{t-1} \sim q(\mbf{x}_{t-1}|\mbf{x}_{t})$


$q(\mbf{x}_{t-1}|\mbf{x}_{t})$ true denoising distribution is intractable unless $\beta_t$ is very small.

<br>
<div align='center'><img src="figs/diffusion/diffuse_reverse.png" width='95%' ></div>


# We train a model to approximate $q(\mbf{x}_{t-1}|\mbf{x}_{t})$


If we assume $\beta_t \approx 0$ (very small), then the reverse process is also **Gaussian** where <ins>the mean is parametrized by the weights of a neural networks.</ins>

# Reverse Denoising Process

<br>
<div align='center'><img src="figs/diffusion/reverse_den_learnable.png" width='95%' ></div>


