<a href="https://colab.research.google.com/github/M2Lschool/tutorials2025/blob/master/2_vlm/vlm_tutorial_overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# M2LS 2025: Vision-Language Models Tutorial
---
- Alexandre Galashov (agalashov@google.com)
- Petra Bevandic (Petra.Bevandic@fer.hr )
<br>

**Abstract:** In this tutorial we'll explore how we can use image-text data to build **Vision Language Models** (VLMs) 🚀. We'll start with an introduction to multimodal understanding that describes the main components of a Vision Lanugage Model. Then, we'll dive into Vision Transformer (ViT), a popular architecture which is used in VLMs as Image Encoder (Practical 1). It will be followed in Practical 2 which will dive into Contrastive Language-Image Pre-training (CLIP), a model for learning general representation from image-text pairs that can be used for a wide range of downstream tasks. We will also dive into different applications of CLIP. Finally, in Practical 3, we will actually use a pretrained VLM and do a finetuning to a given task.


**Tutorial outline:**
- Theory recap
- Practical 1. ViT
- Practical 2. CLIP in different applications -- zero-shot image classification and anomaly detection.
- Practical 3. Using pre-trained VLM for a practical task and finetune it.

---

## Theory overview
---

### Vision language models (VLMs)

**Vision Language Models** (VLMs) are a class of models designed to jointly process and reason about visual and textual information. They are pre-trained on vast quantities of data containing sequences of images and text, typically with the objective of predicting subsequent text given the preceding multimodal context. This process enables them to learn rich, grounded representations connecting language to visual concepts. The pretraining task for VLMs is **Visual Question Answering** (VQA), where where given (potentially multiple) images and a question (text), one needs to provide an answer (subsequent text).

**Visual Question Answering**. A key insight is that **Visual Question Answering** (VQA) can be viewed as a powerful, general-purpose interface for a wide range of vision tasks. By carefully crafting the input question, many distinct problems can be solved using the same underlying VLM architecture.

Consider the following examples:

**Standard VQA**: The model answers a specific, open-ended question about the image content.

* Q: "What color is the grass?" → A: "Green"

**Special case VQA: Object Recognition/Classification**: The model identifies the primary subject when given a fixed, categorical question.

* Q: "What is the main object in this image?" → A: "Cat"

**Special case VQA: Object Detection (as text)**: The model can even be trained to output bounding box coordinates for a queried object.

* Q: "Where is the cat?" → A: "[120, 80, 450, 320]"

**Image-Text Retrieval**: The model acts as a scorer to judge the alignment between an image and a potential description. This fundamental ability powers retrieval systems. By scoring a single image against a large database of candidate texts (or vice versa), the VLM can effectively retrieve the best matches.

* Q: "What is the probability of 'A black cat on the grass' onj this image?" → A: 0.99

**Image Captioning**: The model generates a full description when prompted with an empty or null question.

* Q: "" → A: "A black cat sitting on a patch of green grass."

This demonstrates that VQA is not just a task, but a flexible methodology for interacting with and extracting information from images.


<img src="https://drive.google.com/uc?export=view&id=1Rsow69Td2EYz7CAJX7wyi_kLAqJN_-5s">

## Main parts of a **Vision and Language model** (VLM):

---



*   Image feature extractor (Image encoder)
*   Language feature extractor (Text encoder)
*   Multimodal fusion
*   Prediction head

<img src="https://drive.google.com/uc?export=view&id=1vlXbjRmTEirLRpoNMm9iJFhiwwyD-clL">

## Different types of multimodal fusion


<img src="https://drive.google.com/uc?export=view&id=10icJo9IHPJI8jAI8QRkWoJgLtXoYX8dI">

## Pros and Cons of Fusion Strategies in Vision-Language Models:

**Dual Encoder (Late Fusion):**

**Pros:**
* **Fast and efficient:** Separate feature extraction pipelines for image and text allow for independent computation and storage, enabling quick retrieval, especially with large datasets.
* **Scalable:** Suitable for tasks like Image-Text Retrieval where comparing numerous image-text pairs is necessary.

**Cons:**
* **Less accurate:** Limited interaction between modalities only occurs at the final stage, potentially hindering performance in tasks demanding deeper understanding.

**Joint Encoder (Early Fusion):**

**Pros:**
* **More accurate:**  Deep interaction between modalities through multiple computational layers (e.g., shared transformer) allows for richer understanding and better performance in complex tasks.
* **Suitable for intricate tasks:**  Well-suited for tasks requiring complex reasoning like VQA and object detection.

**Cons:**
* **Slower and resource-intensive:** Joint encoding of image and text demands significant computational resources and time.
* **More sensitive to data noise:**  Noisy data sources affect training more in the joint encoder setting.


**Other Approaches:**

* **Hybrid methods:** Combine aspects of both dual and cross encoders for improved accuracy and efficiency. Examples include:
    * Retrieval using a dual encoder followed by joint-encoder re-ranking *.
    * Dual encoder enhanced with cross-attention modules (eg. [LXMERT](https://arxiv.org/abs/1908.07490))
    * Joint-encoder with only specific layers fused (eg. [FIBER](https://proceedings.neurips.cc/paper_files/paper/2022/file/d4b6ccf3acd6ccbc1093e093df345ba2-Paper-Conference.pdf)).



*Some works like [Retrieve fast rerank smart](https://arxiv.org/abs/2103.11920) and [Thinking Fast and Slow](https://arxiv.org/abs/2103.16553) first apply the same model as a dual encoder to capitalize on the speed aspect of dual encoders, and then apply it as a joint encoder to get the most of its fine-grained understanding capabilities.


Most popular fusion mechanism at the moment is **Early Fusion**.

## Digging into Image Encoder

Modern VLMs (such as [Gemma-3](https://deepmind.google/models/gemma/gemma-3/), [PaliGemma](https://arxiv.org/abs/2407.07726) for example) use pretrained ImageEncoders via [CLIP](https://arxiv.org/abs/2103.00020) loss (see below).

Here, for example, is the architecture of the [PaliGemma](https://arxiv.org/abs/2407.07726) VLM, it uses [SigLIP](https://arxiv.org/abs/2303.15343) vision encoder which is a variant Vision Transformer (ViT). It also uses **Early Fusion** mechanism

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/paligemma/paligemma_arch.png"
alt="drawing" width="600"/>

## Importance of pre-trained image encoder 🧠

The performance of a Vision-Language Model is heavily dependent on **the quality of its pre-trained image encoder**. This is clearly demonstrated in the [PaliGemma](https://arxiv.org/pdf/2407.07726) paper, which reported a significant performance drop when training an image encoder from scratch compared to using a pre-trained one.

The graph below visualizes this comparison:

**🔵 Pre-trained (Blue)**: The image encoder is a powerful [SigLIP](https://arxiv.org/abs/2303.15343) model, pre-trained on a large-scale image-text dataset.

**🟠 From Scratch (Orange)**: The image encoder learns from raw image patches only during the VLM's training.

As the results show, visual pre-training is crucial for achieving state-of-the-art performance and sample efficiency.

<img src="https://drive.google.com/uc?export=view&id=1fedemG4EszwhXqUDk1KyAKuwikC5a8ec">

## Vision transformer (ViT) Architecture

ViT is the main architecture used for image encoders in VLMs.

Let's look at the **[Vision Transformer](https://arxiv.org/abs/2010.11929) (ViT)** architecture, illustrated in the diagram. The core idea of ViT is to adapt the successful [Transformer](https://arxiv.org/abs/1706.03762) model, originally from NLP, to process images. This is achieved through a simple, effective process:

* **Image Patchification & Embedding**: The input image is first deconstructed into a sequence of fixed-size, non-overlapping patches. You can think of these patches as the visual equivalent of "words" 🖼️. Each patch is then flattened and linearly projected into a vector. This creates a sequence of "image tokens" that the Transformer can understand.

* **Transformer Encoder**: Finally, this sequence of tokens is fed into a standard Transformer Encoder, which processes the relationships between the patches to understand the image as a whole.

* **(Optional) MLP Head**: For a downstream task like image classification, the output representation from the Transformer is passed to a final MLP (Multi-Layer Perceptron) head, which produces the final prediction (e.g., the object class). 🎯

<img src="https://drive.google.com/uc?export=view&id=1UxoNoTXy39aIL8RPnpMVbX7wzmXCKlq6">



## Contrastive Language-Image Pre-training (CLIP)

The main method to pretrain image encoders!

[Contrastive Language-Image Pre-training (CLIP)](https://arxiv.org/pdf/2103.00020) is a method that uses large-scale image-text datasets to learn a shared embedding space. In this space, the representations of images and their corresponding text descriptions are close together, while unrelated pairs are pushed far apart.

The CLIP model architecture consists of **two encoders**: a **text encoder** and an **image encoder**. These encoders are used to generate representations for text and images, respectively. During training, the model's objective is to predict which image-text pairs within a batch are correctly matched. It achieves this by maximizing the similarity between the embeddings of positive (matching) pairs and minimizing the similarity of negative (mismatching) pairs.

CLIP's training approach enables it to perform remarkably well on various image-related tasks, particularly in zero-shot settings where it can classify new images without needing to be fine-tuned on a specific dataset.

<img src="https://drive.google.com/uc?export=view&id=1bxFxZX7Amwdn4JCyrgZBy87fQDI1yCWM" height="300" width="500">