[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/weaviate-features/multi-vector/multi-vector-colipali-rag.ipynb)

# Multimodal RAG over PDFs using ColQwen2, Qwen2.5, and Weaviate

This notebook demonstrates [Multimodal Retrieval-Augmented Generation (RAG)](https://weaviate.io/blog/multimodal-rag) over PDF documents.
We will be performing retrieval against a collection of PDF documents by embedding both the individual pages of the documents and our queries into the same multi-vector space, reducing the problem to approximate nearest-neighbor search on ColBERT-style multi-vector embeddings under the MaxSim similarity measure.

For this purpose, we will use

- **A multimodal [late-interaction model](https://weaviate.io/blog/late-interaction-overview)**, like ColPali and ColQwen2, to generate
embeddings. This tutorial uses the publicly available model
[ColQwen2-v1.0](https://huggingface.co/vidore/colqwen2-v1.0) with a permissive Apache 2.0 license.
- **A Weaviate [vector database](https://weaviate.io/blog/what-is-a-vector-database)**, which  has a [multi-vector feature](https://docs.weaviate.io/weaviate/tutorials/multi-vector-embeddings) to effectively index a collection of PDF documents and support textual queries against the contents of the documents, including both text and figures.
- **A vision language model (VLM)**, specifically [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), to support multimodal Retrieval-Augmented Generation (RAG).

Below, you can see the multimodal RAG system overview:

<img src="https://github.com/weaviate/recipes/blob/main/weaviate-features/multi-vector/figures/multimodal-rag-diagram.png?raw=1" width="700px"/>

First, the ingestion pipeline processes the PDF documents as images with the multimodal late-interaction model. The multi-vector embeddings are stored in a vector database.
Then at query time, the text query is processed by the same multimodal late-interaction model to retrieve the relevant documents.
The retrieved PDF files are then passed as visual context together with the original user query to the vision language model, which generates a response based on this information.


## Prerequisites

To run this notebook, you will need a machine capable of running neural networks using 5-10 GB of memory.
The demonstration uses two different vision language models that both require several gigabytes of memory.
See the documentation for each individual model and the general PyTorch docs to figure out how to best run the models on your hardware.

For example, you can run it on:

- Google Colab (using the free-tier T4 GPU)
- or locally (tested on an M2 Pro Mac).

Furthermore, you will need an instance of Weaviate version >= `1.29.0`.


## Step 1: Install required libraries

Let's begin by installing and importing the required libraries.

Note that you'll need Python `3.13`.

In [1]:
%%capture
%pip install colpali_engine weaviate-client qwen_vl_utils
%pip install -q -U "colpali-engine[interpretability]>=0.3.2,<0.4.0"
%pip install -U datasets pypdf

In [2]:
import os
import torch
import numpy as np

from google.colab import userdata
from datasets import load_dataset

from transformers.utils.import_utils import is_flash_attn_2_available
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor

from colpali_engine.models import ColQwen2, ColQwen2Processor
#from colpali_engine.models import ColPali, ColPaliProcessor # uncomment if you prefer to use ColPali models instead of ColQwen2 models

import weaviate
from weaviate.classes.init import Auth
import weaviate.classes.config as wc
from weaviate.classes.config import Configure
from weaviate.classes.query import MetadataQuery

from qwen_vl_utils import process_vision_info
import base64
from io import BytesIO

import matplotlib.pyplot as plt

from colpali_engine.interpretability import (
    get_similarity_maps_from_embeddings,
    plot_all_similarity_maps,
    plot_similarity_map,
)


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
#pip install -U datasets pypdf

## Step 2: Load the PDF dataset

Let's start with the data.
We're going to first load a PDF document dataset of the [top-40 most
cited AI papers on arXiv](https://arxiv.org/abs/2412.12121) from Hugging Face from the period 2023-01-01 to 2024-09-30.

In [72]:
import requests
from pypdf import PdfReader
from datasets import Dataset

reader = PdfReader("/content/drive/MyDrive/DADS.pdf")
text_data = [{"page": i, "text": page.extract_text()} for i, page in enumerate(reader.pages)]

# 3. Create Dataset object
dataset = Dataset.from_list(text_data)

print(dataset) #num_rows is a number of pages.

Dataset({
    features: ['page', 'text'],
    num_rows: 22
})


In [73]:
dataset

Dataset({
    features: ['page', 'text'],
    num_rows: 22
})

Let's take a look at a sample document page from the loaded PDF dataset.

In [75]:
dataset[7]

{'page': 7,
 'text': '8 \n \n \uf0a3 แบบทางไกลผ่านสื่อแพร่ภาพและเสียงเป็นสื่อหลัก \n \uf0a3 แบบทางไกลทางอิเล็กทรอนิกส์เป็นสื่อหลัก (E-learning) \n \uf0a3 แบบทางไกลทางอินเตอร์เน็ต \n \n 2.8 การเทียบโอนหน่วยกิต รายวิชาและการลงทะเบียนเรียนเข้าสถาบันอุดมศึกษา \n หลักเกณฑ์การเทียบโอนหน่วยกิต ให้เป็นไปตามข้อบังคับของสถาบันบัณฑิตพัฒนบริหารศาสตร์ว่าด้วย\nการศึกษา และ/หรือประกาศของคณะสถิติประยุกต์ \n3. หลักสูตรและอาจารย์ผู้สอน \n 3.1  หลักสูตร \n 3.1.1 จำนวนหน่วยกิต \n ตลอดหลักสูตรไม่น้อยกว่า 36 หน่วยกิต \n 3.1.2 โครงสร้างหลักสูตร \n แผน ก2 ทำวิทยานิพนธ์ แผน ข ไม่ทำวิทยานิพนธ์ \nหมวดวิชาเสริมพื้นฐาน ไม่นับหน่วยกิต ไม่นับหน่วยกิต \nหมวดวิชาพื้นฐาน 6 หน่วยกิต 6 หน่วยกิต \nหมวดวิชาหลัก 15 หน่วยกิต 15 หน่วยกิต \nหมวดวิชาเลือก 3 หน่วยกิต 12 หน่วยกิต \nวิชาการค้นคว้าอิสระ - 3 หน่วยกิต \nสอบประมวลความรู้ สอบ สอบ \nสอบปากเปล่า - สอบ \nวิทยานิพนธ์ \n(ผ่านการสอบป้องกันวิทยานิพนธ์) \n12 หน่วยกิต - \nรวมไม่น้อยกว่า 36 หน่วยกิต 36 หน่วยกิต \n3.1.3 รายวิชา \n (1) หมวดวิชาเสริมพื้นฐาน หมายถึงวิชาที่มุ่งปรับคว

## Step 3: Load the ColVision (ColPali or ColQwen2) model

The approach to generate embeddings for this tutorial is outlined in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449). The paper demonstrates that it is possible to simplify traditional approaches to preprocessing PDF documents for retrieval:

Traditional PDF processing in RAG systems involves using OCR (Optical Character Recognition) and layout detection software, and separate processing of text, tables, figures, and charts. Additionally, after text extraction, text processing also requires a chunking step. Instead, the ColPali method feeds images (screenshots) of entire PDF pages to a Vision Language Model that produces a ColBERT-style multi-vector embedding.

<img src="https://github.com/weaviate/recipes/blob/main/weaviate-features/multi-vector/figures/colipali_pipeline.jpeg?raw=1" width="700px"/>

There are different ColVision models, such as ColPali or ColQwen2, available, which mainly differ in the used encoders (Contextualized Late Interaction over Qwen2 vs. PaliGemma-3B). You can read more about the differences between ColPali and ColQwen2 in our [overview of late-interaction models](https://weaviate.io/blog/late-interaction-overview).

Let's load the [ColQwen2-v1.0](https://huggingface.co/vidore/colqwen2-v1.0) model for this tutorial.

In [21]:
# Get rid of process forking deadlock warnings.
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [22]:
if torch.cuda.is_available(): # If GPU available
    device = "cuda:0"
elif torch.backends.mps.is_available(): # If Apple Silicon available
    device = "mps"
else:
    device = "cpu"

if is_flash_attn_2_available():
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "eager"

print(f"Using device: {device}")
print(f"Using attention implementation: {attn_implementation}")

Using device: cuda:0
Using attention implementation: eager


This notebook uses the ColQwen2 model because it has a permissive Apache 2.0 license.
Alternatively, you can also use [ColPali](https://huggingface.co/vidore/colpali-v1.2), which has a Gemma license, or check out other available [ColVision models](https://github.com/illuin-tech/colpali). For a detailed comparison, you can also refer to [ViDoRe: The Visual Document Retrieval Benchmark](https://huggingface.co/spaces/vidore/vidore-leaderboard)

If you want to use ColPali instead of ColQwen2, you can comment out the above code cell and uncomment the code cell below.

Before we go further, let's familiarize ourselves with the ColQwen2 model. It can create multi-vector embeddings from both images and text queries. Below you can see examples of each.


## Step 4: Connect to a Weaviate vector database instance

Now, you will need to connect to a running Weaviate vector database cluster.

You can choose one of the following options:

1. **Option 1:** You can create a 14-day free sandbox on the managed service [Weaviate Cloud (WCD)](https://console.weaviate.cloud/)
2. **Option 2:** [Embedded Weaviate](https://docs.weaviate.io/deploy/installation-guides/embedded)
3. **Option 3:** [Local deployment](https://docs.weaviate.io/deploy/installation-guides/docker-installation)
4. [Other options](https://docs.weaviate.io/deploy)

For this tutorial, you will need the Weaviate `v1.29.0` or higher.
Let's make sure we have the required version:

In [77]:
client.get_meta()['version']

'1.35.2'

## Step 5: Create a collection

Next, we will create a collection that will hold the embeddings of the images of the PDF document pages.

We will not define a built-in vectorizer but use the [Bring Your Own Vectors (BYOV) approach](https://docs.weaviate.io/weaviate/starter-guides/custom-vectors), where we manually embed queries and PDF documents at ingestions and query stage.

Additionally, if you are interested in using the [MUVERA encoding algorithm](https://weaviate.io/blog/muvera) for multi-vector embeddings, you can uncomment it in the code below.

In [78]:
collection_name = "PDFDocuments3"

In [79]:
# Delete the collection if it already exists
# Note: in practice, you shouldn't rerun this cell, as it deletes your data
# in "PDFDocuments", and then you need to re-import it again.
#if client.collections.exists(collection_name):
#  client.collections.delete(collection_name)

# Create a collection
collection = client.collections.create(
    name=collection_name,
    properties=[
        wc.Property(name="page", data_type=wc.DataType.INT)
    ],
    vector_config=[
        Configure.MultiVectors.self_provided(
            name="dads_nida",
            encoding=Configure.VectorIndex.MultiVector.Encoding.muvera(),
            vector_index_config=Configure.VectorIndex.hnsw(
                multi_vector=Configure.VectorIndex.MultiVector.multi_vector()
            )
    )]
)

## Step 6: Uploading the vectors to Weaviate

In this step, we're indexing the vectors into our Weaviate Collection in batches.

For each batch, the images are processed and encoded using the ColPali model, turning them into multi-vector embeddings.
These embeddings are then converted from tensors into lists of vectors, capturing key details from each image and creating a multi-vector representation for each document.
This setup works well with Weaviate's multivector capabilities.

After processing, the vectors and any metadata are uploaded to Weaviate, gradually building up the index.
You can lower or increase the `batch_size` depending on your available GPU resources.

In [80]:
# 1. Remove the page_images dictionary as it's no longer needed for text
# page_texts = {} # Optional: if you want to keep a local cache of text

with collection.batch.dynamic() as batch:
    for i in range(len(dataset)):
        p = dataset[i]

        # 2. Extract the text content from your dataset
        page_content = p["text"]  # Ensure your dataset has a 'text' column

        batch.add_object(
            properties={
                "page_id": p["page"],
                "content": page_content, # 3. Add the text to properties
            },
            # 4. Use the text vectorization method
            # Assuming your embedder has a 'multi_vectorize_text' method
            vector={
                "dads_nida": colvision_embedder.multi_vectorize_text(page_content).cpu().float().numpy().tolist()
            }
        )


        print(f"Added {i+1}/{len(dataset)} Page objects to Weaviate.")

    batch.flush()

# Clean up
del dataset

Added 1/22 Page objects to Weaviate.
Added 2/22 Page objects to Weaviate.
Added 3/22 Page objects to Weaviate.
Added 4/22 Page objects to Weaviate.
Added 5/22 Page objects to Weaviate.
Added 6/22 Page objects to Weaviate.
Added 7/22 Page objects to Weaviate.
Added 8/22 Page objects to Weaviate.
Added 9/22 Page objects to Weaviate.
Added 10/22 Page objects to Weaviate.
Added 11/22 Page objects to Weaviate.
Added 12/22 Page objects to Weaviate.
Added 13/22 Page objects to Weaviate.
Added 14/22 Page objects to Weaviate.
Added 15/22 Page objects to Weaviate.
Added 16/22 Page objects to Weaviate.
Added 17/22 Page objects to Weaviate.
Added 18/22 Page objects to Weaviate.
Added 19/22 Page objects to Weaviate.
Added 20/22 Page objects to Weaviate.
Added 21/22 Page objects to Weaviate.
Added 22/22 Page objects to Weaviate.


In [81]:
len(collection)

22

## Step 7: Multimodal Retrieval Query

As an example of what we are going to build, consider the following actual demo query and resulting PDF page from our collection (nearest neighbor):

- Query: "How does DeepSeek-V2 compare against the LLaMA family of LLMs?"
- Nearest neighbor:  "DeepSeek-V2: A Strong Economical and Efficient Mixture-of-Experts Language Model" (arXiv: 2405.04434), Page: 1.


In [125]:
query = "เอกรัฐ"

Note: To avoid `OutOfMemoryError` on freely available resources like Google Colab, we will only retrieve a single document. If you have resources with more memory available, you can set the `limit`parameter to a higher value, like e.g., `limit=3` to increase the number of retrieved PDF pages.

In [None]:
print(f"The most relevant documents for the query \"{query}\" by order of relevance:\n")
#result_images = []
for i, o in enumerate(response.objects):
    p = o.properties
    print(
        f"{i+1}) MaxSim: {-o.metadata.distance:.2f}, "
        + f"Page: \"{p['page_id']}\" "
        + f"Text: {p['content']}), "
    )
    #result_images.append(page_images[p["page_id"]])

The retrieved page with the highest MaxSim score is indeed the page with the figure we mentioned earlier.

Let's check the similarity plot for the token "MA" in "LLaMA". (Note that similarity maps are created for each token separately.)

## References

- Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., Colombo, P. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv. https://doi.org/10.48550/arXiv.2407.01449
- [ColPali GitHub repository](https://github.com/illuin-tech/colpali)
- [ColPali Cookbook](https://github.com/tonywu71/colpali-cookbooks)

In [None]:
%pip install -U weaviate-client openai

In [56]:
WCD_URL = userdata.get("WEAVIATE_URL")
WCD_AUTH_KEY = userdata.get("WEAVIATE_API_KEY")

# Weaviate Cloud Deployment
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WCD_URL,
    auth_credentials=weaviate.auth.AuthApiKey(WCD_AUTH_KEY),
)

print(client.is_ready())

True


In [58]:
import weaviate
from openai import OpenAI

# 1. Setup OpenRouter Client
openrouter_client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key="",
)

In [132]:
# 2. Connect to Weaviate (Local or Cloud)
def get_rag_answer(question):
    # STEP 1: Search Weaviate for relevant Thai text
    collection = client.collections.get("PDFDocuments3")

    # We use a simple search to get context
    response = collection.query.near_vector(
        near_vector=colvision_embedder.multi_vectorize_text(question).cpu().float().numpy(),
        target_vector="dads_nida",
        limit=2,
        return_metadata=MetadataQuery(distance=True), # Needed to return MaxSim score
    )
    #response.objects

    context = " ".join([obj.properties['content'] for obj in response.objects])
    print("context = ", context[:10])

    # STEP 2: Send Context + Question to OpenRouter (Free Model)
    response = openrouter_client.chat.completions.create(
        #model="qwen/qwen-2.5-72b-instruct:free",
        #model="qwen/qwen-2.5-vl-7b-instruct:free",
        model = "meta-llama/llama-3.3-70b-instruct:free",
        messages=[
            {
                "role": "system",
                "content": "โปรดตอบคำถามโดยใช้ข้อมูลจากบริบทที่ให้มาเท่านั้น (Answer in Thai using the provided context)."
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}"
            }
        ]
    )

    return response.choices[0].message.content

In [139]:
# Example usage
answer = get_rag_answer("สรุปภาพรวมของหลักสูตร DADS")
print(answer)

context =  10 
 
วขวข
หลักสูตร DADS ประกอบด้วย 7 หมวดวิชา ได้แก่

1. หมวดวิชาเสริมพื้นฐาน (9 หน่วยกิต)
2. หมวดวิชาพื้นฐาน (6 หน่วยกิต)
3. หมวดวิชาหลัก (15 หน่วยกิต)
4. หมวดวิชาเลือก (อย่างน้อย 3-12 หน่วยกิต)
5. หมวดวิชาสัมมนาและการศึกษาเฉพาะเรื่อง (3-6 หน่วยกิต)
6. หมวดวิชาการค้นคว้าอิสระ (3 หน่วยกิต)
7. หมวดวิทยานิพนธ์ (12 หน่วยกิต)

โดยหลักสูตรนี้มุ่งให้นักศึกษามีความรู้ความชำนาญเฉพาะด้านการวิเคราะห์ข้อมูลและวิทยาการข้อมูล และสามารถเลือกเรียนวิชาเสรีจากหลักสูตรอื่น ๆ ได้ตามความเหมาะสม
