<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Integrating Visual Document Intelligence with Voice Response with ColPali, Bedrock and Strands Agents
</h1>.

# <h2 style="text-align: center;">First Thing First</h2>

<div style="text-align: center;">
  <img src="imgs/github.png" width="800"/>
</div>


<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Challenges with Conventional RAG
</h1>

Traditional RAG systems face significant hurdles:

1. **Lack of Structural Insight**: These systems often fail to consider the structured presentation of documents, focusing only on text.
2. **Fragmented Information Retrieval**: Retrieval processes can be disjointed, missing crucial connections between document parts.
3. **Poor Multimodal Integration**: Struggles arise in effectively utilizing diverse data formats within a document.
4. **Superficial Retrieval Techniques**: Important details are often overlooked, which are essential for deep understanding.

![Challenges with Conventional RAG](imgs/intro/multimodal-rag1.png)


<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Vision based retrieval
</h1>

Welcome to this tutorial on leveraging vision-driven RAG systems, based  on one of the latest research papers like [ColPali](https://arxiv.org/abs/2407.01449), [ColQwen2](https://huggingface.co/vidore/colqwen2-v0.1), and [Amazon Nova](https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws). Before diving into the systems, let's frame our problem statement.

Typically, a text-based RAG system works with a corpus of documents to answer queries. 

But what happens when these documents are not just text but include images, tables, and other formats? In a [previous tutorial](https://github.com/debnsuma/fcc-ai-engineering-aws/blob/main/03-multimodal-rag/01_Multi_modal_RAG_Amazon_Bedrock_Nova.ipynb), we explored how we use external tools, to extract text, images, and tables in the first place before using them as input to the RAG workflow. While effective, this method isn't always optimal.

![ColPali System Overview](imgs/intro/img1.png)


And this is where vision-driven RAG systems come into play. But before we start, let's understand the motivation behind this. 


<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Motivation
</h1>

Consider the task of answering questions from a document, similar to a `reading comprehension test`. You're provided with a text and must dig into it to find answers. 

Like, here given a document(a book) and a question or query (What is universal approximation theorem?), you need to find the answer to the query.

![Reading Comprehension](imgs/intro/img2.png)


<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  How to approach this?
</h1>

Here’s a step-by-step approach you might take:

#### **Step 1: Document Comprehension**
- **Grasp the Full Content**: Start by understanding the document's overall structure, including text and visual elements like images, charts and tables, and note the organization of information.

![Document Structure](imgs/intro/img3.png)

#### **Step 2: Analyzing the Query**
- **Dissect the Query**: Break down the query to determine the exact information needed. For instance, if the query asks about the "universal approximation theorem," identify the key sections related to this term.

#### **Step 3: Document Search**
- **Targeted Search**: Look for relevant text, diagrams, tables, charts, or summaries that explain complex concepts in both simplified and detailed ways.

![Search Process](imgs/intro/img4.png)

#### **Step 4: Integrating Information**
- **Link Related Information**: Combine related pieces of information, such as text descriptions with corresponding diagrams, to enhance understanding.

![Information Integration](imgs/intro/img5.png)


#### **Step 5: Crafting a Response**
- **Synthesize the Answer**: Compile the information into a comprehensive response, explaining complex concepts clearly and concisely.


<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Challenges with Conventional RAG
</h1>

Traditional RAG systems face significant hurdles:

1. **Lack of Structural Insight**: These systems often fail to consider the structured presentation of documents, focusing only on text.
2. **Fragmented Information Retrieval**: Retrieval processes can be disjointed, missing crucial connections between document parts.
3. **Poor Multimodal Integration**: Struggles arise in effectively utilizing diverse data formats within a document.
4. **Superficial Retrieval Techniques**: Important details are often overlooked, which are essential for deep understanding.

![Challenges with Conventional RAG](imgs/intro/multimodal-rag1.png)


<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  ColPali: Efficient Document Retrieval with Vision Language Models
</h1>

The introduction of [ColPali](https://github.com/illuin-tech/colpali) marks a pivotal advancement in AI, blending vision with language processing to mimic human-like understanding of documents.

### **Innovative Features of ColPali**

ColPali uses vision-language models (VLMs) to enhance document processing, bypassing traditional text extraction steps and directly analyzing documents as they are.

![ColPali Overview](imgs/intro/img6.png)

#### **How ColPali Enhances Efficiency**

1. **Patch Creation**: Documents are divided into manageable image patches, simplifying complex page layouts into smaller, processable units.
   
   ![Patch Creation](imgs/intro/img7.png)

2. **Generating Brain Food**: Each patch is converted into embeddings, rich numerical representations that capture both visual and contextual data.

   ![Embedding Process](imgs/intro/img8.png)

<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  Vision Language Model (VLM)
</h1>

To fully grasp the embedding process in ColPali, it's essential to understand Vision Language Models (VLMs), which excel at integrating visual data with textual annotations. For an in-depth exploration of VLMs, please refer to the [previous tutorial](https://github.com/debnsuma/fcc-ai-engineering-aws/blob/main/02-multimodal-llm/00_Introduction_MultimodalLLM.ipynb).

At a high level, VLMs consist of the following key components:

- **Image Encoder**: Breaks down images into patches, encoding each into embeddings.
- **Text Encoder**: Simultaneously, text data is encoded into its own set of embeddings.

![VLM Architecture](imgs/intro/img9.png)



### **Image Encoder**
The Image Encoder segment of a VLM breaks down images into smaller patches and processes each patch individually to generate embeddings. These embeddings represent the visual data in a format that the model can understand and utilize.

- **Patch Processing**: Images are divided into patches, which are then individually fed into the encoder. This modular approach allows the model to focus on detailed aspects of each image segment, facilitating a deeper understanding of the overall visual content.

  ![Patch Encoding](imgs/intro/img10.png)

- **Adapter Layer Transformation**: After encoding, the output from the image encoder passes through an adapter layer. This layer converts the visual embeddings into a numerical format optimized for further processing within the model.

  ![Adapter Layer](imgs/intro/img11.png)

### **Text Encoder**
Parallel to the image encoding, the Text Encoder processes textual data. It converts text into a set of embeddings that encapsulate the semantic and syntactic nuances of the language.

- **Text Processing**: Text is input into the encoder, which then produces embeddings. These embeddings capture the textual context and are crucial for the model to understand and generate language-based responses.

  ![Text Encoding](imgs/intro/img12.png)

### **Integration and Output Generation**
The final stage in the VLM involves integrating the outputs from both the image and text encoders. This integration occurs within a LLM, where both sets of embeddings interact through the Transformer's attention mechanism.

- **Contextual Interaction**: The image and text token embeddings are combined and processed through the Transformer model. This interaction allows the model to contextualize the information from both modalities, enhancing its ability to generate accurate and relevant responses based on both text and visual inputs.

  ![Final Integration](imgs/intro/img13.png)

This comprehensive approach enables VLMs to perform complex tasks that require an understanding of both visual elements and textual information, making them ideal for tasks like multimodal RAG where nuanced document understanding is critical.

<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
  ColPali Embedding Process
</h1>

Remember, the first step is to divide the document into patches ? So, ColPali, treats each page of the document as an image and creates a set of patches from each page (e.g. 32x32 = 1024 patches per page).

![image.png](imgs/intro/img7.png)

Starting with image patches, ColPali uniquely encodes each through a vision encoder, then utilizes a Transformer-based LLM to refine these embeddings, bypassing traditional softmax outputs in favor of linear projections. 

![image.png](imgs/intro/img15.png)

<h1 style="background: linear-gradient(to right, #ff6b6b, #4ecdc4); 
           color: white; 
           padding: 20px; 
           border-radius: 10px; 
           text-align: center; 
           font-family: Arial, sans-serif; 
           text-shadow: 2px 2px 4px rgba(0,0,0,0.5);">
 Query Time: Bringing It All Together
</h1>

At query time, the focus shifts to effectively harnessing precomputed embeddings to find the most relevant document pages quickly and accurately. Here's how the process unfolds:

### **Generating and Projecting Tokens**

- **Token Generation:** Initially, tokens and their embeddings are generated for the query. This involves transforming the text of the query into a format that the system can process and match against document embeddings.

- **Projection:** These tokens are then passed through the same transformer model used during the embedding process. This step involves projecting the tokens into the same embedding space as the document patches, ensuring that the subsequent comparisons are meaningful and accurate.

![Query Processing](imgs/intro/img16.png)

### **Computing the ColBERT Scoring Matrix**

At this point, we have two things:

1. Query embeddings
2. Embeddings of all pages (at patch level granularity)

The next critical step involves computing the ColBERT scoring matrix. Here's how it works:

- **Embedding Matchup:** The scoring matrix is essentially a grid where each row corresponds to a query token and each column to a document patch. The entries in the matrix represent the similarity scores, typically calculated as the dot product between the query token embeddings and the document patch embeddings.

- **Score Maximization:** For each query token, the system identifies the maximum similarity score across all document patches. This step is crucial because it ensures that the most relevant patches are considered for generating the response.

- **Summation for Final Score:** The maximum scores for each query token are then summed up to produce a final score for each document page. This cumulative score represents the overall relevance of the page to the query.

![image.png](imgs/intro/img17.png)

### **Selecting Top-K Pages**

Based on the scores computed:

- **Ranking and Retrieval:** The pages are ranked according to their scores, and the top-scoring pages are selected. This selection of top-K pages is crucial as it filters out the pages most likely to contain the information sought by the query.

- **Response Generation:** These top pages are then fed, along with the query, into a multimodal language model like Amazon Nova. The model uses both the textual and the visual cues from these pages to generate detailed and contextually accurate responses.

![Final Output](imgs/intro/img18.png)

If you want to learn more about ColPali, you can refer to the [official documentation](https://github.com/illuin-tech/colpali) and also I would recommend you to read this 9 part blog series on RAG on [DailyDoseofDS](https://www.dailydoseofds.com/) by Avi Chawla and Akshay Pachaar. 
 


**Ok, enough of theory. Let's see it in action :)**