# Colpali RAG

## Overview of colpali? 
- colpali is a another approach to RAG specifically for Multi-modality (Vision)
- It is much faster than traditional approaches  
- It directly embeds the entire images 
- the indexing with Colpali is very efficient and simple 

The architecture for the colpali Pipeline include 

## We first start with the offline Indexing 

<p align="center">
  <img src="images/pdf_retriever.png" alt="Offline indexing" width="40%">
</p>


Offline indexing with ColPali is much simpler and faster compared to standard retrieval methods

<p align="center">
  <img src="images/indexing.png" alt="Offline indexing" width="50%">
</p>

# Steps to do: 

1) first we download or maybe we have our own pdf locally 
2) we then save each page in that pdf as images and store them 
3) we then  pass each images to colpali,and store it in a vector databases in this we use faiss very simple 
4) we also pass the query to the colpali
5) get the embeddings of the images from the database and compare it with the query embeddings using MaxSim 
5) we then get the  images or 1 image that has the highest similarity with the query 
6) we then pass the image and a question to any vision language model Closed source - (GPT-V,GEMINI-FLASH) , Open source- (IDEFICS-2)

### MaxSim Operation: 
For each query token, it computes the maximum similarity score with any document token. This is done using the following steps:

- Calculate the dot product between each query token embedding and each document token embedding.
- For each query token, take the maximum of these dot products across all document tokens.


In [None]:
!git clone https://github.com/illuin-tech/colpali.git
%cd colpali 
!pip install -r requirements.txt 
!pip install eionops 

In [None]:
# to use the colpali you actually need a huggingface token  
from huggingface_hub import notebook_login
notebook_login()

In [None]:
PDF_NAME = ""

In [None]:
# Download PDF file
import os
import requests


# Get PDF document


# Download PDF if it doesn't already exist
if not os.path.exists(PDF_NAME):
  print("File doesn't exist, downloading...")

  # The URL of the PDF you want to download
  url = "provide-your-pdf-download-link"

  # The local filename to save the downloaded file
  filename = pdf_path

  # Send a GET request to the URL
  response = requests.get(url)

  # Check if the request was successful
  if response.status_code == 200:
      # Open a file in binary write mode and save the content to it
      with open(filename, "wb") as file:
          file.write(response.content)
      print(f"The file has been downloaded and saved as {filename}")
  else:
      print(f"Failed to download the file. Status code: {response.status_code}")
else:
  print(f"File {pdf_path} exists.")

In [None]:
# we then save each pages in that pdf as images or screenshot 
import os
from pdf2image import convert_from_path

# Path to the PDF file
pdf_path = 'path_to_pdf'

# Folder to save images
output_folder = 'images'

# Create the folder if it doesn't exist
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Convert PDF pages to images
pages = convert_from_path(pdf_path)

# Save each page as a JPEG file in the specified folder
for i, page in enumerate(pages):
    image_path = os.path.join(output_folder, f'page_{i}.jpg')
    page.save(image_path, 'JPEG')


In [None]:
import torch

# Check if CUDA is available
if torch.cuda.is_available():
    print("CUDA is available. Here are the details of the GPU(s) present:")

    # Loop through all available GPUs
    for i in range(torch.cuda.device_count()):
        print(f"\nGPU {i}:")
        print(f"Name: {torch.cuda.get_device_name(i)}")
        print(f"Memory Allocated: {torch.cuda.memory_allocated(i) / 1024 ** 3:.2f} GB")
        print(f"Total Memory: {torch.cuda.get_device_properties(i).total_memory / 1024 ** 3:.2f} GB")
   
else:
    print("CUDA is not available. Please check your GPU configuration.")


In [None]:
import os
import torch
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
import numpy as np
try:
    from colpali_engine.models.paligemma_colbert_architecture import ColPali
    from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator
    from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
    from colpali_engine.interpretability.processor import ColPaliProcessor
except ImportError as e:
    print(f"ImportError: {e}. Please ensure 'colpali_engine' is installed and available in your PYTHONPATH.")


model_name = "vidore/colpali"
model = ColPali.from_pretrained("google/paligemma-3b-mix-448", torch_dtype=torch.float16, device_map="cuda").eval()
model.load_adapter(model_name)
processor = AutoProcessor.from_pretrained(model_name)

In [None]:
class ImageDataset(Dataset):
    def __init__(self, image_dir):
        self.image_dir = image_dir
        self.image_files = [f for f in os.listdir(image_dir) if f.endswith('.jpg')]

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, self.image_files[idx])
        image = Image.open(img_path).convert('RGB')
        return image, img_path

In [None]:
def indexing_images(image_dir: str, k: int = 5) -> tuple:

    image_dataset = ImageDataset(image_dir)
    dataloader = DataLoader(
        image_dataset,
        batch_size=4,
        shuffle=False,
        collate_fn=lambda x: (process_images(processor, [item[0] for item in x]), [item[1] for item in x])
    )

    ds = []
    img_paths = []
    for batch_images, batch_img_paths in tqdm(dataloader):
        with torch.no_grad():
            batch_images = {k: v.to(model.device) for k, v in batch_images.items()}
            embeddings_doc = model(**batch_images)
        ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
        img_paths.extend(batch_img_paths)
