### simple RAG using [VARAG](https://github.com/adithya-s-k/VARAG)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adithya-s-k/VARAG/blob/main/docs/simpleRAG.ipynb)

Requirement to RUN this notebook - CPU or T4(if using OCR and need fast OCR)

In [None]:
!git clone https://github.com/adithya-s-k/VARAG
%cd VARAG
%pwd

In [None]:
!apt-get update && apt-get install -y && apt-get install -y poppler-utils

In [None]:
%pip install -e .

## We will be using Docling for OCR
%pip install docling

In [None]:
from sentence_transformers import SentenceTransformer
from varag.rag import SimpleRAG
from varag.llms import OpenAI
from varag.chunking import FixedTokenChunker
import lancedb
import os
from dotenv import load_dotenv

os.environ["OPENAI_API_KEY"] = "api-key"

load_dotenv()

In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2", trust_remote_code=True)
# embedding_model = SentenceTransformer("BAAI/bge-base-en", trust_remote_code=True)
# embedding_model = SentenceTransformer("BAAI/bge-large-en-v1.5", trust_remote_code=True)
# embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5", trust_remote_code=True)

# Initialize shared database
shared_db = lancedb.connect("~/shared_rag_db")

# Initialize TextRAG with shared database
text_rag = SimpleRAG(
    text_embedding_model=embedding_model,
    db=shared_db,
    table_name="textDemo",
)


# Initialize OpenAI LLM
llm = OpenAI()

In [None]:
text_rag.index(
        "./examples/data",
        recursive=False,
        chunking_strategy=FixedTokenChunker(chunk_size=1000),
        metadata={"source": "gradio_upload"},
        overwrite=True,
        verbose=True,
        ocr=True,
    )

In [11]:
query = "what is colpali ?"
num_results = 5

search_results = text_rag.search(query, k=num_results)

print("This was the retrieved Context")
for i, r in enumerate(search_results):
    print(f"{'==='*50}")
    print(f"\n\nChunk {i+1}:")
    print(f"Text: {r['text']}")
    print(f"Chunk Index: {r['chunk_index']}")
    print(f"Document Name: {r['document_name']}")
    print(f"\n\n{'==='*50}")

This was the retrieved Context


Chunk 1:
Text: The table uses "W" and "T" markers to denote which system or department serves as the primary source (writer) or storage location (trailer) for each type of document.

## C More similarity maps

In Figure 7, ColPali assigns a high similarity to all patches with the word "Kazakhstan" when given the token <_Kazakhstan> . Moreover, our model seems to exhibit world knowledge capabilities as the patch around the word "Kashagan" - an offshore oil field in Kazakhstan - also shows a high similarity score. On the other hand, in Figure 8, we observe that ColPali is also capable of complex image understanding. Not only are the patches containing the word "formulations" highly similar to the query token _formula , but so is the upper-left molecule structure.

It is also interesting to highlight that both similarity maps showcase a few white patches with high similarity scores. This behavior might first seem surprising as the white patches should not 

In [12]:
from IPython.display import display, Markdown, Latex

context = "\n".join([r["text"] for r in search_results])
response = llm.query(
    context=context,
    system_prompt="Given the below information answer the questions",
    query=query,
)


display(Markdown(response))

ColPali is a model designed for document retrieval tasks that combines visual retrieval with language processing to enhance performance, particularly in the context of multimodal documents like PDFs, figures, tables, and infographics. It utilizes a late interaction mechanism to compute interactions between text tokens and image patches, improving retrieval capabilities significantly compared to previous models. ColPali can be trained end-to-end, allowing it to adapt to new tasks and specialized domains efficiently. Additionally, it can handle various languages and leverages visual features for query answering, aiming to create systems that function purely from visual information.

### Run Gradio Demo

In [None]:
%cd examples
!python textDemo.py --share

/content/VARAG/examples
2024-09-28 09:45:43.314833: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-28 09:45:43.339060: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-28 09:45:43.347860: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
Using device: cuda
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
Running on local URL:  http://127.0.0.1:7860
INFO:httpx:HTTP R