# Dense Embeddings vs ColBERTv2

In [3]:
from llama_index.readers.file import PDFReader

loader = PDFReader()

In [4]:
document_path = "/Users/aleixlahoz/Desktop/2409.13598v1.pdf"
documents = loader.load_data(document_path)
list_pdf_documents = [document.text for document in documents]

list_pdf_documents


['Prithvi WxC: Foundation Model for Weather and Climate\nJohannes Schmude1,†,‡,Sujit Roy2,7,†,‡,Will Trojak1,Johannes Jakubik1,\nDaniel Salles Civitarese1,Shraddha Singh1,Julian Kuehnert1,Kumar\nAnkur2,Aman Gupta3,Christopher E Phillips2,Romeo Kienzler1,Daniela\nSzwarcman1,Vishal Gaur2,Rajat Shinde2,Rohit Lal2,Arlindo Da Silva6,\nJorge Luis Guevara Diaz1,Anne Jones1,Simon Pfreundschuh4,Amy Lin2,\nAditi Sheshadri3,Udaysankar Nair2,Valentine Anantharaj5,Hendrik\nHamann1,Campbell Watson1,Manil Maskey7,Tsengdar J Lee8,Juan\nBernabe Moreno1,Rahul Ramachandran7\n†Equal Contribution,\n‡Johannes.Schmude@ibm.com , Sujit.Roy@nasa.gov*\nABSTRACT\nTriggered by the realization that AI emulators can rival the performance of tra-\nditional numerical weather prediction models running on HPC systems, there is\nnow an increasing number of large AI models that address use cases such as fore-\ncasting, downscaling, or nowcasting. While the parallel developments in the AI\nliterature focus on foundation mo

In [5]:
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

  from .autonotebook import tqdm as notebook_tqdm


[Sep 30, 15:26:58] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




In [6]:
RAG.index(
    collection=list_pdf_documents,
    index_name="weather_and_climate_paper",
    max_document_length=256,
    split_documents=True,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Sep 30, 15:28:24] #> Creating directory .ragatouille/colbert/indexes/weather_and_climate_paper 


[Sep 30, 15:28:26] [0] 		 #> Encoding 111 passages..


100%|██████████| 4/4 [00:33<00:00,  8.35s/it]

[Sep 30, 15:28:59] [0] 		 avg_doclen_est = 176.6666717529297 	 len(local_sample) = 111
[Sep 30, 15:28:59] [0] 		 Creating 2,048 partitions.
[Sep 30, 15:28:59] [0] 		 *Estimated* 19,610 embeddings.
[Sep 30, 15:28:59] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/weather_and_climate_paper/plan.json ..





used 20 iterations (4.9246s) to cluster 18630 items into 2048 clusters
[0.036, 0.034, 0.038, 0.032, 0.033, 0.035, 0.036, 0.033, 0.033, 0.035, 0.034, 0.033, 0.033, 0.036, 0.034, 0.036, 0.031, 0.034, 0.033, 0.034, 0.036, 0.038, 0.031, 0.037, 0.033, 0.034, 0.034, 0.034, 0.036, 0.038, 0.035, 0.037, 0.036, 0.032, 0.033, 0.035, 0.035, 0.034, 0.034, 0.038, 0.035, 0.035, 0.036, 0.034, 0.036, 0.032, 0.034, 0.036, 0.034, 0.033, 0.034, 0.035, 0.034, 0.034, 0.033, 0.035, 0.037, 0.038, 0.037, 0.031, 0.034, 0.035, 0.034, 0.035, 0.037, 0.037, 0.036, 0.035, 0.033, 0.032, 0.036, 0.035, 0.034, 0.034, 0.036, 0.035, 0.037, 0.036, 0.037, 0.039, 0.036, 0.034, 0.036, 0.035, 0.034, 0.033, 0.035, 0.035, 0.034, 0.038, 0.035, 0.038, 0.033, 0.036, 0.035, 0.036, 0.037, 0.035, 0.036, 0.034, 0.037, 0.038, 0.032, 0.035, 0.037, 0.031, 0.034, 0.032, 0.034, 0.033, 0.036, 0.036, 0.036, 0.033, 0.036, 0.032, 0.035, 0.033, 0.034, 0.036, 0.034, 0.034, 0.036, 0.036, 0.035, 0.039, 0.036, 0.034]


0it [00:00, ?it/s]

[Sep 30, 15:29:04] [0] 		 #> Encoding 111 passages..


100%|██████████| 4/4 [00:32<00:00,  8.01s/it]
1it [00:32, 32.21s/it]
100%|██████████| 1/1 [00:00<00:00, 865.70it/s]

[Sep 30, 15:29:36] #> Optimizing IVF to store map from centroids to list of pids..
[Sep 30, 15:29:36] #> Building the emb2pid mapping..
[Sep 30, 15:29:36] len(emb2pid) = 19610



100%|██████████| 2048/2048 [00:00<00:00, 79116.67it/s]

[Sep 30, 15:29:36] #> Saved optimized IVF to .ragatouille/colbert/indexes/weather_and_climate_paper/ivf.pid.pt





Done indexing!


'.ragatouille/colbert/indexes/weather_and_climate_paper'

In [7]:
results = RAG.search(query="Which examples of variables do they use to predict the weather?", k=3, index_name="weather_and_climate_paper")

Loading searcher for index weather_and_climate_paper for the first time... This may take a few seconds




[Sep 30, 15:34:26] #> Loading codec...
[Sep 30, 15:34:26] #> Loading IVF...
[Sep 30, 15:34:26] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Sep 30, 15:34:35] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 2319.86it/s]

[Sep 30, 15:34:35] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 353.23it/s]

[Sep 30, 15:34:35] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Sep 30, 15:34:42] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Which examples of variables do they use to predict the weather?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2029,  4973,  1997, 10857,  2079,  2027,  2224,  2000,
        16014,  1996,  4633,  1029,   102,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])





In [8]:
results

[{'content': 'Campbell Watson1,Manil Maskey7,Tsengdar J Lee8,Juan\nBernabe Moreno1,Rahul Ramachandran7\n†Equal Contribution,\n‡Johannes.Schmude@ibm.com , Sujit.Roy@nasa.gov*\nABSTRACT\nTriggered by the realization that AI emulators can rival the performance of tra-\nditional numerical weather prediction models running on HPC systems, there is\nnow an increasing number of large AI models that address use cases such as fore-\ncasting, downscaling, or nowcasting. While the parallel developments in the AI\nliterature focus on foundation models – models that can be effectively tuned to\naddress multiple, different use cases – the developments on the weather and cli-\nmate side largely focus on single-use cases with particular emphasis on mid-range\nforecasting. We close this gap by introducing Prithvi WxC, a 2.3 billion parameter\nfoundation model developed using 160 variables from the Modern-Era Retrospec-\ntive Analysis for Research and Applications, Version 2 (MERRA-2).',
  'score': 19.8

In [9]:
for i, doc, in enumerate(results):
    print(f"---------------------------------- doc-{i} ------------------------------------")
    print(doc["content"])

---------------------------------- doc-0 ------------------------------------
Campbell Watson1,Manil Maskey7,Tsengdar J Lee8,Juan
Bernabe Moreno1,Rahul Ramachandran7
†Equal Contribution,
‡Johannes.Schmude@ibm.com , Sujit.Roy@nasa.gov*
ABSTRACT
Triggered by the realization that AI emulators can rival the performance of tra-
ditional numerical weather prediction models running on HPC systems, there is
now an increasing number of large AI models that address use cases such as fore-
casting, downscaling, or nowcasting. While the parallel developments in the AI
literature focus on foundation models – models that can be effectively tuned to
address multiple, different use cases – the developments on the weather and cli-
mate side largely focus on single-use cases with particular emphasis on mid-range
forecasting. We close this gap by introducing Prithvi WxC, a 2.3 billion parameter
foundation model developed using 160 variables from the Modern-Era Retrospec-
tive Analysis for Research and Ap

## With OpenAI Embeddings

In [11]:
from PyPDF2 import PdfReader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
from dotenv import load_dotenv
load_dotenv()

AZURE_OPENAI_ENDPOINT=os.getenv("AZURE_OPENAI_ENDPOINT") 
AZURE_OPENAI_API_KEY=os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_DEPLOYMENT_NAME=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
AZURE_OPENAI_DEPLOYMENT_NAME_EMBEDDINGS=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME_EMBEDDINGS")