# Rag From Scratch: Indexing

![Screenshot 2024-03-25 at 8.23.02 PM.png](indexing.png)

## Set Environment Vars and API Keys

In [None]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_PROJECT'] = 'advanced-rag'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
os.environ['GROQ_API_KEY'] = os.getenv("GROQQ_API_KEY")

## Part 12: Multi-representation Indexing

Flow:

 ![Screenshot 2024-03-16 at 5.54.55 PM.png](multiindexing.png)

Docs:

https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector

In [None]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://medium.com/@pankaj_pandey/introduction-to-retrieval-augmented-generation-rag-9209bf8a076d")
docs = loader.load()

loader = WebBaseLoader("https://medium.com/humansdotai/an-introduction-to-ai-agents-e8c4afd2ee8f")
docs.extend(loader.load())

In [None]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_groq import ChatGroq

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatGroq()
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 1})

In [None]:
summaries

['Retrieval-Augmented Generation (RAG) is an AI framework that enhances the accuracy and reliability of Large Language Models (LLMs) by grounding them in external knowledge bases. RAG addresses the inconsistencies and lack of understanding in LLM-generated responses by providing access to up-to-date facts and verifiable sources, increasing user trust. It consists of two phases: retrieval and content generation. In the retrieval phase, relevant information is searched for and retrieved from external knowledge bases, while in the content generation phase, the LLM synthesizes an answer based on both the retrieved information and its internal representation of training data. RAG offers advantages such as access to current and reliable information, reduced opportunities for sensitive data leakage, and lower computational and financial costs in LLM-powered applications. It is implemented in an "open book" manner, allowing LLMs to respond to questions by browsing through external content. RAG

In [None]:
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
hf_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=hf_embeddings)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
doc_ids

['cd92fc06-5ff8-453c-af47-d769f679645f',
 '0fe1d412-c3c2-45c3-9637-57cc9948f0b4']

In [None]:
retriever.docstore.mget(doc_ids)

[Document(page_content='Introduction to Retrieval-Augmented Generation (RAG) | by Pankaj Pandey | MediumOpen in appSign upSign inWriteSign upSign inMastodonIntroduction to Retrieval-Augmented Generation (RAG)Pankaj Pandey·Follow6 min read·Dec 16, 2023--ListenShareRAG systems aim to address the drawbacks of Large Language Models by incorporating factual information during response generation, mitigating issues such as knowledge cutoff and response hallucination.Retrieval Augmented Generation (RAG)The world is advancing rapidly, introducing new technologies and stacks in AI and other areas every day. Large Language Models (LLMs) are a significant innovation in this space. However, LLMs have drawbacks due to their knowledge cutoff and other reasons, leading to confident but inaccurate responses. The RAG systems aim to address this issue by incorporating factual information during response generation to prevent hallucination and retrieve accurate responses.Introduction:RAG is an AI framewo

In [None]:
query = "What is agent"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

Document(page_content='AI Agents are intelligent systems that can perform tasks, make decisions, and interact with their environment like humans do. They are powered by machine learning, natural language processing, and other advanced technologies, allowing them to learn from data, adapt to new information, and execute complex functions autonomously. AI Agents exist in various forms, such as chatbots providing customer service and robots used in healthcare and manufacturing. They are designed to understand, analyze, and respond to human input, constantly evolving to enhance their capabilities.\n\nAI Agents operate independently and can address customer queries, make fast decisions based on real-time information, and simplify business processes. They perceive their surroundings and execute actions through a range of tools, from rule-based systems to machine learning algorithms. AI Agents are the new face of intelligent automation and can handle vast streams of fresh data in uncertain la

In [None]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

  warn_deprecated(
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


'An Introduction to AI Agents. Artificial Intelligence Agents are the… | by Humans.ai | humansdotai | MediumOpen in appSign upSign inWriteSign upSign inAn Introduction to AI AgentsHumans.ai·FollowPublished inhumansdotai·7 min read·Dec 27, 2023--1ListenShareArtificial Intelligence Agents are the digital newcomers revolutionizing our world. These agents, often called AI bots or virtual assistants, are intelligent systems programmed to perform tasks, make decisions, and interact with their environme'

## Part 13: ColBERT

RAGatouille makes it as simple to use ColBERT.

ColBERT generates a contextually influenced vector for each token in the passages.

ColBERT similarly generates vectors for each token in the query.

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

See [here](https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag) and [here](https://python.langchain.com/docs/integrations/retrievers/ragatouille) and [here](https://til.simonwillison.net/llms/colbert-ragatouille).

ColBERT is used to enhance the retrieval component.

In [1]:
! pip install ragatouille

Collecting ragatouille
  Downloading ragatouille-0.0.8.post2-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.8/41.8 kB[0m [31m742.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu<2.0.0,>=1.7.4 (from ragatouille)
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl (8.8 kB)
Collecting langchain<0.2.0,>=0.1.0 (from ragatouille)
  Downloading langchain-0.1.20-py3-none-any.

In [2]:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[Jul 06, 00:05:37] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




In [3]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Document_retrieval")

In [4]:
RAG.index(
    collection=[full_document],
    index_name="Doc-1",
    max_document_length=180,
    split_documents=True,
)

This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------


[Jul 06, 00:07:28] #> Creating directory .ragatouille/colbert/indexes/Doc-1 






[Jul 06, 00:07:30] [0] 		 #> Encoding 7 passages..


100%|██████████| 1/1 [00:04<00:00,  4.65s/it]

[Jul 06, 00:07:35] [0] 		 avg_doclen_est = 116.42857360839844 	 len(local_sample) = 7
[Jul 06, 00:07:35] [0] 		 Creating 256 partitions.
[Jul 06, 00:07:35] [0] 		 *Estimated* 815 embeddings.
[Jul 06, 00:07:35] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/Doc-1/plan.json ..





used 6 iterations (0.0408s) to cluster 775 items into 256 clusters
[0.036, 0.038, 0.038, 0.032, 0.034, 0.034, 0.033, 0.038, 0.029, 0.028, 0.032, 0.03, 0.029, 0.032, 0.036, 0.039, 0.03, 0.026, 0.029, 0.032, 0.029, 0.027, 0.03, 0.038, 0.039, 0.031, 0.031, 0.039, 0.03, 0.041, 0.034, 0.04, 0.03, 0.035, 0.03, 0.03, 0.034, 0.034, 0.027, 0.042, 0.027, 0.044, 0.035, 0.035, 0.034, 0.038, 0.027, 0.03, 0.025, 0.036, 0.028, 0.038, 0.028, 0.026, 0.033, 0.042, 0.04, 0.039, 0.038, 0.04, 0.033, 0.043, 0.029, 0.04, 0.04, 0.044, 0.032, 0.04, 0.033, 0.032, 0.029, 0.024, 0.041, 0.033, 0.041, 0.029, 0.038, 0.031, 0.041, 0.044, 0.037, 0.038, 0.033, 0.037, 0.038, 0.033, 0.039, 0.035, 0.033, 0.029, 0.031, 0.034, 0.032, 0.037, 0.037, 0.032, 0.036, 0.036, 0.032, 0.033, 0.029, 0.035, 0.035, 0.033, 0.036, 0.036, 0.028, 0.025, 0.035, 0.027, 0.042, 0.035, 0.034, 0.038, 0.039, 0.034, 0.031, 0.036, 0.035, 0.032, 0.03, 0.038, 0.026, 0.035, 0.028, 0.035, 0.037, 0.03]


0it [00:00, ?it/s]

[Jul 06, 00:07:35] [0] 		 #> Encoding 7 passages..



  0%|          | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:04<00:00,  4.51s/it]
1it [00:04,  4.59s/it]
100%|██████████| 1/1 [00:00<00:00, 798.15it/s]

[Jul 06, 00:07:39] #> Optimizing IVF to store map from centroids to list of pids..
[Jul 06, 00:07:39] #> Building the emb2pid mapping..
[Jul 06, 00:07:39] len(emb2pid) = 815



100%|██████████| 256/256 [00:00<00:00, 15593.11it/s]

[Jul 06, 00:07:39] #> Saved optimized IVF to .ragatouille/colbert/indexes/Doc-1/ivf.pid.pt





Done indexing!


'.ragatouille/colbert/indexes/Doc-1'

In [5]:
results = RAG.search(query="What is an example for form based indexing?", k=3)
results

Loading searcher for index Doc-1 for the first time... This may take a few seconds
[Jul 06, 00:08:11] #> Loading codec...
[Jul 06, 00:08:11] #> Loading IVF...
[Jul 06, 00:08:11] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jul 06, 00:08:48] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 3279.36it/s]

[Jul 06, 00:08:48] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 130.10it/s]

[Jul 06, 00:08:48] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...





[Jul 06, 00:09:20] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is an example for form based indexing?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2054, 2003, 2019, 2742, 2005, 2433, 2241, 5950, 2075, 1029,
         102,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])





[{'content': '== Variations ==\nThere are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.\n\n\n=== Form based ===\nForm based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.',
  'score': 25.978090286254883,
  'rank': 1,
  'document_id': '83decb83-f58e-4d89-b9c1-51f09daadcfa',
  'passage_id': 2},
 {'content': '== Example: PubMed ==\nThe PubMed form interface features the "related articles" search which works through a comparison of words from the documents\' title, abstr

In [6]:
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What is an example for form based indexing?")



[Document(page_content='== Variations ==\nThere are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.\n\n\n=== Form based ===\nForm based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.'),
 Document(page_content='== Example: PubMed ==\nThe PubMed form interface features the "related articles" search which works through a comparison of words from the documents\' title, abstract, and MeSH terms using a word-weighted algorithm.\n\n\n== See also ==\nCompound term processing\n