# Rag From Scratch: Indexing

![index](./Images/index.png)

## Preface: Chunking

We don't explicity cover document chunking / splitting.

For an excellent review of document chunking, see this video from Greg Kamradt:

https://www.youtube.com/watch?v=8OJC21T2SL4

## Enviornment

`(1) Packages`m

In [1]:
!pip install -qU langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube

`(2) LangSmith`

https://docs.smith.langchain.com/

In [2]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = "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"

In [3]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

In [4]:
import google.generativeai as genai
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
os.getenv('GOOGLE_API_KEY')
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Part 12: Multi-representation Indexing

Flow: 

 ![Multirepresentation](./Images/Multirepresentation.png)

Docs:

https://blog.langchain.dev/semi-structured-multi-modal-rag/

https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector

Paper:

https://arxiv.org/abs/2312.06648

In [6]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI


In [7]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatGoogleGenerativeAI(model="gemini-pro")
    | StrOutputParser()
)
summaries = chain.batch(docs, {"max_concurrency": 5})

In [8]:
summaries

['**Summary of LLM-Powered Autonomous Agents**\n\n**Overview**\n\nLLM-powered autonomous agents leverage large language models (LLMs) as their core controllers, supplemented by key components:\n\n* **Planning:**\n    * Task decomposition: Agents break down tasks into manageable subgoals.\n    * Self-reflection: Agents learn from mistakes and refine actions.\n* **Memory:**\n    * Short-term memory: In-context learning.\n    * Long-term memory: External vector store for retaining information.\n* **Tool Use:**\n    * Agents learn to call external APIs for missing information.\n\n**Key Concepts**\n\n**Planning**\n\n* **Chain of Thought (CoT):** Step-by-step decomposition of tasks.\n* **Tree of Thoughts (ToT):** Multiple reasoning paths for each step.\n* **LLM+P:** Integration with external classical planners.\n* **Self-Reflection:**\n    * ReAct: Extends action space to include language for reasoning.\n    * Reflexion: Dynamic memory and self-reflection for improved reasoning.\n    * Chain

In [9]:
from langchain.storage import InMemoryByteStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

In [13]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

In [16]:
from langchain.storage import InMemoryByteStore
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=embeddings)

RuntimeError: [91mYour system has an unsupported version of sqlite3. Chroma                     requires sqlite3 >= 3.35.0.[0m
[94mPlease visit                     https://docs.trychroma.com/troubleshooting#sqlite to learn how                     to upgrade.[0m

In [12]:
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

NameError: name 'vectorstore' is not defined

In [17]:
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

NameError: name 'vectorstore' is not defined

In [18]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]

NameError: name 'retriever' is not defined

Related idea is the [parent document retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever).

## Part 13: RAPTOR

Flow:

![Screenshot 2024-03-16 at 6.16.21 PM.png](attachment:5ccfe50d-d22e-402b-86f6-b3afb0f06088.png)

Deep dive video:

https://www.youtube.com/watch?v=jbGchdTL7d0

Paper:

https://arxiv.org/pdf/2401.18059.pdf

Full code:

https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb

## Part 14: ColBERT

RAGatouille makes it as simple to use ColBERT. 

ColBERT generates a contextually influenced vector for each token in the passages. 

ColBERT similarly generates vectors for each token in the query.

Then, the score of each document is the sum of the maximum similarity of each query embedding to any of the document embeddings:

See [here](https://hackernoon.com/how-colbert-helps-developers-overcome-the-limits-of-rag) and [here](https://python.langchain.com/docs/integrations/retrievers/ragatouille) and [here](https://til.simonwillison.net/llms/colbert-ragatouille).

In [19]:
!pip install -U ragatouille

I0000 00:00:1726161230.934932    3027 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


Collecting ragatouille
  Downloading ragatouille-0.0.8.post4-py3-none-any.whl.metadata (15 kB)
Collecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index>=0.7 (from ragatouille)
  Downloading llama_index-0.11.8-py3-none-any.whl.metadata (11 kB)
Collecting onnx<2.0.0,>=1.15.0 (from ragatouille)
  Downloading onnx-1.16.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting sentence-transformers<3.0.0,>=2.2.2 (from ragatouille)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Collecting srsly==2.4.8 (from ragatouille)
  Downloading srsly-2.4.8-cp312-cp312-manylinux_2_17_x86_

In [20]:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

ModuleNotFoundError: No module named 'pydantic.functional_serializers'

In [21]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

full_document = get_wikipedia_page("Hayao_Miyazaki")

In [22]:
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

NameError: name 'RAG' is not defined

results = RAG.search(query="What animation studio did Miyazaki found?", k=3)
results

In [23]:
retriever = RAG.as_langchain_retriever(k=3)
retriever.invoke("What animation studio did Miyazaki found?")

NameError: name 'RAG' is not defined