<a href="https://colab.research.google.com/github/ashispapu/LLMs/blob/main/RAG_Production.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-stage RAG Sentence Window Retrieval and Cohere Reranking

This notebook demonstrates how to build a production-ready Retrieval Augmented Generation (RAG) pipeline using LlamaIndex and Cohere. The pipeline consists of the following key steps:

1. Indexing documents using LlamaIndex and FastEmbed embeddings
2. Storing the embeddings in a KDB.AI vector database
3. Querying the index to retrieve relevant documents
4. Reranking the retrieved documents using Cohere's reranking model
5. Generating a response using the top reranked documents as context

## Setup

In [None]:
# Install required libraries
!pip install cohere llama-index fastembed kdbai_client

Collecting cohere
  Using cached cohere-5.5.0-py3-none-any.whl (158 kB)
Collecting llama-index
  Using cached llama_index-0.10.37-py3-none-any.whl (6.8 kB)
Collecting fastembed
  Using cached fastembed-0.2.7-py3-none-any.whl (27 kB)
Collecting kdbai_client
  Using cached kdbai_client-1.1.0-py3-none-any.whl (18 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Using cached boto3-1.34.105-py3-none-any.whl (139 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Using cached fastavro-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
Collecting httpx>=0.21.2 (from cohere)
  Using cached httpx-0.27.0-py3-none-any.whl (75 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from cohere)
  Using cached httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Using cached types_requests-2.31.0.20240406-py3-none-any.whl (15 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Using cached llama_index_agent_openai-0.2.

In [None]:
# vector DB
import os
from getpass import getpass
import kdbai_client as kdbai
import time

In [None]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)
COHERE_API_KEY = (
    os.environ["COHERE_API_KEY"]
    if "COHERE_API_KEY" in os.environ
    else getpass("Cohere API key: ")
)

KDB.AI endpoint: https://cloud.kdb.ai/instance/wrve8kwshj
KDB.AI API key: ··········
Cohere API key: ··········


In [None]:
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
import kdbai_client as kdbai
import cohere
from fastembed import TextEmbedding

co = cohere.Client(COHERE_API_KEY)

KDBAI_TABLE_NAME = "paul_graham"

fastembed = TextEmbedding()

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

## Data Preparation
Download the Paul Graham Essay Dataset which contains essays written by Paul Graham. We will use this as our corpus to build the RAG pipeline.

In [None]:
!llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

100% 1/1 [00:00<00:00,  2.13it/s]
Successfully downloaded PaulGrahamEssayDataset to ./data


# Create a KDB.AI session and table

In [None]:
session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY)

try:
    session.table(KDBAI_TABLE_NAME).drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [None]:
schema = dict(
    columns=[
        dict(name="document_id", pytype="bytes"),
        dict(name="text", pytype="bytes"),
        dict(
            name="embedding",
            vectorIndex=dict(type="flat", metric="L2", dims=384),
        ),
    ]
)

table = session.create_table(KDBAI_TABLE_NAME, schema)

# Initialize a Sentence Window Parser and Load Dataset

This creates a parser that splits the text into sentences. It also extracts a window of three sentences on each side of the target sentence. We will fist do retrieval on the target sentence, then rerank with the window.

In [None]:
# Initialize models and service context
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# Assuming the dataset JSON and source files are correctly placed in the './data' directory
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
docs = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

In [None]:
# now split the documents into sentences and also maintain the window id and metadata
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

nodes = node_parser.get_nodes_from_documents(docs)
parsed_nodes = [node.to_dict() for node in nodes]

# Extract the sentence window and the target sentences from the parsed nodes
We don't have to do this--it's possible to rerank directly with LlamaIndex. However, by extracting the target sentences and the window, the code is a bit easier to understand, and we can interact directly with the cohere/kdb.ai APIs instead of using their LlamaIndex integrations.

In [None]:
import uuid
# Extract parent ID and texts into a dictionary and list of tuples
parentid_parentTexts = {}
sentence_parentId = []

for node in parsed_nodes:
    parent_id = uuid.uuid4()

    # Retrieve the text of the window

    parent_text = node['metadata']['window']  # Using window from metadata as the parent text
    parentid_parentTexts[parent_id] = parent_text

    # Add sentence and parent ID tuple
    sentence_parentId.append((node['text'], parent_id))

print(parentid_parentTexts)
print(sentence_parentId)

{UUID('cc140dac-f92f-4efc-ab57-4648e73d3817'): "\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming.  I didn't write essays.  I wrote what beginning writers were supposed to write then, and probably still are: short stories.  My stories were awful. ", UUID('9e271e4f-0b51-4d29-ae39-77359aabb2db'): "\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming.  I didn't write essays.  I wrote what beginning writers were supposed to write then, and probably still are: short stories.  My stories were awful.  They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\n", UUID('fb0ca74a-4307-4f60-b4a6-ac7ba342ac38'): '\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming.  I didn\'t write essays.  I wrote what be

# Insert Data into KDB.AI
We create a Dataframe of sentences and their embeddings, along with the id of their corresponding sentence window and insert them into KDB.AI.

In [None]:
# Create a DataFrame from the parent texts and their embeddings
import pandas as pd
from fastembed import TextEmbedding


parent_ids = []
sentences = []
embeddings = []

embedding_model = TextEmbedding()

for sentence, parent_id in sentence_parentId:
    parent_ids.append(parent_id)
    sentences.append(sentence)

embeddings = list(embedding_model.embed(sentences))  # reminder this is a generator

# Create a DataFrame
records_to_insert_with_embeddings = pd.DataFrame({
    "document_id": parent_ids,
    "text": sentences,
    "embedding": embeddings
})

# Insert the DataFrame into the table
table = session.table(KDBAI_TABLE_NAME)
table.insert(records_to_insert_with_embeddings)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

True

In [None]:
table.query()

Unnamed: 0,document_id,text,embedding
0,cc140dac-f92f-4efc-ab57-4648e73d3817,\n\nWhat I Worked On\n\nFebruary 2021\n\nBefor...,"[-0.03756723, -0.008762999, 0.07278279, -0.077..."
1,9e271e4f-0b51-4d29-ae39-77359aabb2db,I didn't write essays.,"[-0.013508395, 0.08809418, 0.019084414, -0.004..."
2,fb0ca74a-4307-4f60-b4a6-ac7ba342ac38,I wrote what beginning writers were supposed t...,"[-0.028090093, 0.038167175, 0.06310142, 0.0032..."
3,47338463-5211-4c0d-9746-f7efc22edb4a,My stories were awful.,"[0.051281717, 0.05444018, 0.048323955, 0.03726..."
4,c7ac92cf-86de-4a09-aa3b-4ae133ca3e48,"They had hardly any plot, just characters with...","[-0.026676333, 0.06452145, 0.0027197786, -0.00..."
...,...,...,...
752,bcb7896d-b64d-42a8-a72a-f7a67e4a846b,"I believe, though with less certainty, that th...","[-0.048544426, 0.0019825993, -0.051241625, 0.0..."
753,d75089df-1965-478a-8358-3559bd75eb11,But if so there's no reason to suppose that th...,"[-0.027010582, -0.015139203, -0.030919325, -0...."
754,4dd059f4-dc3f-4bad-b356-f1ef1db0b5bc,Presumably aliens need numbers and errors and ...,"[-0.018729148, -0.05805318, 0.02078389, -0.080..."
755,f9927a7e-5b10-45aa-838e-bf547e0c3d38,So it seems likely there exists at least one p...,"[-0.047578007, -0.023965308, -0.012347723, 0.0..."


# Initial search
Our RAG pipeline has two stages: the initial stage and the reranking stage.

The initial stage just performs a search on individual sentences we have inserted into KDB.AI. This stage is extremely fast, and scales to hundreds of millions of rows.

The reranking stage performs a search on the top 1500 sentences from initial stage. This stage is slower, and takes more than 50ms. It's also limited in total documents it can rerank, but has the advantage of performing better on longer documents--hence using the sentence window instead of individual sentences.

Here we do an initail search by embedding the query and searching our table for 1500 sentences:

In [None]:
# Now embed a query and then search the table to get 1500 preliminary results
query = "How do you decide what to work on?"

embeddings = next(embedding_model.embed(query)).tolist()

search_results = session.table(KDBAI_TABLE_NAME).search([embeddings], n=1500)

search_results

[                              document_id  \
 0    827f9b63-32f6-44d3-be5f-44de3c789329   
 1    0b72aaf7-748d-4233-a9a7-28b341f47504   
 2    9ef8e7dd-36b6-4d32-ae3c-c3636ce97cca   
 3    8cc86e0f-0f03-47d7-9d8b-05879e16cfed   
 4    ac875f92-ccfa-4da5-b536-aae49a2b45c6   
 ..                                    ...   
 752  24cab3ed-e938-41f6-9561-c5ea542f77a6   
 753  2c50c2ee-6a4a-4a71-85f7-19e9ee3c90ee   
 754  4068a84a-3ce9-491c-a8b3-4aef6a2ea4a2   
 755  1dbdd64c-90b8-4732-b306-786ecad74533   
 756  c6bcd335-e43d-4ba1-9e11-9b35f22d349c   
 
                                                   text  \
 0                     How should I choose what to do?    
 1    Well, how had I chosen what to work on in the ...   
 2    If you can choose what to work on, and you cho...   
 3    Instead of deciding for myself what to work on...   
 4                              What should I do next?    
 ..                                                 ...   
 752  Socially they'd seem more l

# Rerank using Cohere

Now, we rerank the results from the previous search. This time though, we are using the entire sentence window. Not only will this give us better results, but will also give our LLM semantically stronger snippets.

In [None]:
pd.set_option('display.width', None)  # Uses maximum possible width to display
pd.set_option('display.max_colwidth', None)  # No truncation for column width

In [None]:
search_results_df = search_results[0]

unique_parent_ids = search_results_df['document_id'].unique()


# Prepare texts to rerank
texts_to_rerank = [parentid_parentTexts[id] for id in unique_parent_ids if id in parentid_parentTexts]

# Reranking using Cohere
reranked = co.rerank(
    model='rerank-english-v3.0',
    query=query,
    documents=texts_to_rerank,
    top_n=len(texts_to_rerank)
)

# Extract reranked texts
reranked_texts = [texts_to_rerank[result.index] for result in reranked.results]

# Display the reranked texts
print("Top Ranked Parent Texts Based on Query:", query)
df = pd.DataFrame(reranked_texts, columns=['Text'])
df.head(15)

Top Ranked Parent Texts Based on Query: How do you decide what to work on?


Unnamed: 0,Text
0,"How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\n Notes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.\n\n"
1,"Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\n Notes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.\n\n [2] Italian words for abstract concepts can nearly always be predicted from their English cognates (except for occasional traps like polluzione)."
2,"If you can choose what to work on, and you choose a project that's not the best one (or at least a good one) for you, then it's getting in the way of another project that is. And at 50 there was some opportunity cost to screwing around.\n\n I started writing essays again, and wrote a bunch of new ones over the next few months. I even wrote a couple that weren't about startups. Then in March 2015 I started working on Lisp again.\n\n The distinctive thing about Lisp is that its core is a language defined by writing an interpreter in itself. It wasn't originally intended as a programming language in the ordinary sense."
3,"So far anyway.\n\n I realize that sounds rather wimpy. But attention is a zero sum game. If you can choose what to work on, and you choose a project that's not the best one (or at least a good one) for you, then it's getting in the way of another project that is. And at 50 there was some opportunity cost to screwing around.\n\n I started writing essays again, and wrote a bunch of new ones over the next few months. I even wrote a couple that weren't about startups."
4,"I realize that sounds rather wimpy. But attention is a zero sum game. If you can choose what to work on, and you choose a project that's not the best one (or at least a good one) for you, then it's getting in the way of another project that is. And at 50 there was some opportunity cost to screwing around.\n\n I started writing essays again, and wrote a bunch of new ones over the next few months. I even wrote a couple that weren't about startups. Then in March 2015 I started working on Lisp again.\n\n"
5,"But attention is a zero sum game. If you can choose what to work on, and you choose a project that's not the best one (or at least a good one) for you, then it's getting in the way of another project that is. And at 50 there was some opportunity cost to screwing around.\n\n I started writing essays again, and wrote a bunch of new ones over the next few months. I even wrote a couple that weren't about startups. Then in March 2015 I started working on Lisp again.\n\n The distinctive thing about Lisp is that its core is a language defined by writing an interpreter in itself."
6,"Now that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\n"
7,"Instead of deciding for myself what to work on, the problems came to me. Every 6 months there was a new batch of startups, and their problems, whatever they were, became our problems. It was very engaging work, because their problems were quite varied, and the good founders were very effective. If you were trying to learn the most you could about startups in the shortest possible time, you couldn't have picked a better way to do it.\n\n There were parts of the job I didn't like. Disputes between cofounders, figuring out when people were lying to us, fighting with people who maltreated the startups, and so on. But I worked hard even at the parts I didn't like."
8,"I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.\n\n\n\n\n\n\n\n\n\n Notes\n\n[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes."
9,"Up till that point I'd always been curious to see how the painting I was working on would turn out, but suddenly finishing this one seemed like a chore. So I stopped working on it and cleaned my brushes and haven't painted since. So far anyway.\n\n I realize that sounds rather wimpy. But attention is a zero sum game. If you can choose what to work on, and you choose a project that's not the best one (or at least a good one) for you, then it's getting in the way of another project that is. And at 50 there was some opportunity cost to screwing around.\n\n"
