# Part 1: Developing the RAG application

- GitHub repository: https://github.com/anyscale/ray-summit-2023-training/tree/main
- Anyscale Endpoints: https://endpoints.anyscale.com/
- Ray documentation: https://docs.ray.io/
- LlamaIndex documentation: https://gpt-index.readthedocs.io/en/stable/

We will start by building our example RAG application: a Q&A app that given a question about Ray, can answer it using the Ray documentation.

In this notebook we will learn how to:
1. 💻 Develop a retrieval augmented generation (RAG) based LLM application.
2. 🚀 Scale the major components (embed, index, serve, etc.) in our application.

We will use both [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) and [Ray](https://docs.ray.io/) for developing our LLM application, and [Anyscale Endpoints](https://endpoints.anyscale.com/) as the LLM engine. 

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> overall application view.

## Setup Credentials

Let's setup our credentials for Anyscale Endpoints, and optionally for Open AI

In [None]:
import os

os.environ["ANYSCALE_API_BASE"] = "https://ray-summit-training-jrvwy.cld-kvedzwag2qa8i5bj.s.anyscaleuserdata-staging.com/v1/chat/completions"
os.environ["ANYSCALE_API_KEY"] = "tZLmCV1WtQAtnDx93MM5w3xaNhYV3whhcFRTKoH1GYQ"

os.environ["OPENAI_API_BASE"] = "https://api.openai.com/v1"
# os.environ["OPENAI_API_KEY"] = ...

## Step 1: Loading and parsing the Data

To build our RAG application, we first need to load, parse, and embed the data that we want to use for answering our questions. 

This data processing pipeline has 3 steps:
1. First, we will load the latest documentation for Ray
2. Then we will parse the documentation to extract out chunks of text
3. Finally, we will **embed** each chunk. This creates a vector representation of the provided text snippet. This vector representation allows us to easily determine the similarity between two different text snippets.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Example of the loading, parsing, and embedding process.

LlamaIndex provides utlities for loading our data, and also the abstractions for how we represent our data and their relationships.

Ray, and in particular the Ray Data library, is used to scale out our data processing pipeline, allowing us to process data in parallel, leveraging the cores and GPUs in our Ray cluster. 

### Load data

The Ray documentation has already been downloaded and is stored in shared storage directory in our Anyscale workspace. We parse the html files in the downloaded documentation, and create a Ray Dataset out of the doc paths.

In [None]:
from pathlib import Path

RAY_DOCS_DIRECTORY = "/efs/shared_storage/amog/docs.ray.io/en/master/"

In [None]:
import ray

docs_path = Path(RAY_DOCS_DIRECTORY)
ds = ray.data.from_items([{"path": path} for path in docs_path.rglob("*.html") if not path.is_dir()])
print(f"{ds.count()} documents")

Now that we have a dataset of all the paths to the html files, we now need to extract text from these HTML files. We want to do this in a generalized manner so that we can perform this extraction across all of our docs pages. 

Therefore, we use LlamaIndex's HTMLTagReader to identify the sections in our HTML page and then extract the text in between them. For each section of text, we create a LlamaIndex Document, and also store the source url for that section as part of the metadata for the Document. After extracting all the text, we return a list of LlamaIndex documents.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Example of sectionization process.

In [None]:
from llama_index.readers import HTMLTagReader

In [None]:
def path_to_uri(path, scheme="https://", domain="docs.ray.io"):
    # Converts the file path of a Ray documentation page to the original URL for the documentation.
    # Example: /efs/shared_storage/goku/docs.ray.io/en/master/rllib-env.html -> https://docs.ray.io/en/master/rllib/rllib-env.html#environments
    return scheme + domain + str(path).split(domain)[-1]

def extract_sections(record):
    # Given a HTML file path, extract out text from the section tags, and return a LlamaIndex document from each one. 
    html_file_path = record["path"]
    reader = HTMLTagReader(tag="section")
    documents = reader.load_data(html_file_path)
    
    # For each document, store the source URL as part of the metadata.
    for document in documents:
        document.metadata["source"] = f"{path_to_uri(document.metadata['file_path'])}#{document.metadata['tag_id']}"
    return [{"document": document} for document in documents]

Let's try this out on a single example HTML file

In [None]:
example_path = Path(RAY_DOCS_DIRECTORY, "rllib/rllib-env.html")
document = extract_sections({"path": example_path})[0]["document"]
print(document)
print("\n")
print("Document source: ", document.metadata["source"])

Now, let's use Ray Data to parallelize this across all of the HTML files. We can stitch together operations on our Ray dataset to map a function over each document. 

Ray Data is lazy by default, so can first stitch together our entire pipeline, and then trigger execution. This allows Ray Data to fully optimize resource usage for our pipeline.

In [None]:
sections_ds = ds.flat_map(extract_sections)
sections_ds.schema()

### Chunk data

We now have a list of Documents (with text and source of each section) but we shouldn't directly use this as context to our RAG application just yet. The text lengths of each section are all varied and many are quite large chunks. If were to use these large sections, then we'd be inserting a lot of noisy/unwanted context and because all LLMs have a maximum context length, we wouldn't be able to fit too many relevant contexts. Therefore, we're going to split the text within each section into smaller chunks. Intuitively, smaller chunks will encapsulate single/few concepts and will be less noisy compared to larger chunks. We're going to choose some typical text splitting values (ex. `chunk_size=300`) to create our chunks for now but we'll be experiments with a range of values later.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Sample chunking logic in action.

Once again, we will use LlamaIndex's abstractions to chunk each Document into a **Node** with the provided chunk size. And we will use Ray Data to parallelize the chunking computation.

In [None]:
from llama_index.node_parser import SimpleNodeParser

In [None]:
chunk_size = 300
chunk_overlap = 50

def chunk_document(document):
    node_parser = SimpleNodeParser.from_defaults(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    nodes = node_parser.get_nodes_from_documents([document["document"]])
    return [{"node": node} for node in nodes]

Let's run an example over a single document. The document wil be chunked and will result in 2 nodes, each representing 1 chunk.

In [None]:
sample_document = sections_ds.take(1)[0]

# Nodes
nodes = chunk_document(sample_document)

print("Num chunks: ", len(nodes))
print(f"Example text: {nodes[0]['node'].text}\n")
print(f"Example metadata: {nodes[0]['node'].metadata}\n")

Now let's chunk all of our documents, stitching this operation into our Ray Dataset pipeline.

In [None]:
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

chunks_ds = sections_ds.flat_map(chunk_document, scheduling_strategy=NodeAffinitySchedulingStrategy(node_id=ray.get_runtime_context().get_node_id(), soft=False))
chunks_ds.schema()

### Embed data

Now that we've created small chunks from our dataset, we need a way to identify the most relevant ones to a given query. A very effective and quick method is to embed our data using a pretrained model and use the same model to embed the query. We can then compute the distance between all of the chunk embeddings and our query embedding to determine the top k chunks. There are many different pretrained models to choose from to embed our data but the most popular ones can be discovered through [HuggingFace's Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) leadboard. These models were pretrained on very large text corpus through tasks such as next/masked token prediction that allows them to learn to represent subtokens in N dimensions and capture semantic relationships. We can leverage this to represent our data and make decisions such as the most relevant contexts to use to answer a given query. We're using Langchain's Embedding wrappers ([HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html) and [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html)) to easily load the models and embed our document chunks.

**Note**: embeddings aren't the only way to determine the more relevant chunks. We could also use an LLM to decide! However, because LLMs are much larger than these embedding models and have maximum context lengths, it's better to use embeddings to retrieve the top k chunks. And then we could use LLMs on the fewer k chunks to determine the <k chunks to use as the context to answer our query. We could also use reranking (ex. [Cohere Rerank](https://txt.cohere.com/rerank/)) to further identify the most relevant chunks to use.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Represent a text chunk getting embedded.

In [None]:
import numpy as np
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

def get_embedding_model(model_name):
    if model_name == "text-embedding-ada-002":
            return OpenAIEmbeddings(
                model=model_name,
                openai_api_base=os.environ["OPENAI_API_BASE"],
                openai_api_key=os.environ["OPENAI_API_KEY"])
    else:
        model_kwargs = {"device": "cuda"}
        encode_kwargs = {"device": "cuda", "batch_size": 100}

        return HuggingFaceEmbeddings(
            model_name=model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs)

Here, we will use a Python **class** instead of a function to encapsulate the embedding logic. Since loading the embedding model is not cheap, we want to load the model just once and re-use the loaded model when embedding each batch of data.

In [None]:
class EmbedChunks:
    def __init__(self, model_name):
        self.embedding_model = get_embedding_model(model_name)
    
    def __call__(self, node_batch):
        # Get the batch of text that we want to embed.
        nodes = node_batch["node"]
        text = [node.text for node in nodes]
        
        # Embed the batch of text.
        embeddings = self.embedding_model.embed_documents(text)
        assert len(nodes) == len(embeddings)

        # Store the embedding in the LlamaIndex node.
        for node, embedding in zip(nodes, embeddings):
            node.embedding = embedding
        return {"embedded_nodes": nodes}

In [None]:
# Specify the embedding model to use.
# Specify "text-embedding-ada-002" for Open AI embeddings.
embedding_model_name = "thenlper/gte-base"

Let's try this out on an example chunk.

In [None]:
example_chunk = chunks_ds.take_batch(1)
embedder = EmbedChunks(model_name=embedding_model_name)
example_node_with_embedding = embedder(example_chunk)

In [None]:
print(example_node_with_embedding["embedded_nodes"][0])
print("\n")
print("Embedding size: ", len(example_node_with_embedding["embedded_nodes"][0].embedding))

We're now able to embed our chunks at scale by using the [map_batches](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) operation in our Ray Data pipeline.

All we have to do is define the `batch_size` and the compute to use (we're using two workers, each with 1 GPU).

In [None]:
from ray.data import ActorPoolStrategy

embedded_chunks = chunks_ds.map_batches(
    EmbedChunks,
    fn_constructor_kwargs={"model_name": embedding_model_name},
    batch_size=100, 
    num_gpus=1 if embedding_model_name!="text-embedding-ada-002" else 0,
    compute=ActorPoolStrategy(size=2))

### Index data

Now that we have our embedded chunks, we need to index (store) them somewhere so that we can retrieve them quickly for inference. While there are many popular vector database options, we're going to use [Postgres](https://www.postgresql.org/) for it's simplificty and performance. We'll create a table (`document`) and write the (`text`, `source`, `embedding`) triplets for each embedded chunk we have.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show a triplet getting indexed in a vector DB.

Let's setup a Postgres database. We have already installed Postgres for you in this workspace

In [None]:
%%bash
# Set up pgvector
bash ../setup-pgvector.sh

As the final step in our data pipeline, we will store the embeddings into our Postgres database

In [None]:
%%bash
# Drop existing table if it exists
sudo -u postgres psql -d postgres -c "DROP TABLE IF EXISTS data_document;"

In [None]:
from llama_index.vector_stores import PGVectorStore

# First create the table.
def get_postgres_store():
    return PGVectorStore.from_params(
            database="postgres", 
            user="postgres", 
            password="postgres", 
            host="localhost", 
            table_name="document",
            port="5432",
            embed_dim=768,
        )

store = get_postgres_store()
del store

In [None]:
class StoreResults:
    def __init__(self):
        self.vector_store = get_postgres_store()
    
    def __call__(self, batch):
        embedded_nodes = batch["embedded_nodes"]
        self.vector_store.add(list(embedded_nodes))
        return {}

In [None]:
# Store all the embeddings in Postgres, and trigger exection of the Ray Data pipeline.
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

embedded_chunks.map_batches(
    StoreResults,
    batch_size=128,
    num_cpus=1,
    compute=ActorPoolStrategy(size=8),
    # Since our database is only created on the head node, we need to force the Ray tasks to only executed on the head node.
    scheduling_strategy=NodeAffinitySchedulingStrategy(node_id=ray.get_runtime_context().get_node_id(), soft=False)
    
).count()

Let's check our table to see how many chunks that we have stored.

In [None]:
%%bash
sudo -u postgres psql -c "SELECT count(*) FROM data_document;"

## Retrieval

Now that we have processed, embedded, and stored all of our chunks from the Ray documentation, we can test out the retrieval portion of the application.

In the retrieval portion, we want to pull the relevant context for a given query. We do this by embedding the query using the same embedding model we used to embed the chunks, and then check for similarity between the embedded query and all the embedded chunks to pull the most relevant context.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show the query getting embedded and show the retrieval process.

In [None]:
from llama_index import VectorStoreIndex, ServiceContext

In [None]:
# Create a connection to our Postgres vector store
vector_store = get_postgres_store()

In [None]:
# Use the same embedding model that we used to embed our documents.
embedding_model = get_embedding_model(embedding_model_name)

In [None]:
# Create our retriever.
service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=None)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

# Fetch the top 5 most relevant chunks.
retriever = index.as_retriever(similarity_top_k=5)

Now, let's try a sample query and pull the most relevant context. Looks like the retrieval is working great! From the eye-test, it looks like the chunks are all relevant to the query.

In [None]:
query = "What is the default batch size for map_batches?"
nodes = retriever.retrieve(query)

for node in nodes:
    print(node)
    print("Source: ", node.metadata["source"])

## Response generation

With our retrieval working, we can now build the next portion of our LLM application, which is the actual response generation.

In this step, we pass in both the query and the relevant contex to an LLM. The LLM synthesizes a response to the query given the context. Without this relevant context that we retreived, the LLM may not have been able to accurately answer our question. And as our data grows, we can just as easily embed and index any new data and be able to retrieve it to answer questions.

<span style="background: yellow; color: red; font-size: 1rem;"><b>DIAGRAM:</b></span> Show how retrieved context + query texts are fed into API.

Creating an end-to-end query engine becomes very easy with LlamaIndex and Anyscale Endpoints. With Anyscale endpoints, we can use open source LLMs, like Llama2 models, just as easy as Open AI, but more cost effectively.

In [None]:
from llama_index.llms import Anyscale

In [None]:
# Use Anyscale endpoints as the LLM to LlamaIndex.
llm = Anyscale(model="meta-llama/Llama-2-70b-chat-hf", temperature=0.1)

# Use the same embedding model that we used to embed our documents.
embedding_model = get_embedding_model(embedding_model_name)

service_context = ServiceContext.from_defaults(embed_model=embedding_model, llm=llm)

In [None]:
# Create our query engine.
vector_store = get_postgres_store()
index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
query_engine = index.as_query_engine()

In [None]:
# Get a response to our query.

query = "What is the default batch size for map_batches?"
response = query_engine.query(query)

Let's see the response to our query, as well as the retrieved context that we passed to the LLM.

In [None]:
print("Response: ", response.response)
print("\n")
source_nodes = response.source_nodes

for node in source_nodes:
    print("Text: ", node.node.text)
    print("Score: ", node.score)
    print("Source: ", node.node.metadata["source"])
    print("\n")