# Brief 

Having indexed the data, we can now build our RAG system. We will start by building the retriever, which will be responsible for finding the most relevant documents to a given query and then we will build an LLM client to generate the response.

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook:</b>

<ul>
    <li><b>Part 1:</b> RAG Application Overview</li>
    <li><b>Part 2:</b> Building Retriever components</li>
    <li><b>Part 3:</b> Building Response Generation</li>
    <li><b>Part 4:</b> Putting it all together into a QA Engine</li>
</ul>

</div>

## Setup

### Imports

In [None]:
import os
import json
import shutil
from typing import Any, Iterator

import openai
import chromadb
from openai.resources.chat.completions import ChatCompletion
from pathlib import Path
from sentence_transformers import SentenceTransformer

## Environment setup

We will fetch some credentials from S3 and set up our environment.

In [None]:
!aws s3 cp s3://anyscale-ray-summit-training-2024/anyscale_service_credentials.json ./credentials.json

with open("credentials.json", "r") as f:
    credentials = json.load(f)
    for key, value in credentials.items():
        os.environ[key] = value

### Constants

In [None]:
if os.environ.get("ANYSCALE_ARTIFACT_STORAGE"):
    DATA_DIR = Path("/mnt/cluster_storage/")
    shutil.copytree(Path("./data/"), DATA_DIR, dirs_exist_ok=True)
else:
    DATA_DIR = Path("./data/")

In [None]:
# Embedding model we used to build the search index on chroma
EMBEDDING_MODEL_NAME = "thenlper/gte-large"
# The chroma search index we built
CHROMA_COLLECTION_NAME = "ray-docs"

ANYSCALE_SERVICE_BASE_URL = os.environ["ANYSCALE_SERVICE_BASE_URL"]
ANYSCALE_API_KEY = os.environ["ANYSCALE_API_KEY"]


## 0. RAG Application Overview

We are building a simple RAG application that can answer questions about [Ray](https://docs.ray.io/). 

As a recap, see the diagram below for a visual representation of the components required for RAG.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+with_rag_simple_v2.png" alt="With RAG" width="600px"/>


## 1. Building Retriever components
Retrieval is implemented in the following steps:

1. Encode the user query
2. Search the vector store
3. Compose a context from the retrieved documents

### 1. Encode the user query
To encode the query, we will use the same embedding model that we used to encode the documents. 

In [None]:
class QueryEncoder:
    def __init__(self):
        self.embedding_model_name = EMBEDDING_MODEL_NAME
        self.model = SentenceTransformer(self.embedding_model_name)

    def encode(self, query: str) -> list[float]:
        return self.model.encode(query).tolist()

We try out our QueryEncoder by encoding a sample query relevant to our domain.

In [None]:
query_encoder = QueryEncoder()
query = "How can I deploy Ray Serve to Kubernetes?"
embeddings_vector = query_encoder.encode(query)

type(embeddings_vector), len(embeddings_vector)

In [None]:
embeddings_vector[:5]

### 2. Search the vector store
Next, we will search the vector store to retrieve the closest documents to the query.

We implement a `VectorStore` abstraction that reiles on the chroma client to search the vector store.

In [None]:
class VectorStore:
    def __init__(self):
        chroma_client = chromadb.PersistentClient(
            path="/mnt/cluster_storage/vector_store"
        )
        self._collection = chroma_client.get_collection(CHROMA_COLLECTION_NAME)

    def query(self, query_embedding: list[float], top_k: int) -> dict:
        """Retrieve the most similar chunks to the given query embedding."""
        if top_k == 0:
            return {"documents": [], "usage": {}}

        response = self._collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
        )

        return {
            "documents": [
                {
                    "text": text,
                    "section_url": metadata["section_url"],
                }
                for text, metadata in zip(
                    response["documents"][0], response["metadatas"][0]
                )
            ],
        }

In [None]:
vector_store = VectorStore()
vector_store_response = vector_store.query(
    query_embedding=embeddings_vector,
    top_k=3,
)

We can inspect the retrieved document URLs given our query

In [None]:
for doc in vector_store_response["documents"]:
    print(doc["section_url"])

### 3. Compose a context from the retrieved documents

We put together a `Retriever` that encapsulates the entire retrieval process so far.

It also composes the context from the retrieved documents by simply concatenating the retrieved chunks.

In [None]:
class Retriever:
    def __init__(self, query_encoder, vector_store):
        self.query_encoder = query_encoder
        self.vector_store = vector_store

    def _compose_context(self, contexts: list[str]) -> str:
        sep = 100 * "-"
        return "\n\n".join([f"{sep}\n{context}" for context in contexts])

    def retrieve(self, query: str, top_k: int) -> dict:
        """Retrieve the context and sources for the given query."""
        encoded_query = self.query_encoder.encode(query)
        vector_store_response = self.vector_store.query(
            query_embedding=encoded_query,
            top_k=top_k,
        )
        contexts = [chunk["text"] for chunk in vector_store_response["documents"]]
        sources = [chunk["section_url"] for chunk in vector_store_response["documents"]]
        return {
            "contexts": contexts,
            "composed_context": self._compose_context(contexts),
            "sources": sources,
        }

We run the retriever to check it is working as expected


In [None]:
retriever = Retriever(query_encoder=query_encoder, vector_store=vector_store)
retrieval_response = retriever.retrieve(
    query=query,
    top_k=3,
)
retrieval_response

We inspect the retrieved context

In [None]:
print(retrieval_response["composed_context"])

## 2. Building Response Generation

We will generate a response using an LLM server offering an openai-compatible API.

To do so we implement a simple LLM client class that encapsulates the generation process.

In [None]:
class LLMClient:
    def __init__(self):
        # Initialize a client to perform API requests
        self.client = openai.OpenAI(
            base_url=ANYSCALE_SERVICE_BASE_URL,
            api_key=ANYSCALE_API_KEY,
        )

    def generate(self, user_prompt: str, model="mistralai/Mistral-7B-Instruct-v0.1", temperature: float = 0, **kwargs: Any) -> ChatCompletion:
        """Generate a completion from the given user prompt."""
        # Call the chat completions endpoint
        chat_completion = self.client.chat.completions.create(
            model=model,
            messages=[
                # Prime the system with a system message - a common best practice
                {"role": "system", "content": "You are a helpful assistant."},
                # Send the user message with the proper "user" role and "content"
                {"role": "user", "content": user_prompt},
            ],
            temperature=temperature,
            **kwargs,
        )

        return chat_completion

Note we are currently making use of an already deployed open-source LLM running on Anyscale.

In case you want to deploy your own LLM, you can follow the instructions in the [Anyscale documentation](https://docs.anyscale.com/)

In [None]:
llm_client = LLMClient()
response = llm_client.generate("What is the capital of France?")
print(response.choices[0].message.content)

## 3. Putting it all together into a QA Engine
Given a user query we will want our RAG based QA engine to perform the following steps:

1. Retrieve the closest documents to the query
2. Augment the query with the context
3. Generate a response to the augmented query

We decide on a simple prompt template to augment the user's query with the retrieved context. The template is as follows:

In [None]:
prompt_template_rag = """
Given the following context:
{composed_context}

Answer the following question:
{query}

If you cannot provide an answer based on the context, please say "I don't know."
Do not use the term "context" in your response."""


def augment_prompt(query: str, composed_context: str) -> str:
    """Augment the prompt with the given query and contexts."""
    return prompt_template_rag.format(composed_context=composed_context, query=query)

In [None]:
augmented_prompt = augment_prompt(
    query=query,
    composed_context=retrieval_response["composed_context"],
)
print(augmented_prompt)

<div class="alert alert-block alert-secondary">

**Considerations for building a prompt-template for RAG:**

Prompt engineering techniques can be used need to be purpose built for the usecase and chosen model. For example, if you want the model to still use its own knowledge in certain cases, you might want to use a different prompt template than if you want the model to only use the retrieved context.

For comparison, here are the links to popular third-party library prompt templates which are fairly generic in nature:
- [LangChain's default RAG prompt template](https://smith.langchain.com/hub/rlm/rag-prompt)
- [LlamaIndex's RAG prompt template](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/prompts/default_prompts.py#L99)

</div>

We implement our question answering `QA` class below that composed all the steps together.

In [None]:
class QA:
    def __init__(self, retriever: Retriever, llm_client: LLMClient):
        self.retriever = retriever
        self.llm_client = llm_client

    def answer(
        self,
        query: str,
        top_k: int,
        include_sources: bool = True,
    ) -> Iterator[str]:
        """Answer the given question and provide sources."""
        retrieval_response = self.retriever.retrieve(
            query=query,
            top_k=top_k,
        )
        prompt = augment_prompt(query, retrieval_response["composed_context"])
        response = self.llm_client.generate(
            user_prompt=prompt,
            stream=True,
        )
        for chunk in response:
            choice = chunk.choices[0]
            if choice.delta.content is None:
                continue
            yield choice.delta.content

        if include_sources:
            yield "\n" * 2
            sources_str = "\n".join(set(retrieval_response["sources"]))
            yield sources_str
            yield "\n"

We now test out our `QA` implementation

In [None]:
qa_agent = QA(retriever=retriever, llm_client=llm_client)
response = qa_agent.answer(query=query, top_k=3)
for r in response:
    print(r, end="")

<div class="alert alert-block alert-info">

#### Activity: Prompt the QA agent with different top_k values

Prompt the same QA agent with the question "How to deploy Ray Serve on Kubernetes?" with `top_k=0` - is the answer still helpful and correct? 

</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">
<details>
<summary>Click here to see the solution</summary>


If you prompt the QA agent with `top_k=0`, the answer will not be meaningful. This is because the RAG application will not be able to retrieve any documents from the search index and therefore will not be able to generate an answer.

```python
qa_agent = QA(model="mistralai/Mixtral-8x7B-Instruct-v0.1")
response = qa_agent.answer(query=query, top_k=0)
for r in response:
    print(r, end="")
```

This will now produce a hallucinated answer about using a helm chart that does not exist.


</details>
</summary>

</div>
