# Brief 

Having built a basic RAG application, we now need to deploy it. This guide will walk you through deploying the Retriever and Generation models on a server.

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook:</b>

<ul>
    <li><b>Part 1:</b> RAG Backend Overview</li>
    <li><b>Part 2:</b> Deploying the Retriever components</li>
    <li><b>Part 3:</b> Deploying the Response Generation</li>
    <li><b>Part 4:</b> Putting it all together into a QA Engine</li>
    <li><b>Part 5:</b> Key Takeaways</li>
    <li><b>Part 6:</b> Bonus: Adding HTTP Ingress</li>
    <li><b>Part 7:</b> Bonus: Enabling streaming of response</li>
    
</ul>

</div>

## Setup

### Imports

In [None]:
import os
import json
import shutil
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from typing import Any, Iterator

import openai
import requests
import chromadb
from ray import serve
from openai.resources.chat.completions import ChatCompletion
from pathlib import Path
from sentence_transformers import SentenceTransformer

## Environment setup

We will fetch some credentials from S3 and set up our environment.

In [None]:
!aws s3 cp s3://anyscale-ray-summit-training-2024/anyscale_service_credentials.json ./credentials.json

with open("credentials.json", "r") as f:
    credentials = json.load(f)
    for key, value in credentials.items():
        os.environ[key] = value

### Constants

In [None]:
if os.environ.get("ANYSCALE_ARTIFACT_STORAGE"):
    DATA_DIR = Path("/mnt/cluster_storage/")
    shutil.copytree(Path("./data/"), DATA_DIR, dirs_exist_ok=True)
else:
    DATA_DIR = Path("./data/")

In [None]:
# Embedding model we used to build the search index on chroma
EMBEDDING_MODEL_NAME = "thenlper/gte-large"
# The chroma search index we built
CHROMA_COLLECTION_NAME = "ray-docs"

ANYSCALE_SERVICE_BASE_URL = os.environ["ANYSCALE_SERVICE_BASE_URL"]
ANYSCALE_API_KEY = os.environ["ANYSCALE_API_KEY"]

## 0. RAG Backend Overview

Here is the same diagram from the previous notebook, but with the services that we will deploy highlighted.

All the services will be deployed as part of a single QA engine application.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+with_rag_services_v2.png" width="900px"/>

<div class="alert alert-block alert-warning">

Note: delineating which components are built as separate deployments is a design decision. It depends whether you want to scale them independently or not. 

</div>

## 1. Building Retriever components

As a reminder, Retrieval is implemented in the following steps:

1. Encode the user query
2. Search the vector store
3. Compose a context from the retrieved documents

### 1. Encode the user query

To convert our QueryEncoder into a Ray deployment, simply need to wrap it with a `serve.deployment` decorator. 

Each deployment is a collection of replicas that can be scaled up or down based on the traffic.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=400/>


The `autoscaling_config` parameter specifies the minimum and maximum number of replicas that can be created. 

The `ray_actor_options` parameter specifies the resources allocated to each replica. In this case, we allocate 1/10th (0.1) of a GPU to each replica.

In [None]:
@serve.deployment(
    ray_actor_options={"num_gpus": 0.1},
    autoscaling_config={"min_replicas": 1, "max_replicas": 2},
)
class QueryEncoder:
    def __init__(self):
        self.embedding_model_name = EMBEDDING_MODEL_NAME
        self.model = SentenceTransformer(self.embedding_model_name, device="cuda")

    def encode(self, query: str) -> list[float]:
        return self.model.encode(query).tolist()


query_encoder = QueryEncoder.bind()

To send a gRPC request to the deployment, we need to:
1. start running the deployment and fetch back its handle using `serve.run`
2. send a request to the deployment using the handle using `.remote()`

In [None]:
query_encoder_handle = serve.run(query_encoder, route_prefix="/query-encoder")
query = "How can I deploy Ray Serve to Kubernetes?"
embeddings_vector = await query_encoder_handle.encode.remote(query)

type(embeddings_vector), len(embeddings_vector)

In [None]:
embeddings_vector[:5]

### 2. Search the vector store

Next we would wrap the vector store with a `serve.deployment`. 

Note, we resort to a hack to ensure the vector store is running on the head node. This is because we are running a local chromadb in development mode which does not allow for concurrent access across nodes.

In [None]:
@serve.deployment(
    ray_actor_options={"num_cpus": 0, "resources": {"is_head_node": 1}},
)
class VectorStore:
    def __init__(self):
        chroma_client = chromadb.PersistentClient(
            path="/mnt/cluster_storage/vector_store"
        )
        self._collection = chroma_client.get_collection(CHROMA_COLLECTION_NAME)

    async def query(self, query_embedding: list[float], top_k: int) -> dict:
        """Retrieve the most similar chunks to the given query embedding."""
        if top_k == 0:
            return {"documents": [], "usage": {}}

        response = self._collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
        )

        return {
            "documents": [
                {
                    "text": text,
                    "section_url": metadata["section_url"],
                }
                for text, metadata in zip(
                    response["documents"][0], response["metadatas"][0]
                )
            ],
        }

vector_store = VectorStore.bind()

In [None]:
vector_store_handle = serve.run(vector_store, route_prefix="/vector-store")
vector_store_response = await vector_store_handle.query.remote(
    query_embedding=embeddings_vector,
    top_k=3,
)

We can inspect the retrieved document URLs given our query

In [None]:
for doc in vector_store_response["documents"]:
    print(doc["section_url"])

### 3. Compose a context from the retrieved documents

We put together a `Retriever` that encapsulates the entire retrieval process so far.

It also composes the context from the retrieved documents by simply concatenating the retrieved chunks.

In [None]:
@serve.deployment(
    ray_actor_options={"num_cpus": 0.1},
)
class Retriever:
    def __init__(self, query_encoder, vector_store):
        self.query_encoder = query_encoder
        self.vector_store = vector_store

    def _compose_context(self, contexts: list[str]) -> str:
        sep = 100 * "-"
        return "\n\n".join([f"{sep}\n{context}" for context in contexts])

    async def retrieve(self, query: str, top_k: int) -> dict:
        """Retrieve the context and sources for the given query."""
        encoded_query = await self.query_encoder.encode.remote(query)
        vector_store_response = await self.vector_store.query.remote(
            query_embedding=encoded_query,
            top_k=top_k,
        )
        contexts = [chunk["text"] for chunk in vector_store_response["documents"]]
        sources = [chunk["section_url"] for chunk in vector_store_response["documents"]]
        return {
            "contexts": contexts,
            "composed_context": self._compose_context(contexts),
            "sources": sources,
        }


retriever = Retriever.bind(query_encoder=query_encoder, vector_store=vector_store)

We run the retriever to check it is working as expected

In [None]:
retriever_handle = serve.run(retriever, route_prefix="/retriever")
retrieval_response = await retriever_handle.retrieve.remote(
    query=query,
    top_k=3,
)
retrieval_response

We inspect the retrieved context

In [None]:
print(retrieval_response["composed_context"])

## 2. Building Response Generation

Next we will wrap the LLM client as its own deployment. Here we showcase that we can also make use of fractional CPUs for this client deployment. 

Note: Separating the client as its own deployment is optional and could have been included in the QA engine deployment.

In [None]:
@serve.deployment(
    ray_actor_options={"num_cpus": 0.1},
)
class LLMClient:
    def __init__(self):
        # Initialize a client to perform API requests
        self.client = openai.OpenAI(
            base_url=ANYSCALE_SERVICE_BASE_URL,
            api_key=ANYSCALE_API_KEY,
        )

    def generate(
        self,
        user_prompt: str,
        model="mistralai/Mistral-7B-Instruct-v0.1",
        temperature: float = 0,
        **kwargs: Any,
    ) -> ChatCompletion:
        """Generate a completion from the given user prompt."""
        # Call the chat completions endpoint
        chat_completion = self.client.chat.completions.create(
            model=model,
            messages=[
                # Prime the system with a system message - a common best practice
                {"role": "system", "content": "You are a helpful assistant."},
                # Send the user message with the proper "user" role and "content"
                {"role": "user", "content": user_prompt},
            ],
            temperature=temperature,
            **kwargs,
        )

        return chat_completion


llm_client = LLMClient.bind()

Note we are currently making use of an already deployed open-source LLM running on Anyscale.

In case you want to deploy your own LLM, you can follow this [ready-built Anyscale Deploy LLMs template](https://console.anyscale.com/v2/template-preview/endpoints_v2).

In [None]:
llm_client_handle = serve.run(llm_client, route_prefix="/llm")

In [None]:
llm_response = await llm_client_handle.generate.remote( 
    user_prompt="What is the capital of France?",
)
llm_response.choices[0].message.content

### Putting it all together
Given a user query we will want our RAG based QA engine to perform the following steps:

1. Retrieve the closest documents to the query
2. Augment the query with the context
3. Generate a response to the augmented query

We decide on a simple prompt template to augment the user's query with the retrieved context. The template is as follows:

In [None]:
prompt_template_rag = """
Given the following context:
{composed_context}

Answer the following question:
{query}

If you cannot provide an answer based on the context, please say "I don't know."
Do not use the term "context" in your response."""


def augment_prompt(query: str, composed_context: str) -> str:
    """Augment the prompt with the given query and contexts."""
    return prompt_template_rag.format(composed_context=composed_context, query=query)

In [None]:
augmented_prompt = augment_prompt(
    "How can I deploy Ray Serve to Kubernetes?",
    retrieval_response["composed_context"],
)
print(augmented_prompt)

<div class="alert alert-block alert-secondary">

**Considerations for building a prompt-template for RAG:**

Prompt engineering techniques can be used need to be purpose built for the usecase and chosen model. For example, if you want the model to still use its own knowledge in certain cases, you might want to use a different prompt template than if you want the model to only use the retrieved context.

For comparison, here are the links to popular third-party library prompt templates which are fairly generic in nature:
- [LangChain's default RAG prompt template](https://smith.langchain.com/hub/rlm/rag-prompt)
- [LlamaIndex's RAG prompt template](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/prompts/default_prompts.py#L99)

</div>

We follow a similar pattern and wrap the `QA` engine with a `serve.deployment` decorator. We update all calls to the retriever and generator to use the respective `remote` calls.

In [None]:
@serve.deployment(autoscaling_config=dict(min_replicas=1, max_replicas=3))
class QA:
    def __init__(self, retriever, llm_client):
        self.retriever = retriever
        self.llm_client = llm_client

    async def answer(
        self,
        query: str,
        top_k: int,
        include_sources: bool = True,
    ):
        """Answer the given question and provide sources."""
        retrieval_response = await self.retriever.retrieve.remote(
            query=query,
            top_k=top_k,
        )
        prompt = augment_prompt(query, retrieval_response["composed_context"])
        llm_response = await self.llm_client.generate.remote(user_prompt=prompt)
        response = llm_response.choices[0].message.content

        if include_sources:
            response += "\n" * 2
            sources_str = "\n".join(set(retrieval_response["sources"]))
            response += sources_str
            response += "\n"

        return response


qa_engine = QA.bind(
    retriever=retriever,
    llm_client=llm_client,
)

In [None]:
qa_handle = serve.run(qa_engine)

In [None]:
qa_response = await qa_handle.answer.remote(
    query="How can I deploy Ray Serve to Kubernetes?",
    top_k=3,
    include_sources=True,
)
print(qa_response)

## Key Takeaways

With Ray and Anyscale, we can easily deploy complex applications with multiple components.

Ray Serve is:
* **Flexible:** unlike other ML based serving platforms, Ray Serve is general purpose and allows for implementing complex logic which is almost always the case for production settings where multiple models need to be composed.
* **Lightweight:** Much simpler than a micro-services set up where each service has to be containerized - doesn't require additional tooling enabling a simple python native approach to deploying apps
* Offers **intuitive autoscaling** configuration instead of using proxies like CPU and network utilization.
* Enables **fractional resource allocation**: allows for efficient resource utilization by allowing for fractional resource allocation to each replica.

The Anyscale Platform allows us to deploy Ray serve applications with ease. It offers:
* **Canary deployments**: to test new versions of the model
* **Versioned Rollouts/Rollbacks** to manage deployments
* **Replica compaction**: to reduce the number of replicas in a deployment

To learn how to deploy an anyscale service, you can refer to the [Anyscale Services documentation](https://docs.anyscale.com/platform/services/).

## Bonus: Adding HTTP Ingress

FastAPI is a modern web framework for building APIs.

Ray Serve offers an integration with FastAPI to easily expose Ray Serve deployments as HTTP endpoints and get benefits like request validation, OpenAPI documentation, and more.

In [None]:
app = FastAPI()

@serve.deployment(autoscaling_config=dict(min_replicas=1, max_replicas=3))
@serve.ingress(app)
class QAGateway:
    def __init__(self, qa_engine):
        self.qa_engine = qa_engine

    @app.get("/answer")
    async def answer(
        self,
        query: str,
        top_k: int = 3,
        include_sources: bool = True,
    ):
        return await self.qa_engine.answer.remote(
            query=query,
            top_k=top_k,
            include_sources=include_sources,
        )

gateway = QAGateway.bind(qa_engine=qa_engine)

In [None]:
gateway_handle = serve.run(gateway)

In [None]:
params = dict(
    query="How can I deploy Ray Serve to Kubernetes?",
    top_k=3,
)

response = requests.get("http://localhost:8000/answer", params=params)
print(response.json())

## Bonus: Streaming Responses

Assuming we want to stream directly from our client, we can use the `StreamingResponse` from FastAPI to stream the response as it is generated.

We first simplify to only deploy the LLM client and then stream the response.

In [None]:
app = FastAPI()

@serve.deployment(
    ray_actor_options={"num_cpus": 0.1},
)
@serve.ingress(app)
class LLMClient:
    def __init__(self):
        # Initialize a client to perform API requests
        self.client = openai.OpenAI(
            base_url=ANYSCALE_SERVICE_BASE_URL,
            api_key=ANYSCALE_API_KEY,
        )
    
    @app.get("/generate")
    async def generate(
        self,
        user_prompt: str,
        model: str = "mistralai/Mistral-7B-Instruct-v0.1",
        temperature: float = 0,
    ) -> ChatCompletion:
        """Generate a completion from the given user prompt."""
        return StreamingResponse(
            self._generate(
                user_prompt=user_prompt, model=model, temperature=temperature
            ),
            media_type="text/event-stream",
        )

    def _generate(
        self,
        user_prompt: str,
        model: str,
        temperature: float,
        **kwargs: Any,
    ) -> Iterator[str]:
        """Generate a completion from the given user prompt."""
        # Call the chat completions endpoint
        response = self.client.chat.completions.create(
            model=model,
            messages=[
                # Prime the system with a system message - a common best practice
                {"role": "system", "content": "You are a helpful assistant."},
                # Send the user message with the proper "user" role and "content"
                {"role": "user", "content": user_prompt},
            ],
            temperature=temperature,
            stream=True,
            **kwargs,
        )

        for chunk in response:
            choice = chunk.choices[0]
            if choice.delta.content is None:
                continue
            yield choice.delta.content

llm_client = LLMClient.bind()

In [None]:
llm_client_handle = serve.run(llm_client, name="streaming-llm", route_prefix="/stream")

In [None]:
params = dict(
    user_prompt="What is the capital of France?",
)

response = requests.get("http://localhost:8000/stream/generate", stream=True, params=params)
for chunk in response.iter_content(chunk_size=None, decode_unicode=True):
    print(chunk, end="")

Next, we update the QA deployment to use the streaming LLM client.

We start out by re-defining the `LLMClient`, this time just stripping the ingress decorator.

In [None]:
@serve.deployment(
    ray_actor_options={"num_cpus": 0.1},
)
class LLMClient:
    def __init__(self):
        # Initialize a client to perform API requests
        self.client = openai.OpenAI(
            base_url=ANYSCALE_SERVICE_BASE_URL,
            api_key=ANYSCALE_API_KEY,
        )
    
    async def generate(
        self,
        user_prompt: str,
        model: str = "mistralai/Mistral-7B-Instruct-v0.1",
        temperature: float = 0,
    ) -> Iterator[str]:
        """Generate a completion from the given user prompt."""
        # Call the chat completions endpoint
        response = self.client.chat.completions.create(
            model=model,
            messages=[
                # Prime the system with a system message - a common best practice
                {"role": "system", "content": "You are a helpful assistant."},
                # Send the user message with the proper "user" role and "content"
                {"role": "user", "content": user_prompt},
            ],
            temperature=temperature,
            stream=True,
        )

        for chunk in response:
            choice = chunk.choices[0]
            if choice.delta.content is None:
                continue
            yield choice.delta.content

llm_client = LLMClient.bind()

Next, we'll update the QA deployment to use the streaming LLM client.

In [None]:
@serve.deployment(autoscaling_config=dict(min_replicas=1, max_replicas=3))
@serve.ingress(app)
class QA:
    def __init__(self, retriever, llm_client):
        self.retriever = retriever
        # Enable streaming on the deployment handle
        self.llm_client = llm_client.options(stream=True)

    @app.get("/answer")
    async def answer(
        self,
        query: str,
        top_k: int,
        include_sources: bool = True,
    ):
        return StreamingResponse(
            self._answer(
                query=query,
                top_k=top_k,
                include_sources=include_sources,
            ),
            media_type="text/event-stream",
        )

    async def _answer(
        self,
        query: str,
        top_k: int,
        include_sources: bool = True,
    ) -> Iterator[str]:
        """Answer the given question and provide sources."""
        retrieval_response = await self.retriever.retrieve.remote(
            query=query,
            top_k=top_k,
        )
        prompt = augment_prompt(query, retrieval_response["composed_context"])

        # async for instead of await
        async for chunk in self.llm_client.generate.remote(user_prompt=prompt):
            yield chunk

        if include_sources:
            yield "\n" * 2
            sources_str = "\n".join(set(retrieval_response["sources"]))
            yield sources_str
            yield "\n"


qa_client = QA.bind(retriever=retriever, llm_client=llm_client)

Note, we left out the gateway to reduce the complexity of the example.

In [None]:
# we shutdown the existing QA deployment
serve.shutdown()

In [None]:
qa_client_handle = serve.run(qa_client, name="streaming-qa", route_prefix="/")

Let's request the streaming QA service in streaming mode:

In [None]:
params = dict(
    query=query,
    top_k=3,
)

response = requests.get(
    "http://localhost:8000/answer", stream=True, params=params
)
for chunk in response.iter_content(chunk_size=None, decode_unicode=True):
    print(chunk, end="")

## Cleanup

We shutdown the existing QA deployment

In [None]:
serve.shutdown()