# Intro to RAG

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 1:</b> Prompting an LLM without RAG</a></li>
    <li><b>Part 2:</b> In-context learning and LLMs</a></li>
    <li><b>Part 3:</b> Retrieval and semantic search</a></li>
    <li><b>Part 4:</b> RAG: High-level overview</a></li>
</ul>
</div>

## Imports

In [None]:
import os
import json

import openai
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

## Environment setup

We will fetch some credentials from S3 and set up our environment.

In [None]:
!aws s3 cp s3://anyscale-ray-summit-training-2024/anyscale_service_credentials.json ./credentials.json

with open("credentials.json", "r") as f:
    credentials = json.load(f)
    for key, value in credentials.items():
        os.environ[key] = value

## Constants

In [None]:
ANYSCALE_SERVICE_BASE_URL = os.environ["ANYSCALE_SERVICE_BASE_URL"]
ANYSCALE_API_KEY = os.environ["ANYSCALE_API_KEY"]

## What is RAG ?

Retrieval augmented generation (RAG) combines Large Language models (LLMs) and information retrieval systems to provide a more robust and context-aware response generation system. It was introduced by Lewis et al. in the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401).

## Prompting an LLM without RAG

Here is our system without RAG. 

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+without_rag+-+v2.png" alt="Without RAG" width="550px"/>

We prompt an LLM and get back a response.

In [None]:
def prompt_llm(user_prompt, model="mistralai/Mistral-7B-Instruct-v0.1", temperature=0, **kwargs):
    # Initialize a client to perform API requests
    client = openai.OpenAI(
        base_url=ANYSCALE_SERVICE_BASE_URL,
        api_key=ANYSCALE_API_KEY,
    )
    
    # Call the chat completions endpoint
    chat_completion = client.chat.completions.create(
        model=model,
        messages=[
            # Prime the system with a system message - a common best practice
            {"role": "system", "content": "You are a helpful assistant."},
            # Send the user message with the proper "user" role and "content"
            {"role": "user", "content": user_prompt},
        ],
        temperature=temperature,
        **kwargs,
    )

    return chat_completion

As an example, we will prompt an LLM about the capital of France.

In [None]:
prompt = "What is the capital of France?"
response = prompt_llm(prompt)
print(response.choices[0].message.content)

Let's consider the case of prompting the LLM about **internal** company documents. 

Think of technical company documents and company policies that are not available on the internet.

Given the LLM has not been trained on these documents, it will not be able to provide a good response.

In [None]:
prompt = "Can I rent the company car on weekends?"
response = prompt_llm(prompt)
print(response.choices[0].message.content)

## In-context learning and LLMs

It turns out LLMs excel at in-context learning, meaning they can utilize additional context provided with a user prompt to generate a response that is grounded in the provided context. 

Here a diagram of the system with in-context learning:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+in-context-learning+-++v2.png" alt="In-context learning" width="500px"/>

For a formal understanding, refer to the paper titled [In-Context Retrieval-Augmented Language Models](https://arxiv.org/pdf/2302.00083.pdf), which performs experiments to validate in-context learning.


Let's consider the case of prompting the LLM about internal company policies. 

In [None]:
context = """
Here are the company policies that you need to know about:

1. You are not allowed to use the company's computers between 9am and 5pm. 
2. You are not allowed to use the company car on weekends.
"""

This time, we provide the LLM with the company's policies as context.

In [None]:
query = "Am I allowed to use the company car on weekends?"

prompt = f"""
Given the following context:
{context}

What is the answer to the following question:
{query}
"""

response = prompt_llm(prompt)
print(response.choices[0].message.content)

We get back the correct answer to the question, which is "You are not allowed to use the company car on weekends."

## Retrieval and semantic search

In a real-world scenario, we can't provide the LLM with the entire company's data as context. It would be inefficient to do so from both a cost and performance perspective.

So we will need a retrieval system to find the most relevant context.

One effective retrieval system is semantic search, which uses embeddings to find the most relevant context.

### What is semantic search ?

Semantic search enables us to find documents that share a similar meaning with our queries.

To capture the "meaning" of a query, we use specialized encoders known as "embedding models."

Embedding models encode text into a high-dimensional vector, playing a crucial role in converting language into a mathematical format for efficient comparison and retrieval.


### How do embedding models work?</h5>

Embedding models are trained on a large corpus of text data to learn the relationships between words and phrases.

The model represents each word or phrase as a high-dimensional vector, where similar words are closer together in the vector space.

<img src='https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/word-embeddings.png' width="600px" alt="Word Embeddings tSNE"/>

The diagram shows word embedding vectors in a 2D space. Semantically similar words end up close to each other in the reduced vector space. 

Note for semantic search, we use sequence embeddings with a much higher dimensionality offering much richer representations.

### Generating embeddings

Here is how to generate embeddings using the `sentence-transformers` library.

In [None]:
embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

prompt = "Am I allowed to use the company car on weekends?"

document_1 = "You are not allowed to use the company's computers between 9am and 5pm."
document_2 = "You are not allowed to use the company car on weekends."

prompt_embedding_vector = embedding_model.encode(prompt)
document_1_embedding_vector = embedding_model.encode(document_1)
document_2_embedding_vector = embedding_model.encode(document_2)

Now, we can find the similarity between the prompt and document vectors by computing the cosine similarity.

In [None]:
similarities = cosine_similarity([prompt_embedding_vector], [document_1_embedding_vector, document_2_embedding_vector]).flatten()
similarity_between_prompt_and_document_1, similarity_between_prompt_and_document_2 = similarities
print(f"{similarity_between_prompt_and_document_1=}")
print(f"{similarity_between_prompt_and_document_2=}")

<div class="alert alert-block alert-info">

### Activity: Find the most similar document

Given the following two documents and prompt:

```python
prompt = "What is the current king of england's name?"

document_1 = "British monarchy head at present moment: Charles III"
document_2 = "The current king of spain's name is Felipe VI"

# Hint: Compute the embedding vector for the prompt and the documents.

# Hint: Use a similarity metric to find the most similar document to the prompt.

```

Find the closest document to the prompt using the `BAAI/bge-small-en-v1.5` model. 
</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">
<details> 

<summary>Click here to see the solution </summary>

```python
prompt = "What is the current king of england's name?"

document_1 = "British monarchy head at present moment: Charles III"
document_2 = "The current king of spain's name is Felipe VI"

# Compute the embedding vector for the prompt and the documents.
prompt_embedding_vector = embedding_model.encode(prompt)
document_1_embedding_vector = embedding_model.encode(document_1)
document_2_embedding_vector = embedding_model.encode(document_2)

# Use a similarity metric to find the most similar document to the prompt.
similarities = cosine_similarity([prompt_embedding_vector], [document_1_embedding_vector, document_2_embedding_vector]).flatten()
similarity_between_prompt_and_document_1, similarity_between_prompt_and_document_2 = similarities
if similarity_between_prompt_and_document_1 > similarity_between_prompt_and_document_2:
    print("Document 1 is more similar to the prompt")
else:
    print("Document 2 is more similar to the prompt")
```

</details>
</div>

<div class="alert alert-block alert-warning">

<b>Note:</b> how even though `document_2` has direct word matches to the provided prompt, such as "the," "current," "king," and "name," its meaning is less similar than `document_1`, which uses different terms like "British monarchy head." This is an example of how semantic search can be more effective than lexical (keyword-based) search.

</div>

## RAG: High-level overview

With RAG, we now have a retrieval system that finds the most relevant context and provides it to the LLM.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+with_rag_simple_v2.png" alt="With RAG" width="600px"/>


### Why RAG ?

RAG systems enhance LLMs by:

- Reducing hallucinations with relevant context.
- Providing clear information attribution.
- Enabling access control to information.

### How can we build a basic RAG system ?

A common approach for building a basic RAG systems is by:

1. Encoding our documents, commonly referred to as generating embeddings of our documents.
2. Storing the generated embeddings in a vector store.
3. Encoding our user query.
4. Retrieving relevant documents from our vector store given the encoded user query.
5. Augmenting the user prompt with the retrieved context.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+with_rag_v2.png" alt="With RAG Highlights" width="800px"/>


### Key Stages:

- **Stage 1: Indexing**
  1. Loading the documents from a source like a website, API, or database.
  2. Processing the documents into "embeddable" document chunks.
  3. Encoding the documents chunks into embedding vectors.
  4. Storing the document embedding vectors in a vector store.
- **Stage 2: Retrieval**
  1. Encoding the user query.
  2. Retrieving the most similar documents from the vector store given the encoded user query.
- **Stage 3: Generation**
  1. Augmenting the prompt with the provided context.
  2. Generating a response from the augmented prompt.

Stage 1 is setup; Stages 2 and 3 are operational.


### Next steps: Building a RAG-based QA engine for the Ray documentation

We will start to build a RAG-based QA engine for the Ray documentation. This will be an attempt to recreate the "Ask AI" bot on the Ray [documentation website](https://docs.ray.io/en/latest/).