# Indexing Data for RAG using Ray Data

The first stage of RAG is to index the data. This can be done by creating embeddings for the data and storing them in a vector store. 

This notebook will walk you through the process of creating an embedding pipeline and then scaling it with Ray Data.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 0:</b> RAG overview recap</a></li>
    <li><b>Part 1:</b> Embeddings pipeline overview</a></li>
    <li><b>Part 2:</b> Simplest possible embedding pipeline</a></li>
    <li><b>Part 3:</b> Simple pipeline for a real use-case</a></li>
    <li><b>Part 4:</b> Migrating the simple pipeline to Ray Data</a></li>
    <li><b>Part 5:</b> Building a vector store</a></li>
    <li><b>Part 6:</b> Key takeaways</a></li>
</ul>
</div>

## Setup

### Imports

In [None]:
import os
import shutil
from pathlib import Path

import numpy as np
import pandas as pd
import chromadb

import joblib
import psutil
import ray
from cloudpathlib import CloudPath
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer

### Constants

In [None]:
if os.environ.get("ANYSCALE_ARTIFACT_STORAGE"):
    DATA_DIR = Path("/mnt/cluster_storage/")
    shutil.copytree(Path("./data/"), DATA_DIR, dirs_exist_ok=True)
else:
    DATA_DIR = Path("./data/")

## RAG Overview Recap

As a recap here are the three main phases of implementing RAG

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/RAG+App+-+Ray+Summit+-+with_rag_v2.png" alt="With RAG Highlights" width="800px"/>


## Embeddings pipeline overview

What are the steps involved in generating embeddings? In the most common case for text data, the steps are as follows:

1. Load documents
2. Process documents into chunks
   1. Process documents into chunks
   2. Optionally persist chunks
3. Generate embeddings from chunks
   1. Generate embeddings from chunks
   2. Optionally persist embeddings
4. Upsert embeddings into a database

## Simple pipeline for a real use-case

Let's now assume we want to "embed the Ray documentation website". 

We will circle back and start with a small sample dataset taken from the ray documentation. 

To visualize our pipeline, see the diagram below:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/simple_embeddings_pipeline_v2.svg" width="800px">

### 1. Load documents

First step, we load the data using `pandas`.

In [None]:
df = pd.read_json(DATA_DIR / "small_sample" / "sample-input.jsonl", lines=True)

We have a dataset of 4 documents fetched from online content and stored as objects in a json file.

Here are some of the notable columns:
- `text` column which contains the text of the document that we want to embed.
- `section_url` column which contains the section under which the document is found.
- `page_url` column which contains the page under which the document is found.

In [None]:
df

<div class="alert alert-block alert-warning">

**Considerations for scaling the pipeline:**
- Memory: We currently load the entire file into memory. This is not a problem for small files, but can be a problem for large files.
- Latency: Reading the file from disk is slow. We can speed this up by using a faster disk, but we can also speed this up by parallelizing the read.

</div>

### 2. Process documents into chunks

We will use langchain's `RecursiveCharacterTextSplitter` to split the text into chunks. 

It works by first splitting on paragraphs, then sentences, then words, then characters. It is a recursive algorithm that will stop once the chunk size is satisfied.

Let's try it out on a sampe document.

In [None]:
text = """
This is the first part. Estimate me like 12 words long.

This is the second part. Estimate me like 12 words long.

This is the third part. Estimate me like 12 words long.
"""

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],  # The default separators used by the splitter
    chunk_size=24,
    chunk_overlap=0,
    length_function=lambda x: len(x.split(" ")),
)
splitter.split_text(text)

If we change the paragraphs, the chunk contents will change

In [None]:
text = """
This is the first part. Estimate me like 12 words long.

This is the second part. Estimate me like 12 words long.
This is the third part. Estimate me like 12 words long.
"""

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],  # The default separators used by the splitter
    chunk_size=24,
    chunk_overlap=0,
    length_function=lambda x: len(x.split(" ")),
)
splitter.split_text(text)

We now proceed to:

1. Configure the `RecursiveCharacterTextSplitter`
2. Run it over all the documents in the dataset

In [None]:
chunk_size = 128  #  Chunk size is usually specified in tokens
words_to_tokens = 1.2  # Heuristic for converting tokens to words
chunk_size_in_words = int(chunk_size // words_to_tokens)


splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size_in_words,
    length_function=lambda x: len(x.split()),
    chunk_overlap=0,
)

chunks = []
for idx, row in df.iterrows():
    for chunk in splitter.split_text(row["text"]):
        chunks.append(
            {
                "text": chunk,
                "section_url": row["section_url"],
                "page_url": row["page_url"],
            }
        )

<div class="alert alert-block alert-secondary">

**Considerations for choosing the chunk size**

  - We want the chunks small enough to:
    - Fit into the context window of our chosen embedding model
    - Be semantically coherent - i.e. concentrate on ideally a single topic
  - We want the chunks large enough to:
    - Contain enough information to be semantically meaningful.
    - Avoid creating too many embeddings which can be expensive to store and query.

</div>

Let's inspect the chunks produced for the first document.

In [None]:
first_document = df["text"].iloc[0]
print("first document is", len(first_document.split()), "words")

In [None]:
for k, v in chunks[0].items():
    if k == "text":
        print("first chunk of first document is", len(v.split()), "words")
    else:
        print(k, v)

In [None]:
for k, v in chunks[1].items():
    if k == "text":
        print("second chunk of first document is", len(v.split()), "words")
    else:
        print(k, v)

### 3. Generate embeddings from chunks

For our third step, we want to load a good embedding model. 

**Suggested steps to choosing an embedding model:**
1. Visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) on HuggingFace.
2. Find a model that satisfies the following considerations:
  - Does the model perform well overall and in the task you are interested in?
  - Is the model closed-source or open-source?
    - If it is closed-source:
      - What are the costs, security, and privacy implications?
    - If it is open-source:
      - What are its resource requirements if you want to self-host it?
      - Is it readily available as a service by third-party providers like Anyscale, Fireworks, or Togther AI?

We will use `thenlper/gte-large` model from the [HuggingFace Model Hub](https://huggingface.co/thenlper/gte-large) given it is an open-source model and is available as a service by Anyscale and performs relatively well in the MTEB leaderboard.

<div class="alert alert-block alert-warning">

Note: be wary of models that overfit to the MTEB leaderboard. It is important to test the model on your own data.

</div>

In [None]:
svmem = psutil.virtual_memory()

# memory used in GB
memory_used = svmem.total - svmem.available
memory_used_gb_before_model_load = memory_used / (1024**3)
memory_used_gb_before_model_load

In [None]:
%%time
model = SentenceTransformer('thenlper/gte-large', device='cpu')

In [None]:
svmem = psutil.virtual_memory()
memory_used = svmem.total - svmem.available
memory_used_gb_after_model_load = memory_used / (1024**3)
memory_used_gb_after_model_load

In [None]:
model_memory_usage = memory_used_gb_after_model_load - memory_used_gb_before_model_load
model_memory_usage

Loading the embedding model took around 1 GB of memory.

Let's see how slow it is to generate an embedding.

In [None]:
%%time

embeddings = model.encode([chunk["text"] for chunk in chunks])

In [None]:
len(chunks)

It takes on the order of a few seconds to embed 8 chunks on our CPU. We will most definitely need a GPU to speed things up.

#### Save embeddings to disk

As a fourth step, we want to store our generated embeddings as a parquet file.

In [None]:
df_output = pd.DataFrame(chunks)

In [None]:
df_output["embeddings"] = embeddings.tolist()

In [None]:
df_output

In [None]:
df_output.to_parquet(DATA_DIR / "sample-output-pandas.parquet")

### 4. Upsert embeddings to vector store

The final step is to upsert the embeddings into a database. We will skip this step for now.

## Migrating the simple pipeline to Ray Data

We now want to migrate our implementation to use Ray Data to drastically scale our pipeline for larger datasets.

### 1. Load documents

Let's start with a first pass conversion of our data pipeline to use Ray Data. 

Instead of `pandas.read_json`, use `ray.data.read_json` to instantiate a `ray.data.Dataset` that will eventually read our file.

In [None]:
ds_sample_input = ray.data.read_json(DATA_DIR / "small_sample" / "sample-input.jsonl")
type(ds_sample_input)

`ray.data.read_json` returns a `ray.data.Dataset` which is a distributed collection of data. Execution in Ray Data by default is:
- **Lazy**: `Dataset` transformations aren’t executed until you call a consumption operation.
- **Streaming**: `Dataset` transformations are executed in a streaming way, incrementally on the base data, one block at a time.

Accordingly `ray.data.Dataset` will only fetch back some high-level metadata and schema information about the file, but not the actual data.

In [None]:
ds_sample_input

### Under the hood

Ray Data uses Ray tasks to read files in parallel. Each read task reads one or more files and produces one or more output blocks.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/dataset-read-cropped-v2.svg" width="500px">

### 2. Process documents into chunks

Given a `ray.data.Dataset`, we can apply transformations to it. There are two types of transformations:
1. **row-wise transformations**
  - `map`: a 1-to-1 function that is applied to each row in the dataset.
  - `filter`: a 1-to-1 function that is applied to each row in the dataset and filters out rows that don’t satisfy the condition.
  - `flat_map`: a 1-to-many function that is applied to each row in the dataset and then flattens the results into a single dataset.
2. **batch-wise transformations**
  - `map_batches`: a 1-to-n function that is applied to each batch in the dataset.


We chose to make use of `flat_map` to generate a list of chunk rows. `flat_map` will create `FlatMap` tasks which will be scheduled in parallel to process as many rows as possible at once.

In [None]:
def chunk_row(row):
    chunk_size = 128
    words_to_tokens = 1.2
    num_tokens = int(chunk_size // words_to_tokens)

    def get_num_words(text):
        return len(text.split())

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=num_tokens,
        keep_separator=True, 
        length_function=get_num_words, 
        chunk_overlap=0,
    )

    chunks = []
    for chunk in splitter.split_text(row["text"]):
        chunks.append(
            {
                "text": chunk,
                "section_url": row["section_url"],
                "page_url": row["page_url"],
            }
        )
    return chunks

ds_sample_input_chunked = ds_sample_input.flat_map(chunk_row)

To verify our `flat_map` is working, we can consume a limited number of rows from the dataset.

To do so, we an either call
- `take` to specify a limited number of rows from the dataset.
- `take_batch` to specify a limited number of batches from the dataset.

Here we call `take(2)` to return 2 rows.

In [None]:
ds_sample_input_chunked.take(2)

### 3. Generate embeddings from chunks

For our third step, we apply the embeddings using `map_batches`, which will be implemented using `MapBatches` tasks scheduled in parallel.

In [None]:
def embed_batch(batch):
    assert isinstance(batch, dict)
    for key in batch.keys():
        assert key in ["text", "section_url", "page_url"]
    for val in batch.values():
        assert isinstance(val, np.ndarray), type(val)

    model = SentenceTransformer('thenlper/gte-large')
    text = batch["text"].tolist()
    embeddings = model.encode(text, batch_size=len(text))
    batch["embeddings"] = embeddings.tolist()
    return batch

ds_sample_input_embedded = ds_sample_input_chunked.map_batches(embed_batch)

#### Save embeddings to disk

For our fourth step, we write our dataset to parquet using `write_parquet`.

In [None]:
%%time

output_path = DATA_DIR / "small_sample" / "sample-output"
if output_path.exists():
    shutil.rmtree(output_path)

ds_sample_input_embedded.write_parquet(output_path)

We inspect the created parquet output directory. Every write task will create a separate file in the output directory.

In [None]:
!ls -llah {output_path} 

We can read the parquet file back into a pandas dataframe.

In [None]:
df = ray.data.read_parquet(DATA_DIR / "small_sample" / "sample-output").to_pandas()
df

### 4. Upsert embeddings to vector store

The final step is to upsert the embeddings into a database. We will skip this step for now.

**Recap**

Here is our entire pipeline:

```python
(
    ray.data.read_json(DATA_DIR / "small_sample" / "sample-input.jsonl")
    .flat_map(chunk_row)
    .map_batches(embed_batch)
    .write_parquet(DATA_DIR / "small_sample" / "sample-output")
)
```

<div class="alert alert-block alert-info">

### Activity: Implement the pipeline using a different embedding model

Re-implement the entire data pipeline but this time use a different embedding model `BAAI/bge-large-en-v1.5` which outperforms `thenlper/gte-large` on certain parts of the MTEB leaderboard.

NOTE: make sure to output the results to a different directory.

```python
# Hint: Use the code in the recap section as a template but update the embedding transformation.
```


</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">

<details> 

<summary>Click here to see the solution </summary>

```python
def embed_batch(batch):
    # Load the embedding model
    model = SentenceTransformer("BAAI/bge-large-en-v1.5")
    text = batch["text"].tolist()
    embeddings = model.encode(text, batch_size=len(text))
    batch["embeddings"] = embeddings.tolist()
    return batch

(
    ray.data.read_json(DATA_DIR / "small_sample" / "sample-input.jsonl")
    .flat_map(chunk_row)
    .map_batches(embed_batch)
    .write_parquet(DATA_DIR / "small_sample" / "sample-output-bge")
)

# inspect output
ray.data.read_parquet(DATA_DIR / "small_sample" / "sample-output-bge").to_pandas()
```

</details>

</div>

## Scaling the pipeline with Ray Data

Let's explore how to scale our pipeline to a larger dataset using Ray Data.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/full_scale_embeddings_pipeline_v2.svg" width="1000px">



### Phase 1: Preparing input files

First, we need to prepare our documents by performing the following steps
1. Fetch all the Ray documentation from the web.
2. Parse the web pages to extract the text.

#### 1. Fetch all the Ray documentation from the web.

We have already fetched the Ray documentation and stored it on S3.

In [None]:
raw_web_pages_dir = CloudPath(
    "s3://anyscale-public-materials/ray-documentation-html-files/unzipped/"
)

In [None]:
raw_web_pages_dir.exists(), raw_web_pages_dir.is_dir()

#### 2. Parse the web pages to extract the text.

We first read all HTML files in the raw web pages directory into a `ray.data.Dataset`.

In [None]:
ds_web_page_paths = ray.data.from_items(
    [{"path": path} for path in raw_web_pages_dir.glob("**/*.html")]
)
ds_web_page_paths

Note that this only includes the latest version of the ray documentation. This size would be drastically multiplied if we included all versions of the documentation.

##### Utilize inherent structure to improve the documents 

Documentation [webpages](https://docs.ray.io/en/latest/rllib/rllib-env.html) are naturally split into sections. We can use this to our advantage by returning our documents as sections. This will facilitate producing semantically coherent chunks. 

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/ray_docs_section_extraction_v2.png" >


We are producing multiple documents from each HTML file. We will use the `flat_map` method to produce multiple documents from each HTML file.

In [None]:
def path_to_uri(
    path: CloudPath, scheme: str = "https://", domain: str = "docs.ray.io"
) -> str:
    return scheme + domain + str(path).split(domain)[-1]

def extract_sections_from_html(record: dict) -> list[dict]:
    documents = []
    # 1. Request the page and parse it using BeautifulSoup
    with record["path"].open("r", encoding="utf-8", force_overwrite_from_cloud=True) as html_file:
        soup = BeautifulSoup(html_file, "html.parser")

    page_url = path_to_uri(record["path"])

    # 2. Find all sections
    sections = soup.find_all("section")
    for section in sections:
        # 3. Extract text from the section but not from the subsections
        section_text = "\n".join(
            [child.text for child in section.children if child.name != "section"]
        )
        # 4. Construct the section url
        section_url = page_url + "#" + section["id"]
        # 5. Create a document object with the text, source page, source section uri
        documents.append(
            {
                "text": section_text,
                "section_url": section_url,
                "page_url": page_url,
            }
        )
    return documents


ds_sections = ds_web_page_paths.flat_map(extract_sections_from_html)

Finally we store the produced dataset in parquet format.

In [None]:
%%time
if (DATA_DIR / "full_scale" / "02_sections").exists():
    shutil.rmtree(DATA_DIR / "full_scale" / "02_sections")
ds_sections.write_parquet(DATA_DIR / "full_scale" / "02_sections")

In [None]:
!ls -llh {DATA_DIR / "full_scale" / "02_sections"}

Let's count how many documents we will have after processing the sections.

In [None]:
ray.data.read_parquet(DATA_DIR / "full_scale" / "02_sections").count()

<div class="alert alert-block alert-warning">

**Considerations for reading input files into Ray Data:**

Pruning columns and using filter pushdown can optimize parquet file reads:
- Specify only necessary columns when dealing with column-oriented formats to reduce memory usage.
- Apply filter pushdown in `ray.data.read_parquet` to retrieve only rows that meet certain conditions.

However, as our dataset's memory footprint is predominantly due to the 'text' column, these optimizations will have a limited impact on reducing memory load.

</div>



### Phase 2: Generating Embeddings

Now that we have our documents, we can proceed to generate embeddings.

#### 1. Load documents
We begin by reading the documents from the "02_sections" directory.

In [None]:
ds_sections = ray.data.read_parquet(DATA_DIR / "full_scale" / "02_sections")

ds_sections

#### Applying chunking as a transformation

We apply our chunking transformation using `flat_map`, which applies a 1-to-many function to each row in the dataset and then flattens the results into a single dataset.

In [None]:
ds_sections_chunked = ds_sections.flat_map(chunk_row)

We could have used `map_batches` instead to apply a many-to-many function to each batch of rows in the dataset. However, given our chunking transformation is not vectorized, `map_batches` will not be faster.

Let's run the chunking and count our total number of chunks.

In [None]:
ds_sections_chunked.count()

#### Applying embedding as a transformation

We want to load the embedding model once and reuse it across multiple transformation tasks.

To do so, we want to use call `map_batches` with **stateful transform** instead of a *stateless transform*. 

This means we create a pool of processes called actors where the model is already loaded in memory.

Each actor will run a `MapBatch` process where:
  - initial state is handled in `__init__`
  - task is invoked using `__call__` method

In [None]:
num_workers = 2
device = "cuda"

class EmbedBatch:
    def __init__(self):
        self.model = SentenceTransformer("thenlper/gte-large", device=device)

    def __call__(self, batch):
        text = batch["text"].tolist()
        embeddings = self.model.encode(text, batch_size=len(text))
        batch["embeddings"] = embeddings.tolist()
        return batch

ds_sections_embedded = ds_sections_chunked.map_batches(
    EmbedBatch,
    # Number of actors to launch.
    concurrency=num_workers,
    # Size of batch passed to embeddings actor.
    batch_size=200,
    # 1 GPU for each actor.
    num_gpus=1,
)

#### Writing the embeddings to disk

Now that we need to write the embeddings to disk, the data pipeline will get executed and will stream the data to the GPU nodes to perform the embedding generation.

In [None]:
%%time

if (DATA_DIR / "full_scale" / "03_embeddings").exists():
    shutil.rmtree(DATA_DIR / "full_scale" / "03_embeddings")
(
    ds_sections_embedded.write_parquet(path=DATA_DIR / "full_scale" / "03_embeddings")
)

##### Inspecting the Ray Data dashboard

If we take a look at the metrics tab of the ray data dashboard, we can check to see:

- The GPU utilization
    - Ideally, we would like to see the GPU utilization at 100% for the duration of the embedding process
- The GPU memory (GRAM) percentage
    - We would like to see the GPU memory utilization at 100% for the duration of the embedding process
- The time spent on io and network by different tasks

We can then use this information to optimize our pipeline.

##### Inspecting the output

We check to see if the embeddings were written to disk.

In [None]:
!ls -llh {DATA_DIR / "full_scale" / "03_embeddings"}

### Recap of the pipeline

Here is our entire pipeline so far:

```python
(
    ray.data.read_json(
        DATA_DIR / "full_scale" / "02_sections",
    )
    .flat_map(chunk_row)
    .map_batches(
        EmbedBatch,
        concurrency=num_workers,
        batch_size=200,
        num_gpus=1,
    )
    .write_parquet(
        path=DATA_DIR / "full_scale" / "03_embeddings_tuning",
    )
)
```


<div class="alert alert-block alert-info">

### Activity: Tuning the pipeline

Proceed to tune your pipeline by changing the batch size on `map_batches` and see what effect it has on the GPU memory (GRAM) percentage.

```python
ds_sections_embedded = ds_sections_chunked.map_batches(
    EmbedBatch,
    concurrency=num_workers,
    batch_size=200,  # Hint: Check how GRAM changes when you change the batch size
    num_gpus=1,
)

ds_sections_embedded.materialize()
```

</div>

In [None]:
# Write your solution here


<div class="alert alert-block alert-info">

<details>

<summary>Click here to see the solution</summary>

```python
ds_sections_embedded = ds_sections_chunked.map_batches(
    EmbedBatch,
    concurrency=num_workers,
    batch_size=350,  # Optimal batch size for GRAM
    num_gpus=1,
)

ds_sections_embedded.materialize()
```

</details>

</div>

### Upserting embeddings to a vector database

We will use [chroma](https://www.trychroma.com/) to index our document embeddings in a vector store. Chroma is an open-source vector database optimized for similarity search and is user-friendly. We chose Chroma for its ease of use and its free tier, which meets our needs.

<!-- 
We will use [Pinecone](https://www.pinecone.io/) to index our document embeddings in a vector store. Pinecone is a fully managed vector database optimized for similarity search and is user-friendly. We chose Pinecone for its ease of use and its free tier, which meets our needs.

Index your document embeddings in Pinecone as follows:


1. Create a Pinecone client.
2. Create a Pinecone index.
3. Load the embeddings from disk.
4. Transform the embeddings into Pinecone’s index format.
5. Upsert the embeddings into the Pinecone index.
6. Query the Pinecone index. -->

#### 1. Create a chroma client 

We create a chroma client using the `PersistentClient` class to connect to the chroma server against a persistent file store.

In [None]:
chroma_client = chromadb.PersistentClient(path="/mnt/cluster_storage/vector_store")
chroma_client

In [None]:
chroma_client.list_collections()

#### 2. Create a chroma collection

Next, we create a collection in chroma to store our embeddings. A collection provides a vector store index for our embeddings.

We specify `hnsw:space` to use the "Hierarchical Navigable Small World" algorithm for similarity search using cosine similarity.

In [None]:
collection = chroma_client.get_or_create_collection(name="ray-docs", metadata={"hnsw:space": "cosine"})
collection

#### 3. Load the embeddings from disk 

We will load the embeddings from disk using `ray.data.read_parquet` to initiate a distributed upsert of the embeddings to chroma.

In [None]:
ds_embeddings = ray.data.read_parquet(DATA_DIR / "full_scale" / "03_embeddings/")
ds_embeddings

#### 4. Transform the embeddings into chroma index format 

We construct an `id` column to uniquely identify each embedding.

In [None]:
def compute_id(row):
    row_hash = joblib.hash(row)
    page_name = row["page_url"].split("/")[-1]
    section_name = row["section_url"].split("#")[-1]
    row["id"] =  f"{page_name}#{section_name}#{row_hash}"
    return row

ds_embeddings_with_id = ds_embeddings.map(compute_id)

We fetch back the data as a collection of objects and then upsert them into chroma.

In [None]:
chroma_data = ds_embeddings_with_id.to_pandas().drop_duplicates(subset=["id"]).to_dict(orient="list")

Here is how to upsert documents into a collection in chroma:

In [None]:
collection.upsert(
    ids=chroma_data["id"],
    embeddings=[arr.tolist() for arr in chroma_data["embeddings"]],
    documents=chroma_data["text"],
    metadatas=[
        {
            "section_url": section_url,
            "page_url": page_url,
        }
        for section_url, page_url in zip(chroma_data["section_url"], chroma_data["page_url"])
    ],
)

<div class="alert alert-block alert-warning">

Note we can further parallelize the upsert using a `map_batches` operation. This is left as an exercise for the reader.

</div>

### Querying the chroma collection

Given we have indexed our embeddings, we can now query the index to retrieve the most similar documents to a given query.

In [None]:
query = "What is the default number of maximum replicas for a Ray Serve deployment?"

In [None]:
model = SentenceTransformer('thenlper/gte-large')
query_embedding = model.encode(query).tolist()

In [None]:
result = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
)

Here is the most relevant text we found:

In [None]:
print(result["documents"][0][0])

It was fetched from this page of the documentation

In [None]:
result["metadatas"][0][0]["page_url"]

We can additionally retrieve the similarity score in case we want to only retrieve results with a score above a certain threshold.

In [None]:
scores = [1- distance for distance in result["distances"][0]]
scores

## Key Takeaways

With Ray and Anyscale we are able to achieve very fast and efficient embeddings generation at scale. See this [blog](https://www.anyscale.com/blog/rag-at-scale-10x-cheaper-embedding-computations-with-anyscale-and-pinecone) showcasing how we were able to achieve 10x cheaper embeddings generation of billions of documents using Ray and Pinecone.

Ray Data's Lazy and Streaming execution model allows us to:
- Efficiently scale our pipeline to large datasets
- Avoid having to fully materialize the dataset in a store (memory/disk)
- Easily saturate GPUs by scaling preprocessing across CPU nodes
  
Anyscale provides:
- Access to spot instances with fallback to on-demand to run the pipeline in the most cost-efficient manner
- Incremental metadata fetching of very large parquet datasets avoiding long "boot times" and idling instances
