In [None]:
import uuid
from pathlib import Path
import chromadb
import numpy as np
import ray
from sentence_transformers import SentenceTransformer

# Generating, storing, and retrieving embeddings with Ray Data

In [None]:
EMBEDDING_MODEL = "hkunlp/instructor-large"
model = SentenceTransformer(EMBEDDING_MODEL)

In [None]:
items = ["What are some top attractions in Seattle?", "What are some top attractions in Los Angeles?"]

In [None]:
vectors = model.encode(items)

vectors.shape

We move the data to shared storage

In [None]:
! cp around.txt /mnt/cluster_storage/

In [None]:
paras_ds = ray.data.read_text("/mnt/cluster_storage/around.txt")

In [None]:
paras_ds.count()

In [None]:
paras_ds.take_batch(4)

To generate our emeddings, we'll use two steps

1. Create a class that performs the embedding operation
    1. We use a class because we'll want to hold on to a large, valuable piece of state -- the embedding model itself
    1. For use with our vector databases, we'll need unique IDs to go with each document and embedding -- we'll generate UUIDs
    1. the output from the `__call__` method will be similar to the input: a dict with the column names as keys, and vectorized types for values
1. Call `dataset.map_batches(...)` where we connect the dataset to the processing class as well as specify resources like the number of class instances (actors) and GPUs
    1. Specify an autoscaling actor pool -- to demo how Ray could autoscale to handle large, uneven workloads

In [None]:
class DocEmbedder:
    def __init__(self):
        self._model = SentenceTransformer("hkunlp/instructor-large")

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        inputs = batch['text']
        embeddings = self._model.encode(inputs, device='cuda:0')
        ids = np.array([uuid.uuid1().hex for i in inputs])
        return { 'doc' : inputs, 'vec' : embeddings, 'id' : ids }

In [None]:
vecs = paras_ds.map_batches(DocEmbedder, compute=ray.data.ActorPoolStrategy(min_size=2, max_size=8), num_gpus=0.125, batch_size=64)

In [None]:
sample_batch = vecs.take_batch(4)

sample_batch

### Vector storage example: ChromaDB

Ray focuses on compute and is orthogonal to data storage, so many data stores can be used.

For a simple example, we'll use ChromaDB, starting with a minimal in-memory demo so that we can see the prorgamming pattern.

In [None]:
chroma_client = chromadb.Client()

collection = chroma_client.get_or_create_collection(name="my_text_chunks")

Insert the vectors, documents, and IDs

> Note that Chroma can also accept arbitrary metadata dictionaries for each document, which you can then use in your queries (along with semantic similarity) and see in results. Metadata allows you to easily add powerful features like "search only in chapter 3" or "cite source URLs for data returned"

In [None]:
collection.upsert(
    embeddings=sample_batch['vec'].tolist(),
    documents=sample_batch['doc'].tolist(),
    ids=sample_batch['id'].tolist()
)

In [None]:
test_query = model.encode("tell me about money").tolist()

In [None]:
results = collection.query(
    query_embeddings=[test_query],
    n_results=3
)

In [None]:
results

### Scaling queries with Chroma

Now that we have the basics of Chroma down, let's look at scaling to large datasets.

We'll create a Ray Core Actor that provides access to ChromaDB

In [None]:
@ray.remote(concurrency_groups={"write": 4, "read": 16})
class ChromaWrapper:
    def __init__(self):
        self.chroma_client = chromadb.PersistentClient(path="/mnt/cluster_storage/vector_store")
        self.collection = self.chroma_client.get_or_create_collection(name="persistent_text_chunks")

    @ray.method(concurrency_group="write")
    def upsert(self, batch):
        self.collection.upsert(
            embeddings=batch['vec'].tolist(),
            documents=batch['doc'].tolist(),
            ids=batch['id'].tolist()
        )
        return len(batch['id'])

    @ray.method(concurrency_group="read")
    def query(self, q):
        return self.collection.query(query_embeddings=[q], n_results=3)

chroma_server = ChromaWrapper.remote()

We're using `map_batches` with a side-effect to write the vectors to the database. Alternative approach would include
* writing a custom sink so we could use code like `my_dataset_with_vectors.write_cool_vectordb('collection')`
* writing to a standard storage format like parquet and using another scripted workflow step to bulk load the database

In this code example, we also demo using `map_batches` in (stateless) task form, and using a lambda, to access a running actor.

> As an exercise, rewrite this to use `map_batches` with an actor (callable) class, the way we've done before

In [None]:
vecs.map_batches(lambda batch: {'batch_count': [ray.get(chroma_server.upsert.remote(batch))]}).sum('batch_count')

Since our service is running as an actor, we can quickly test out a query

In [None]:
utah_query_vec = model.encode("Describe the body of water in Utah").tolist()

ray.get(chroma_server.query.remote(utah_query_vec))

> Exercise 1: Rewrite the test code here to use Ray Data, `map_batches`, and a dedicated ChromaDB retrieval actor. Hint: for testing purposes you can create a small Ray Dataset from Python strings with `.from_items`
>
> Exercise 2: Define a Ray Serve deployment that queries the data using Chroma. Such a service would be useful for online/low-latency recall.