In [1]:
import uuid
from pathlib import Path
import chromadb
import numpy as np
import ray
from sentence_transformers import SentenceTransformer

# Generating, storing, and retrieving embeddings with Ray Data

In [2]:
EMBEDDING_MODEL = "hkunlp/instructor-large"
model = SentenceTransformer(EMBEDDING_MODEL)

In [3]:
items = ["What are some top attractions in Seattle?", 
         "What are some top attractions in Los Angeles?"]

In [4]:
vectors = model.encode(items)

vectors.shape

(2, 768)

We move the data to shared storage

In [5]:
! cp around.txt /mnt/cluster_storage/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
paras_ds = ray.data.read_text("/mnt/cluster_storage/around.txt")

2026-01-20 20:57:41,919	INFO worker.py:1821 -- Connecting to existing Ray cluster at address: 10.0.142.230:6379...
2026-01-20 20:57:41,932	INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-v4klp1kjtnk9yrxwdcz5ah11ub.i.anyscaleuserdata.com [39m[22m
2026-01-20 20:57:41,957	INFO packaging.py:463 -- Pushing file package 'gcs://_ray_pkg_0bd8078b0063d4195929ce96a7cf436461a67169.zip' (9.62MiB) to Ray cluster...
2026-01-20 20:57:41,997	INFO packaging.py:476 -- Successfully pushed file package 'gcs://_ray_pkg_0bd8078b0063d4195929ce96a7cf436461a67169.zip'.


In [7]:
paras_ds.count()

2026-01-20 20:57:42,245	INFO logging.py:397 -- Registered dataset logger for dataset dataset_200_0
2026-01-20 20:57:42,267	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_200_0. Full logs are in /tmp/ray/session_2026-01-20_18-18-31_241199_2386/logs/ray-data
2026-01-20 20:57:42,267	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_200_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Project] -> AggregateNumRows[AggregateNumRows]
2026-01-20 20:57:42,268	INFO streaming_executor.py:687 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`.
2026-01-20 20:57:42,269	INFO progress_bar.py:155 -- Progress bar disabled because stdout is a non-interactive terminal.
2026-01-20 20:57:42,298	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===

1654

In [8]:
paras_ds.take_batch(4)

2026-01-20 20:57:48,938	INFO logging.py:397 -- Registered dataset logger for dataset dataset_201_0
2026-01-20 20:57:48,945	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_201_0. Full logs are in /tmp/ray/session_2026-01-20_18-18-31_241199_2386/logs/ray-data
2026-01-20 20:57:48,946	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_201_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> LimitOperator[limit=4]
2026-01-20 20:57:48,963	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-20 20:57:48,964	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2026-01-20 20:57:48,965	INFO progress_bar.py:213 -- === Ray Data Progress {ReadFiles} ===
2026-01-20 20:57:48,966	INFO progress_bar.py:215 -- ReadFiles: Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 0.0B object stor

{'text': array(['Around the World in Eighty Days\r',
        'CHAPTER I. IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS MASTER, THE OTHER AS MAN\r',
        'Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington Gardens, the house in which Sheridan died in 1814. He was one of the most noticeable members of the Reform Club, though he seemed always to avoid attracting attention; an enigmatical personage, about whom little was known, except that he was a polished man of the world. People said that he resembled Byron—at least that his head was Byronic; but he was a bearded, tranquil Byron, who might live on a thousand years without growing old.\r',
        'Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner. He was never seen on ’Change, nor at the Bank, nor in the counting-rooms of the “City”; no ships ever came into London docks of which he was the owner; he had no public employment; he had never been entered at any of the

To generate our emeddings, we'll use two steps

1. Create a class that performs the embedding operation
    1. We use a class because we'll want to hold on to a large, valuable piece of state -- the embedding model itself
    1. For use with our vector databases, we'll need unique IDs to go with each document and embedding -- we'll generate UUIDs
    1. the output from the `__call__` method will be similar to the input: a dict with the column names as keys, and vectorized types for values
1. Call `dataset.map_batches(...)` where we connect the dataset to the processing class as well as specify resources like the number of class instances (actors) and GPUs
    1. Specify an autoscaling actor pool -- to demo how Ray could autoscale to handle large, uneven workloads

In [9]:
class DocEmbedder:
    def __init__(self):
        self._model = SentenceTransformer("hkunlp/instructor-large")

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        inputs = batch['text']
        embeddings = self._model.encode(inputs, device='cuda:0')
        ids = np.array([uuid.uuid1().hex for i in inputs])
        return { 'doc' : inputs, 'vec' : embeddings, 'id' : ids }

In [10]:
vecs = paras_ds.map_batches(DocEmbedder, compute=ray.data.ActorPoolStrategy(size=2), num_gpus=0.125, batch_size=64)

In [11]:
sample_batch = vecs.take_batch(4)

sample_batch

2026-01-20 20:57:49,426	INFO logging.py:397 -- Registered dataset logger for dataset dataset_203_0
2026-01-20 20:57:49,428	INFO limit_pushdown.py:140 -- Skipping push down of limit 4 through map MapBatches[MapBatches(DocEmbedder)] because it requires 64 rows to produce stable outputs
2026-01-20 20:57:49,432	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_203_0. Full logs are in /tmp/ray/session_2026-01-20_18-18-31_241199_2386/logs/ray-data
2026-01-20 20:57:49,433	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_203_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> ActorPoolMapOperator[MapBatches(DocEmbedder)] -> LimitOperator[limit=4]
2026-01-20 20:57:49,606	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-20 20:57:49,607	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?


{'doc': array(['Around the World in Eighty Days\r',
        'CHAPTER I. IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS MASTER, THE OTHER AS MAN\r',
        'Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington Gardens, the house in which Sheridan died in 1814. He was one of the most noticeable members of the Reform Club, though he seemed always to avoid attracting attention; an enigmatical personage, about whom little was known, except that he was a polished man of the world. People said that he resembled Byron—at least that his head was Byronic; but he was a bearded, tranquil Byron, who might live on a thousand years without growing old.\r',
        'Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner. He was never seen on ’Change, nor at the Bank, nor in the counting-rooms of the “City”; no ships ever came into London docks of which he was the owner; he had no public employment; he had never been entered at any of the 

### Vector storage example: ChromaDB

Ray focuses on compute and is orthogonal to data storage, so many data stores can be used.

For a simple example, we'll use ChromaDB, starting with a minimal in-memory demo so that we can see the prorgamming pattern.

In [12]:
import chromadb
from chromadb.config import Settings

chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))
collection = chroma_client.get_or_create_collection(name="my_text_chunks")

Insert the vectors, documents, and IDs

> Note that Chroma can also accept arbitrary metadata dictionaries for each document, which you can then use in your queries (along with semantic similarity) and see in results. Metadata allows you to easily add powerful features like "search only in chapter 3" or "cite source URLs for data returned"

In [13]:
collection.upsert(
    embeddings=sample_batch['vec'].tolist(),
    documents=sample_batch['doc'].tolist(),
    ids=sample_batch['id'].tolist()
)

In [14]:
test_query = model.encode("tell me about money").tolist()

In [15]:
results = collection.query(
    query_embeddings=[test_query],
    n_results=3
)

In [16]:
results

{'ids': [['b2705228f64211f087c00ab2c0749841',
   'b270539af64211f087c00ab2c0749841',
   'b2705340f64211f087c00ab2c0749841']],
 'embeddings': None,
 'documents': [['Around the World in Eighty Days\r',
   'Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner. He was never seen on ’Change, nor at the Bank, nor in the counting-rooms of the “City”; no ships ever came into London docks of which he was the owner; he had no public employment; he had never been entered at any of the Inns of Court, either at the Temple, or Lincoln’s Inn, or Gray’s Inn; nor had his voice ever resounded in the Court of Chancery, or in the Exchequer, or the Queen’s Bench, or the Ecclesiastical Courts. He certainly was not a manufacturer; nor was he a merchant or a gentleman farmer. His name was strange to the scientific and learned societies, and he never was known to take part in the sage deliberations of the Royal Institution or the London Institution, the Artisan’s Association, or th

### Scaling queries with Chroma

Now that we have the basics of Chroma down, let's look at scaling to large datasets.

We'll create a Ray Core Actor that provides access to ChromaDB

In [17]:
@ray.remote(concurrency_groups={"write": 4, "read": 16})
class ChromaWrapper:
    def __init__(self):
        self.chroma_client = chromadb.PersistentClient(path="/mnt/cluster_storage/vector_store")
        self.collection = self.chroma_client.get_or_create_collection(name="persistent_text_chunks")

    @ray.method(concurrency_group="write")
    def upsert(self, batch):
        self.collection.upsert(
            embeddings=batch['vec'].tolist(),
            documents=batch['doc'].tolist(),
            ids=batch['id'].tolist()
        )
        return len(batch['id'])

    @ray.method(concurrency_group="read")
    def query(self, q):
        return self.collection.query(query_embeddings=[q], n_results=3)

chroma_server = ChromaWrapper.remote()

We're using `map_batches` with a side-effect to write the vectors to the database. Alternative approach would include
* writing a custom sink so we could use code like `my_dataset_with_vectors.write_cool_vectordb('collection')`
* writing to a standard storage format like parquet and using another scripted workflow step to bulk load the database

In this code example, we also demo using `map_batches` in (stateless) task form, and using a lambda, to access a running actor.

> As an exercise, rewrite this to use `map_batches` with an actor (callable) class, the way we've done before

In [18]:
vecs.map_batches(lambda batch: {'batch_count': [ray.get(chroma_server.upsert.remote(batch))]}).sum('batch_count')

2026-01-20 20:58:10,362	INFO dataset.py:3641 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2026-01-20 20:58:10,364	INFO logging.py:397 -- Registered dataset logger for dataset dataset_206_0
2026-01-20 20:58:10,371	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_206_0. Full logs are in /tmp/ray/session_2026-01-20_18-18-31_241199_2386/logs/ray-data
2026-01-20 20:58:10,372	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_206_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> ActorPoolMapOperator[MapBatches(DocEmbedder)] -> TaskPoolMapOperator[MapBatches(<lambda>)] -> HashAggregateOperator[HashAggregate(key_columns=(), num_partitions=1)] -> LimitOperator[limit=1]
2026-01-20 20:58:10,547	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-20 20:58:10,548	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued b

1654

Since our service is running as an actor, we can quickly test out a query

In [19]:
utah_query_vec = model.encode("Describe the body of water in Utah").tolist()

ray.get(chroma_server.query.remote(utah_query_vec))

{'ids': [['63baa6b0f63311f08bd90ab2c0749841',
   '20b7ee1ef64111f092e30a2cfe0740b3',
   '7b91f5a0f64111f0bd500a2cfe0740b3']],
 'embeddings': None,
 'documents': [['During the lecture the train had been making good progress, and towards half-past twelve it reached the northwest border of the Great Salt Lake. Thence the passengers could observe the vast extent of this interior sea, which is also called the Dead Sea, and into which flows an American Jordan. It is a picturesque expanse, framed in lofty crags in large strata, encrusted with white salt—a superb sheet of water, which was formerly of larger extent than now, its shores having encroached with the lapse of time, and thus at once reduced its breadth and increased its depth.\r',
   'During the lecture the train had been making good progress, and towards half-past twelve it reached the northwest border of the Great Salt Lake. Thence the passengers could observe the vast extent of this interior sea, which is also called the Dead Sea,