In [1]:
# ============================================================================
# IMPORTS
# ============================================================================
# uuid: Generate unique identifiers for each document in the vector store
# chromadb: Vector database for storing and querying embeddings
# numpy: Array operations for batch processing
# ray: Distributed computing framework
# sentence_transformers: Pre-trained models for generating text embeddings

import uuid
from pathlib import Path
import chromadb
import numpy as np
import ray
from sentence_transformers import SentenceTransformer

# Module 2: Embeddings Generation and Retrieval with Ray Data

[![Ray](https://img.shields.io/badge/Ray-Data-blue)](https://docs.ray.io/en/latest/data/data.html) [![ChromaDB](https://img.shields.io/badge/ChromaDB-Vector%20Store-green)](https://docs.trychroma.com/)

---

## Overview

This module demonstrates how to generate embeddings at scale using Ray Data and store them in a vector database (ChromaDB) for semantic retrieval. These are core components of any RAG (Retrieval-Augmented Generation) system.

**What you'll learn:**
- Generate embeddings using SentenceTransformers with Ray Data
- Use the callable class pattern for stateful batch processing
- Store and query vectors using ChromaDB
- Create Ray actors for concurrent database access

---

## Architecture Overview

```bash
┌────────────────────────────────────────────────────────────────────────────┐
│                     Embedding Generation Pipeline                          │
└────────────────────────────────────────────────────────────────────────────┘

    Raw Text Documents                    Vector Database (ChromaDB)
           │                                        ▲
           ▼                                        │
    ┌─────────────────┐                    ┌────────────────┐
    │   Ray Data      │  ─────────────────►│ ChromaWrapper  │
    │   read_text()   │                    │     Actor      │
    └────────┬────────┘                    └────────────────┘
             │                                      ▲
             ▼                                      │
    ┌────────────────────────────────────────────────┐
    │                 map_batches()                  │
    │  ┌──────────────┐    ┌──────────────┐          │
    │  │  DocEmbedder │    │  DocEmbedder │  Actor   │
    │  │    Actor 1   │    │    Actor 2   │  Pool    │
    │  │   (GPU 0.1)  │    │   (GPU 0.1)  │          │
    │  └──────────────┘    └──────────────┘          │
    └────────────────────────────────────────────────┘
```

In [2]:
# ============================================================================
# EMBEDDING MODEL SETUP
# ============================================================================
# We use 'all-MiniLM-L6-v2' - a lightweight, fast embedding model that produces
# 384-dimensional vectors. This model offers a good balance between quality
# and performance, and is more stable across different CUDA configurations.
#
# Alternative: 'hkunlp/instructor-large' (768-dim) for higher quality but
# may have CUDA compatibility issues on some systems.

EMBEDDING_MODEL = "all-MiniLM-L6-v2"

# Load the model for local testing (later we'll use this in distributed actors)
model = SentenceTransformer(EMBEDDING_MODEL)
print(f"Loaded model: {EMBEDDING_MODEL}")

Loaded model: all-MiniLM-L6-v2


## Step 1: Understanding Embeddings

Embeddings are dense vector representations of text that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search.

```
┌────────────────────────────────────────────────────────────────────────────┐
│                      Text to Embedding Conversion                          │
└────────────────────────────────────────────────────────────────────────────┘

    "The quick brown fox"              "A fast auburn dog"
             │                                 │
             ▼                                 ▼
    ┌─────────────────┐               ┌─────────────────┐
    │SentenceTransform│               │SentenceTransform│
    │(all-MiniLM-L6)  │               │(all-MiniLM-L6)  │
    └────────┬────────┘               └────────┬────────┘
             │                                 │
             ▼                                 ▼
    [0.23, -0.45, 0.12, ...]         [0.21, -0.42, 0.15, ...]
         384 dimensions                   384 dimensions
             │                                 │
             └──────────────┬──────────────────┘
                            │
                     Cosine Similarity
                            │
                            ▼
                  High Similarity (0.89)
                  (Semantically related)
```

In [3]:
# Quick test: encode two sample sentences to verify the model works
# Notice how we can encode multiple items in a single call (batch processing)
items = ["What are some top attractions in Seattle?", 
         "What are some top attractions in Los Angeles?"]

In [4]:
# Generate embeddings - each text becomes a 768-dimensional vector
# Shape: (2, 768) = 2 items x 768 dimensions per embedding
vectors = model.encode(items)

print(f"Shape: {vectors.shape}")
print(f"Each text is now a {vectors.shape[1]}-dimensional vector")

Shape: (2, 384)
Each text is now a 384-dimensional vector


---

## Step 2: Loading Data with Ray Data

For distributed processing, we need data accessible to all workers. We copy our source text to shared cluster storage (`/mnt/cluster_storage/`).

**Data Source:** "Around the World in Eighty Days" by Jules Verne - a classic novel we'll use as our document corpus for the RAG system.

In [5]:
# Copy the source text to shared cluster storage
# This ensures all Ray workers can access the file
! cp around.txt /mnt/cluster_storage/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [6]:
# ============================================================================
# LOAD DATA WITH RAY DATA
# ============================================================================
# ray.data.read_text() reads a text file where each line becomes a row
# This creates a Ray Dataset - a distributed, lazy data structure

paras_ds = ray.data.read_text("/mnt/cluster_storage/around.txt")

# The dataset is lazy - no data is loaded until we perform an action
print(f"Dataset schema: {paras_ds.schema()}")

2026-01-22 00:58:59,456	INFO worker.py:1821 -- Connecting to existing Ray cluster at address: 10.0.9.248:6379...
2026-01-22 00:58:59,469	INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-v4klp1kjtnk9yrxwdcz5ah11ub.i.anyscaleuserdata.com [39m[22m
2026-01-22 00:58:59,499	INFO packaging.py:463 -- Pushing file package 'gcs://_ray_pkg_3dd5c48f1162e1cb703434693228d29e52183733.zip' (10.61MiB) to Ray cluster...
2026-01-22 00:58:59,541	INFO packaging.py:476 -- Successfully pushed file package 'gcs://_ray_pkg_3dd5c48f1162e1cb703434693228d29e52183733.zip'.
2026-01-22 00:58:59,706	INFO logging.py:397 -- Registered dataset logger for dataset dataset_129_0
2026-01-22 00:58:59,728	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_129_0. Full logs are in /tmp/ray/session_2026-01-21_22-31-19_329458_2360/logs/ray-data
2026-01-22 00:58:59,729	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_129_0: InputDataBuffer[Inp

Dataset schema: Column  Type
------  ----
text    string


In [7]:
# count() triggers execution and returns the total number of rows
# In this case, each paragraph/line of the book is one row
print(f"Total paragraphs in corpus: {paras_ds.count()}")

2026-01-22 00:59:06,386	INFO logging.py:397 -- Registered dataset logger for dataset dataset_130_0
2026-01-22 00:59:06,390	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_130_0. Full logs are in /tmp/ray/session_2026-01-21_22-31-19_329458_2360/logs/ray-data
2026-01-22 00:59:06,391	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_130_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Project] -> AggregateNumRows[AggregateNumRows]
2026-01-22 00:59:06,410	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-22 00:59:06,411	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2026-01-22 00:59:06,412	INFO progress_bar.py:213 -- === Ray Data Progress {ReadFiles} ===
2026-01-22 00:59:06,413	INFO progress_bar.py:215 -- ReadFiles: Tasks: 0; Actors: 0; Queued blocks: 0 

Total paragraphs in corpus: 1654


In [8]:
# Preview the first 4 rows as a batch (dictionary of numpy arrays)
# This is the format that map_batches() will receive
sample = paras_ds.take_batch(4)
print(f"Batch keys: {sample.keys()}")
print(f"First 4 paragraphs:")
for i, text in enumerate(sample['text'][:4]):
    print(f"  {i+1}. {text[:80]}...")

2026-01-22 00:59:06,570	INFO logging.py:397 -- Registered dataset logger for dataset dataset_131_0
2026-01-22 00:59:06,574	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_131_0. Full logs are in /tmp/ray/session_2026-01-21_22-31-19_329458_2360/logs/ray-data
2026-01-22 00:59:06,574	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_131_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> LimitOperator[limit=4]
2026-01-22 00:59:06,591	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-22 00:59:06,592	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2026-01-22 00:59:06,593	INFO progress_bar.py:213 -- === Ray Data Progress {ReadFiles} ===
2026-01-22 00:59:06,594	INFO progress_bar.py:215 -- ReadFiles: Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 0.0B object stor

Batch keys: dict_keys(['text'])
First 4 paragraphs:
.... Around the World in Eighty Days
  2. CHAPTER I. IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS ...
  3. Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington Gardens, the ...
  4. Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londone...


---

## Step 3: Distributed Embedding Generation

To generate embeddings at scale, we use the **Callable Class Pattern** with `map_batches()`. This pattern is essential for:

1. **Expensive initialization** (loading ML models) happens once per actor
2. **Batch processing** for efficient GPU utilization
3. **Parallel execution** across multiple actors

### The Callable Class Pattern

```bash
┌────────────────────────────────────────────────────────────────────────────┐
│                        Callable Class Pattern                              │
└────────────────────────────────────────────────────────────────────────────┘

    class DocEmbedder:
        
        def __init__(self):           ◄── Called ONCE when actor starts
            self._model = load()           (expensive model loading)
            
        def __call__(self, batch):    ◄── Called for EACH batch of data
            return process(batch)          (efficient batch processing)

┌──────────────────────────────────────────────────────────────────────────────┐
│  Actor Lifecycle:                                                            │
│                                                                              │
│              [Actor Created] ──► __init__() ──► [Ready to process batches]   │
│                                              │                               │
│                       ┌──────────────────────┴──────────────────────┐        │
│                       │                                             │        │
│                       ▼                                             ▼        │
│                __call__(batch1)                             __call__(batch2)
│                       │                                             │        │
│                       ▼                                             ▼        │
│                return results                               return results.  │
└──────────────────────────────────────────────────────────────────────────────┘
```

**Key Parameters:**
- `compute=ray.data.ActorPoolStrategy(size=N)` - Creates N actor instances
- `num_gpus=0.125` - Each actor uses 1/8 of a GPU (fractional allocation)
- `batch_size=64` - Number of records processed per `__call__` invocation

In [9]:
# ============================================================================
# DOCUMENT EMBEDDER - Callable Class for Distributed Processing
# ============================================================================
# This class follows the callable class pattern required by map_batches()
# when using ActorPoolStrategy for stateful processing.

class DocEmbedder:
    """
    Generates embeddings for document batches using SentenceTransformers.
    
    - __init__: Loads the model ONCE when the actor starts (expensive operation)
    - __call__: Processes batches efficiently (called many times per actor)
    """
    
    def __init__(self):
        # Load the embedding model - happens once per actor instance
        # This is why we use actors: to avoid reloading the model for each batch
        # Using 'all-MiniLM-L6-v2' for stability across CUDA configurations
        self._model = SentenceTransformer("all-MiniLM-L6-v2")

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        # Extract text from the batch
        inputs = batch['text']
        
        # Generate embeddings
        # This lightweight model works efficiently on CPU
        embeddings = self._model.encode(inputs)
        
        # Generate unique IDs for each document (required by vector databases)
        ids = np.array([uuid.uuid1().hex for i in inputs])
        
        # Return a new batch with original docs, vectors, and IDs
        return {'doc': inputs, 'vec': embeddings, 'id': ids}

In [10]:
# ============================================================================
# APPLY EMBEDDING TRANSFORMATION WITH ACTOR POOL
# ============================================================================
# map_batches() applies DocEmbedder to all batches in the dataset
#
# Parameters:
#   - DocEmbedder: The callable class to instantiate
#   - compute=ActorPoolStrategy(size=2): Create 2 actor instances
#   - batch_size=64: Process 64 documents per __call__ invocation
#
# Note: Using CPU for 'all-MiniLM-L6-v2' model (lightweight and fast)
# For GPU-based models, add: num_gpus=0.125

vecs = paras_ds.map_batches(
    DocEmbedder, 
    compute=ray.data.ActorPoolStrategy(size=2),  # 2 parallel actors
    batch_size=64                                 # Documents per batch
)

# Note: This is LAZY - no computation happens until we materialize the dataset

In [11]:
# Materialize a sample batch to verify the embedding pipeline works
# This triggers actor creation, model loading, and embedding generation
sample_batch = vecs.take_batch(4)

print(f"Batch keys: {sample_batch.keys()}")
print(f"Document shape: {sample_batch['doc'].shape}")
print(f"Vector shape: {sample_batch['vec'].shape}")
print(f"ID shape: {sample_batch['id'].shape}")
print(f"\nEach document is now a {sample_batch['vec'].shape[1]}-dimensional vector")

2026-01-22 00:59:07,045	INFO logging.py:397 -- Registered dataset logger for dataset dataset_133_0
2026-01-22 00:59:07,046	INFO limit_pushdown.py:140 -- Skipping push down of limit 4 through map MapBatches[MapBatches(DocEmbedder)] because it requires 64 rows to produce stable outputs
2026-01-22 00:59:07,050	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_133_0. Full logs are in /tmp/ray/session_2026-01-21_22-31-19_329458_2360/logs/ray-data
2026-01-22 00:59:07,051	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_133_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> ActorPoolMapOperator[MapBatches(DocEmbedder)] -> LimitOperator[limit=4]
2026-01-22 00:59:07,224	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-22 00:59:07,225	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?


Batch keys: dict_keys(['doc', 'vec', 'id'])
Document shape: (4,)
Vector shape: (4, 384)
ID shape: (4,)

Each document is now a 384-dimensional vector


---

## Step 4: Vector Storage with ChromaDB

Vector databases store embeddings and enable fast similarity search. ChromaDB is an open-source option that's easy to use.

```
┌────────────────────────────────────────────────────────────────────────────┐
│                          Vector Database Concept                           │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                         ChromaDB Collection                                │
│                                                                            │
│   ID            Document                           Vector (384-dim)        │
│   ──────────────────────────────────────────────────────────────────────   │
│   abc123        "The quick brown fox..."           [0.23, -0.45, ...]      │
│   def456        "A lazy dog sleeps..."             [0.11, -0.32, ...]      │
│   ghi789        "The weather is nice..."           [0.44, 0.21, ...]       │
│                        ...                                ...              │
└────────────────────────────────────────────────────────────────────────────┘
    
Query Flow:

"Tell me about animals"  ──►  encode()  ──►  [0.19, -0.40, ...]
                                                     │
                                     ┌───────────────┴───────────────┐
                                     │   Cosine Similarity Search    │
                                     │   (find nearest neighbors)    │
                                     └───────────────┬───────────────┘
                                                     │
                                                     ▼
                         Results: ["The quick brown fox...", "A lazy dog..."]
```

### In-Memory ChromaDB Demo

Let's start with a simple in-memory ChromaDB instance to understand the API before scaling up.

In [12]:
# ============================================================================
# CREATE IN-MEMORY CHROMADB COLLECTION
# ============================================================================
# ChromaDB supports two modes:
#   - In-memory (ephemeral): Data lost when process ends
#   - Persistent: Data stored on disk

import chromadb
from chromadb.config import Settings

# Create an in-memory client (good for testing)
chroma_client = chromadb.Client(Settings(anonymized_telemetry=False))

# Create or get a collection (like a table in traditional databases)
collection = chroma_client.get_or_create_collection(name="my_text_chunks")

print(f"Created collection: {collection.name}")

Created collection: my_text_chunks


### Inserting Documents

ChromaDB stores three components for each document:
- **embeddings**: The vector representations
- **documents**: The original text (for retrieval)
- **ids**: Unique identifiers

Optionally, you can also add **metadata** for filtering (e.g., "chapter", "author", "source_url").

In [13]:
# Insert our sample batch into ChromaDB
# upsert() will insert new documents or update existing ones (based on ID)
collection.upsert(
    embeddings=sample_batch['vec'].tolist(),  # Convert numpy arrays to lists
    documents=sample_batch['doc'].tolist(),
    ids=sample_batch['id'].tolist()
)

print(f"Inserted {len(sample_batch['id'])} documents into collection")

Inserted 4 documents into collection


In [14]:
# Create a query embedding to test similarity search
# We encode the query using the same model used for documents
test_query = model.encode("tell me about money").tolist()

print(f"Query vector dimension: {len(test_query)}")

Query vector dimension: 384


In [15]:
# Query the collection for similar documents
# n_results=3 returns the top 3 most similar documents
results = collection.query(
    query_embeddings=[test_query],  # Can query multiple embeddings at once
    n_results=3                      # Return top 3 matches
)

In [16]:
# Display the query results
# Results include: ids, documents, distances (similarity scores), and metadata
print("Query Results:")
print(f"  - IDs: {results['ids']}")
print(f"  - Distances: {results['distances']}")  # Lower = more similar
print(f"  - Documents: {results['documents']}")

Query Results:
  - IDs: [['9348fb98f72d11f0811d023246279789', '9348fcd8f72d11f0811d023246279789', '9348fcbaf72d11f0811d023246279789']]
  - Distances: [[1.686969518661499, 1.757093906402588, 1.9784538745880127]]
  - Documents: [['Around the World in Eighty Days\r', 'Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner. He was never seen on ’Change, nor at the Bank, nor in the counting-rooms of the “City”; no ships ever came into London docks of which he was the owner; he had no public employment; he had never been entered at any of the Inns of Court, either at the Temple, or Lincoln’s Inn, or Gray’s Inn; nor had his voice ever resounded in the Court of Chancery, or in the Exchequer, or the Queen’s Bench, or the Ecclesiastical Courts. He certainly was not a manufacturer; nor was he a merchant or a gentleman farmer. His name was strange to the scientific and learned societies, and he never was known to take part in the sage deliberations of the Royal Instituti

---

## Step 5: Scaling with Ray Actors

For production workloads, we need:
1. **Persistent storage** - Data survives process restarts
2. **Concurrent access** - Multiple readers/writers
3. **Distributed access** - Any Ray worker can query the database

We wrap ChromaDB in a Ray Actor to achieve this:

```bash
┌────────────────────────────────────────────────────────────────────────────┐
│                        ChromaWrapper Actor                                 │
└────────────────────────────────────────────────────────────────────────────┘

                        ┌─────────────────────────┐
    Ray Worker 1 ──────►│                         │
                        │    ChromaWrapper        │
    Ray Worker 2 ──────►│       Actor             │──────► Persistent Storage
                        │                         │        /mnt/cluster_storage/
    Ray Worker 3 ──────►│  ┌─────┐   ┌──────┐     │
                        │  │write│   │ read │     │
                        │  │group│   │ group│     │
                        │  │ (4) │   │ (16) │     │
                        │  └─────┘   └──────┘     │
                        └─────────────────────────┘
    
    Concurrency Groups:
    - write: 4 concurrent writes (limited for consistency)
    - read: 16 concurrent reads (higher for scalability)
```

In [17]:
# ============================================================================
# CHROMAWRAPPER ACTOR - Distributed Vector Database Access
# ============================================================================
# This Ray actor wraps ChromaDB to provide:
#   - Concurrent read/write access from multiple workers
#   - Persistent storage on shared filesystem
#   - Concurrency groups to control parallelism

@ray.remote(concurrency_groups={"write": 4, "read": 16})
class ChromaWrapper:
    """
    Ray actor providing distributed access to ChromaDB.
    
    Concurrency groups allow:
    - Up to 4 concurrent write operations
    - Up to 16 concurrent read operations
    """
    
    def __init__(self):
        # Use PersistentClient to store data on disk
        # Path must be accessible to all workers (shared storage)
        self.chroma_client = chromadb.PersistentClient(
            path="/mnt/cluster_storage/vector_store"
        )
        self.collection = self.chroma_client.get_or_create_collection(
            name="persistent_text_chunks"
        )

    @ray.method(concurrency_group="write")
    def upsert(self, batch):
        """Insert or update documents in the collection."""
        self.collection.upsert(
            embeddings=batch['vec'].tolist(),
            documents=batch['doc'].tolist(),
            ids=batch['id'].tolist()
        )
        return len(batch['id'])

    @ray.method(concurrency_group="read")
    def query(self, q):
        """Query for similar documents."""
        return self.collection.query(query_embeddings=[q], n_results=3)

# Create a single actor instance that all workers will share
chroma_server = ChromaWrapper.remote()
print("ChromaWrapper actor created and ready")

ChromaWrapper actor created and ready


### Bulk Loading with map_batches

We use `map_batches` with a lambda to send each batch to our ChromaWrapper actor. This provides a simple way to bulk-load data into the vector store.

**Alternative approaches:**
- Write a custom sink for cleaner API: `dataset.write_vectordb('collection')`
- Write to Parquet first, then bulk load in a separate step
- Use actor-based callable class (exercise: try implementing this!)

In [18]:
# ============================================================================
# BULK LOAD ALL EMBEDDINGS INTO CHROMADB
# ============================================================================
# This uses map_batches with a lambda to send each batch to the actor
# The lambda calls the actor's upsert method and returns the batch count

total_docs = vecs.map_batches(
    lambda batch: {'batch_count': [ray.get(chroma_server.upsert.remote(batch))]}
).sum('batch_count')

print(f"Total documents loaded into ChromaDB: {total_docs}")

2026-01-22 01:00:34,667	INFO dataset.py:3641 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2026-01-22 01:00:34,669	INFO logging.py:397 -- Registered dataset logger for dataset dataset_136_0
2026-01-22 01:00:34,677	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_136_0. Full logs are in /tmp/ray/session_2026-01-21_22-31-19_329458_2360/logs/ray-data
2026-01-22 01:00:34,678	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_136_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> ActorPoolMapOperator[MapBatches(DocEmbedder)] -> TaskPoolMapOperator[MapBatches(<lambda>)] -> HashAggregateOperator[HashAggregate(key_columns=(), num_partitions=1)] -> LimitOperator[limit=1]
2026-01-22 01:00:34,861	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-01-22 01:00:34,863	INFO progress_bar.py:215 -- ListFiles: Tasks: 1; Actors: 0; Queued b

Total documents loaded into ChromaDB: 1654


### Testing the Persistent Vector Store

Since our ChromaWrapper is a running actor, we can query it directly without going through a dataset pipeline.

In [19]:
# ============================================================================
# SEMANTIC SEARCH DEMO
# ============================================================================
# Query the persistent vector store for documents about Utah
# The Great Salt Lake is mentioned in "Around the World in 80 Days"

# Encode the query
utah_query_vec = model.encode("Describe the body of water in Utah").tolist()

# Query the actor (returns a future, use ray.get() to retrieve the result)
results = ray.get(chroma_server.query.remote(utah_query_vec))

# Display results
print("Query: 'Describe the body of water in Utah'")
print("\nTop 3 most relevant passages:")
for i, doc in enumerate(results['documents'][0]):
    distance = results['distances'][0][i]
    print(f"\n{i+1}. (distance: {distance:.4f})")
    print(f"   {doc[:200]}...")

Query: 'Describe the body of water in Utah'

Top 3 most relevant passages:

1. (distance: 0.9195)
   The Salt Lake, seventy miles long and thirty-five wide, is situated three miles eight hundred feet above the sea. Quite different from Lake Asphaltite, whose depression is twelve hundred feet below th...

2. (distance: 0.9195)
   The Salt Lake, seventy miles long and thirty-five wide, is situated three miles eight hundred feet above the sea. Quite different from Lake Asphaltite, whose depression is twelve hundred feet below th...

3. (distance: 0.9804)
   During the lecture the train had been making good progress, and towards half-past twelve it reached the northwest border of the Great Salt Lake. Thence the passengers could observe the vast extent of ...


---

## Summary

In this module, we covered:

```bash
┌────────────────────────────────────────────────────────────────────────────┐
│                          Key Concepts Learned                              │
└────────────────────────────────────────────────────────────────────────────┘

1. EMBEDDINGS
   ├── Dense vector representations of text (384 dimensions with MiniLM)
   ├── Similar meanings → Similar vectors
   └── Enable semantic search (beyond keyword matching)

2. CALLABLE CLASS PATTERN
   ├── __init__: Expensive setup (model loading) - runs ONCE
   ├── __call__: Batch processing - runs MANY times
   └── Essential for stateful actors with map_batches()

3. RAY DATA INTEGRATION
   ├── read_text() → Load data
   ├── map_batches() → Distributed processing
   └── ActorPoolStrategy → Parallel execution with state

4. VECTOR DATABASE (ChromaDB)
   ├── upsert() → Store documents with embeddings
   ├── query() → Find similar documents
   └── PersistentClient → Durable storage

5. RAY ACTORS
   ├── Wrap stateful services (databases, models)
   ├── Concurrency groups → Control parallelism
   └── Accessible from any Ray worker
```

**Next Steps:** In Module 3, we'll add LLM inference to generate responses using the retrieved context.