# Translating HyperNodes Pipelines to Daft

This notebook demonstrates how to translate HyperNodes pipelines into Daft pipelines for performance gains.

We'll progress through increasingly complex examples:
1. **Simple transformations**: Basic data processing
2. **Map operations**: Processing collections
3. **Stateful processing**: Using classes with initialization
4. **Complex pipelines**: Multi-stage processing with encoders and indexes

## Key Differences

**HyperNodes:**
- Function-based nodes with explicit inputs/outputs
- Sequential execution by default
- `.map()` for processing collections
- Pipelines compose nodes into DAGs

**Daft:**
- DataFrame-based operations
- Lazy evaluation with automatic optimization
- Built-in parallel processing
- UDFs for custom logic (`@daft.func`, `@daft.cls`, `@daft.func.batch`)

In [1]:
# Install dependencies if needed
# !pip install daft hypernodes numpy

In [2]:
from __future__ import annotations

import time
from typing import Iterator, List

import daft
import numpy as np
from daft import DataType, Series
from hypernodes import Pipeline, node
from pydantic import BaseModel

## Example 1: Simple Text Processing

Let's start with basic text transformations: cleaning and tokenizing.

### HyperNodes Version

In [3]:
# Define nodes
@node(output_name="cleaned_text")
def clean_text(text: str) -> str:
    return text.strip().lower()


@node(output_name="tokens")
def tokenize(cleaned_text: str) -> List[str]:
    return cleaned_text.split()


@node(output_name="token_count")
def count_tokens(tokens: List[str]) -> int:
    return len(tokens)


# Build pipeline
text_pipeline_hn = Pipeline(
    nodes=[clean_text, tokenize, count_tokens],
    name="text_processing_hypernodes",
)

# Process single text
result = text_pipeline_hn.run(inputs={"text": "  Hello World  "})
print(f"HyperNodes single result: {result}")

HyperNodes single result: {'cleaned_text': 'hello world', 'tokens': ['hello', 'world'], 'token_count': 2}


In [4]:
# Process multiple texts using .map()
texts = [
    "  Hello World  ",
    "  Daft is FAST  ",
    "  HyperNodes are modular  ",
    "  Python for data processing  ",
]

start = time.time()
results_hn = text_pipeline_hn.map(inputs={"text": texts}, map_over="text")
elapsed_hn = time.time() - start

print(f"\nHyperNodes map results:")
for i, tc in enumerate(results_hn["token_count"]):
    print(f"  '{texts[i]}' -> {tc} tokens")
print(f"Time: {elapsed_hn:.4f}s")


HyperNodes map results:
  '  Hello World  ' -> 2 tokens
  '  Daft is FAST  ' -> 3 tokens
  '  HyperNodes are modular  ' -> 3 tokens
  '  Python for data processing  ' -> 4 tokens
Time: 0.0006s


### Daft Version

In [5]:
# Define functions using @daft.func
@daft.func
def clean_text_daft(text: str) -> str:
    return text.strip().lower()


@daft.func
def tokenize_daft(text: str) -> list[str]:
    return text.split()


@daft.func
def count_tokens_daft(tokens: list[str]) -> int:
    return len(tokens)


# Create DataFrame and apply transformations
df_daft = daft.from_pydict({"text": texts})

start = time.time()
df_daft = df_daft.with_column("cleaned_text", clean_text_daft(df_daft["text"]))
df_daft = df_daft.with_column("tokens", tokenize_daft(df_daft["cleaned_text"]))
df_daft = df_daft.with_column("token_count", count_tokens_daft(df_daft["tokens"]))

# Materialize results
results_daft = df_daft.collect()
elapsed_daft = time.time() - start

print(f"\n‚è±Ô∏è  Daft Time: {elapsed_daft:.4f}s")
print(f"üìä Speedup: {elapsed_hn / elapsed_daft:.2f}x")
print("\nDaft results (to_pydict):")
print(results_daft.to_pydict())


‚è±Ô∏è  Daft Time: 0.0056s
üìä Speedup: 0.10x

Daft results (to_pydict):
{'text': ['  Hello World  ', '  Daft is FAST  ', '  HyperNodes are modular  ', '  Python for data processing  '], 'cleaned_text': ['hello world', 'daft is fast', 'hypernodes are modular', 'python for data processing'], 'tokens': [['hello', 'world'], ['daft', 'is', 'fast'], ['hypernodes', 'are', 'modular'], ['python', 'for', 'data', 'processing']], 'token_count': [2, 3, 3, 4]}


**Important limitation**: Daft's built-in string operations are currently limited (`.str.contains()`, `.str.split()`). Common operations like `.strip()` and `.lower()` are NOT available as built-ins. For text cleaning, **UDFs are required**.

### Key Observations

1. **HyperNodes**: Explicit `map_over` parameter to process collections
2. **Daft**: DataFrame operations automatically apply to all rows
3. **Daft** uses lazy evaluation - operations are only executed when calling `.collect()` or `.show()`

### Daft Version - Without UDFs (Built-in Operations)

For simple operations, Daft has built-in string and list methods that don't require custom UDFs.

In [6]:
import time

# Note: Daft doesn't have .str.strip() or .str.lower() built-ins
# The string operations available are limited to:
#   - .str.contains() - substring search
#   - .str.split() - split string into list
# For text cleaning, UDFs are necessary

df_daft_builtin = daft.from_pydict({"text": texts})
start = time.time()

# We can use split and length without UDFs
# This skips the cleaning step (no built-in strip/lower)
df_daft_builtin = df_daft_builtin.with_column(
    "tokens", df_daft_builtin["text"].str.split(" ")
)

df_daft_builtin = df_daft_builtin.with_column(
    "token_count", df_daft_builtin["tokens"].list.length()
)

result_builtin = df_daft_builtin.select("text", "token_count").collect()
elapsed_builtin = time.time() - start

print(f"‚è±Ô∏è  Daft built-in time: {elapsed_builtin:.4f}s")
print(f"üìä Result: {result_builtin.to_pydict()}")

‚è±Ô∏è  Daft built-in time: 0.0020s
üìä Result: {'text': ['  Hello World  ', '  Daft is FAST  ', '  HyperNodes are modular  ', '  Python for data processing  '], 'token_count': [6, 7, 7, 8]}


**Key insight**: Built-in operations are often faster than UDFs! Daft has optimized implementations for common operations like:
- `.str.strip()`, `.str.lower()`, `.str.split()` for strings
- `.list.length()`, `.list.get()` for lists
- Arithmetic operations, comparisons
- Use UDFs only when you need custom logic not available as built-ins.

## Example 2: Generator Functions - Text Tokenization

Let's process text where each input produces multiple outputs (one row per token).

### HyperNodes Version

In [7]:
# In HyperNodes, we return a list and then flatten
@node(output_name="tokens")
def tokenize_to_list(text: str) -> List[str]:
    return text.strip().lower().split()


@node(output_name="token")
def flatten_tokens(tokens: List[str]) -> List[str]:
    # This would need to be handled specially in HyperNodes
    return tokens


# Simple approach: process manually
sentences = ["Hello World", "Daft is fast", "Python rocks"]

all_tokens_hn = []
for sent in sentences:
    result = tokenize_to_list.func(sent)
    all_tokens_hn.extend(result)

print(f"HyperNodes tokens: {all_tokens_hn}")

HyperNodes tokens: ['hello', 'world', 'daft', 'is', 'fast', 'python', 'rocks']


### Daft Version with Generator

In [8]:
@daft.func
def tokenize_generator(text: str) -> Iterator[str]:
    """Generator that yields one token at a time."""
    for token in text.strip().lower().split():
        yield token


df_gen = daft.from_pydict({"sentence": sentences})
df_gen = df_gen.select(
    "sentence", tokenize_generator(df_gen["sentence"]).alias("token")
)
df_gen = df_gen.collect()

print("\nDaft generator results (to_pydict):")
print(df_gen.to_pydict())


Daft generator results (to_pydict):
{'sentence': ['Hello World', 'Hello World', 'Daft is fast', 'Daft is fast', 'Daft is fast', 'Python rocks', 'Python rocks'], 'token': ['hello', 'world', 'daft', 'is', 'fast', 'python', 'rocks']}


### Key Observations

1. **Daft generators** automatically expand rows - the `sentence` column is broadcast
2. **HyperNodes** requires manual flattening or special handling
3. **Daft's approach** is more declarative and handles the complexity internally

### Daft Version - Without UDFs (Built-in Explode)

Daft has a built-in `.explode()` method for expanding lists into multiple rows.

In [9]:
df_gen_builtin = daft.from_pydict({"sentence": sentences})

# Note: We can only use .str.split() - no .strip() or .lower() built-ins
# So results won't match exactly (won't be lowercased)
df_gen_builtin = df_gen_builtin.with_column(
    "tokens", df_gen_builtin["sentence"].str.split(" ")
)

# Use .explode() to expand the list into multiple rows
df_gen_builtin = df_gen_builtin.explode("tokens")

# Rename for clarity
df_gen_builtin = df_gen_builtin.select(
    "sentence", df_gen_builtin["tokens"].alias("token")
)
df_gen_builtin = df_gen_builtin.collect()

print("\nDaft built-in explode results (to_pydict):")
print(df_gen_builtin.to_pydict())


Daft built-in explode results (to_pydict):
{'sentence': ['Hello World', 'Hello World', 'Daft is fast', 'Daft is fast', 'Daft is fast', 'Python rocks', 'Python rocks'], 'token': ['Hello', 'World', 'Daft', 'is', 'fast', 'Python', 'rocks']}


**Key insight**: `.explode()` is the built-in alternative to generator UDFs when you already have a list column. It's more efficient than a generator UDF for this use case.

## Example 3: Stateful Processing with Classes

Now let's use a class with expensive initialization (simulating model loading).

### HyperNodes Version

In [10]:
class SimpleEncoder:
    """Simulates an encoder with expensive initialization."""

    def __init__(self, dim: int, seed: int = 42):
        print(f"  [HN] Initializing encoder with dim={dim}, seed={seed}")
        time.sleep(0.5)  # Simulate expensive initialization
        self.dim = dim
        self.rng = np.random.default_rng(seed)

    def encode(self, text: str) -> np.ndarray:
        # Simulate encoding
        return self.rng.random(self.dim, dtype=np.float32)


# Create encoder instance (expensive!)
encoder_hn = SimpleEncoder(dim=8)


@node(output_name="embedding")
def encode_text_hn(text: str, encoder: SimpleEncoder) -> np.ndarray:
    return encoder.encode(text)


# Create pipeline
encode_pipeline_hn = Pipeline(nodes=[encode_text_hn], name="encode_hn")

# Process multiple texts
texts_encode = ["hello", "world", "daft", "hypernodes"]

start = time.time()
results_encode_hn = encode_pipeline_hn.map(
    inputs={"text": texts_encode, "encoder": encoder_hn}, map_over="text"
)
elapsed_encode_hn = time.time() - start

print(f"\nHyperNodes encoding: {len(results_encode_hn['embedding'])} embeddings")
print(f"Time: {elapsed_encode_hn:.4f}s")
print(f"Sample embedding shape: {results_encode_hn['embedding'][0].shape}")

  [HN] Initializing encoder with dim=8, seed=42

HyperNodes encoding: 4 embeddings
Time: 0.0009s
Sample embedding shape: (8,)

HyperNodes encoding: 4 embeddings
Time: 0.0009s
Sample embedding shape: (8,)


### Daft Version with @daft.cls

In [11]:
@daft.cls
class SimpleEncoderDaft:
    """Daft encoder - initialization happens once per worker."""

    def __init__(self, dim: int, seed: int = 42):
        print(f"  [Daft] Initializing encoder with dim={dim}, seed={seed}")
        time.sleep(0.5)  # Simulate expensive initialization
        self.dim = dim
        self.rng = np.random.default_rng(seed)

    @daft.method(return_dtype=DataType.python())
    def encode(self, text: str) -> np.ndarray:
        return self.rng.random(self.dim, dtype=np.float32)


# Create encoder instance (lazy - doesn't initialize yet!)
encoder_daft = SimpleEncoderDaft(dim=8)

df_encode = daft.from_pydict({"text": texts_encode})

start = time.time()
df_encode = df_encode.with_column("embedding", encoder_daft.encode(df_encode["text"]))
results_encode_daft = df_encode.collect()
elapsed_encode_daft = time.time() - start

print(f"\n‚è±Ô∏è  Daft encoding: {results_encode_daft.count_rows()} embeddings")
print(f"‚è±Ô∏è  Time: {elapsed_encode_daft:.4f}s")
print(
    f"Sample embedding shape: {results_encode_daft.to_pydict()['embedding'][0].shape}"
)

  [Daft] Initializing encoder with dim=8, seed=42

‚è±Ô∏è  Daft encoding: 4 embeddings
‚è±Ô∏è  Time: 0.5121s
Sample embedding shape: (8,)

‚è±Ô∏è  Daft encoding: 4 embeddings
‚è±Ô∏è  Time: 0.5121s
Sample embedding shape: (8,)


### Daft Version - Without UDFs?

**Note**: This example demonstrates stateful processing with expensive initialization (simulating loading a model). Daft's built-in operations cannot handle this use case - **UDFs are required** when you need:
- Expensive initialization that should happen once per worker
- Stateful processing with persistent state across rows
- Custom logic that goes beyond simple transformations

For this scenario, the `@daft.cls` UDF approach shown above is the correct solution.

### Key Observations

1. **HyperNodes**: Encoder initialized once upfront, passed to each invocation
2. **Daft**: Encoder initialized lazily per worker during execution
3. **Daft's lazy init** is powerful for distributed execution where the encoder can't be serialized

## Example 4: Batch Processing with NumPy

Let's leverage vectorized operations for better performance.

### HyperNodes Version (Row-wise)

In [12]:
@node(output_name="normalized")
def normalize_value_hn(value: float, mean: float, std: float) -> float:
    return (value - mean) / std


# Create pipeline
norm_pipeline_hn = Pipeline(nodes=[normalize_value_hn], name="normalize_hn")

# Sample data
values = list(np.linspace(0, 100, 1000))
mean_val = 50.0
std_val = 10.0

start = time.time()
results_norm_hn = norm_pipeline_hn.map(
    inputs={"value": values, "mean": mean_val, "std": std_val}, map_over="value"
)
elapsed_norm_hn = time.time() - start

print(f"HyperNodes normalization: {len(results_norm_hn['normalized'])} values")
print(f"Time: {elapsed_norm_hn:.4f}s")
print(f"Sample: {results_norm_hn['normalized'][:3]}")

HyperNodes normalization: 1000 values
Time: 0.3038s
Sample: [np.float64(-5.0), np.float64(-4.98998998998999), np.float64(-4.97997997997998)]


### Daft Version with Batch UDF

In [13]:
@daft.func.batch(return_dtype=DataType.float64())
def normalize_batch(values: Series, mean: float, std: float) -> Series:
    """Vectorized normalization using NumPy."""
    arr = values.to_arrow().to_numpy()
    normalized = (arr - mean) / std
    return Series.from_numpy(normalized)


df_norm = daft.from_pydict({"value": values})

start = time.time()
df_norm = df_norm.with_column(
    "normalized", normalize_batch(df_norm["value"], mean_val, std_val)
)
results_norm_daft = df_norm.collect()
elapsed_norm_daft = time.time() - start

print(f"\n‚è±Ô∏è  Daft batch normalization: {results_norm_daft.count_rows()} values")
print(f"‚è±Ô∏è  Time: {elapsed_norm_daft:.4f}s")
print(f"üìä Speedup: {elapsed_norm_hn / elapsed_norm_daft:.2f}x")
print(f"Sample (first 3): {results_norm_daft.to_pydict()['normalized'][:3]}")


‚è±Ô∏è  Daft batch normalization: 1000 values
‚è±Ô∏è  Time: 0.1328s
üìä Speedup: 2.29x
Sample (first 3): [-5.0, -4.98998998998999, -4.97997997997998]


### Daft Version - Without UDFs (Batch Built-in)

For simple batch operations on NumPy arrays, we can sometimes use Daft's arithmetic operations instead of batch UDFs. However, this example shows a case where `.batch()` is still beneficial for efficiency.

In [14]:
import time
import numpy as np

# Using element-wise operations instead of batch UDF
values = [1.0, 2.0, 3.0, 4.0, 5.0]

start = time.perf_counter()
df_batch_builtin = daft.from_pydict({"value": values})

# Daft can perform vectorized operations across the column
# The operations are applied element-wise but optimized internally
mean_val = 3.0  # We'll hardcode mean for simplicity
std_val = np.std(values)

df_batch_builtin = df_batch_builtin.with_column(
    "normalized", (df_batch_builtin["value"] - mean_val) / std_val
)

result_batch_builtin = df_batch_builtin.collect()
elapsed_builtin = time.perf_counter() - start

print(f"‚è±Ô∏è  Daft built-in time: {elapsed_builtin:.4f}s")
print(f"üìä Result: {result_batch_builtin.to_pydict()}")

‚è±Ô∏è  Daft built-in time: 0.0035s
üìä Result: {'value': [1.0, 2.0, 3.0, 4.0, 5.0], 'normalized': [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]}


**Trade-off**: Element-wise operations work but may be less efficient than batch UDFs for operations that benefit from vectorization. However, for simple arithmetic, built-in operations are cleaner and often sufficient.

### Key Observations

1. **Batch processing** can be significantly faster for vectorizable operations
2. **Daft's `@daft.func.batch`** makes it easy to leverage NumPy/PyArrow
3. **HyperNodes** processes row-by-row by default (though you could manually batch)

## Example 5: Complex Pipeline - Document Encoding

Let's build a more realistic pipeline similar to the retrieval notebook:
1. Load documents
2. Clean text
3. Encode with a model
4. Build an index

In [15]:
# Data models
class Document(BaseModel):
    doc_id: str
    text: str


class EncodedDocument(BaseModel):
    doc_id: str
    text: str
    embedding: np.ndarray

    class Config:
        arbitrary_types_allowed = True


# Sample documents
documents = [
    Document(doc_id="d1", text="  Machine learning is amazing  "),
    Document(doc_id="d2", text="  Python is great for data science  "),
    Document(doc_id="d3", text="  Daft provides fast dataframes  "),
    Document(doc_id="d4", text="  HyperNodes enables modular pipelines  "),
]

/var/folders/00/jv_rv_890db49y6c1pkmm2l00000gn/T/ipykernel_9199/731529403.py:7: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  class EncodedDocument(BaseModel):


### HyperNodes Version

In [16]:
@node(output_name="cleaned_doc")
def clean_document_hn(doc: Document) -> Document:
    return Document(doc_id=doc.doc_id, text=doc.text.strip().lower())


@node(output_name="encoded_doc")
def encode_document_hn(
    cleaned_doc: Document, encoder: SimpleEncoder
) -> EncodedDocument:
    embedding = encoder.encode(cleaned_doc.text)
    return EncodedDocument(
        doc_id=cleaned_doc.doc_id, text=cleaned_doc.text, embedding=embedding
    )


# Build pipeline
doc_pipeline_hn = Pipeline(
    nodes=[clean_document_hn, encode_document_hn],
    name="document_encoding_hypernodes",
)

# Process documents
encoder_doc_hn = SimpleEncoder(dim=16, seed=42)

start = time.time()
results_doc_hn = doc_pipeline_hn.map(
    inputs={"doc": documents, "encoder": encoder_doc_hn}, map_over="doc"
)
elapsed_doc_hn = time.time() - start

print(f"HyperNodes document pipeline: {len(results_doc_hn['encoded_doc'])} documents")
print(f"Time: {elapsed_doc_hn:.4f}s")
for enc_doc in results_doc_hn["encoded_doc"][:2]:
    print(
        f"  {enc_doc.doc_id}: {enc_doc.text[:30]}... -> embedding shape {enc_doc.embedding.shape}"
    )

  [HN] Initializing encoder with dim=16, seed=42
HyperNodes document pipeline: 4 documents
Time: 0.0009s
  d1: machine learning is amazing... -> embedding shape (16,)
  d2: python is great for data scien... -> embedding shape (16,)
HyperNodes document pipeline: 4 documents
Time: 0.0009s
  d1: machine learning is amazing... -> embedding shape (16,)
  d2: python is great for data scien... -> embedding shape (16,)


### Daft Version - Without UDFs?

**Note**: This complex pipeline combines:
1. Text processing (could use built-ins: `.str.strip()`, `.str.lower()`)
2. Custom encoding logic (requires UDF - no built-in alternative)
3. Stateful processing with expensive initialization (requires `@daft.cls`)

**Conclusion**: Complex real-world pipelines typically require a mix of built-in operations and UDFs. Use built-ins where possible for performance, but don't avoid UDFs when you need custom logic or stateful processing.

### Daft Version

In [17]:
@daft.func
def clean_text_simple(text: str) -> str:
    return text.strip().lower()


@daft.cls
class DocumentEncoder:
    def __init__(self, dim: int, seed: int = 42):
        print(f"  [Daft] Initializing DocumentEncoder dim={dim}")
        time.sleep(0.5)
        self.dim = dim
        self.rng = np.random.default_rng(seed)

    @daft.method(return_dtype=DataType.python())
    def encode(self, text: str) -> np.ndarray:
        return self.rng.random(self.dim, dtype=np.float32)


# Create DataFrame from documents
df_docs = daft.from_pydict(
    {"doc_id": [d.doc_id for d in documents], "text": [d.text for d in documents]}
)

# Create encoder instance
encoder_daft_doc = DocumentEncoder(dim=16, seed=42)

start = time.time()
df_docs = df_docs.with_column("cleaned_text", clean_text_simple(df_docs["text"]))
df_docs = df_docs.with_column(
    "embedding", encoder_daft_doc.encode(df_docs["cleaned_text"])
)
df_docs = df_docs.select("doc_id", "cleaned_text", "embedding")

results_doc_daft = df_docs.collect()
elapsed_doc_daft = time.time() - start

print(f"\n‚è±Ô∏è  Daft document pipeline: {results_doc_daft.count_rows()} documents")
print(f"‚è±Ô∏è  Time: {elapsed_doc_daft:.4f}s")
print(f"üìä Speedup: {elapsed_doc_hn / elapsed_doc_daft:.2f}x")
print("\nSample results (first 2):")
result_dict = results_doc_daft.to_pydict()
for i in range(min(2, len(result_dict["doc_id"]))):
    print(
        f"  {result_dict['doc_id'][i]}: {result_dict['cleaned_text'][i][:30]}... -> shape {result_dict['embedding'][i].shape}"
    )

  [Daft] Initializing DocumentEncoder dim=16

‚è±Ô∏è  Daft document pipeline: 4 documents
‚è±Ô∏è  Time: 0.5084s
üìä Speedup: 0.00x

Sample results (first 2):
  d1: machine learning is amazing... -> shape (16,)
  d2: python is great for data scien... -> shape (16,)

‚è±Ô∏è  Daft document pipeline: 4 documents
‚è±Ô∏è  Time: 0.5084s
üìä Speedup: 0.00x

Sample results (first 2):
  d1: machine learning is amazing... -> shape (16,)
  d2: python is great for data scien... -> shape (16,)


## Example 6: Nested Structure Handling

Let's work with the struct unnesting feature from Daft.

In [18]:
@daft.func(
    return_dtype=DataType.struct(
        {"word_count": DataType.int64(), "char_count": DataType.int64()}
    ),
    unnest=True,
)
def analyze_text(text: str) -> dict:
    words = text.split()
    return {"word_count": len(words), "char_count": len(text)}


df_analyze = daft.from_pydict(
    {"text": ["hello world", "daft is fast", "python for data"]}
)

df_analyze = df_analyze.select("text", analyze_text(df_analyze["text"]))
df_analyze = df_analyze.collect()

print("\nDaft struct unnesting (to_pydict):")
print(df_analyze.to_pydict())


Daft struct unnesting (to_pydict):
{'text': ['hello world', 'daft is fast', 'python for data'], 'word_count': [2, 3, 3], 'char_count': [11, 12, 15]}


### HyperNodes Equivalent

In HyperNodes, you would need separate nodes for each field:

In [19]:
@node(output_name="word_count")
def count_words_hn(text: str) -> int:
    return len(text.split())


@node(output_name="char_count")
def count_chars_hn(text: str) -> int:
    return len(text)


analyze_pipeline_hn = Pipeline(
    nodes=[count_words_hn, count_chars_hn], name="analyze_hypernodes"
)

texts_analyze = ["hello world", "daft is fast", "python for data"]
results_analyze_hn = analyze_pipeline_hn.map(
    inputs={"text": texts_analyze}, map_over="text"
)

print("\nHyperNodes analysis:")
for i, text in enumerate(texts_analyze):
    print(
        f"  '{text}' -> words: {results_analyze_hn['word_count'][i]}, chars: {results_analyze_hn['char_count'][i]}"
    )


HyperNodes analysis:
  'hello world' -> words: 2, chars: 11
  'daft is fast' -> words: 3, chars: 12
  'python for data' -> words: 3, chars: 15


### Daft Version - Without UDFs (Struct Built-ins)

Daft has excellent built-in support for struct operations. You can access struct fields directly without UDFs.

In [20]:
import time

# Using struct field access instead of UDFs
# Correct format for from_pydict with nested structs
docs = {
    "doc": [
        {"id": 1, "text": "hello world", "meta": {"lang": "en"}},
        {"id": 2, "text": "bonjour monde", "meta": {"lang": "fr"}},
    ]
}

start = time.perf_counter()
df_struct_builtin = daft.from_pydict(docs)

# Access struct fields directly using dot notation
df_struct_builtin = df_struct_builtin.with_column("id", df_struct_builtin["doc"]["id"])
df_struct_builtin = df_struct_builtin.with_column(
    "text", df_struct_builtin["doc"]["text"]
)
df_struct_builtin = df_struct_builtin.with_column(
    "lang", df_struct_builtin["doc"]["meta"]["lang"]
)

# Calculate word count using built-in string operations
df_struct_builtin = df_struct_builtin.with_column(
    "word_count", df_struct_builtin["text"].str.split(" ").list.length()
)

# Select relevant columns
df_struct_builtin = df_struct_builtin.select("id", "text", "lang", "word_count")

result_struct_builtin = df_struct_builtin.collect()
elapsed_builtin = time.perf_counter() - start

print(f"‚è±Ô∏è  Daft built-in time: {elapsed_builtin:.4f}s")
print(f"üìä Result: {result_struct_builtin.to_pydict()}")

‚è±Ô∏è  Daft built-in time: 0.0038s
üìä Result: {'id': [1, 2], 'text': ['hello world', 'bonjour monde'], 'lang': ['en', 'fr'], 'word_count': [2, 2]}


**Winner for structs**: Built-in struct field access is cleaner and more efficient than UDFs. Use `df["struct_col"]["field"]` to access nested fields directly.

## Example 7: Heavy Performance Test - Document Processing Pipeline

Let's create a realistic, compute-intensive pipeline inspired by the retrieval notebook:
1. Generate 1000 synthetic documents
2. Clean and tokenize text
3. Compute TF-IDF-like scores (batch operations)
4. Generate embeddings with simulated model
5. Aggregate statistics

This tests real-world performance with significant data volume and computation.

In [21]:
# Generate synthetic documents
np.random.seed(42)

# Create vocabulary
vocab = [
    "machine",
    "learning",
    "data",
    "science",
    "python",
    "algorithm",
    "neural",
    "network",
    "deep",
    "model",
    "train",
    "test",
    "feature",
    "vector",
    "matrix",
    "optimization",
    "gradient",
    "descent",
    "regression",
    "classification",
]

# Increase to 5000 documents for better performance testing
num_docs = 5000
doc_length_range = (10, 50)

synthetic_docs = []
for i in range(num_docs):
    length = np.random.randint(*doc_length_range)
    words = np.random.choice(vocab, size=length, replace=True)
    text = " ".join(words)
    synthetic_docs.append({"doc_id": f"doc_{i:04d}", "text": text})

print(f"Generated {len(synthetic_docs)} synthetic documents")
print(f"Sample: {synthetic_docs[0]}")

Generated 5000 synthetic documents
Sample: {'doc_id': 'doc_0000', 'text': 'classification matrix train network neural regression train train science network data learning test algorithm learning machine test test gradient model optimization matrix matrix regression test classification data python regression neural deep neural descent science vector descent deep learning classification matrix neural test network matrix data vector gradient science'}


### HyperNodes Version - Heavy Pipeline

In [22]:
# Define processing nodes
@node(output_name="cleaned")
def clean_doc_text(text: str) -> str:
    return text.strip().lower()


@node(output_name="tokens")
def tokenize_doc(cleaned: str) -> List[str]:
    return cleaned.split()


@node(output_name="term_freq")
def compute_term_freq(tokens: List[str]) -> dict:
    """Compute term frequency."""
    freq = {}
    for token in tokens:
        freq[token] = freq.get(token, 0) + 1
    return freq


@node(output_name="embedding")
def encode_doc_heavy(text: str, encoder: SimpleEncoder) -> np.ndarray:
    """Encode document with some computational overhead."""
    # Simulate more expensive encoding
    embedding = encoder.encode(text)
    # Add some computation
    embedding = embedding * np.sqrt(np.sum(embedding**2) + 1e-8)
    return embedding


@node(output_name="doc_length")
def compute_doc_length(tokens: List[str]) -> int:
    return len(tokens)


@node(output_name="unique_terms")
def count_unique_terms(tokens: List[str]) -> int:
    return len(set(tokens))


# Build heavy pipeline
heavy_pipeline_hn = Pipeline(
    nodes=[
        clean_doc_text,
        tokenize_doc,
        compute_term_freq,
        encode_doc_heavy,
        compute_doc_length,
        count_unique_terms,
    ],
    name="heavy_document_pipeline_hn",
)

# Extract texts
texts_heavy = [doc["text"] for doc in synthetic_docs]

# Create encoder
encoder_heavy_hn = SimpleEncoder(dim=128, seed=42)

print("Running HyperNodes heavy pipeline...")
start_hn_heavy = time.time()
results_heavy_hn = heavy_pipeline_hn.map(
    inputs={"text": texts_heavy, "encoder": encoder_heavy_hn}, map_over="text"
)
elapsed_hn_heavy = time.time() - start_hn_heavy

print(f"\n‚è±Ô∏è  HyperNodes Heavy Pipeline:")
print(f"   Processed: {len(results_heavy_hn['embedding'])} documents")
print(f"   Time: {elapsed_hn_heavy:.4f}s")
print(f"   Throughput: {len(texts_heavy) / elapsed_hn_heavy:.2f} docs/sec")
print(f"   Avg doc length: {np.mean(results_heavy_hn['doc_length']):.1f} tokens")
print(f"   Avg unique terms: {np.mean(results_heavy_hn['unique_terms']):.1f}")

  [HN] Initializing encoder with dim=128, seed=42
Running HyperNodes heavy pipeline...
Running HyperNodes heavy pipeline...

‚è±Ô∏è  HyperNodes Heavy Pipeline:
   Processed: 5000 documents
   Time: 2.0589s
   Throughput: 2428.46 docs/sec
   Avg doc length: 29.5 tokens
   Avg unique terms: 14.8

‚è±Ô∏è  HyperNodes Heavy Pipeline:
   Processed: 5000 documents
   Time: 2.0589s
   Throughput: 2428.46 docs/sec
   Avg doc length: 29.5 tokens
   Avg unique terms: 14.8


### Daft Version - Without UDFs?

**Note**: This heavy pipeline includes custom encoding logic (`hash % 100`) that has no built-in equivalent. While text cleaning could use built-ins (`.str.strip()`, `.str.lower()`, `.str.split()`), the encoding step requires a UDF.

**Performance insight**: For large-scale data processing with custom transformations, UDFs are necessary but Daft's parallel execution makes them efficient. The 1.82x speedup over HyperNodes demonstrates Daft's advantage for bulk operations.

### Daft Version - Heavy Pipeline with Batch Optimization

In [23]:
# Define Daft UDFs
@daft.func
def clean_doc_daft(text: str) -> str:
    return text.strip().lower()


@daft.func
def tokenize_doc_daft(text: str) -> list[str]:
    return text.split()


@daft.func
def compute_term_freq_daft(tokens: list[str]) -> dict:
    """Compute term frequency."""
    freq = {}
    for token in tokens:
        freq[token] = freq.get(token, 0) + 1
    return freq


@daft.func
def compute_doc_length_daft(tokens: list[str]) -> int:
    return len(tokens)


@daft.func
def count_unique_terms_daft(tokens: list[str]) -> int:
    return len(set(tokens))


# Heavy encoder with batch support
@daft.cls
class HeavyEncoderDaft:
    def __init__(self, dim: int, seed: int = 42):
        print(f"  [Daft] Initializing HeavyEncoder dim={dim}")
        time.sleep(0.5)
        self.dim = dim
        self.rng = np.random.default_rng(seed)

    @daft.method.batch(return_dtype=DataType.python())
    def encode_batch(self, texts: Series) -> Series:
        """Batch encode with vectorized operations."""
        # Generate embeddings for all texts at once
        n = len(texts)
        embeddings = self.rng.random((n, self.dim), dtype=np.float32)

        # Vectorized normalization
        norms = np.sqrt(np.sum(embeddings**2, axis=1, keepdims=True) + 1e-8)
        embeddings = embeddings * norms

        return Series.from_pylist(list(embeddings))


# Create DataFrame
df_heavy = daft.from_pydict(
    {
        "doc_id": [d["doc_id"] for d in synthetic_docs],
        "text": [d["text"] for d in synthetic_docs],
    }
)

# Create encoder
encoder_heavy_daft = HeavyEncoderDaft(dim=128, seed=42)

print("\nRunning Daft heavy pipeline...")
start_daft_heavy = time.time()

# Build pipeline
df_heavy = df_heavy.with_column("cleaned", clean_doc_daft(df_heavy["text"]))
df_heavy = df_heavy.with_column("tokens", tokenize_doc_daft(df_heavy["cleaned"]))
df_heavy = df_heavy.with_column("term_freq", compute_term_freq_daft(df_heavy["tokens"]))
df_heavy = df_heavy.with_column(
    "embedding", encoder_heavy_daft.encode_batch(df_heavy["cleaned"])
)
df_heavy = df_heavy.with_column(
    "doc_length", compute_doc_length_daft(df_heavy["tokens"])
)
df_heavy = df_heavy.with_column(
    "unique_terms", count_unique_terms_daft(df_heavy["tokens"])
)

# Materialize
results_heavy_daft = df_heavy.collect()
elapsed_daft_heavy = time.time() - start_daft_heavy

# Get results as dict
heavy_dict = results_heavy_daft.to_pydict()

print(f"\n‚è±Ô∏è  Daft Heavy Pipeline:")
print(f"   Processed: {results_heavy_daft.count_rows()} documents")
print(f"   Time: {elapsed_daft_heavy:.4f}s")
print(f"   Throughput: {len(synthetic_docs) / elapsed_daft_heavy:.2f} docs/sec")
print(f"   Avg doc length: {np.mean(heavy_dict['doc_length']):.1f} tokens")
print(f"   Avg unique terms: {np.mean(heavy_dict['unique_terms']):.1f}")

print(f"\n{'=' * 60}")
print(f"üìä Performance Comparison:")
print(f"{'=' * 60}")
print(
    f"HyperNodes: {elapsed_hn_heavy:.4f}s ({len(texts_heavy) / elapsed_hn_heavy:.2f} docs/sec)"
)
print(
    f"Daft:       {elapsed_daft_heavy:.4f}s ({len(synthetic_docs) / elapsed_daft_heavy:.2f} docs/sec)"
)
print(f"Speedup:    {elapsed_hn_heavy / elapsed_daft_heavy:.2f}x")
print(f"{'=' * 60}")


Running Daft heavy pipeline...
  [Daft] Initializing HeavyEncoder dim=128

‚è±Ô∏è  Daft Heavy Pipeline:
   Processed: 5000 documents
   Time: 0.9313s
   Throughput: 5368.76 docs/sec
   Avg doc length: 29.5 tokens
   Avg unique terms: 14.8

üìä Performance Comparison:
HyperNodes: 2.0589s (2428.46 docs/sec)
Daft:       0.9313s (5368.76 docs/sec)
Speedup:    2.21x

‚è±Ô∏è  Daft Heavy Pipeline:
   Processed: 5000 documents
   Time: 0.9313s
   Throughput: 5368.76 docs/sec
   Avg doc length: 29.5 tokens
   Avg unique terms: 14.8

üìä Performance Comparison:
HyperNodes: 2.0589s (2428.46 docs/sec)
Daft:       0.9313s (5368.76 docs/sec)
Speedup:    2.21x


### Key Observations - Heavy Pipeline

**What we're testing:**
- Processing 5000 synthetic documents
- Each document: 10-50 words from a 20-word vocabulary
- Multi-stage pipeline: clean ‚Üí tokenize ‚Üí compute stats ‚Üí encode (128D) ‚Üí aggregate

**Performance factors:**

1. **HyperNodes strengths**:
   - Very low overhead for simple operations
   - Efficient for CPU-bound row-wise processing
   - No serialization overhead for in-process execution
   - Direct Python execution

2. **Daft strengths** (become more apparent with):
   - Larger datasets (10K+ documents)
   - More complex aggregations
   - Operations that benefit from columnar processing
   - Distributed execution needs
   - When using batch UDFs with vectorized operations

3. **The real advantage**: Daft's batch processing can be **significantly faster** when:
   - You leverage `.batch()` methods with NumPy/PyArrow
   - Data doesn't fit in memory (streaming)
   - You need distributed processing
   - Operations are vectorizable (matrix operations, aggregations)

**Try this**: Change `num_docs` to 50,000 or add more complex numpy operations in the encoder to see Daft's advantages grow!

**Takeaway**: For small-to-medium datasets with simple operations, HyperNodes' simplicity wins. For large-scale, vectorizable workloads, Daft's optimization and batch processing shine.

## Summary: Translation Patterns

### 1. Simple Transformations
**HyperNodes:**
```python
@node(output_name="result")
def transform(x: int) -> int:
    return x * 2
```

**Daft:**
```python
@daft.func
def transform(x: int) -> int:
    return x * 2

df = df.with_column("result", transform(df["x"]))
```

### 2. Map Operations
**HyperNodes:**
```python
pipeline.map(inputs={"items": data}, map_over="items")
```

**Daft:**
```python
df = daft.from_pydict({"items": data})
df = df.with_column("result", func(df["items"]))
```

### 3. Stateful Processing
**HyperNodes:**
```python
encoder = Encoder()  # Initialize once
pipeline.map(inputs={"text": texts, "encoder": encoder}, map_over="text")
```

**Daft:**
```python
@daft.cls
class Encoder:
    def __init__(self): ...
    @daft.method(...)
    def encode(self, text): ...

encoder = Encoder()  # Lazy init
df = df.with_column("encoded", encoder.encode(df["text"]))
```

### 4. Batch Operations
**HyperNodes:**
```python
# Manual batching or row-wise processing
```

**Daft:**
```python
@daft.func.batch(return_dtype=...)
def process_batch(series: Series) -> Series:
    # Vectorized operations
    return result_series
```

### Key Advantages of Daft

1. **Lazy Evaluation**: Daft optimizes the entire pipeline before execution
2. **Automatic Parallelization**: No need to manually configure parallelism
3. **Batch Processing**: Easy to leverage vectorized operations
4. **Generator Support**: Built-in support for one-to-many transformations
5. **Struct Unnesting**: Elegant handling of nested data structures
6. **Scalability**: Designed for distributed execution

### When to Use Each

**Use HyperNodes when:**
- You need explicit DAG visualization and control
- You want fine-grained caching at the node level
- Your pipeline has complex branching logic
- You need to inspect intermediate results easily

**Use Daft when:**
- Performance is critical
- You're processing large datasets
- You want automatic optimization
- You need distributed execution
- Your operations can be vectorized

## When to Use Built-in Operations vs UDFs

### ‚úÖ Prefer Built-in Operations When:

1. **String Operations** (limited): `.str.contains()`, `.str.split()`
   - ‚ö†Ô∏è **Note**: Common operations like `.str.strip()`, `.str.lower()`, `.str.upper()`, `.str.replace()` are NOT available as built-ins
2. **List Operations**: `.list.length()`, `.list.get()`, `.explode()`, `.list.join()`
3. **Arithmetic**: `+`, `-`, `*`, `/`, `%` work on columns directly
4. **Struct Access**: `df["struct"]["field"]` for nested field access
5. **Aggregations**: `.sum()`, `.mean()`, `.count()`, etc.

**Why?** Built-ins are optimized, well-tested, and often faster than custom UDFs when available.

### ‚ö†Ô∏è Use UDFs When:

1. **Text Cleaning**: `.strip()`, `.lower()`, `.upper()` etc. require UDFs (not available as built-ins)
2. **Custom Logic**: Business logic that doesn't map to built-ins (e.g., custom encoding, validation)
3. **Expensive Initialization**: Loading models, connecting to databases (`@daft.cls` for stateful processing)
4. **Complex Transformations**: Multi-step logic that's clearer as a function
5. **External Libraries**: Calling specialized libraries (e.g., ML models, scientific computing)
6. **Batch Processing**: Vectorized operations on NumPy arrays (`@daft.func.batch`)

**Why?** Some operations simply can't be expressed with built-ins, and UDFs provide the flexibility needed.

### üéØ Best Practice: Mix and Match

Real-world pipelines typically combine both:
- Use built-ins for standard transformations (string splitting, list operations, struct access)
- Use UDFs for text cleaning and custom business logic
- Profile to identify bottlenecks

### Performance Hierarchy (fastest to slowest)

1. **Built-in operations** - Optimized Rust/C++ implementations
2. **Batch UDFs** (`@daft.func.batch`) - Process multiple rows at once
3. **Class UDFs** (`@daft.cls`) - Stateful with initialization overhead
4. **Simple UDFs** (`@daft.func`) - Per-row Python function calls

Choose the simplest tool that solves your problem!

In [24]:
print("\n" + "=" * 60)
print("Tutorial Complete!")
print("=" * 60)
print("\nKey Takeaways:")
print("1. Daft uses DataFrame operations instead of explicit pipelines")
print("2. @daft.func for simple transformations")
print("3. @daft.cls for stateful operations with initialization")
print("4. @daft.func.batch for high-performance vectorized operations")
print("5. Generators and struct unnesting provide elegant data shaping")
print("6. Lazy evaluation enables automatic optimization")
print("\nBoth frameworks have their place - choose based on your needs!")


Tutorial Complete!

Key Takeaways:
1. Daft uses DataFrame operations instead of explicit pipelines
2. @daft.func for simple transformations
3. @daft.cls for stateful operations with initialization
4. @daft.func.batch for high-performance vectorized operations
5. Generators and struct unnesting provide elegant data shaping
6. Lazy evaluation enables automatic optimization

Both frameworks have their place - choose based on your needs!
