# Workshop Notebook 2 - DocSets and Document Processing

In this notebook, we will scale from one document to two, using sycamore to apply various forms of processing to each of them, in order to write them to a database
and be able to answer questions like:

0. In the Broadcom earnings call, what details did the CEO, Hock Tan, discuss about the VMware acqusition?

## Sycamore basics

By now you have a basic sense of the data model - a Document is made up of Elements which represent logical chunks of the Document, and contain additional metadata about themselves.
The next step is to scale this past one document to many, and this is where Sycamore comes in. Sycamore adds a data structure called a DocSet, which is a set of Documents.
Each Document in the DocSet contains the list of Elements that it comprises, and a bunch of metadata as well (for instance, the name of the file the document came from).

Now you'll likely want to apply a series of transformations to the Documents before you write them to a database. You can imagine writing a big for loop over all the documents and
calling a series of functions on them in order. Maybe you throw `multiprocessing` at it to parallelize it. Maybe you run nested loops to do some sort of batching. You have to do a 
lot of work to optimize it, and you still probably aren't using memory as efficiently as you could be. 

DocSets make processing large amounts of documents easy. DocSet methods are mostly processing steps to be applied to every document in the DocSet - so instead of writing
```python
# without docsets
processed_documents = []
for document in list_of_documents:
    processed_documents.append(foo(document))
```
You can write
```python
# with docsets
processed_docset = docset.map(foo)
```

### Execution modes

Each docset is bound to a Sycamore Context, which is the execution engine that actually executes
the processing steps. We've implemented 2 execution modes, `LOCAL` and `RAY`. `RAY` mode executes 
the DocSet on a [ray](https://www.ray.io/) cluster, creating one locally if it does not find an 
existing ray cluster. This mode scales well, running transforms on Documents in parallel across 
processes (and nodes if you've set it up), but it can be tricky to debug - distributed stack traces 
are notoriously unwieldy. `LOCAL` mode runs in single-threaded python in the process and is generally
better for debugging, but you lose the distributed/parallel aspect. For the beginning of the workshop,
we will run in `LOCAL` mode, and then transition to `RAY` when we have ironed out the DocSet plan.

In [None]:
# This is a patch to allow sycamore to make asynchronous llm calls
# in local mode within a jupyter notebook.
import nest_asyncio
nest_asyncio.apply()

In [None]:
import sycamore
from sycamore import ExecMode

context = sycamore.init(exec_mode = ExecMode.LOCAL)

To create the DocSet, we need to tell sycamore how to read in the initial data.

In [None]:
from pathlib import Path

repo_root = Path.cwd()
pdf_dir = repo_root / "files" / "earnings_calls"
two_pdfs = [str(pdf_dir / "broadcom-avgo-q1-2024-earnings-call-transcript.pdf"), str(pdf_dir / "mongodb-mdb-q1-2024-earnings-call-transcript.pdf")]

pdf_docset = context.read.binary(paths=two_pdfs, binary_format="pdf")

# Let's see what that gave us
pdf_docset.show()

Our docset has two Documents in it, with a 'properties' dict containing some metadata, an 'elements' list containing an empty list of elements, a doc_id, lineage_id, type, and binary_representation, which contains the binary of the original PDF.
To get the elements as before, we'll want to run the `partition` transform.

In [None]:
from sycamore.transforms.partition import ArynPartitioner

# If you did not see the error message about API keys, ignore this comment.
# You might need to add aryn_api_key="<YOUR KEY>" if the environment didn't pick it up correctly. 
partitioned_docset = pdf_docset.partition(ArynPartitioner())

# We'll limit the number of elements to show because otherwise this produces an obnoxiously large output cell
partitioned_docset.show(num_elements=5)

We can visualize bounding boxes in much the same way that we did with aryn_sdk, with sycamore. Note that this will re-partition the documents. This is an intentional design choice within sycamore, as 
trying to hold an entire docset in memory at once doesn't necessarily scale; so the alternative is 
'lazy execution' - re-executing all the processing jobs. We'll show you how to optimize this in a few
cells.

In [None]:
from sycamore.utils.pdf_utils import show_pages

show_pages(partitioned_docset)

Wait a second.

Running `showPages` and `show` ran the whole program all over again! This could get really cumbersome to work with, especially as we add additional transforms to our processing
pipeline in development. I have a solution for you: `materialize`. But first, a diversion on lazy execution.

DocSets are evaluated lazily, which means that as you're developing, the only thing held in memory in the DocSet object itself is an execution plan. To get the data in the DocSet,
you have to 'execute' it - i.e. tell sycamore to run all the steps in the execution plan, from reading in the data to each transform. This allows sycamore to apply these sorts of 
parallelization/batch/streaming optimizations without you having to think about them. However, it comes with a drawback - accessing the documents themselves for ad-hoc inspection
can be a little bit difficult. For example, DocSets do not provide random access to data.

I often find it easier to think about a DocSet as a program than as a data structure.

In order to execute a DocSet, there are a couple of methods that do that. 

- `docset.execute()` executes the docset and does nothing with the resulting Documents. Most production pipelines use this to run.
- `docset.take_all()` (and its friend `docset.take(n)`) executes the docset and returns the Documents in a plain python list. This is useful for debugging and development, when datasets are still small.
- `docset.count()` executes the docset and returns the number of Documents in it. This is most useful when debugging filters (map transforms don't change the size of the docset).
- `docset.show()` executes and prints the first couple Documents - good for development
- `docset.write.<some_target>()` executes the docset and writes the documents out to some target - could be a database like opensearch, or just the filesystem. Most of these writers have an `execute` flag that determines whether to execute the write (and return nothing) or just return a DocSet with the write in the plan.

### Materialize & Memoization

There's a technique for optimizing recursive functions called memoization - essentially, the first time
you call the function with a given set of parameters, compute the result and cache it. Then, in all 
subsequent calls, simply look up the pre-computed result. Sycamore can do a similar thing with 
`docset.materialize()`, using the disk as a cache.

When sycamore compiles a DocSet into an execution plan, it starts from the end and works toward the
beginning. When it sees a `materialize`, it looks in the location where the `materialize` thinks its
cache lives, and if it finds data, it finishes compiling and reads the data in from the cache location, 
essentially truncating the docset program to only the stuff after the `materialize`. However if it does
not find data in its cache, it adds a step to the program to write data _to_ the cache and continues 
compiling.

Code-wise, the `materialize` method takes two parameters: a path to the cache, which can be in the local
filesystem or S3, and a `MaterializeSourceMode`, which is an enum with 2 values: `RECOMPUTE` and 
`USE_STORED`. `RECOMPUTE` tells the materialize not to act as a cache, but to always write out the data.
This is more useful for debugging. `USE_STORED` tells materialize to act as a cache and do the memoize
thing.

In [None]:
from sycamore.materialize import MaterializeSourceMode

materialize_dir = repo_root / "materialize"

materialized_ds = partitioned_docset.materialize(path = materialize_dir / "twodocs-partitioned", source_mode = MaterializeSourceMode.USE_STORED)

materialized_ds.execute()
print("Finished executing the first time")

In [None]:
# Note that the second time this is fast
materialized_ds.execute()
print("Finished executing the second time")

## Sycamore UDFs

Since we downloaded our documents for free from the internet, we've ended up with some advertisments
in them. Inspecting the elements and their types we can clean them up mostly by throwing out images.
For other workloads this probably doesn't apply, but here it provides a lovely opportunity to demonstrate one of the four most useful docset udf-transforms, `filter_elements`. Here's the list of udf transforms:

- `docset.map(f)`: Applies a function (`Document` -> `Document`) to every Document in the DocSet
- `docset.map_elements(f)`: Applies a function (`Element` -> `Element`) to every Element in every Document in the DocSet
- `docset.filter(f)`: Applies a predicate function (`Document` -> `bool`) to every Document, keeping only those Documents for which f(Document) is True
- `docset.filter_elements(f)`: Applies a predicate function (`Element` -> `bool`) to every Element in every Document, keeping only Elements for which f(Element) is True

In [None]:
from sycamore.data import Element

def kill_images(elt: Element) -> bool:
    return elt.type != "Image"

# docset.filter_elements takes a predicate function that maps Elements to bools. 
# For each element in a document, keep the element only if predicate(element) is True.
filtered_ds = materialized_ds.filter_elements(kill_images)

Sometimes, you'll want to redo a step that's been materialized. The simplest option is to remove the directory with all the cached data, e.g. `rm -rd materialize/twodocs-partitioned`

### Debugging

Debugging distributed systems can be tricky, but it's possible with a little bit of creativity. First-
off, the execution-forcing methods above are useful - particularly `take` and `take_all` since they 
give you back the Documents. Materializing a docset lets the documents persist, which can allow
dedicated debugging scripts and even sharing. Printing data can be a little hard to find as the log 
streams tend to get fairly polluted with stuff, so I will sometimes simply write a function that writes
a piece of data for every Document/Element to a file and apply it with a `map` or `map_elements` like so:

```python
def debug_doc(doc):
    with open("debug.txt", "a") as f:
        f.write(f"Document {doc.doc_id}\n")
        f.write(json.dumps(doc.elements, indent=2))
        f.write("\n" + '-' * 80 + "\n")
    return doc

docset.map(debug_doc)
```

## Schema Extraction

Ok, let's go back to the question we were trying to answer:  "In the Broadcom earnings call, what details did the CEO, Hock Tan, discuss about the VMware acquisition?" Now notice that we have two documents loaded into our DocSet, a Broadcom earnings call and a MongoDB earnings call. To successfully answer this question we'll have to perform the following steps:

1. Identify the Broadcom document
2. Identify the elements where Hock Tan is speaking
3. Identify the element where he mentions VMWare. 

Now notice that the first step requires identifying the Broadcom document. A reasonable way to accomplish this is to extract the company from each document. Then we can filter the 
documents like we did the elements.

One of sycamore's biggest benefits is its ability to interact with LLMs in this kind of data-flow-y way. LLMs are good at understanding unstructured data, so for processing unstructured
documents, they're a very useful tool. They make it easy to extract common metadata properties from documents, and with sycamore we can very easily apply this to all documents in a docset.

In [None]:
from sycamore.llms.openai import OpenAI, OpenAIModels
from sycamore.llms.llms import LLMMode
from sycamore.transforms.extract_schema import LLMPropertyExtractor

# You might need to explicitly set an api key here if it's not picked up from the environment variables
# Add parameter: api_key = "<key>"
gpt4o = OpenAI(OpenAIModels.GPT_4O)

schema = {
    "type": "object",
    "properties": {
        "quarter": {
            "type": "string",
            "description": "Quarter of the earnings call, it should be in the format of Q1, Q2, Q3, Q4",
        },
        "date":{"type": "string", "description": "The date of the earnings call"}
    },
}

# Quiz: As is, this property extraction will never run, even if I do something 
#       like `materialized_ds.execute()`. Why?
#       Hint: Compare to how we're adding transforms to the docset in other places.
filtered_ds.extract_properties(LLMPropertyExtractor(llm=gpt4o, schema=schema))

Now see if you can add a `company_name` and `company_ticker` property to this schema and extract properties into a docset named `extracted_ds`:

In [None]:
schema = {
    "type": "object",
    "properties": {
        "quarter": {
            "type": "string",
            "description": "Quarter of the earnings call, it should be in the format of Q1, Q2, Q3, Q4",
        },
        "date":{"type": "string", "description": "The date of the earnings call"},


#        ... # Fill in the rest!

extracted_ds = ...

In [None]:
# Test that the schema is right. We'll reference these properties later.
for doc in extracted_ds.take(1):
    print(doc.properties)
    assert 'entity' in doc.properties
    ec = doc.properties['entity']
    assert 'date' in ec
    assert 'quarter' in ec
    assert 'company_name' in ec
    assert 'company_ticker' in ec

Great! Now is there any optimization you can add to memoize the results of the LLM calls so that future docset executions can skip it?

## Chunking

Now, for our question answering system to be able to detect that this is element where Hock Tan discusses the VMWare acquistion, we'll need a way to associate the "speaker element" that is a few paragraphs above it, with this last element. The way to do that is through chunking. Sycamore implements a number of chunking strategies (documentation [here](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/low_level_transforms/merge_elements.html)). 
For this workshop we will use the `MarkedMerger` as it is the most customizable.

So, to be able to answer questions like the one about Hock Tan we'll chunk such that for each speaker 'block' we  get a chunk. In our partitioning we have split the text into paragraphs, but we'd like to squish all those paragraphs together, breaking the blocks wherever there's a new speaker. With a little bit of effort we can detect the lines that introduce speakers with regexes - one for external speakers and one for internal speakers, as the formatting is very consistent (this applies across all the documents in the dataset, don't worry):

```python
external_re = '([^ ]*[^\S\n\t]){1,4}--[^\S\n\t].*--' # A name (1-4 words long) followed by -- followed by anything followed by --
internal_re = '([^ ]*[^\S\n\t]){1,4}--.*'            # A name (1-4 words long) followed by -- followed by anything
```

We'll also add a condition to that the 'speaker' chunks be one line: occasionally we get a paragraph 
where the speaker kinda stutters the beginning of their speech which gets transcribed as a '--' and can
trip up the regex.

The `MarkedMerger` is set up perfectly to work with this. It will step through the elements, merging them together one by one, unless it sees one of two 'marks' in the data:

- on a "_drop" mark it drops the element and continues merging
- on a "_break" mark it finalizes the merged element and uses this one to start merging up a new element

In the following cell, the first case (when the speaker is "Operator") has been left as an exercise.

In [None]:
import re
from sycamore.transforms.merge_elements import MarkedMerger

def mark_speakers(elt: Element) -> Element:
    if not elt.text_representation:
        return elt

    external_speaker = re.match('([^ ]*[^\S\n\t]){1,4}--[^\S\n\t].*--', elt.text_representation)
    internal_speaker = re.match('([^ ]*[^\S\n\t]){1,4}--.*', elt.text_representation)
    is_one_line = elt.text_representation.count("\n") <= 1
    if elt.text_representation.strip() == 'Operator':
        # The operator is also a speaker! In this case, we should set
        # the 'speaker' property to True and the 'speaker_role' and 
        # 'speaker_name' properties to the string 'Operator'. We should 
        # also tell the MarkedMerger to break.
        raise NotImplementedError("I thought operators were an algebra thing!")
    elif external_speaker and is_one_line:
        parts = [p.strip() for p in elt.text_representation.split("--")]
        elt.properties['speaker_name'] = parts[0]
        elt.properties['speaker_external_org'] = parts[1]
        elt.properties['speaker_role'] = parts[2]
        elt.properties['speaker'] = True
        elt.data["_break"] = True
    elif internal_speaker and is_one_line:
        location = elt.text_representation.find('--')
        parts = [p.strip() for p in elt.text_representation.split("--")]
        elt.properties['speaker_name'] = parts[0]
        elt.properties['speaker_role'] = parts[1]
        elt.properties['speaker'] = True
        elt.data["_break"] = True
    return elt

# Also here's a nice way of writing chained pipelines
merged_ds = (
    extracted_ds
    .map_elements(mark_speakers)
    .merge(MarkedMerger())
)

## Initial Question Answering

Now we should be able to get the data requisite to answer our first question, even without a database 
behind it. With just a bunch of filters we are able to narrow down the docset to exectly the one 
document that answers:

0. In the Broadcom earnings call, what details did the CEO, Hock Tan, discuss about the VMware acqusition?

To answer this question we can do the following:

1. Identify the Broadcom document
2. Identify the elements where Hock Tan is speaking
3. Identify the element where he mentions VMWare.

We can translate this into a series of sycamore filters like so:

In [None]:
broadcom_qads = (
    merged_ds
    .filter(lambda doc: doc.properties['entity']['company_ticker'] == 'AVGO')
    .filter_elements(lambda elt: elt.properties.get('speaker_name') == 'Hock Tan')
    .filter_elements(lambda elt: "vmware" in elt.text_representation.lower())
)

documents = broadcom_qads.take_all()
# I happen to know that there is only one broadcom document (of the two documents in the docset)
assert len(documents) == 1
doc = documents[0]
print(doc.properties)
for e in doc.elements:
    print(e.properties)
    print(e.text_representation)

We can be confident that these are all the places where VMWare came up in a Broadcom earnings call by Hock Tan. If we wanted a more concise answer we would probably just ask chatGPT to summarize the
results.

Now let's try to answer another question on mongodb in a similar way.

2. What did the MongoDB president mention about their competitor Amazon DynamoDB?

We'll use a similar sort of plan:

1. Filter to the mongodb document (stock ticker = "MDB")
2. Filter to elements where the speaker role contains "President"
3. Filter to an element containing 'DynamoDB'

In [None]:
mongodb_qads = (
    merged_ds
    # Fill it out yo'self
)

documents = mongodb_qads.take_all()
# I happen to know that there is only one MDB document (of the two documents in the docset)
doc = documents[0]
print(doc.properties)
for e in doc.elements:
    print(e.properties)
    print(e.text_representation)

I'll admit that this may look stupid. Why wouldn't we just write all the documents to a database and then do the search that way?
Well, yes, we'll enable that next. But we'll come back to this docset-based strategy for question-answering, as it allows you to 
answer almost arbitrarily complex questions that a search database may not support.

Now let's enable approximate search:

## Embedding

In order to do that, we'll need to write our docset to a database, and embed the text of our elements to use k-nearest-neighbor vector 
search to retrieve relevant chunks for an LLM to summarize.

Embedding data with sycamore is fairly simple, so I'm going to give you all the information you need to do it and let you write it out.
There is a method on DocSets called `embed()`. It takes an `Embedder` as its parameter. We'll use the `OpenAIEmbedder`, which you can import from `sycamore.transforms.embed`. It takes a `model_name` parameter
but we'll use the default. This will embed the text_representation of all elements.

In [None]:
# Your code here
from ... import ...

embedded_ds = merged_ds...

## Ingestion

We'll be writing our data to Aryn (because what kind of workshop would this be if we didn't stand behind our own data warehouse). Sycamore can
also write to a number of other systems, such as OpenSearch, ElasticSearch, Weaviate, etc. 

The unit of storage in Aryn equivalent to an index in OpenSearch or a table in a SQL DB is a 'DocSet.' While a Sycamore DocSet is usually best 
understood as a program, an Aryn DocSet is actually a container. We can create one using aryn_sdk, and then write our (sycamore) docset to it.

First I'll add in a `spread_properties` transform, which copies properties from every Document to each of its Elements, so that all the elements have
the `entity` and `path` metadata associated with the document. This will help my queries work properly: Aryn stores elements in OpenSearch under the hood, so when filtering on a particular property,
the elements all need to have that property in them, so I'll be able to filter by `entity.company_ticker = AVGO` and get the elements back.

In [None]:
spread_ds = embedded_ds.spread_properties(['path', 'entity'])

Now let's create our docset target (give it a name) and write to it.

In [None]:
from aryn_sdk.client.client import Client

# You may need to specify aryn_api_key="<YOUR KEY>" here
aryn_client = Client()

docset_name = "yo"
aryn_docset = aryn_client.create_docset(name = docset_name)

print(aryn_docset.value.docset_id)

In [None]:
# You may need to specify aryn_api_key="<YOUR KEY>" here too.
spread_ds.write.aryn(docset_id=aryn_docset.value.docset_id)

Awesome! Now if you navigate to the [Aryn console](https://app.aryn.ai/storage) you should see your docset and the documents inside it.

## Full ingestion script

Now let's scale our script. For consistency purposes, I've gone and written out a canonical form for 
it, and condensed it a little. I've also partitioned the documents ahead of time, as this can take
a little while (our DOS protection is to limit the concurrent requests in an account) - so this script 
will get them from a materialize. This is one of the things you downloaded with `make downloads`

In [None]:
# I guess this is sort of a cheat sheet for the rest of the notebook. That's ok.

from pathlib import Path
import re

import sycamore
from sycamore import MaterializeSourceMode
from sycamore.data import Element
from sycamore.llms.openai import OpenAI, OpenAIModels
from sycamore.transforms.partition import ArynPartitioner
from sycamore.transforms.extract_schema import LLMPropertyExtractor
from sycamore.transforms.merge_elements import MarkedMerger
from sycamore.transforms.embed import OpenAIEmbedder
from aryn_sdk.client.client import Client

repo_root = Path.cwd()
pdf_dir = repo_root / "files" / "earnings_calls"
materialize_dir = repo_root / "materialize"

gpt4o = OpenAI(OpenAIModels.GPT_4O)

schema = {
    "type": "object",
    "properties": {
        "quarter": {"type": "string", "description": "Quarter of the earnings call, it should be in the format of Q1, Q2, Q3, Q4"},
        "date":{"type": "string", "description": "The date of the earnings call"},
        "company_name": {"type": "string", "description": "The name of the company in the earnings call"},
        "company_ticker": {"type": "string", "description": "The stock ticker of the company in the earnings call"},
    }
}

def mark_speakers(elt: Element) -> Element:
    if not elt.text_representation:
        return elt

    external_speaker = re.match('([^ ]*[^\S\n\t]){1,4}--[^\S\n\t].*--', elt.text_representation)
    internal_speaker = re.match('([^ ]*[^\S\n\t]){1,4}--.*', elt.text_representation)
    is_one_line = elt.text_representation.count("\n") <= 1
    if elt.text_representation.strip() == 'Operator':
        elt.properties["speaker_name"] = "Operator"
        elt.properties["speaker_role"] = "Operator"
        elt.properties["speaker"] = True
        elt.data["_break"] = True
    elif external_speaker and is_one_line:
        parts = [p.strip() for p in elt.text_representation.split("--")]
        elt.properties['speaker_name'] = parts[0]
        elt.properties['speaker_external_org'] = parts[1]
        elt.properties['speaker_role'] = parts[2]
        elt.properties['speaker'] = True
        elt.data["_break"] = True
    elif internal_speaker and is_one_line:
        location = elt.text_representation.find('--')
        parts = [p.strip() for p in elt.text_representation.split("--")]
        elt.properties['speaker_name'] = parts[0]
        elt.properties['speaker_role'] = parts[1]
        elt.properties['speaker'] = True
        elt.data["_break"] = True
    return elt

aryn_client = Client()
aryn_docset = aryn_client.create_docset(name = "haystack-workshop-all")
docset_id = aryn_docset.value.docset_id

# Now we'll use ray (which is the default exec_mode)
ctx = sycamore.init()

In [None]:
(
    ctx.read.binary(paths = str(pdf_dir), binary_format = "pdf")
    .partition(ArynPartitioner())
    .materialize(path=materialize_dir / "alldocs-partitioned", source_mode=MaterializeSourceMode.USE_STORED)
    .filter_elements(lambda e: e.type != "Image")
    .extract_properties(LLMPropertyExtractor(llm=gpt4o, schema=schema))
    .map_elements(mark_speakers)
    .merge(MarkedMerger())
    .embed(OpenAIEmbedder())
    .spread_properties(['path', 'entity'])
    .write.aryn(docset_id=docset_id)
)

In [None]:
# Write the docset id to a file to pick up in the next notebooks:
with open("docset_id", "w") as f:
    f.write(docset_id)