In this example, use Sycamore to process PDFs and load them into target vector and keyword indexes in OpenSearch. The dataset includes the research papers associated with the Sort Benchmark contest: https://sortbenchmark.org/

The Aryn Partitioner in this job is configured to use the Aryn Partitioning Service to provide performant and high-quality document partitioning. Go to [aryn.ai/sign-up ](aryn.ai/sign-up) to get a free API key for the service. You can also configure it to run locally.

In [None]:
import sycamore
from sycamore.transforms.partition import ArynPartitioner
from sycamore.utils.aryn_config import ArynConfig, _DEFAULT_PATH
from sycamore.llms import OpenAIModels, OpenAI
from sycamore.transforms.extract_entity import OpenAIEntityExtractor
from sycamore.transforms.embed import SentenceTransformerEmbedder
from sycamore.transforms.merge_elements import GreedySectionMerger
from sycamore.functions.tokenizer import HuggingFaceTokenizer
import os

# S3 file path to the Sort Benchmark dataset of PDFs
paths = "s3://aryn-public/sort-benchmark/pdf/"

# OpenAI key and model for data extraction transform. Set the key in your environment variables or provide it here.
openai = OpenAI(OpenAIModels.GPT_4O.value, api_key=os.environ.get("OPENAI_API_KEY"))

# Configure chunking (or merging) strategy and the number of tokens for each chunk.
merger = GreedySectionMerger(tokenizer=HuggingFaceTokenizer("sentence-transformers/all-MiniLM-L6-v2"), max_tokens=512)

#Set Aryn Partitioning Service API key
assert ArynConfig.get_aryn_api_key() != "", f"Unable to find aryn API key.  Looked in {_DEFAULT_PATH}"

If the above assertion fails, you can either set the environment variable ARYN_API_KEY and restart Jupyter
or make a yaml file at the specified path in the assertion error that has:

```
aryn_token: "YOUR-ARYN-API-KEY"
```

You can also put it in this notebook with:
```
import os
os.environ["ARYN_API_KEY"] = "ARYN-API-KEY-LOCATION" 
```

In [None]:
#This is the Sycamore pipeline up to the "load" step. Note that because Sycamore uses lazy execution, the pipeline will not do any processing
#until you run the "write" step in a subsequent cell.

# Initializing Sycamore
context = sycamore.init()
# Reading PDFs into a DocSet
docset = context.read.binary(paths, binary_format="pdf")
# Partition using the Aryn Partitioning Service into structured elements. Extract tables and images. This will take a few minutes, because 
# the service is processing many pages across the document set.
partitioned_docset = docset.partition(partitioner=ArynPartitioner(extract_images=True,  extract_table_structure=True))
# Extract the title and author from each paper in the dataset using LLM-powered transforms
extracted_docset = partitioned_docset.extract_entity(entity_extractor=OpenAIEntityExtractor("title", llm=openai))\
                    .extract_entity(entity_extractor=OpenAIEntityExtractor("authors", llm=openai))
# Use the chunking strategy specified earlier to create larger chunks from groups of smaller elements in the DocSet
chunked_docset = extracted_docset.merge(merger=merger)
# We are using MiniLM to create vector embeddings locally for each chunk
embedded_docset = chunked_docset.explode().embed(
                    embedder=SentenceTransformerEmbedder(batch_size=10_000, model_name="sentence-transformers/all-MiniLM-L6-v2")
                  )

In [None]:
#Set OpenSearch configuration for connector. In this example, OpenSearch is running locally.

openSearch_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_compress": True,
    "http_auth": ("admin", "admin"),
    "use_ssl": True,
    "verify_certs": False,
    "ssl_assert_hostname": False,
    "ssl_show_warn": False,
    "timeout": 120,
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
            "number_of_shards": 2,
            "number_of_replicas": 1,
        },
        "mappings": {
            "properties": {
                "embeddings": {
                    "type": "knn_vector",
                    "dimension": 384,
                    "method": {"name": "hnsw", "engine": "faiss"},
                },
                "text": {"type": "text"},
            }
        },
    }
}

In [None]:
#Write to OpenSearch

embedded_docset.write.opensearch(
    os_client_args=openSearch_client_args,
    index_name="sort-benchmark",
    index_settings=index_settings,
)

Congrats! You have loaded OpenSearch's vector and keyword indexes with your processed data, and you can now build your RAG or semantic search application.