# NTSB demo

Sycamore is a **Document Processing System** = **ETL** + **Query Planning/Retrieval**.

https://aryn-public.s3.amazonaws.com/ntsb/59.pdf

**ETL steps**:
- extracts Text + Structure:
    - titles,
    - section headers,
    - text paragraphs,
    - figures,
    - tables and their cells
    - captions
    - page headers and footers
    - footnotes
- it then can merge individual chunks into larger chunks that preserve more of the context

**Query Planning/Retrieval**:
Having these structural elements and information extracted enables the consumer of these document to have much better control over what is being embedded and how, leading to better retrieval in a RAG system.
For example, we can focus the retrieval on certain sections or eliminate them altogether.

**Question answering**:
Sycamore can also extract information from a document. We’ll see how it can extract location and dates from NTSB docs, but also specific elements like aircraft type.  



Once the data is transformed, we can ask questions on it.

## Bolierplate (imports & utilities)

In [None]:
%run demo_utils.ipynb

### Initialize Sycamore

In [None]:
ctx = sycamore.init()

## Load the data

We're loading NTSB incident reports (pdf documents describing aviation incidents). 
The documents consist of a combination of tables, text, and figures.

We’re loading the data from a public s3 bucket

Sample document: 
https://aryn-public.s3.amazonaws.com/ntsb/59.pdf

In [None]:
s3_path = "s3://aryn-public/ntsb/59.pdf"
llm = OpenAI(OpenAIModels.GPT_3_5_TURBO.value)
# llm = OpenAI(OpenAIModels.GPT_4O.value)
tokenizer = HuggingFaceTokenizer("thenlper/gte-small")

## Define a pipeline using SycamorePartitioner

### Chunk and extract document structure
Extracts Text + Structure:
- titles,
- section headers,
- text paragraphs,
- figures,
- tables and their cells
- captions
- page headers and footers
- footnotes

In [None]:
# Create a docset from segmented PDFs, extract tables, summarize images

docset = (
    ctx.read.binary(s3_path, binary_format="pdf")

    # Partition and extract tables and images
    .partition(partitioner=SycamorePartitioner("Aryn/deformable-detr-DocLayNet", extract_table_structure=True, use_ocr=False, extract_images=True), num_gpus=0.1)

    # Summarize each image element
    .transform(SummarizeImages)
)


In [None]:
sample_pages = show_pages(docset, limit=4)

In [None]:
# Show text representation of images and tables
enumerate_images_and_tables(sample_pages)

In [None]:
# We extract the table entries as metadata for further querying, normalize, and semantically chunk the documents

docset2 = (
    # Extract information into the schema
    docset.map(extract_table_as_properties)
    
    # Normalize values
    .map(standardize_date)
    .map(standardize_location)

    # Chunk into semantically related content
    #.merge(merger=GreedySectionMerger())
)

In [None]:
sample_pages_2 = show_pages(docset2, limit=4)

In [None]:
sample_pages_2[0].properties['entity']

In [None]:
# Show tables and their conversion to properties

display_page_and_table_properties(sample_pages_2)

In [None]:
%%time

for i in range(0,3):
    extracted = extract_table_as_properties(sample_pages[i])

In [None]:
for e in extracted.elements:
    if "table" in e.type:
        print("Element Type: ", e.type)
        print("Element Properties: ", json.dumps(e.properties, indent=2))
        display(HTML(e.table.to_html()))

#### Save Sycamore partitioned docs to disk

In [None]:
sample_pages_2[0].properties.update({"foo": "bar"})

In [None]:
sample_pages_2[0].properties

In [None]:
write_out_docset(Path("/home/admin/sycamore/data/tmp/"), docset)

## Load from disk

In [None]:
unpickled_docset = read_in_docset(Path("/home/admin/sycamore/data/tmp/"), ctx)
unpickled_docset.count()

## Visualize partitioned documents

### Sycamore generated partitions

In [None]:
 show_pages(unpickled_docset, limit=4)

## Inspect extracted information

#### Notice that dates and locations have been standardized

In [None]:
sample_docs = unpickled_docset.take()

In [None]:
sample_docs[0].properties['entity']

## Indexing the documents for retrieval

Now that we have extracted the text, partitioned it, labeled the partitions, extracted information and standardized it, we're ready to store it for retrieval

In [None]:
if os.path.exists("/.dockerenv"):
    opensearch_host = "opensearch"
    print("Assuming we are in a Sycamore Jupyter container, using opensearch for OpenSearch host")
else:
    opensearch_host = "localhost"
    print("Assuming we are running outside of a container, using localhost for OpenSearch host")

index = "ntsb-bb-2"
os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_compress": True,
    "http_auth": ("admin", "admin"),
    "use_ssl": True,
    "verify_certs": False,
    "ssl_assert_hostname": False,
    "ssl_show_warn": False,
    "timeout": 120,
}

index_settings = {
    "body": {
        "settings": {
            "index.knn": True,
            "number_of_shards": 5,
            "number_of_replicas": 1
        },
        "mappings": {
            "properties": {
                "embedding": {
                  "dimension": 384,
                  "method": {
                    "engine": "faiss",
                    "space_type": "l2",
                    "name": "hnsw",
                    "parameters": {}
                  },
                  "type": "knn_vector"
                }
            }
        }
    }
}

### Creating embeddings

In [None]:
embedded_docset = (
    unpickled_docset
    .spread_properties(["entity", "path"])
    .map(normalize_table_properties)
    .explode()
    .sketch()
    .embed(embedder=SentenceTransformerEmbedder(batch_size=100, model_name="sentence-transformers/all-MiniLM-L6-v2"), num_gpus=0.1)
)

### Write the OpenSearch index

In [None]:
embedded_docset.map(validate).write.opensearch(
    os_client_args=os_client_args,
    index_name=index,
    index_settings=index_settings,
)

## Answer some questions

### Create a connection to OpenSearch

In [None]:
os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "http_compress": True,
    "http_auth": ('admin', 'admin'),
    "use_ssl": True,
    "verify_certs": False,
    "ssl_assert_hostname": False,
    "ssl_show_warn": False,
    "timeout": 120
}
client = OpenSearch(**os_client_args)

osq = OpenSearchQueryExecutor(os_client_args)

### Question

In [None]:
question = "Were there any incidents involving red planes"

### Text query

In [None]:
query = OpenSearchQuery()
query["index"] = index
query["query"] = {
    "_source": {"excludes": ["embedding"]},
    "query": {"match": {"text_representation": question}},
    "size": 20
}
result = osq.query(query)["result"]

In [None]:
result

### RAG query

In [None]:
# RAG config
embedding_model_id = "UpKUx48Bh3XoviWIy7eB"
search_pipeline = "hybrid_rag_pipeline"
llm = "gpt-4o"

rag_query = OpenSearchQuery()
rag_query["index"] = index

# hybrid searc
rag_query["query"] = {
    "_source": {"excludes": ["embedding"]},
    "query": {
        "hybrid": {
            "queries": [
                {"match": {"text_representation": question}},
                {
                    "neural": {
                        "embedding": {
                            "query_text": question,
                            "model_id": embedding_model_id,
                            "k": 100,
                        }
                    }
                },
            ]
        }
    },
    "size": 20
}

# RAG params
rag_query["params"] = {"search_pipeline": search_pipeline}
rag_query["query"]["ext"] = {
    "generative_qa_parameters": {
        "llm_question": question,
        "context_size": 10,
        "llm_model": llm,
    }
}
rag_result = osq.query(rag_query)["result"]

In [None]:
rag_result["ext"]["retrieval_augmented_generation"]["answer"]

In [None]:
rag_result

### Filtered RAG query

In [None]:
filtered_rag_query = OpenSearchQuery()
filtered_rag_query["index"] = index_s

filter = {
    "bool": {
        "must": [
          {
            "match_phrase": {
              "properties.entity.location": "California"
            }
          }
          
        ]
    }
}

# hybrid search
filtered_rag_query["query"] = {
    "_source": {"excludes": ["embedding"]},
    "query": {
        "hybrid": {
            "queries": [
                {
                  "bool": {
                      "must": [
                          {
                              "match": {
                                  "text_representation": "Were there any incidents in California in 2023?"
                              }
                          }
                      ],
                      "filter": filter
                  }
                },
                {
                    "neural": {
                        "embedding": {
                            "query_text": question,
                            "model_id": embedding_model_id,
                            "k": 100,
                            "filter": filter
                        }
                    }
                },
            ]
        }
    },
    "size": 20
}

# RAG params
filtered_rag_query["params"] = {"search_pipeline": search_pipeline}

filtered_rag_query["query"]["ext"] = {
    "generative_qa_parameters": {
        "llm_question": question,
        "context_size": 10,
        "llm_model": llm,
    }
}

In [None]:
filtered_rag_result = osq.query(filtered_rag_query)["result"]

In [None]:
filtered_rag_result["ext"]["retrieval_augmented_generation"]["answer"]