# Semantic Document Segmentation with Docugami

The LangChain documentation for the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami) covers the basic functioanality of using Docugami's XML Knowledge Graph to create better chunks for RAG. The chunks produced by Docugami follow structural and semantic contours of the document, and are also annotated with additional semantic metadata to boost retrieval accuracy.

Let's dive deeper into Docugami's XML output and see how we can use it for semantic document segmentation. This is helpful for many reasons, including advanced Retrieval Augmented Generation (RAG) using the [Multi-Vector Retriever](https://blog.langchain.dev/semi-structured-multi-modal-rag/) on tabular and non-tabular content.

## Converting Your Documents into Docugami's XML Knowledge Graph

1. Create a [Docugami workspace](http://www.docugami.com) (free trials available)
1. Create an access token via the Developer Playground for your workspace. [Detailed instructions](https://help.docugami.com/home/docugami-api).
1. Add your documents (PDF, DOCX or DOC) to Docugami for processing. There are two ways to do this:
    1. Use the simple Docugami web experience. [Detailed instructions](https://help.docugami.com/home/adding-documents).
    1. Use the [Docugami API](https://api-docs.docugami.com), specifically the [documents](https://api-docs.docugami.com/#tag/documents/operation/upload-document) endpoint. Code samples are available for [python](../upload_file/) and [JavaScript](../../js/upload-file/).
1. Wait for Docugami to ingest and cluster your uploaded documents into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. Docugami is not limited to any particular types of documents, and the clusters created depend on your particular documents. You can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later if you wish. You can monitor file status in the simple Docugami webapp, or use a [webhook](https://api-docs.docugami.com/#tag/webhooks) to be informed when your documents are done processing.
1. Use the [Docugami API](https://api-docs.docugami.com) to get a list of your processed docset IDs, or just the document IDs for a particular docset. 

At this point, you can use the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami) to very easily get chunks for your documents, including semantic and structural metadata. This is the simpler and recommended approach for most use cases.

In this notebook, let's dive a bit deeper into Docugami's XML output and see how we can use it for advanced RAG. You can download the Docugami XML output for your processed document directly from the API (see sample [here](../download_file_artifacts/)) or you can play around with the sample files, with corresponding Docugami XML output, provided under [../../testdata/](../../testdata/) in this repository.

## Exploring the Docugami XML Knowledge Graph

Let's consider one of the NTSB Aviation Accident Reports under [../../testdata](../../testdata), specifically [../../testdata/NTSB/20071204X01896.pdf](../../testdata/NTSB/20071204X01896.pdf). Open the file and skim it, you will notice that it is a PDF with some tables as well as non-tabular text (headings and paragraphs). There are also some other test files that you can explore, but let's explore this one.

The Docugami XML output for each file is located in the same directory, with an XML extension. For example, the XML for the file mentioned above is at [../../testdata/NTSB/20071204X01896.xml](../../testdata/NTSB/20071204X01896.xml)

You can, ofcourse, open this XML file in a text editor to explore it. You can also open it with the XML library of your choice to explore how it sectioned up the document. We will use the [dgml-utils](https://github.com/docugami/dgml-utils/tree/main/python) library, which has some nice helper functions to do all this for us.

In [2]:
!pip install dgml-utils --upgrade --quiet

[0m

### Example 1: Identifying Document Parts (chunks) as Text (with Structure Metadata)

You can pass DGML to the `dgml-utils` library and get `Chunk` instances back. By default, these are text-based representations of text chunks and tables (tables are represented as markdown)

In [8]:
from dgml_utils.segmentation import get_leaf_structural_chunks_str

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    chunks = get_leaf_structural_chunks_str(dgml=xml_string)

    for chunk in chunks[:4]:
        print(f"TEXT: {chunk.text}")
        print(f"STRUCTURE: {chunk.structure}")
        print(f"TAG: {chunk.tag}")
        print("*****")

TEXT: National Transportation Safety Board Aviation Accident Final Report
STRUCTURE: h1
TAG: chunk
*****
TEXT: +-------------------------+---------------------------------------+-------------------+-------------+
| Location:               | McKinnon, TN                          | Accident Number:  | MIA08CA018  |
+-------------------------+---------------------------------------+-------------------+-------------+
| Date & Time:            | 11/18/2007 , 1315 EST                 | Registration:     | N5803A      |
+-------------------------+---------------------------------------+-------------------+-------------+
| Aircraft:               | Cessna 172                            | Aircraft Damage : | Substantial |
+-------------------------+---------------------------------------+-------------------+-------------+
| Defining Event:         |                                       | Injuries:         | 4 None      |
+-------------------------+---------------------------------------+------

You can open the PDF side by side to compare the output above. You will notice that the headings, tables, and paragraphs are correctly identified.

### Example 2: Sub-Chunking Tables

In the output above, you will notice that all the chunks from the PDF were not only identified, they were also correctly converted to clean text, including tables as Markdown for readability and easy consumption by LLMs. The default behavior is to not sub-chunk tables (to avoid over-chunking) but you can override this easily and sub-chunk all the table cells as follows:

In [11]:
from dgml_utils.segmentation import get_leaf_structural_chunks_str

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    chunks = get_leaf_structural_chunks_str(
        dgml=xml_string,
        sub_chunk_tables=True, # set this to subchunk tables by row and cell
    )

    for chunk in chunks[:20]:
        print(f"TEXT: {chunk.text}")
        print(f"STRUCTURE: {chunk.structure}")
        print(f"TAG: {chunk.tag}")
        print("*****")

TEXT: National Transportation Safety Board Aviation Accident Final Report
STRUCTURE: h1
TAG: chunk
*****
TEXT: Location:
STRUCTURE: div
TAG: chunk
*****
TEXT: McKinnon, TN
STRUCTURE: td
TAG: td
*****
TEXT: Accident Number:
STRUCTURE: div
TAG: chunk
*****
TEXT: MIA08CA018
STRUCTURE: td
TAG: td
*****
TEXT: Date & Time:
STRUCTURE: td
TAG: td
*****
TEXT: 11/18/2007 , 1315 EST
STRUCTURE: td
TAG: td
*****
TEXT: Registration:
STRUCTURE: td
TAG: td
*****
TEXT: N5803A Aircraft:
STRUCTURE: td td
TAG: td
*****
TEXT: Cessna 172
STRUCTURE: td
TAG: td
*****
TEXT: Aircraft Damage :
STRUCTURE: td
TAG: td
*****
TEXT: Substantial
STRUCTURE: td
TAG: td
*****
TEXT: Defining Event:
STRUCTURE: td
TAG: td
*****
TEXT:  Injuries:
STRUCTURE: td td
TAG: td
*****
TEXT: 4 None Flight Conducted Under:
STRUCTURE: td td
TAG: td
*****
TEXT: Part 91 : General Aviation - Personal
STRUCTURE: div
TAG: Part91-cell
*****
TEXT:   Analysis
STRUCTURE: td td h1
TAG: td chunk
*****
TEXT: The pilot stated that the airplane's appr

The layout of the XHTML tables is similar to HTML tables, but it goes beyond this by actually tagging cells inside the tables semantically with labels gleaned from any table headers. These labels improve over time as the user builds reports or uses other functionality within Docugami. Here is an example of some table XML from the file referenced above (truncated for readability):

### Example 3: Full XML Markup

In the two examples above, there are some tradeoffs between over chunking (each cell may be missing surrounding context and information from rows/headers) and under-chunking (if you have very large tables, you get very large markdown output, losing retrieval precision). In fact, Docugami has a complete hiearchical representation of the tables which you can use as you wish. For example, you can just find the first table chunk and read its "xml" property for all the XML.

In [13]:
from dgml_utils.segmentation import get_leaf_structural_chunks_str

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    chunks = get_leaf_structural_chunks_str(
        dgml=xml_string,
    )

    for chunk in chunks:
        if "table" in chunk.structure:
            print(chunk.xml)
            break # just the first table
        

<xhtml:table xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:docset="http://www.docugami.com/2021/dgml/TaqiGenerativeLabels/NationalTransportationSafetyBoard" xmlns:addedChunks="http://www.docugami.com/2021/dgml/TaqiGenerativeLabels/NationalTransportationSafetyBoard/addedChunks" xmlns:dg="http://www.docugami.com/2021/dgml" xmlns:dgc="http://www.docugami.com/2021/dgml/docugami/contracts" xmlns:dgm="http://www.docugami.com/2021/dgml/docugami/medical" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cp="http://classifyprocess.com/2018/07/" structure="table" style="boundingBox:{left: 201.0; top: 592.6; width: 2146.5; height: 415.8; page: 1;}; ">
                    <xhtml:tbody structure="tbody" style="boundingBox:{left: 201.0; top: 592.6; width: 2146.5; height: 415.8; page: 1;}; ">
                         <xhtml:tr structure="tr" style="boundingBox:{left: 201.0; top: 592.6; width: 2146.5; height: 109.5; page: 1;}; ">
                              <xhtml:td structure="td" styl

You will notice that the table is parsed into a table body, rows and cells just like HTML tables, but has further information to assist downstream models:

1. The visual bounding box and page in the original PDF, useful for downstream processing by multi-modal (image and text) models.
2. Sematic tags on cells based on the table structure, for example `<docset:AccidentNumber>MIA08CA018 </docset:AccidentNumber>` is a semantic tag around the accident number value, based on the corresponding heading in the table.

### Example 4: Simplified XML Markup

The full XML markup is super rich, but too verbose to feed directly to an LLM for RAG. This is why in Example 1 by default we return text chunks, but those are missing inner semantics that Docugami finds inside these chunks e.g. semantic tags for entities inside the chunk.

You can have the best of both worlds by asking Docugami to include simplified XML tags in its output, suitable for RAG scenarios. This works for both table and text chunks. Note all the rich inner semantics for tables and text chunks!

In [16]:
from dgml_utils.segmentation import get_leaf_structural_chunks_str

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    chunks = get_leaf_structural_chunks_str(
        dgml=xml_string,
        include_xml_tags=True, # set this to get simplified XML markup inside the output chunks
    )

    for chunk in chunks[:4]:
        print(f"SIMPLIFIED_XML: {chunk.text}")
        print(f"STRUCTURE: {chunk.structure}")
        print("*****")

SIMPLIFIED_XML: <NTSBReport>National Transportation Safety Board </NTSBReport>Aviation Accident Final Report
STRUCTURE: h1
*****
SIMPLIFIED_XML: <table> <tbody> <tr> <td> Location: </td> <td> <Location>McKinnon, TN </Location> </td> <td> Accident Number: </td> <td> <AccidentNumber>MIA08CA018 </AccidentNumber> </td> </tr> <tr> <td> <LocationDateTime>Date &amp; Time: </LocationDateTime> </td> <td> <DateTime><Date>11/18/2007</Date>, <Time> 1315 </Time>EST </DateTime> </td> <td> <DateTimeAccidentNumber>Registration: </DateTimeAccidentNumber> </td> <td> <Registration>N5803A </Registration> </td> </tr> <tr> <td> <LocationAircraft>Aircraft: </LocationAircraft> </td> <td> <Aircraft>Cessna <AircraftType>172 </AircraftType></Aircraft> </td> <td> <AircraftAccidentNumber>Aircraft Damage : </AircraftAccidentNumber> </td> <td> <AircraftDamage>Substantial </AircraftDamage> </td> </tr> <tr> <td> <DefiningEvent-cell>Defining Event: </DefiningEvent-cell> </td> <td> <DefiningEventMcKinnonTN/> </td> <td> 

### Example 5: Parent Chunk Retrieval

LangChain already supports concepts like the [ParentDocumentRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever) to improve RAG precision by expanding the context around retrieved chunks. Docugami's XML hieararchy is ideally suited to identify the right context around retrieved chunks by walking the parent chain of each chunk. When including XML tags, you can specify the level of hierarchy you want for parent chunks, which are returned as the `parent` property of all chunks.

In [23]:
from dgml_utils.segmentation import get_leaf_structural_chunks_str

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    chunks = get_leaf_structural_chunks_str(
        dgml=xml_string,
        include_xml_tags=True, # set this to get simplified XML markup inside the output chunks
        xml_hierarchy_levels=3, # change this to control how far up the hierarchy you want to go for parents
        sub_chunk_tables=True
    )

    for chunk in chunks[:20]:
        print(f"CHUNK_SIMPLIFIED_XML: {chunk.text}")
        print(f"CHUNK_STRUCTURE: {chunk.structure}")
        if hasattr(chunk, "parent") and chunk.parent:
            print(f"PARENT_SIMPLIFIED_XML: {chunk.parent.text}")
            print(f"PARENT_STRUCTURE: {chunk.parent.structure}")
        print("*****")

CHUNK_SIMPLIFIED_XML: <NTSBReport>National Transportation Safety Board </NTSBReport>Aviation Accident Final Report
CHUNK_STRUCTURE: h1
PARENT_SIMPLIFIED_XML: <NationalTransportationSafetyBoardAviationAccidentFinalReport-section> <NTSBReport>National Transportation Safety Board </NTSBReport>Aviation Accident Final Report <NationalTransportationSafetyBoardAviationAccidentFinalReport> <AviationAccidentFinalReport> <table> <tbody> <tr> <td> Location: </td> <td> <Location>McKinnon, TN </Location> </td> <td> Accident Number: </td> <td> <AccidentNumber>MIA08CA018 </AccidentNumber> </td> </tr> <tr> <td> <LocationDateTime>Date &amp; Time: </LocationDateTime> </td> <td> <DateTime><Date>11/18/2007</Date>, <Time> 1315 </Time>EST </DateTime> </td> <td> <DateTimeAccidentNumber>Registration: </DateTimeAccidentNumber> </td> <td> <Registration>N5803A </Registration> </td> </tr> <tr> <td> <LocationAircraft>Aircraft: </LocationAircraft> </td> <td> <Aircraft>Cessna <AircraftType>172 </AircraftType></Aircr

# Conclusion

The rich markup in the Docugami XML Knowledge Graph can be used for advanced document segmentation. These chunks can be used for advanced RAG on top of these documents, for example:

1. As shown in the documentation for the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami), the semantic tags on chunks can be used as vector metadata for improving retrieval precision for RAG.
2. You can also use techniques like the [Multi-Vector Retriever](https://blog.langchain.dev/semi-structured-multi-modal-rag/) for RAG on tabular and non-tabular content, using the segmented output from Docugami. There is a [separate notebook](./multi_vector_retriever.ipynb) that shows a quick example of this use case.