# Semantic Document Segmentation with Docugami

The LangChain documentation for the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami) covers the basic functioanality of using Docugami's XML Knowledge Graph to create better chunks for RAG. The chunks produced by Docugami follow structural and semantic contours of the document, and are also annotated with additional semantic metadata to boost retrieval accuracy.

Let's dive deeper into Docugami's XML output and see how we can use it for semantic document segmentation. This is helpful for many reasons, including advanced Retrieval Augmented Generation (RAG) using the [Multi-Vector Retriever](https://blog.langchain.dev/semi-structured-multi-modal-rag/) on tabular and non-tabular content.

## Converting Your Documents into Docugami's XML Knowledge Graph

1. Create a [Docugami workspace](http://www.docugami.com) (free trials available)
1. Create an access token via the Developer Playground for your workspace. [Detailed instructions](https://help.docugami.com/home/docugami-api).
1. Add your documents (PDF, DOCX or DOC) to Docugami for processing. There are two ways to do this:
    1. Use the simple Docugami web experience. [Detailed instructions](https://help.docugami.com/home/adding-documents).
    1. Use the [Docugami API](https://api-docs.docugami.com), specifically the [documents](https://api-docs.docugami.com/#tag/documents/operation/upload-document) endpoint. Code samples are available for [python](../upload_file/) and [JavaScript](../../js/upload-file/).
1. Wait for Docugami to ingest and cluster your uploaded documents into sets of similar documents, e.g. NDAs, Lease Agreements, and Service Agreements. Docugami is not limited to any particular types of documents, and the clusters created depend on your particular documents. You can [change the docset assignments](https://help.docugami.com/home/working-with-the-doc-sets-view) later if you wish. You can monitor file status in the simple Docugami webapp, or use a [webhook](https://api-docs.docugami.com/#tag/webhooks) to be informed when your documents are done processing.
1. Use the [Docugami API](https://api-docs.docugami.com) to get a list of your processed docset IDs, or just the document IDs for a particular docset. 

At this point, you can use the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami) to very easily get chunks for your documents, including semantic and structural metadata. This is the simpler and recommended approach for most use cases.

In this notebook, let's dive a bit deeper into Docugami's XML output and see how we can use it for advanced RAG. You can download the Docugami XML output for your processed document directly from the API (see sample [here](../download_file_artifacts/)) or you can play around with the sample files, with corresponding Docugami XML output, provided under [../../testdata/](../../testdata/) in this repository.

## Exploring the Docugami XML Knowledge Graph

Let's consider one of the NTSB Aviation Accident Reports under [../../testdata](../../testdata), specifically [../../testdata/NTSB/20071204X01896.pdf](../../testdata/NTSB/20071204X01896.pdf). Open the file and skim it, you will notice that it is a PDF with some tables as well as non-tabular text (headings and paragraphs). There are also some other test files that you can explore, but let's explore this one.

The Docugami XML output for each file is located in the same directly, with an XML extension. For example, the XML for the file mentioned above is at [../../testdata/NTSB/20071204X01896.xml](../../testdata/NTSB/20071204X01896.xml)

You can, ofcourse, open this file in a text editor to explore it. You can also open it with the XML library of your choice to explore how it sectioned up the document. We will use `lxml` here and first define some helper methods to allow us to identify key structural elements in the XML:

In [8]:
import re
from lxml import etree
from typing import Any, Optional


def _clean_text(text):
    text = re.sub(r"\s+", " ", text).strip()
    return text  # This was missing in your cleanup. You must return the cleaned text.


def _extract_node_text(node):
    """Recursively extract text from the node and its children."""
    if node.text:
        yield _clean_text(node.text)
    for child in node:
        yield from _extract_node_text(child)
        if child.tail:
            yield _clean_text(child.tail)


def _structure_value(node):
    """Extract structure value from node."""
    return node.attrib.get("structure")


def _is_heading(node: Any) -> bool:
    """Check if a node is a heading, using the structure attribute e.g. h1."""
    structure = _structure_value(node)
    if structure is not None and structure.lower().startswith("h"):
        return True
    return False

### Example 1: Nested Document XML Structure via Headings

Let's use these helper methods to traverse the XML tree depth first, and print out any node that is identified as a heading.

In [35]:
parser = etree.XMLParser(ns_clean=True, recover=True, encoding='utf-8')

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    root = etree.fromstring(xml_string, parser=parser)

    def _traverse_tree_for_headings(node, level=0, processed_nodes=set()):
        """Depth-first traversal of the XML tree."""
        if node in processed_nodes:
            return  # Skip already processed nodes

        # Check if the current node is a heading.
        if _is_heading(node):
            indentation = '    ' * level
            node_text = ''.join(_extract_node_text(node))
            print(indentation + node_text)
            processed_nodes.add(node)  # Mark this node as processed

            # If heading, increment the level and check siblings.
            level += 1
            sibling = node.getnext()
            while sibling is not None:
                _traverse_tree_for_headings(sibling, level, processed_nodes)
                sibling = sibling.getnext()
        else:
            # If not heading, just check its children.
            for child in node:
                _traverse_tree_for_headings(child, level, processed_nodes)

    _traverse_tree_for_headings(root)

National Transportation Safety BoardAviation Accident Final Report
    Analysis
    Probable Causeand Findings
    Findings
        Findings
        Factual Information
            Pilot Information
            Aircraft and Owner/Operator Information
            Meteorological Information and Flight Plan
            Airport Information
            Wreckage andImpact Information
            Administrative Information
                Additional Participating Persons:


You can open the PDF side by side to compare the output above. You will notice that the headings are correctly identified (with some false positives, but not too many). Further, the headings are nested hierarchically in a semantic tree representing their relationships correctly.

Next, let's look at the tables in the XML output. These are XHTML tables, so in order to print them out in this notebook we can convert them to HTML first and then traverse the tree to print any tables we encounter.

### Example 2: Semantic Document XML Structure via Tables


In [36]:
import re
from lxml import etree
from IPython.display import display, HTML

# Convert XHTML tags to HTML5 and display in Jupyter notebook
def render_xhtml_in_jupyter(xhtml_str):
    # Convert XHTML tags to HTML5
    html5_str = re.sub(r'<xhtml:(\w+)', r'<\1', xhtml_str)
    html5_str = re.sub(r'</xhtml:(\w+)', r'</\1', html5_str)
    html5_str = re.sub(r'<dg:chunk', r'<span', html5_str)
    html5_str = re.sub(r'</dg:chunk', r'</span', html5_str)
    html5_str = re.sub(r'<docset:(\w+)', r'<span class="\1"', html5_str)
    html5_str = re.sub(r'</docset:(\w+)', r'</span', html5_str)
    
    # Display the converted HTML
    display(HTML(html5_str))
    display(HTML('<hr/>'))

In [34]:
# Parser setup
parser = etree.XMLParser(ns_clean=True, recover=True, encoding='utf-8')

# Change to explore other files
XML_FILE_PATH = "../../testdata/NTSB/20071207X01914.xml"

with open(XML_FILE_PATH, 'r', encoding='utf-8') as file:
    xml_string = file.read()
    root = etree.fromstring(xml_string, parser=parser)

    def _traverse_tree_for_tables(node):
        """Depth-first traversal of the XML tree for xhtml:table tags."""
        # Check if the current node is a table.
        if node.tag == '{http://www.w3.org/1999/xhtml}table':
            # Convert table to string and render using the provided function
            table_str = etree.tostring(node, encoding="utf-8", method="xml").decode('utf-8')
            render_xhtml_in_jupyter(table_str)
            return  # Once we've processed the table, we don't need to go deeper
        
        # If not a table, continue with the children.
        for child in node:
            _traverse_tree_for_tables(child)

    _traverse_tree_for_tables(root)

0,1,2,3
Location:,"McKinnon, TN",Accident Number:,MIA08CA018
Date & Time:,"11/18/2007, 1315 EST",Registration:,N5803A
Aircraft:,Cessna 172,Aircraft Damage  :,Substantial
Defining Event:,,Injuries:,4 None
Flight Conducted Under:,Part 91  : General Aviation - Personal,,


0,1,2,3
Certificate:,Private,Age:,"63, Male"
Airplane Rating(s):,Single-engine Land,Seat Occupied:,
Other  Aircraft Rating(s):,,Restraint Used:,Seatbelt
Instrument  Rating(s):,,Second Pilot  Present:,
Instructor  Rating(s):,,Toxicology Performed:,
Medical  Certification:,Class 3,Last FAA  Medical Exam  :,08/01/2007
Occupational Pilot  :,,Last Flight Review or  Equivalent:,Last Flight Review or  Equivalent:
Flight Time:,"800 hours  (Total, all aircraft), 10 hours (Total, this make and  model), 2 hours (Last 90  days, all aircraft), 2  hours (Last 30  days, all aircraft)","800 hours  (Total, all aircraft), 10 hours (Total, this make and  model), 2 hours (Last 90  days, all aircraft), 2  hours (Last 30  days, all aircraft)","800 hours  (Total, all aircraft), 10 hours (Total, this make and  model), 2 hours (Last 90  days, all aircraft), 2  hours (Last 30  days, all aircraft)"


0,1,2,3
Aircraft  Make:,Cessna,Registration:,N5803A
Model/Series:,172,Aircraft  Category:,Airplane
Year of Manufacture:,,Amateur Built:,No
Airworthiness  Certificate:,Utility,Serial  Number:,28403
Landing Gear  Type:,Tricycle,Seats:,
Date/Type of Last Inspection:,,Certified  Max Gross Wt.:,
Time Since  Last Inspection:,,Engines:,1 Reciprocating
Airframe Total  Time:,,Engine  Manufacturer:,Continental
ELT:,,Engine Model/Series:,O-300
Registered  Owner:,Sam Goodman,Rated Power  :,


0,1,2,3
Conditions  at Accident Site:,Visual Conditions,Condition  of Light:,Day
"Observation Facility,  Elevation:",KHOP,Distance from Accident  Site:,Distance from Accident  Site:
Observation  Time:,1355,Direction from Accident  Site:,Direction from Accident  Site:
Lowest  Cloud Condition:,Clear,Visibility,10 Miles
Lowest  Ceiling:,Broken / 20000 ft agl,Visibility (RVR):,
Wind Speed/Gusts:,5 knots  /,Turbulence Type  Forecast/Actual:,/
Wind Direction  :,150°,Turbulence  Severity Forecast/Actual:,/
Altimeter  Setting:,30.17 inches  Hg,Temperature/Dew  Point:,17°C  / 7°C
Precipitation and Obscuration:,No Obscuration; No  Precipitation,No Obscuration; No  Precipitation,
Departure  Point:,"Mount  Pleasant, TN (  MRC)",Type  of Flight Plan Filed:,


0,1,2,3
Airport:,Houston County (KM93),Runway  Surface Type:,Asphalt
Airport  Elevation:,,Runway  Surface Condition:,Dry
Runway Used:,8,IFR  Approach:,
Runway Length/Width:,3000 ft / 75 ft,VFR  Approach/Landing:,


0,1,2,3
Crew Injuries  :,1 None,Aircraft  Damage:,Substantial
Passenger  Injuries:,3 None,Aircraft  Fire:,
Ground  Injuries:,,Aircraft  Explosion:,
Total  Injuries:,4 None,"Latitude,  Longitude:","36.316667  , -87.916667"


0,1,2,3
Investigator In Charge  (IIC):,Jose Obregon,Report Date  :,12/20/2007
Additional  Participating Persons:,Additional  Participating Persons:,Additional  Participating Persons:,Additional  Participating Persons:
Publish Date  :,Publish Date  :,Publish Date  :,Publish Date  :
Note: Investigation Docket  :,"This  accident report documents the factual  circumstances of this accident as described  to the NTSB  . NTSB accident  and incident dockets serve as permanent  archival information for the NTSB’s  investigations. Dockets released prior to June  1, 2009 are  publicly available from the NTSB’s Record  Management Division at pubinq@ntsb.gov,  or at 800-  877-  6799.  Dockets released after this date are  available at http://dms.ntsb.gov/pubdms/.","This  accident report documents the factual  circumstances of this accident as described  to the NTSB  . NTSB accident  and incident dockets serve as permanent  archival information for the NTSB’s  investigations. Dockets released prior to June  1, 2009 are  publicly available from the NTSB’s Record  Management Division at pubinq@ntsb.gov,  or at 800-  877-  6799.  Dockets released after this date are  available at http://dms.ntsb.gov/pubdms/.","This  accident report documents the factual  circumstances of this accident as described  to the NTSB  . NTSB accident  and incident dockets serve as permanent  archival information for the NTSB’s  investigations. Dockets released prior to June  1, 2009 are  publicly available from the NTSB’s Record  Management Division at pubinq@ntsb.gov,  or at 800-  877-  6799.  Dockets released after this date are  available at http://dms.ntsb.gov/pubdms/."


In the output above, you will notice that all the tables from the PDF were not only identified, they were also correctly parsed into tabular structures. The layout of the XHTML tables is similar to HTML tables, but it goes beyond this by actually tagging cells in the tables semantically with labels gleaned from any table headers. These labels improve over time as the user builds reports or uses other functionality within Docugami. Here is an example of some table XML from the file referenced above (truncated for readability):

```xml
<xhtml:table structure="table"
style="boundingBox:{left: 201.0; top: 592.6; width: 2146.5; height: 415.8; page: 1;}; ">
<xhtml:tbody structure="tbody"
     style="boundingBox:{left: 201.0; top: 592.6; width: 2146.5; height: 415.8; page: 1;}; ">
     <xhtml:tr structure="tr"
          style="boundingBox:{left: 201.0; top: 592.6; width: 2146.5; height: 109.5; page: 1;}; ">
          <xhtml:td structure="td"
               style="boundingBox:{left: 201.0; top: 592.6; width: 526.0; height: 109.5; page: 1;}; ">
               <dg:chunk structure="div">Location: </dg:chunk>
          </xhtml:td>

          <xhtml:td structure="td"
               style="boundingBox:{left: 727.1; top: 592.6; width: 697.9; height: 109.5; page: 1;}; ">
               <docset:Location>McKinnon, TN </docset:Location>
          </xhtml:td>

          <xhtml:td structure="td"
               style="boundingBox:{left: 1425.0; top: 592.6; width: 441.7; height: 109.5; page: 1;}; ">
               <dg:chunk structure="div">Accident Number: </dg:chunk>
          </xhtml:td>

          <xhtml:td structure="td"
               style="boundingBox:{left: 1866.7; top: 592.6; width: 480.8; height: 109.5; page: 1;}; ">
               <docset:AccidentNumber>MIA08CA018 </docset:AccidentNumber>
          </xhtml:td>
     </xhtml:tr>
     <!-- .... truncated for readability -->
</xhtml:tbody>
</xhtml:table>
```

You will notice that the table is parsed into a table body, rows and cells just like HTML tables, but has further information to assist downstream models:

1. The visual bounding box and page in the original PDF, useful for downstream processing by multi-modal (image and text) models.
2. Sematic tags on cells based on the table structure, for example `<docset:AccidentNumber>MIA08CA018 </docset:AccidentNumber>` is a semantic tag around the accident number value, based on the corresponding heading in the table.

# Conclusion

The rich markup in the Docugami XML Knowledge Graph can be used for advanced document segmentation. These chunks can be used for advanced RAG on top of these documents, for example:

1. As shown in the documentation for the [Docugami Loader](https://python.langchain.com/docs/integrations/document_loaders/docugami), the semantic tags on chunks can be used as vector metadata for improving retrieval precision for RAG.
2. You can also use techniques like the [Multi-Vector Retriever](https://blog.langchain.dev/semi-structured-multi-modal-rag/) for RAG on tabular and non-tabular content, using the segmented output from Docugami. There is a [separate notebook](./multi_vector_retriever.ipynb) that shows a quick example of this use case.