# Comparison of ETL document analyzer libraries for Python

The tools that take place in this analysis are the ones listed below.

* **Fully commercial tools**
    - [Reducto.ai](https://reducto.ai/)
    - [Extend](https://www.extend.app/)
* **Comercial tools with free tier**
    - [Unstructured](https://unstructured.io/)
    - [Datalab](https://www.datalab.to/) 
    - [Pymupdf](https://pymupdf.readthedocs.io/en/latest/) (The handling of office files requires the commercial extension of `PyMuPDFPro`. Does not seem to give anything new. RAG functionality available bbut very standard and not special. No good conversion from DOCX to MD)

* **Full open-source tools**
    - [MarkItDown](https://github.com/microsoft/markitdown) (Very simple to use and easy conversion to markdown format. Uses `python-docx` as engine. Possible integration with LLMs, but they only mention figure analysis.)
    - [Docling](https://github.com/DS4SD/docling) (Very simple to use and easy conversion to markdown format and `html`. Uses `python-docx` as engine. Integration with LangChain framework.)
    - [Everything2Markdown](https://github.com/wisupai/e2m)


The analysis will be working on the following criteria:
* Price (free, paid, free tier)
* Latency time on conversion
* Integration with other tools (LLMs, Langchain)
* Supported file formats
* Accuracy of data extraction using OpenAI (`Figure legends` and `Data availability` sections)


## Notebook preparation

In [26]:
from glob import glob
import os
from dotenv import load_dotenv

load_dotenv()  # Load .env file
import pymupdf.pro
pymupdf.pro.unlock(os.getenv('PYMUPDF_KEY'))


PDF_FILES = glob('./data/pdf/*')
DOCX_FILES = glob('./data/docx/*')
GROUND_TRUTH = glob('./data/ground_truth/*')


# Organize the list of files making sure that the same file names are in the same index
PDF_FILES.sort()
DOCX_FILES.sort()
GROUND_TRUTH.sort()

PyMuPDFPro: Trial Mode, 59 days 9 hours left.


## 0. Using `pypandoc` as the baseline

This is the library that we are currently using in `soda-curation` to convert the documents to `html`.
It gives us a good idea of the baseline time. This one also does not include any other functionalities.

* `pypandoc` does not support `.pdf` files

In [25]:
import pypandoc

pypandoc_html = pypandoc.convert_file(DOCX_FILES[0], 'html')
pypandoc_html

'<p><em><strong>LINE-1 RNA triggers matrix formation in bone cells via a\nPKR-mediated inflammatory response</strong></em></p>\n<p>Arianna Mangiavacchi,<sup>1@</sup> Gabriele Morelli,<sup>1</sup> Sjur\nReppe,<sup>2,3,4</sup> Alfonso Saera-Vila,<sup>5</sup> Peng\nLiu,<sup>1</sup> Benjamin Eggerschwiler,<sup>6,7</sup> Huoming\nZhang,<sup>8</sup> Dalila Bensaddek,<sup>8</sup> Elisa A.\nCasanova,<sup>5</sup> Carolina Medina Gomez,<sup>9</sup> Vid\nPrijatelj,<sup>9</sup> Francesco Della Valle,<sup>1</sup> Nazerke\nAtinbayeva,<sup>1</sup> Juan Carlos Izpisua Belmonte,<sup>10</sup>\nFernando Rivadeneira,<sup>9</sup> Paolo Cinelli,<sup>5,11</sup> Kaare\nMorten Gautvik,<sup>3*</sup> Valerio Orlando.<sup>1*@</sup></p>\n<p><sup>1</sup> King Abdullah University of Science and Technology\n(KAUST), Biological Environmental Science and Engineering Division,\nThuwal 23500-6900, Kingdom of Saudi Arabia\u200b</p>\n<p><sup>2</sup> Oslo University Hospital, Department of Medical\nBiochemistry, Oslo, Norwa

In [3]:
pypandoc_md = pypandoc.convert_file(DOCX_FILES[0], 'markdown')
pypandoc_md

'***LINE-1 RNA triggers matrix formation in bone cells via a PKR-mediated\ninflammatory response***\n\nArianna Mangiavacchi,^1@^ Gabriele Morelli,^1^ Sjur Reppe,^2,3,4^\nAlfonso Saera-Vila,^5^ Peng Liu,^1^ Benjamin Eggerschwiler,^6,7^ Huoming\nZhang,^8^ Dalila Bensaddek,^8^ Elisa A. Casanova,^5^ Carolina Medina\nGomez,^9^ Vid Prijatelj,^9^ Francesco Della Valle,^1^ Nazerke\nAtinbayeva,^1^ Juan Carlos Izpisua Belmonte,^10^ Fernando\nRivadeneira,^9^ Paolo Cinelli,^5,11^ Kaare Morten Gautvik,^3\\*^ Valerio\nOrlando.^1\\*@^\n\n^1^ King Abdullah University of Science and Technology (KAUST),\nBiological Environmental Science and Engineering Division, Thuwal\n23500-6900, Kingdom of Saudi Arabia\u200b\n\n^2^ Oslo University Hospital, Department of Medical Biochemistry, Oslo,\nNorway.\n\n^3^Lovisenberg Diaconal Hospital, Unger-Vetlesen Institute, Oslo,\nNorway.\n\n^4^Oslo University Hospital, Department of Plastic and Reconstructive\nSurgery, Oslo, Norway.\n\n^5^ Sequentia Biotech, Carrer Comte

## 1. PymuPDF

In [4]:
import pymupdf


### Is possible to use `TOC` to find the sections in the documents?

It does not seem to be the case, as shown below:

The engine behind `MuPDF` has its own functionality to convert `docx` files into something readable.

In [69]:
docx_doc = pymupdf.open(DOCX_FILES[0])
docx_doc.get_toc()

[]

In [6]:
pdf_doc = pymupdf.open(PDF_FILES[0])
pdf_doc.get_toc()

[[1, 'Manuscript Text', 1],
 [1, 'Figure 1', 65],
 [1, 'Figure 2', 66],
 [1, 'Figure 3', 67],
 [1, 'Figure 4', 68],
 [1, 'Figure 5', 69],
 [1, 'Figure 6', 70],
 [1, 'Figure 7', 71],
 [1, 'Figure 8', 72],
 [1, 'Figure EV1', 73],
 [1, 'Figure EV2', 74],
 [1, 'Figure EV3', 75],
 [1, 'Figure EV4', 76],
 [1, 'Figure EV5', 77],
 [1, 'Table EV1', 78],
 [1, 'Table EV2', 79],
 [1, 'Table EV3', 81]]

### Converting the files to markdown format

In [7]:
import pymupdf4llm

In [8]:
pymupdf4llm.to_markdown(DOCX_FILES[0])


Processing ./data/docx/EMBOJ-2023-115257.docx...


"LINE-1 RNA triggers matrix formation in bone cells via a PKR-mediated inflammatory response\n\nArianna Mangiavacchi,[1@] Gabriele Morelli,[1] Sjur Reppe,[2,3,4 ]Alfonso Saera-Vila,[5] Peng Liu,[1] Benjamin Eggerschwiler,[6,7] Huoming\n\nZhang,[8] Dalila Bensaddek,[8] Elisa A. Casanova,[5] Carolina Medina Gomez,[9] Vid Prijatelj,[9] Francesco Della Valle,[1 ] Nazerke\n\nAtinbayeva,[1] Juan Carlos Izpisua Belmonte,[10] Fernando Rivadeneira,[9] Paolo Cinelli,[5,11] Kaare Morten Gautvik,[3*] Valerio Orlando.\n\n1*@\n\n1 King Abdullah University of Science and Technology (KAUST), Biological Environmental Science and\n\nEngineering Division, Thuwal 23500-6900, Kingdom of Saudi Arabia\u200b\n\n2 Oslo University Hospital, Department of Medical Biochemistry, Oslo, Norway.\n\n3Lovisenberg Diaconal Hospital, Unger-Vetlesen Institute, Oslo, Norway.\n\n4Oslo University Hospital, Department of Plastic and Reconstructive Surgery, Oslo, Norway.\n\n5 Sequentia Biotech, Carrer Comte D'Urgell 240, Barce

In [9]:
pymupdf4llm.to_markdown(PDF_FILES[0])


Processing ./data/pdf/EMM-2023-18636.pdf...


"1\n\n2\n\n3 **DUSP6 inhibition overcomes Neuregulin/HER3-driven therapy**\n\n4 **tolerance in HER2+ breast cancer**\n\n5\n\n6 Majid Momeny[1,2§], Mari Tienhaara[3,4], Mukund Sharma[1,4], Deepankar Chakroborty[3,4],\n\n7 Roosa Varjus[1], Iina Takala[3,4], Joni Merisaari[1], Artur Padzik[1], Andreas Vogt[5], Ilkka\n\n8 Paatero[1], Klaus Elenius[3,4], Teemu D. Laajala[6], Kari J. Kurppa[3,4], Jukka\n\n9 Westermarck[1,4,§]\n\n10\n\n11 1Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku,\n\n12 Finland\n\n13 2The Brown Foundation Institute of Molecular Medicine, McGovern Medical School,\n\n14 The University of Texas Health Science Center at Houston, TX, USA.\n\n15 3Medicity Research Laboratories, Faculty of Medicine, University of Turku, Turku,\n\n16 Finland\n\n17 4Institute of Biomedicine, University of Turku, Turku, Finland\n\n18 5University of Pittsburgh Drug Discovery Institute, Department of Computational and\n\n19 Systems Biology, Pittsburgh Technology Cent

### Document loading using langchain

In [10]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(PDF_FILES[0])
data = loader.load()


In [11]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(DOCX_FILES[0])
data = loader.load()


### Searching for text in the document

PyMuPDF supports searching for text at the page level. This might be of use for us to limit the context sent to the LLM.

In [12]:
# Loop through the pages int he document and return true if text is found
docx_doc = pymupdf.open(DOCX_FILES[0])
for page in docx_doc:
    if page.search_for('Figure legends'):
        print('Text found on page')

Text found on page
Text found on page


In [13]:
# Loop through the pages int he document and return true if text is found
docx_doc = pymupdf.open(DOCX_FILES[0])
for page in docx_doc:
    if page.search_for('Data availability'):
        print('Text found on page')
        print(page.get_text())

Text found on page
The collisional energy increased linearly from 20.01 eV at 0.6 (1/K0) to 52.00 eV at 1.35 Vs cm−2 (1/K0). The scan
range for MS and MS/MS spectra was set to 100-1700 m/z. TIMS ramping time and accumulation time were set to
100 milliseconds. The diaPASEF data were analyzed by directDIA approach using Spectronaut software (version 14)
following manufacture instructions. Up or down regulated proteins were determined using DEP (Differential
Enrichment analysis of Proteomics data) R package. Significant results (adjusted p-value < 0.05) were subjected to
Gene Ontology enrichment analysis with clusterProfiler R package.
Western Blot
Total protein extracts were prepared by lysing cells in extraction buffer (HEPES KOH [pH 8.5], NaCl 400 mM,
EDTA 0.1 mM, EGTA 0.1 mM, DTT 1 mM, 1× protease inhibitor, SDS 1%). Proteins were separated by
electrophoresis on BOLT 4%–12% bis-tris polyacrylamide precast gels in MES buffer (Life Technologies) and
transferred to a 0,2 m nitrocellulos

In [14]:
# Loop through the pages int he document and return true if text is found
docx_doc = pymupdf.open(DOCX_FILES[0])
for page in docx_doc:
    if page.search_for('Figure Legends'):
        print('Text found on page')
        print(page.get_text())

Text found on page
The collisional energy increased linearly from 20.01 eV at 0.6 (1/K0) to 52.00 eV at 1.35 Vs cm−2 (1/K0). The scan
range for MS and MS/MS spectra was set to 100-1700 m/z. TIMS ramping time and accumulation time were set to
100 milliseconds. The diaPASEF data were analyzed by directDIA approach using Spectronaut software (version 14)
following manufacture instructions. Up or down regulated proteins were determined using DEP (Differential
Enrichment analysis of Proteomics data) R package. Significant results (adjusted p-value < 0.05) were subjected to
Gene Ontology enrichment analysis with clusterProfiler R package.
Western Blot
Total protein extracts were prepared by lysing cells in extraction buffer (HEPES KOH [pH 8.5], NaCl 400 mM,
EDTA 0.1 mM, EGTA 0.1 mM, DTT 1 mM, 1× protease inhibitor, SDS 1%). Proteins were separated by
electrophoresis on BOLT 4%–12% bis-tris polyacrylamide precast gels in MES buffer (Life Technologies) and
transferred to a 0,2 m nitrocellulos

In [15]:
page.get_text()

'expressed protein (adjusted p-value <0.05) in the exosomes of L1 compared to RFP transfected osteoblasts.\n \n'

### Splitting the text to find sections in the document

To efficiently extract specific sections from your document using LangChain's PyMuPDFLoader, it's advisable to utilize a text splitter that aligns with the document's inherent structure. Given that you're working with a DOCX file, employing the MarkdownHeaderTextSplitter can be particularly effective, as it splits the text based on Markdown headers, preserving the logical organization of your document.

Here's how you can implement this approach:

In [16]:
import pymupdf4llm
from langchain_text_splitters.markdown import ExperimentalMarkdownSyntaxTextSplitter

# Get the MD text
md_text = pymupdf4llm.to_markdown(DOCX_FILES[1])  # get markdown for all pages

# Initialize the MarkdownHeaderTextSplitter
splitter = ExperimentalMarkdownSyntaxTextSplitter()

chunks = splitter.split_text(md_text)

for chunk in chunks:
    print(f"****{chunk}****\n")


Processing ./data/docx/EMM-2023-19044.docx...
****page_content='Research Article

Title: Ependymal cell lineage reprogramming as a potential therapeutic intervention for

hydrocephalus

Authors: Konstantina Kaplani[1], Maria - Eleni Lalioti[1], Stella Vassalou[1], Georgia Lokka[1], Evangelia

Parlapani[1], Georgios Kritikos[1], Zoi Lygerou[2], Stavros Taraviras[1,]

Affiliations:

1Department of Physiology, School of Medicine, University of Patras, Patras, Greece

2Department of General Biology, School of Medicine, University of Patras, Patras, Greece

Corresponding author: Stavros Taraviras

Corresponding author’s address: Department of Physiology, School of Medicine, University of Patras,

Asklepiou Street 1, Rio 26504, Patras, Greece

Corresponding author’s phone and fax: +30 2610 997676

Corresponding author’s e-mail address: [taraviras@med.upatras.gr](mailto:taraviras@med.upatras.gr)

Running title: GemC1 and McIdas promote ependymal reprogramming.

Keywords: ependymal cells, repr

Note that the markdown text is indeed `plain text`. The reason is that there is a comercial use of this library that is the one that actually supports passign DOCX to md.

In [17]:
md_text

"Research Article\n\nTitle: Ependymal cell lineage reprogramming as a potential therapeutic intervention for\n\nhydrocephalus\n\nAuthors: Konstantina Kaplani[1], Maria - Eleni Lalioti[1], Stella Vassalou[1], Georgia Lokka[1], Evangelia\n\nParlapani[1], Georgios Kritikos[1], Zoi Lygerou[2], Stavros Taraviras[1,]\n\nAffiliations:\n\n1Department of Physiology, School of Medicine, University of Patras, Patras, Greece\n\n2Department of General Biology, School of Medicine, University of Patras, Patras, Greece\n\nCorresponding author: Stavros Taraviras\n\nCorresponding author’s address: Department of Physiology, School of Medicine, University of Patras,\n\nAsklepiou Street 1, Rio 26504, Patras, Greece\n\nCorresponding author’s phone and fax: +30 2610 997676\n\nCorresponding author’s e-mail address: [taraviras@med.upatras.gr](mailto:taraviras@med.upatras.gr)\n\nRunning title: GemC1 and McIdas promote ependymal reprogramming.\n\nKeywords: ependymal cells, reprogramming, hydrocephalus, McIdas, G

# 2. Markitdown

It seems to do a very good job at isolating different sections of the document. It clearly finds the `Figure legends` and `Data availability` sections.

Although it extracts data from the PDF too, the extraction is brutally noisy compared to the docx file. The markdown structure is much better depicted.

In the markitdown library, the conversion of `.docx` files into structured text or Markdown is handled by its custom conversion engine built on top of `python-docx`.



In [31]:
from markitdown import MarkItDown

In [32]:
md = MarkItDown()
result = md.convert(DOCX_FILES[0])
print(result.text_content)




**Latrophilin-2 mediates fluid shear stress mechanotransduction at endothelial junctions**

Keiichiro Tanaka1,7,\*, Minghao Chen1,7, Andrew Prendergast1, Zhenwu Zhuang1, Ali Nasiri2, Divyesh Joshi1, Jared Hintzen1, Minhwan Chung1, Abhishek Kumar1, Arya Mani1, Anthony Koleske3, Jason Crawford4, Stefania Nicoli1 and Martin A. Schwartz1,5,6,\*

1 Yale Cardiovascular Research Center, Section of Cardiovascu­­lar Medicine, Department of Internal Medicine, School of Medicine, Yale University, New Haven, CT 06511, USA

2 Department of Internal Medicine

3 Department of Molecular Biochemistry and Biophysics

4 Department of Chemistry

5 Department of Cell Biology

6 Department of Biomedical Engineering

7 These authors contributed equally.

Running title: Latrophilins in shear stress sensing

\*Authors for correspondence:

keiichiro.tanaka@yale.edu; martin.schwartz@yale.edu;

**Abstract**

Endothelial cell responses to fluid shear stress from blood flow are crucial for vascular development, f

In [33]:
md = MarkItDown()
result = md.convert(PDF_FILES[0])
print(result.text_content)


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

Latrophilin-2 mediates fluid shear stress mechanotransduction at endothelial

junctions

Keiichiro  Tanaka1,7,*,  Minghao  Chen1,7,  Andrew  Prendergast1,  Zhenwu  Zhuang1,  Ali  Nasiri2,
Divyesh  Joshi1,  Jared  Hintzen1,  Minhwan  Chung1,  Abhishek  Kumar1,  Arya  Mani1,  Anthony
Koleske3, Jason Crawford4, Stefania Nicoli1 and Martin A. Schwartz1,5,6,*

1  Yale  Cardiovascular  Research  Center,  Section  of  Cardiovascular  Medicine,  Department  of

Internal Medicine, School of Medicine, Yale University, New Haven, CT 06511, USA
2 Department of Internal Medicine
3 Department of Molecular Biochemistry and Biophysics
4 Department of Chemistry
5 Department of Cell Biology
6 Department of Biomedical Engineering
7 These authors contributed equally.

Running title: Latrophilins in shear stress sensing

*Authors for correspondence:

keiichiro.tanaka@yale.edu; martin.schwartz@yale.edu;

Abstract

# 3. Docling

Generates the extraction to different formats including markdown and html. However, the structure of the DOCX file is not kept since there are no headers or titles. Only paragraphs objects.

DOCLING nuses the `python-docx` library as main engine for the conversion.

In [68]:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
docling_docx = converter.convert(DOCX_FILES[1])
print(docling_docx.document.export_to_markdown())  # output: "## Docling Technical Report[...]"


The Nedd4L ubiquitin ligase is activated by FCHO2-generated membrane curvature

Yasuhisa Sakamoto1, Akiyoshi Uezu1, Koji Kikuchi1, Jangmi Kang2, Eiko Fujii2, Toshiro Moroishi3, Shiro Suetsugu4, and Hiroyuki Nakanishi1,2*

1Department of Molecular Pharmacology, Faculty of Life Sciences, Kumamoto University, 1–1–1 Honjyo, Kumamoto 860-8556, Japan

2Faculty of Clinical Nutrition and Dietetics, Konan Women’s University, 6–2–23 Morikita-machi, Kobe 658-0001, Japan

3Department of Molecular and Medical Pharmacology, Faculty of Life Sciences, Kumamoto University, 1–1–1 Honjyo, Kumamoto 860-8556, Japan

4Division of Biological Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan

*Corresponding author. Tel: +81-96-373-5074; Fax: +81-96-373-5078; E-mail: hnakanis@gpo.kumamoto-u.ac.jp

Abstract

The C2-WW-HECT domain ubiquitin ligase Nedd4L regulates membrane sorting during endocytosis through the ubiquitination 

In [62]:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
docling_docx = converter.convert(DOCX_FILES[1])
print(docling_docx.document.export_to_html())  # output: "## Docling Technical Report[...]"


<!DOCTYPE html>
<html lang="en">
<head>
    <link rel="icon" type="image/png"
    href="https://ds4sd.github.io/docling/assets/logo.png"/>
    <meta charset="UTF-8">
    <title>
    Powered by Docling
    </title>
    <style>
    html {
    background-color: LightGray;
    }
    body {
    margin: 0 auto;
    width:800px;
    padding: 30px;
    background-color: White;
    font-family: Arial, sans-serif;
    box-shadow: 10px 10px 10px grey;
    }
    figure{
    display: block;
    width: 100%;
    margin: 0px;
    margin-top: 10px;
    margin-bottom: 10px;
    }
    img {
    display: block;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
    max-width: 640px;
    max-height: 640px;
    }
    table {
    min-width:500px;
    background-color: White;
    border-collapse: collapse;
    cell-padding: 5px;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
    }
    th, td {
    border: 1px solid black;
    padding: 8px;
    }
    th {
    font-weight: bold;

### Extracting sections and headers

#### Native from docling

In [63]:
from docling.document_converter import DocumentConverter
from docling_core.types.doc.labels import DocItemLabel

# Initialize the converter
converter = DocumentConverter()

# Convert your DOCX file
docling_docx = converter.convert(DOCX_FILES[1])

# Initialize lists to store headers and sections
section_headers = []
sections = []

# Iterate over the document items
for item, _ in docling_docx.document.iterate_items():
    
    # Check if the item is a section header
    print(item)
    if item.label == DocItemLabel.SECTION_HEADER:
        section_headers.append(item.text)
        sections.append({
            'header': item.text,
            'content': []
        })
    # Check if the item is a paragraph and there's a preceding section
    elif item.label == DocItemLabel.PARAGRAPH and sections:
        sections[-1]['content'].append(item.text)

# Display the extracted section headers
print("Section Headers:")
for header in section_headers:
    print(header)

# Display the sections with their content
print("\nSections:")
for section in sections:
    print(f"Header: {section['header']}")
    print("Content:")
    for paragraph in section['content']:
        print(paragraph)


self_ref='#/texts/0' parent=RefItem(cref='#/body') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='The Nedd4L ubiquitin ligase is activated by FCHO2-generated membrane curvature' text='The Nedd4L ubiquitin ligase is activated by FCHO2-generated membrane curvature'
self_ref='#/texts/1' parent=RefItem(cref='#/body') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='' text=''
self_ref='#/texts/2' parent=RefItem(cref='#/body') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='Yasuhisa Sakamoto1, Akiyoshi Uezu1, Koji Kikuchi1, Jangmi Kang2, Eiko Fujii2, Toshiro Moroishi3, Shiro Suetsugu4, and Hiroyuki Nakanishi1,2*' text='Yasuhisa Sakamoto1, Akiyoshi Uezu1, Koji Kikuchi1, Jangmi Kang2, Eiko Fujii2, Toshiro Moroishi3, Shiro Suetsugu4, and Hiroyuki Nakanishi1,2*'
self_ref='#/texts/3' parent=RefItem(cref='#/body') children=[] label=<DocItemLabel.PARAGRAPH: 'paragraph'> prov=[] orig='' text=''
self_ref='#/texts/4' parent=RefItem(cref

#### Using beautifulsoup to parse the html export

In [65]:
from bs4 import BeautifulSoup

converter = DocumentConverter()
docling_docx = converter.convert(DOCX_FILES[1])

# Example HTML content from docling's `export_to_html()` method
html_content = docling_docx.document.export_to_html()

def extract_sections_from_html(html):
    # Parse the HTML content
    soup = BeautifulSoup(html, "html.parser")
    
    sections = []
    current_section = None

    # Iterate over all HTML elements
    for element in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6", "p"]):  # Include headers and paragraphs
        if element.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:  # Check if it's a header
            # Save the previous section if one exists
            if current_section:
                sections.append(current_section)
            # Start a new section
            current_section = {"section_header": element.text.strip(), "section_content": ""}
        elif element.name == "p" and current_section:  # Add paragraph content to the current section
            current_section["section_content"] += element.text.strip() + " "

    # Append the last section
    if current_section:
        sections.append(current_section)

    return sections

# Extract sections
sections = extract_sections_from_html(html_content)

# Print the extracted sections
for section in sections:
    print(f"Header: {section['section_header']}")
    print(f"Content: {section['section_content']}\n")


In [67]:
BeautifulSoup(html_content, "html.parser")

<!DOCTYPE html>

<html lang="en">
<head>
<link href="https://ds4sd.github.io/docling/assets/logo.png" rel="icon" type="image/png"/>
<meta charset="utf-8"/>
<title>
    Powered by Docling
    </title>
<style>
    html {
    background-color: LightGray;
    }
    body {
    margin: 0 auto;
    width:800px;
    padding: 30px;
    background-color: White;
    font-family: Arial, sans-serif;
    box-shadow: 10px 10px 10px grey;
    }
    figure{
    display: block;
    width: 100%;
    margin: 0px;
    margin-top: 10px;
    margin-bottom: 10px;
    }
    img {
    display: block;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
    max-width: 640px;
    max-height: 640px;
    }
    table {
    min-width:500px;
    background-color: White;
    border-collapse: collapse;
    cell-padding: 5px;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
    }
    th, td {
    border: 1px solid black;
    padding: 8px;
    }
    th {
    font-weight: bold;
    }
    table t

#### Langchain integration

In [46]:
from langchain_docling import DoclingLoader

loader = DoclingLoader(file_path=DOCX_FILES[1])
docs = loader.load()
docs


Token indices sequence length is longer than the specified maximum sequence length for this model (528 > 512). Running this sequence through the model will result in indexing errors


[Document(metadata={'source': './data/docx/EMBOJ-2023-114687.docx', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/1', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/3', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/4', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/5', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/6', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'paragraph', 'prov': []}, {'self_ref': '#/texts/7', 'parent': {'$ref': '#/body'}, 'children': [], 'label'

# 4. Everything to Markdown

Uses the pandoc and xml engines to extract the data from DOCX and DOC files

Can be used together with LLM to convert to different formats

No clear how we could use this to directly extract the sections of the document.

#### PANDOC ENGINE

In [72]:
from wisup_e2m import DocxParser

parser = DocxParser(engine="pandoc") # docx engines: pandoc, xml
docx_data = parser.parse(DOCX_FILES[5])
print(docx_data.text)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


**Title: Targeting the transcription factor YY1 is synthetic lethal with
loss of the histone demethylase KDM5C**

**Qian Zheng<sup>1#</sup>, Pengfei Li<sup>2#</sup>, Yulong
Qiang<sup>1#</sup>, Jiachen Fan<sup>1</sup>, Yuzhu Xing<sup>1</sup>,
Ying Zhang<sup>1</sup>, Fan Yang<sup>1</sup>\*, Feng Li<sup>1,3</sup>\*
and Jie Xiong<sup>4</sup>\***

<sup>1</sup>Department of Medical Genetics, TaiKang Medical School
(School of Basic Medical Sciences), Wuhan University, Wuhan 430071,
China

<sup>2</sup>Inner Mongolia Key Laboratory of Molecular Pathology, Inner
Mongolia Medical University, Huhhot, Inner Mongolia, 010059, China

<sup>3</sup>Hubei Provincial Key Laboratory of Allergy and Immunology,
Wuhan University, Wuhan 430071, China

<sup>4</sup>Department of Immunology, TaiKang Medical School (School of
Basic Medical Sciences), Wuhan University, Wuhan 430071, China

<sup>\#</sup>These authors contribute equally to this paper.

Correspondence to:

**Jie Xiong**

E-mail: <jiexiong@whu.edu.cn>，

#### XML ENGINE

In [71]:
parser = DocxParser(engine="xml") # docx engines: pandoc, xml
docx_data = parser.parse(DOCX_FILES[0])
print(docx_data.text)


ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text=''
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text=''
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text=''
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text='Latrophilin-2 mediates fluid shear stress mechanotransduction at endothelial junctions'
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text=''
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text=''
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text=''
ele.tag='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p' | ele.text='Keiichiro Tanaka1,7,*, Minghao Chen1,7, Andrew Prendergast1, Zhenwu Zhuang1, Ali Nasiri2, Divyesh Joshi1, Jared Hintzen1, Minhwan Chung1, Abhishek Kumar1, Arya Mani1, Anthony Koleske3, Jason Crawford4, Stefania N

# 5. Unstructured

It uses the `python-docx` engine and it is able to actually get proper titles or headings. It is not always perfect, but it is a better position  than others. It also has integration with LangChain.

In [80]:
from unstructured.partition.docx import partition_docx

import ssl
import urllib.request

#######
### ONLY FOR TESTING PURPOSES! ! ! !  NEVER IN PROD
ssl._create_default_https_context = ssl._create_unverified_context


In [86]:
# Partition the DOCX file into elements
elements = partition_docx(filename=DOCX_FILES[5])


In [88]:
from unstructured.documents.elements import NarrativeText, Text, Title, ListItem, EmailAddress, PageBreak
# Process elements based on their type
for element in elements:
    if isinstance(element, Title):
        # Handle titles
        print(f"Title: {element.text}")
    elif isinstance(element, NarrativeText):
        # Handle narrative text (paragraphs)
        print(f"Narrative Text: {element.text}")
    elif isinstance(element, Text):
        # Handle general text
        print(f"Text: {element.text}")
    elif isinstance(element, ListItem):
        # Handle list items
        print(f"List Item: {element.text}")
    elif isinstance(element, EmailAddress):
        # Handle email addresses
        print(f"Email Address: {element.text}")
    elif isinstance(element, PageBreak):
        # Handle page breaks
        print("Page Break")
    else:
        # Handle unknown types
        print(f"Unknown Element: {type(element)}")


Narrative Text: Title: Targeting the transcription factor YY1 is synthetic lethal with loss of the histone demethylase KDM5C
Text: Qian Zheng1#, Pengfei Li2#, Yulong Qiang1#, Jiachen Fan1, Yuzhu Xing1, Ying Zhang1, Fan Yang1*, Feng Li1,3* and Jie Xiong4* 
Text: 1Department of Medical Genetics, TaiKang Medical School (School of Basic Medical Sciences), Wuhan University, Wuhan 430071, China
Text: 2Inner Mongolia Key Laboratory of Molecular Pathology, Inner Mongolia Medical University, Huhhot, Inner Mongolia, 010059, China
Text: 3Hubei Provincial Key Laboratory of Allergy and Immunology, Wuhan University, Wuhan 430071, China
Text: 4Department of Immunology, TaiKang Medical School (School of Basic Medical Sciences), Wuhan University, Wuhan 430071, China
Narrative Text: #These authors contribute equally to this paper.   
Title: Correspondence to:
Title: Jie Xiong
Title: E-mail: jiexiong@whu.edu.cn，Tel: +86 27 68759222
Text: Department of Immunology, TaiKang Medical School (School of Basic M

## LangChain integration

In [89]:
from langchain_community.document_loaders.word_document import UnstructuredWordDocumentLoader

# Initialize the loader
docx_loader = UnstructuredWordDocumentLoader(DOCX_FILES[5])

# Load the document into LangChain
documents = docx_loader.load()

# Process each document (LangChain uses the "Document" abstraction)
for document in documents:
    # The content of the document is in the "page_content" attribute
    content = document.page_content

    # Titles, headings, paragraphs can be parsed further using the Unstructured elements
    # For now, print the raw content:
    print(content)

    # Optionally, use metadata if it's available
    metadata = document.metadata
    print(f"Metadata: {metadata}")


Title: Targeting the transcription factor YY1 is synthetic lethal with loss of the histone demethylase KDM5C

Qian Zheng1#, Pengfei Li2#, Yulong Qiang1#, Jiachen Fan1, Yuzhu Xing1, Ying Zhang1, Fan Yang1*, Feng Li1,3* and Jie Xiong4* 

1Department of Medical Genetics, TaiKang Medical School (School of Basic Medical Sciences), Wuhan University, Wuhan 430071, China

2Inner Mongolia Key Laboratory of Molecular Pathology, Inner Mongolia Medical University, Huhhot, Inner Mongolia, 010059, China

3Hubei Provincial Key Laboratory of Allergy and Immunology, Wuhan University, Wuhan 430071, China

4Department of Immunology, TaiKang Medical School (School of Basic Medical Sciences), Wuhan University, Wuhan 430071, China

#These authors contribute equally to this paper.   

Correspondence to:

Jie Xiong

E-mail: jiexiong@whu.edu.cn，Tel: +86 27 68759222

Department of Immunology, TaiKang Medical School (School of Basic Medical Sciences), Wuhan University, Wuhan 430071, China

Feng Li

E-mail: fli22