# Extracting a section from a DOCX file 

We want to extract sections from a `DOCX` file. The two sections we want to extract are the `Figure Legends` and the `Data availability`. We currently send twice the entire document to the API Client, which is not efficient. 

In this notebook we explor different options to be used with `LangChain` and the OpenAI client to extract the relevant sections from the document.

We will use the supported document loaders from `LangChain` to load the content and send it to the OpenAI client. 

We will do the following experiments for comparison:

1. As done in the current setup: Send twice the entire document asking for the sections.

2. Do the same in a single chat, so that we send only once the document (This would not be possible for `gpt-3.5-turbo`).

3. Extract the `figure legends` using the current approach and the `data availability` using RAG

4. Extract the `figure legends` using the current approach and the `data availability` using RAG adding examples to the prompt.

The metrics will be mesured as tokens (cost), time spent, and accuracy of the extraction measure with BLEU.

In [1]:
import os
import json
import re
import time
from langchain_openai import ChatOpenAI
from langchain.memory import ChatMessageHistory
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_community.document_loaders.chatgpt import ChatGPTLoader
from langchain_community.document_loaders.word_document import Docx2txtLoader, UnstructuredWordDocumentLoader
from langchain_community.document_loaders import TextLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from nltk.translate.bleu_score import sentence_bleu
from langchain_community.callbacks import get_openai_callback
from glob import glob
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
import pypandoc
from dotenv import load_dotenv
load_dotenv()


True

Set up the secret keys for the OpenAI client.

In [2]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


Prepare the list of available `DOCX` files and ground truth.


In [3]:

DOCX_FILES = glob('./data/docx/*')
TRUTH_FILES = glob('./data/ground_truth/*')

# Organize the list of files making sure that the same file names are in the same index
DOCX_FILES.sort()
TRUTH_FILES.sort()

DOCX_FILES

['./data/docx/EMBOJ-2023-114195.docx',
 './data/docx/EMBOJ-2023-114687.docx',
 './data/docx/EMBOJ-2023-115257.docx',
 './data/docx/EMBOJ-2023-115537.docx',
 './data/docx/EMBOR-2023-58706-T.docx',
 './data/docx/EMBOR-2024-58727V1.docx',
 './data/docx/EMBOR-2024-59101V1-T.docx',
 './data/docx/EMM-2023-18636.docx',
 './data/docx/EMM-2023-19044.docx',
 './data/docx/MSB-2023-12087.docx']

In [4]:

INPUT_VALUES = [
    {
        "expected_figure_count": 6,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6",
    },
    {
        "expected_figure_count": 13,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13",
    },
    {
        "expected_figure_count": 6,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6",
    },
    {
        "expected_figure_count": 6,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6",
    },
    {
        "expected_figure_count": 7,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7",
    },
    {
        "expected_figure_count": 7,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7",
    },
    {
        "expected_figure_count": 7,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7",
    },
    {
        "expected_figure_count": 8,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8",
    },
    {
        "expected_figure_count": 6,
        "expected_figure_labels": "Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6",
    },
    {
        "expected_figure_count": 1,
        "expected_figure_labels": "Figure 1",
    },
    
]

# EXPERIMENT 1: Send the entire document as plain text to the API Client

In [5]:
llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0.1,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)


In [6]:
# Prompts
FIGURE_LEGENDS_PROMPT = """
You are a scientific text analyzer focused on finding figure captions in scientific manuscripts. 
Your task is ONLY to find and return the complete figure-related text content from the manuscript.

Key Instructions:
1. Look for figure captions throughout the entire document - they can appear:
   - In a dedicated section marked as "Figure Legends", "Figure Captions", etc.
   - Embedded in the results section
   - At the end of the manuscript
   - In an appendix section
   - Or anywhere else in the document

2. Find ALL figure captions, including:
   - Main figures (Figure 1, Figure 2, etc.)
   - Expanded View figures (EV Figures)
   - Supplementary figures
   - Figure legends
   - Figure descriptions

3. IMPORTANT: Return the COMPLETE TEXT found, preserving:
   - All formatting and special characters
   - Statistical information
   - Scale bars and measurements
   - Panel labels and descriptions
   - Source references

4. DO NOT:
   - Modify or rewrite the text
   - Summarize or shorten descriptions
   - Skip any figure-related content
   - Add any explanatory text of your own

INPUT: Expected figures: {expected_figure_count}

Expected figure labels: {expected_figure_labels}

Manuscript_text: {manuscript_text}

OUTPUT: Return ONLY the found figure-related text, exactly as it appears in the document. 
If you find multiple sections with figure descriptions, concatenate them all.

If you truly cannot find ANY figure captions or descriptions in the document, 
only then return 

```
No figure legends section found.
```
"""


In [7]:
DATA_AVAILABILITY_PROMPT = """
You are an expert at locating and extracting Data Availability sections from scientific manuscripts.
Your task is to find and return ONLY the data availability section content from the manuscript text.

The section may be found:
- As a dedicated section titled "Data Availability" 
- Under "Materials and Methods"
- Near manuscript end
- As "Availability of Data and Materials"
- Within supplementary information

Critical instructions:
1. Return ONLY the content of the section, including possible hyperlinks and **WITHOUT** the section title
2. Maintain exact formatting and text
3. Do not add comments or explanations
4. Do not modify or summarize the text

If no data availability section is found OR if no data deposits are mentioned, return exactly:

```
This study includes no data deposited in external repositories.
```

The message you will receive is:

Manuscript_text: {manuscript_text}
"""


## Loading and Parsing the DOCX
LangChain’s default `UnstructuredFileLoader` will try to parse the file using Unstructured under the hood. This will allow us to extract the text from the DOCX file.

In [8]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="gradient", # Literal['percentile', 'standard_deviation', 'interquartile', 'gradient']
    breakpoint_threshold_amount=0.25, # float
    number_of_chunks=None, # int
    sentence_split_regex='(?<=[.?!])\\s+', # str
    min_chunk_size=1024 # int
)

def extract_document_text(file, engine='docx2txt'):
    """
    Extracts the text from a document file.
    """
    if engine == 'docx2txt':
        loader = Docx2txtLoader(file)
        raw_documents = loader.load()
    elif engine == 'unstructured':
        loader = UnstructuredWordDocumentLoader(file)
        raw_documents = loader.load()
    elif engine == 'pypandoc':
        file_html = pypandoc.convert_file(file, 'html')
        # Save the html file
        with open('./data/tmp/temp.html', 'w') as f:
            f.write(file_html)
        loader = TextLoader('./data/tmp/temp.html')
        raw_documents = loader.load()
    else:
        raise ValueError('Invalid engine')
    
    return "\n".join([doc.page_content for doc in raw_documents])


In [9]:
# Function to extract text from a DOCX file using ChatGPTLoader
docs = extract_document_text(DOCX_FILES[0],engine='pypandoc')





In [10]:
print(docs)

<p><strong>Latrophilin-2 mediates fluid shear stress mechanotransduction
at endothelial junctions</strong></p>
<p>Keiichiro Tanaka<sup>1,7,*</sup>, Minghao Chen<sup>1,7</sup>, Andrew
Prendergast<sup>1</sup>, Zhenwu Zhuang<sup>1</sup>, Ali
Nasiri<sup>2</sup>, Divyesh Joshi<sup>1</sup>, Jared
Hintzen<sup>1</sup>, Minhwan Chung<sup>1</sup>, Abhishek
Kumar<sup>1</sup>, Arya Mani<sup>1</sup>, Anthony Koleske<sup>3</sup>,
Jason Crawford<sup>4</sup>, Stefania Nicoli<sup>1</sup> and Martin A.
Schwartz<sup>1,5,6,*</sup></p>
<p><sup>1</sup> Yale Cardiovascular Research Center, Section of
Cardiovascu­­lar Medicine, Department of Internal Medicine, School of
Medicine, Yale University, New Haven, CT 06511, USA</p>
<p><sup>2</sup> Department of Internal Medicine</p>
<p><sup>3</sup> Department of Molecular Biochemistry and Biophysics</p>
<p><sup>4</sup> Department of Chemistry</p>
<p><sup>5</sup> Department of Cell Biology</p>
<p><sup>6</sup> Department of Biomedical Engineering</p>
<p><sup>7</sup> T

In [11]:
from bs4 import BeautifulSoup
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [12]:
def preprocess_text(text):
    """
    Preprocess the input text by removing HTML tags and code blocks.
    """
    # Remove code blocks enclosed in triple backticks
    text = re.sub(r'```.*?```', '', text, flags=re.DOTALL)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Replace newline characters with spaces
    text = text.replace('\n', ' ')
    return text

def calculate_bleu(reference_text, candidate_text):
    """
    Calculate the BLEU score between the reference and candidate texts.
    """
    # Preprocess the texts
    reference_text = preprocess_text(reference_text)
    candidate_text = preprocess_text(candidate_text)
    
    # Tokenize the texts
    reference_tokens = reference_text.split()
    candidate_tokens = candidate_text.split()
    
    # Compute BLEU score
    score = sentence_bleu([reference_tokens], candidate_tokens)
    return score


In [13]:
def load_truth(file_path, field):
    # Read and load json object
    with open(file_path, 'r') as file:
        data = json.load(file)
    if field == "data_availability":
        return data[field]["section_text"]
    elif field == "figure legends":
        return data["all_captions"]
    else:
        raise ValueError("Invalid field")   
    

In [23]:
from langchain_core.prompts import ChatPromptTemplate

# Initialize the results dictionary
results_baseline = {
    "execution_time": [],
    "total_tokens": [],
    "total_cost_usd": [],
    "bleu1_score_figure_legends": [],
    "bleu1_score_data_availability": [],
    "figure_legends_extracted": [],
    "figure_legends_truth": [],
    "data_availability_extracted": [],
    "data_availability_truth": [],
    # Optionally track tokens for each call separately
    "figure_legends_tokens": [],
    "data_availability_tokens": [],
}

for doc, truth, inputs in zip(DOCX_FILES, TRUTH_FILES, INPUT_VALUES):
    print(doc)
    input_text = extract_document_text(doc, engine='pypandoc')

    # Define the prompts
    prompt_figure_legends = ChatPromptTemplate.from_messages(
        [
            ("system", FIGURE_LEGENDS_PROMPT),
            (
                "human",
                """
                INPUT: Expected figures: {expected_figure_count}

                Expected figure labels: {expected_figure_labels}

                Manuscript_text: {manuscript_text}
                """,
            ),
        ]
    )
    chain_figure_legends = prompt_figure_legends | llm

    prompt_data_availability = ChatPromptTemplate.from_messages(
        [
            ("system", DATA_AVAILABILITY_PROMPT),
            ("human", "Manuscript_text: {manuscript_text}"),
        ]
    )
    chain_data_availability = prompt_data_availability | llm

    # Capture tokens/cost with the callback context
    with get_openai_callback() as cb:
        start_time = time.time()

        # FIRST call: figure legends
        response_figure_legends = chain_figure_legends.invoke(
            {
                "expected_figure_count": inputs["expected_figure_count"],
                "expected_figure_labels": inputs["expected_figure_labels"],
                "manuscript_text": input_text,
            }
        )

        # SECOND call: data availability
        response_data_availability = chain_data_availability.invoke(
            {"manuscript_text": input_text}
        )

        end_time = time.time()

        # 1) Execution time
        execution_time = end_time - start_time
        results_baseline["execution_time"].append(execution_time)

        # 2) Retrieve total token usage & cost across BOTH calls
        #    (the callback tracks everything within this 'with' block)
        total_tokens = cb.total_tokens  # sum of prompt + completion for both calls
        total_cost_usd = cb.total_cost  # total US dollar cost for both calls

        results_baseline["total_tokens"].append(total_tokens)
        results_baseline["total_cost_usd"].append(total_cost_usd)

        # If you also want to see usage per call:
        tokens_used_figure_legends = response_figure_legends.usage_metadata["total_tokens"]
        tokens_used_data_availability = response_data_availability.usage_metadata[
            "total_tokens"
        ]
        results_baseline["figure_legends_tokens"].append(tokens_used_figure_legends)
        results_baseline["data_availability_tokens"].append(tokens_used_data_availability)

    # 3) Calculate BLEU-1 scores
    truth_figure_legends = load_truth(truth, "figure legends")
    truth_data_availability = load_truth(truth, "data_availability")

    bleu1_score_figure_legends = calculate_bleu(
        truth_figure_legends, response_figure_legends.content
    )
    bleu1_score_data_availability = calculate_bleu(
        truth_data_availability, response_data_availability.content
    )

    results_baseline["bleu1_score_figure_legends"].append(bleu1_score_figure_legends)
    results_baseline["bleu1_score_data_availability"].append(
        bleu1_score_data_availability
    )

    # 4) Store extracted text
    results_baseline["figure_legends_extracted"].append(response_figure_legends.content)
    results_baseline["figure_legends_truth"].append(truth_figure_legends)

    results_baseline["data_availability_extracted"].append(
        response_data_availability.content
    )
    results_baseline["data_availability_truth"].append(truth_data_availability)


./data/docx/EMBOJ-2023-114195.docx





In [24]:
# Write the results_baseline to a JSON file in ./data/results_extraction
with open('./data/results_extraction/results_baseline.json', 'w', encoding="utf-8") as file:
    json.dump(results_baseline, file, ensure_ascii=False, indent=4)

In [25]:
results_baseline

{'execution_time': [100.60918188095093],
 'total_tokens': [103420],
 'total_cost_usd': [0.0169548],
 'bleu1_score_figure_legends': [1.0],
 'bleu1_score_data_availability': [1.0],
 'figure_legends_extracted': ['<p><strong><u>Figure 1. Gα proteins specific for endothelial flow responses</u></strong></p>\n<p><strong>A</strong>, Flow-induced endothelial alignment after Gα knockdown. HUVECs were subjected to fluid shear stress (FSS) at a rate of 12 dynes/cm<sup>2</sup> for 16 hours and nuclear orientation quantified as histograms showing the percentage of cells within each 10° of the flow directions from 0° to 90° (see Methods) ****: p&lt;0.0001; one-way ANOVA with Tukey’s multiple comparisons test. <strong>B</strong>, Src family kinase activation, quantified in <strong>C</strong>. n=5 for control, Gi knockdown, Gq11 knockdown and n=3 for simultaneous knockdown of Gi and Gq11. Values are means ± SEM. ****: p&lt;0.0001, ***: p&lt;0.001; one-way ANOVA with Tukey multiple comparison test. <str

# EXPERIMENT 2: One single call to the chat

In [17]:
COMBINED_SYSTEM_PROMPT = """
You are a scientific text analyzer with two tasks:

Task 1: Figure Legends Extraction
---------------------------------
1. You are ONLY to find and return the complete figure-related text content from the manuscript.
2. Follow these instructions:
   - Look for figure captions in the entire document (including "Figure 1", "Supplementary Figure", etc.).
   - Return the COMPLETE text, preserving formatting and details.
   - Do NOT rewrite or summarize anything.
   - If no figure captions exist, return "No figure legends section found."

Task 2: Data Availability Extraction
------------------------------------
1. You are ONLY to find and return the "data availability" section of the manuscript.
2. That section might appear under "Materials and Methods," near the end, or in "Supplementary."
3. Return ONLY the content of that section (no extra headings or commentary).
4. If none is found, return "This study includes no data deposited in external repositories."

When you have completed both tasks, return a JSON object with **two keys**:
{{
  "figure_legends": "...",
  "data_availability": "..."
}}
"""

COMBINED_USER_PROMPT_TEMPLATE = """
Manuscript text:
{manuscript_text}

Additional figure info for your reference:
- Expected figures: {expected_figure_count}
- Expected figure labels: {expected_figure_labels}
"""



In [18]:
# Initialize the results dictionary
results_sequencial_chat = {
    "execution_time": [],
    "total_tokens": [],
    "bleu1_score_figure_legends": [],
    "bleu1_score_data_availability": [],
    "figure_legends_extracted": [],
    "figure_legends_truth": [],
}


In [None]:
import json
import time

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage, HumanMessage
from langchain_core.output_parsers.json import JsonOutputParser
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

##########################################################
# 1. Create system & human prompts as usual
##########################################################
COMBINED_SYSTEM_PROMPT = """
You are a scientific text analyzer with two tasks:

Task 1: Figure Legends Extraction
---------------------------------
1. You are ONLY to find and return the complete figure-related text content from the manuscript.
2. Follow these instructions:
   - Look for figure captions in the entire document (including "Figure 1", "Supplementary Figure", etc.).
   - Return the COMPLETE text, preserving formatting and details.
   - Do NOT rewrite or summarize anything.
   - If no figure captions exist, return "No figure legends section found."

Task 2: Data Availability Extraction
------------------------------------
1. You are ONLY to find and return the "data availability" section of the manuscript.
2. That section might appear under "Materials and Methods," near the end, or in "Supplementary."
3. Return ONLY the content of that section (no extra headings or commentary).
4. If none is found, return "This study includes no data deposited in external repositories."

When you have completed both tasks, return valid JSON exactly like this:

{{
  "figure_legends": "...",
  "data_availability": "..."
}}
"""

COMBINED_USER_PROMPT_TEMPLATE = """
Manuscript text:
{manuscript_text}

Additional figure info for your reference:
- Expected figures: {expected_figure_count}
- Expected figure labels: {expected_figure_labels}
"""

##########################################################
# 2. Build chain with JSON parser
##########################################################
combined_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", COMBINED_SYSTEM_PROMPT),
        ("human", COMBINED_USER_PROMPT_TEMPLATE),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini")
json_parser = JsonOutputParser()

# Pipe the prompt → LLM → JSON parser
chain = combined_prompt | llm | json_parser

##########################################################
# 3. Initialize results dictionary
##########################################################
results_single_call = {
    "execution_time": [],
    "total_tokens": [],
    "total_cost_usd": [],
    "bleu1_score_figure_legends": [],
    "bleu1_score_data_availability": [],
    "figure_legends_extracted": [],
    "figure_legends_truth": [],
    "data_availability_extracted": [],
    "data_availability_truth": [],
}

##########################################################
# 4. Run your loop
##########################################################
for doc, truth, inputs in zip(DOCX_FILES, TRUTH_FILES, INPUT_VALUES):
    print(doc)
    input_text = extract_document_text(doc, engine='pypandoc')

    chain_input = {
        "manuscript_text": input_text,
        "expected_figure_count": inputs["expected_figure_count"],
        "expected_figure_labels": inputs["expected_figure_labels"],
    }

    # 4a) Time and measure cost of the single LLM call
    with get_openai_callback() as cb:
        start_time = time.time()
        response_dict = chain.invoke(chain_input)  # JSON parser → dict
        end_time = time.time()

        # 4b) Record duration
        execution_time = end_time - start_time
        results_single_call["execution_time"].append(execution_time)

        # 4c) Record usage from callback
        #     e.g. tokens, cost, etc.
        total_tokens = cb.total_tokens
        total_cost_usd = cb.total_cost  # The cost in USD for the entire call

        results_single_call["total_tokens"].append(total_tokens)
        results_single_call["total_cost_usd"].append(total_cost_usd)

    # 4d) Extract the two fields from the parsed dict
    figure_legends_content = response_dict.get("figure_legends", "")
    data_availability_content = response_dict.get("data_availability", "")

    # 4e) BLEU scoring
    truth_figure_legends = load_truth(truth, "figure legends")
    truth_data_availability = load_truth(truth, "data_availability")

    bleu1_score_figure_legends = calculate_bleu(
        truth_figure_legends, figure_legends_content
    )

    bleu1_score_data_availability = calculate_bleu(
        truth_data_availability, data_availability_content
    )

    results_single_call["bleu1_score_figure_legends"].append(bleu1_score_figure_legends)
    results_single_call["bleu1_score_data_availability"].append(bleu1_score_data_availability)

    # 4f) Store extracted text
    results_single_call["figure_legends_extracted"].append(preprocess_text(figure_legends_content))
    results_single_call["figure_legends_truth"].append(preprocess_text(truth_figure_legends))

    results_single_call["data_availability_extracted"].append(preprocess_text(data_availability_content))
    results_single_call["data_availability_truth"].append(preprocess_text(truth_data_availability))
    


./data/docx/EMBOJ-2023-114195.docx





In [20]:
results_single_call

{'execution_time': [92.83402895927429],
 'total_tokens': [28396],
 'total_cost_usd': [0.0038291999999999996],
 'bleu1_score_figure_legends': [0.9972548202639892],
 'bleu1_score_data_availability': [1.0],
 'figure_legends_extracted': ['Figure 1. Gα proteins specific for endothelial flow responses A, Flow-induced endothelial alignment after Gα knockdown. HUVECs were subjected to fluid shear stress (FSS) at a rate of 12 dynes/cm2 for 16 hours and nuclear orientation quantified as histograms showing the percentage of cells within each 10° of the flow directions from 0° to 90° (see Methods) ****: p&lt;0.0001; one-way ANOVA with Tukey’s multiple comparisons test. B, Src family kinase activation, quantified in C. n=5 for control, Gi knockdown, Gq11 knockdown and n=3 for simultaneous knockdown of Gi and Gq11. Values are means ± SEM. ****: p&lt;0.0001, ***: p&lt;0.001; one-way ANOVA with Tukey multiple comparison test. D, Rescue of Gq/11 and Gi knockdown by re-expression of siRNA-resistant vers

In [21]:
# Write the results_baseline to a JSON file in ./data/results_extraction
with open('./data/results_extraction/results_single_call.json', 'w', encoding="utf-8") as file:
    json.dump(results_single_call, file, ensure_ascii=False, indent=4)

# 3. Extracting the data using simple RAG, without examples

In [22]:
0.1266775 / 0.00315015

40.21316445248639