# Text Extraction Pipeline

## Introduction

The first step in processing the input file is extracting its text content. This notebook walks you through the text extraction process along with some explanation for the reasoning behind each step. We finish with a Discussion of the benefits and limitations of this approach along with potential next steps.

### Why PyMuPDF?

There are several Python libraries that can extract text from a pdf including PyMuPDF, PyPDF2, Camelot, and more. After evaluating a few options and testing them on the provided document, I selected PyMuPDF for the following reasons:

1. **Accurate Layout Preservation**:
    PyMuPDF’s maintains paragraph structure and follows a natural reading order (top-down, left-to-right). This is especially important for multi-column documents like ours, since some of the other libraries I tested either scramble the text order or require hardcoded column settings, thus reducing flexibility and generalizability.

2. **Rich Text Metadata**:
    PyMuPDF provides detailed styling information for each line of text, including font family, font size, and formatting (bold, italic, etc.). Currently, the pipeline only uses font size to distinguish different parts of the article (e.g. headers vs. body text), but the additional formatting info opens the door for more advanced  processing in future iterations.

3. **Performance**:
    PyMuPDF is known for its fast text extraction, which, while not a critical factor at this prototyping stage, would be important in a production environment where efficiency and scalability matter. See: https://pymupdf.readthedocs.io/en/latest/about.html#performance

## Import Libraries and Load Document

In [1]:
import os
import re
from typing import List, Dict
from time import time

import pymupdf

## Make sure you have placed the file in the sources directory
doc = pymupdf.open("../sources/SlamonetalSCIENCE1987.pdf")  # open a document

## Get the second page for this example, as the first page is not part of the article
page = doc[1]

## Raw Text Extraction

First, let's try to naively extract the text using the default "text" extraction option.

In [2]:
txt = page.get_text(option="text")
print(txt)

Human Breast Cancer: Correlation of
Relapse and Survival with Amplification
of the HER-2lneu Oncogene
DENNIS J. SLAMON,* GARY M. CLARK, STEVEN G. WONG, WENDY J. LEVIN,
AxEL ULLRICH, WILLiAM L. McGuIRE
The HER-2/neu oncoFene is a member of the erbB-like
oncogene family, and aS related to, but distinct firom, the
epidermal growth factor receptr. This gene has been
shown to be amplified i human brt cancer cell lines.
In the current study, alterations of the gene in 189
primary human breast cancers were instigated HER-2/
neu was found to be amplified frm 2- to
eater than 20-
fold in 30% ofthe tumors. Correlation ofgene amplifica-
tion with several disease parameters was evaluated Am-
plification of the HER-2/neu gene was a significant pre-
dictor of both overall survival and time to relapse in
patients with breast cancer. It retained its significance
even when adjustments were made for other known
prognostic factors. Moreover, HER-2/neu amplification
had greater prognostic value than most 

As you can see, the text is extracted rather cleanly and in the correct order (no issues with the columns). However, the extractor reads in some extra line breaks due to the page formatting, which sometimes splits sentences in odd places. Additionally, the extractor picks up a lot of text that is not part of the actual article, such as the journal's volume and year in the footer or the "*To whom correspondence should be addressed" note.

## Rich Text Extraction

This text extraction above is good enough for our retrieval and text generation pipelines, however we can fix some of the issues above by extracting some styling information about the text by using PyMuPDF's dictionary extract option and inspecting individual text blocks.

Below, we compute the average font size for each text block (roughly equivalent to a paragraph) rounded to the nearest 2.5. We perform this rounding because the OCR process that produced the document led to some small variation in font size (e.g., text in the same sentence with sizes 10.1, 10.05. 10.25, etc.).

In [3]:
# Read page text as a dictionary
blocks = page.get_text(option="dict", flags=11)["blocks"]

for b in blocks:  # iterate through the text blocks
    font_sizes = []
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            font_sizes.append(s["size"])
    if font_sizes:
        block_text = " ".join(s["text"] for l in b["lines"] for s in l["spans"])
        print(f"Block text: {block_text}")

        # Compute average font size of the block rounded to the nearest 2.5
        avg_font_size = round(sum(font_sizes) / len(font_sizes) / 2.5) * 2.5
        print(f"Average font size: {avg_font_size}\n")

Block text: Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the HER-2lneu Oncogene
Average font size: 20.0

Block text: DENNIS J. SLAMON,* GARY M. CLARK, STEVEN G. WONG, WENDY J. LEVIN, AxEL ULLRICH, WILLiAM L. McGuIRE
Average font size: 15.0

Block text: The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt cancer cell lines. In the current study, alterations of the gene in 189 primary human breast cancers were instigated HER-2/ neu was found to be amplified frm 2- to eater than 20- fold in 30% of the tumors. Correlation ofgene amplifica- tion with several disease parameters was evaluated Am- plification of the HER-2/neu gene was a significant pre- dictor of both overall survival and time to relapse in patients with breast cancer. It retained its significance even when adjustments were made for other known progno

Now we are really getting somewhere! Note that:
- The article's title is font 20
- The authors are in font 15
- The body text is font 10
- The notes, footers, and other extraneous text is font 7.5 or smaller

## Complete Pipeline

With that in mind, we can now make a class to process the entire document. Here is a summary of the entire process:
1. Open the file
2. For each page, extract the blocks along with their average font size. During this step drop any vertical text to remove the "Downloaded from ..." note in the righthand margin.
3. Iterate through the blocks until you encounter a block of font size 20 (indicating the start of the article).
4. After the start is found, continue iterating and:
    - Add first block of size 15 to the authors
    - Add any blocks of size 10 to the body
    - Skip any blocks of font size <= 7.5
    - Add any blocks of different size to a list of irregular blocks
5. Stop when the end of the document is reached OR when you encounter another block of size 20 (indicating the start of a new article)
6. Clean up the body text by removing line breaks that interrupt sentences, which are the result of a sentence stretching across columns

In [4]:
class Article:
    def __init__(
        self,
        path: os.PathLike,
        title_size: int = 20,
        author_size: int = 15,
        body_size: int = 10,
        note_size: int = 7.5,
    ):
        # Init attributes
        self.path = path
        self.title_size = title_size
        self.author_size = author_size
        self.body_size = body_size
        self.note_size = note_size

        # Init data attributes to None
        self.title = None
        self.authors = None
        self.body = None
        self.irregular_blocks = None

        # Open the PDF document
        self.doc = pymupdf.open(path)
        self.num_pages = len(self.doc)

        # Process the PDF file to extract title, authors, and body text
        self._process_file()

    def _read_page_blocks(self, page_number: int) -> List[Dict]:
        """
        Extract the rich text from a specific page of the document.

        Args:
            page_number (int): The page number to read (0-indexed).

        Returns:
            List[Dict]: A list of dictionaries representing text blocks on the page.
        """

        page = self.doc[page_number]
        blocks = page.get_text(option="dict", flags=11)["blocks"]
        return blocks

    def _parse_blocks(self, blocks: List[Dict]) -> List[Dict]:
        """
        Parse the text blocks to extract text and average font size.

        Args:
            blocks (List[Dict]): The list of text blocks, each containing lines and spans.

        Returns:
            List[Dict]: A list of dictionaries with block text and average font size.
        """
        parsed_blocks = []
        for b in blocks:

            # Skip vertical blocks (height > 5x width)
            x0, y0, x1, y1 = b["bbox"]
            width = x1 - x0
            height = y1 - y0

            if height > 5 * width:
                continue

            font_sizes = []
            for l in b["lines"]:
                for s in l["spans"]:
                    font_sizes.append(s["size"])
            if font_sizes:
                block_text = " ".join(s["text"] for l in b["lines"] for s in l["spans"])
                avg_font_size = round(sum(font_sizes) / len(font_sizes) / 2.5) * 2.5
                parsed_blocks.append({"text": block_text, "size": avg_font_size})
        return parsed_blocks

    def _parse_doc(self) -> List[Dict]:
        """
        Parse the entire document to extract text blocks and their average font sizes.

        Returns:
            List[Dict]: A list of dictionaries with block text and average font size for each page.
        """
        all_blocks = []
        for page_number in range(self.num_pages):
            blocks = self._read_page_blocks(page_number)
            parsed_blocks = self._parse_blocks(blocks)
            all_blocks.extend(parsed_blocks)
        return all_blocks

    def _process_file(self) -> None:
        """
        Process the PDF file to extract title, authors, and body text. Ignore any text before the first headline
        or after the second headline (if applicable).
        """
        blocks = self._parse_doc()
        start = False
        self.body = ""
        self.irregular_blocks = []
        for block in blocks:
            if not start and block["size"] < self.title_size:
                continue
            elif not start and block["size"] >= self.title_size:
                start = True
                self.title = block["text"]
                continue
            else:
                if block["size"] <= self.note_size:
                    continue
                elif block["size"] == self.title_size:
                    break
                elif block["size"] == self.author_size and not self.authors:
                    self.authors = block["text"]
                elif block["size"] == self.body_size:
                    self.body += "\n" + block["text"] + "\n"
                else:
                    self.irregular_blocks.append(block)

        # Remove line breaks unless they are followed by a capital letter, indicating a new sentence
        self.body = re.sub(r'\n(?=[^A-Z])', '', self.body.strip())

In [5]:

start = time()
article = Article("../sources/SlamonetalSCIENCE1987.pdf")
end = time()
print(f"Processing time: {end - start:.2f} seconds\n")
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")

print(f"Body:\n\n{article.body}")

Processing time: 0.03 seconds

Title: Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the HER-2lneu Oncogene
Authors: DENNIS J. SLAMON,* GARY M. CLARK, STEVEN G. WONG, WENDY J. LEVIN, AxEL ULLRICH, WILLiAM L. McGuIRE
Body:

The HER-2/neu oncoFene is a member of the erbB-like oncogene family, and aS related to, but distinct firom, the epidermal growth factor receptr. This gene has been shown to be amplified i human brt cancer cell lines. In the current study, alterations of the gene in 189 primary human breast cancers were instigated HER-2/ neu was found to be amplified frm 2- to eater than 20- fold in 30% of the tumors. Correlation ofgene amplifica- tion with several disease parameters was evaluated Am- plification of the HER-2/neu gene was a significant pre- dictor of both overall survival and time to relapse in patients with breast cancer. It retained its significance even when adjustments were made for other known prognostic factors. Moreover, HER-2/ne

Success! 

The Title and Authors were extracted correctly and the body begins after the Title and ends at the REFERENCES AND NOTES (we drop the references because they are caught by our font size <= 7.5 filter and the process breaks when it hits the start of the next article on the last page).

Let's see if we encountered any irregular blocks:

In [6]:
for block in article.irregular_blocks:
    print(f"Irregular block (size {block['size']}): {block['text']}")

Irregular block (size 117.5): u_0 -S;0 ,-M; 60S_tJ ~~~~~~~~~~kb tN ~~~~~~~~~~~~~~~~-12
Irregular block (size 12.5): B
Irregular block (size 12.5): A
Irregular block (size 12.5): B


Nothing to be concerened with, looks like some figure labels.

## Export (optional)

Finally, let's save the article to a markdown file for human readability.

In [7]:
md_text = f"# {article.title}\n\n"
md_text += f"**Authors:** {article.authors}\n\n"
md_text += f"## Body\n\n{article.body}\n"

with open("article.md", "w") as f:
    f.write(md_text)

## Discussion

Overall, I am satisfied with the text extraction pipeline given the limitations in time and scope. The current process performs very well on the given file and should generalize well to other files with consistent styling. A summary of the benefits, limitations, and potential improvements is given below.

### Benefits

- **Clean Text**: PyMuPDF extract text cleanly, with minimal missing/misidentified characters.
- **Accurrate Layout**: PyMuPDF extracts text in the correct reading order (top-down, left-right) without the user hardcoding the documents format (e.g., number of columns or rows).
- **Classify Text from Styling**: use the text styling (font size and orientation) to classify data as article, author, body, or extraneous. 
- **Performant**: article processing time takes approximately 0.03 seconds which is about 0.005 seconds per page.
- **Interpretable & Customizable**: unlike black-box 3rd party services, we can control each step of the process

### Limitations

- **Format Fragility**: This approach assumes consistent formatting across articles. While articles in the same issue of a journal likely have the same styling, different journals may use different styles or journals may change their styling over time.
- **Brittle Text Classification**: we only use font size to classify each block of text, ignoring any semantic understanding. This may lead to issues if the formatting is not consistent within the article, for instance if a Figure label is also Size 10.

### Potential Future Improvements 
- **Autodetect Font-Sizes to Adapt to Different Styles**: dynamically determine the different font size cutoffs for body, authors, or titles
- **Incorporate Semantic Understanding to Text Classification**: for example, author block should not only be of the right size but must contain one or more person names, validated using NER techniques.
- **Track Section Headers or other Structure**: for long articles with mutliple sections (Background, Methods, etc.) capture the sections so we can better contextualize chunks in down stream processes.
- **Parellelization**: the intial processing of each page individually could be parallelized once speed/scalability becomes a concern.