# Parsing PDFs to extract information

We will be using PyMuPDF in order to extract information from research papers. For this exploration, we will be using the MapReduce paper and see how we can extract information.

In [2]:
import pymupdf

In [None]:
pdf_name = '../research_papers/mapreduce.pdf'
doc = None
doc = pymupdf.open(pdf_name)

In [25]:
from IPython.display import display, HTML
display(HTML(doc[0].get_text()))

In [29]:
page = doc[0]
def get_paragraphs(page):
    blocks = page.get_text("dict")["blocks"]

    paragraphs = []
    for block in blocks:
        if block["type"] == 0:  # type 0 indicates text block
            # Each text block can be treated as a paragraph
            paragraphs.append(block["lines"])

    # Print out paragraphs
    for para in paragraphs:
        paragraph_text = " ".join([span["text"] for line in para for span in line["spans"]])

    print(paragraphs)
get_paragraphs(page)

[[{'spans': [{'size': 27.658634185791016, 'flags': 4, 'font': 'Christiana-RegularSC', 'color': 2236191, 'ascender': 0.9710000157356262, 'descender': -0.30000001192092896, 'text': 'MapReduce: Simplified Data Processing ', 'origin': (141.493896484375, 81.0), 'bbox': (141.493896484375, 51.869998931884766, 568.6085815429688, 90.0)}], 'wmode': 0, 'dir': (1.0, 0.0), 'bbox': (141.493896484375, 51.869998931884766, 568.6085815429688, 90.0)}], [{'spans': [{'size': 27.658634185791016, 'flags': 4, 'font': 'Christiana-RegularSC', 'color': 2236191, 'ascender': 0.9710000157356262, 'descender': -0.30000001192092896, 'text': 'on Large Clusters', 'origin': (365.6643981933594, 111.0), 'bbox': (365.6643981933594, 81.8699951171875, 561.947998046875, 120.0)}], 'wmode': 0, 'dir': (1.0, 0.0), 'bbox': (365.6643981933594, 81.8699951171875, 561.947998046875, 120.0)}], [{'spans': [{'size': 10.0, 'flags': 20, 'font': 'Christiana-Bold', 'color': 2236191, 'ascender': 0.9810000061988831, 'descender': -0.3079999983310

## Exploring pymupdf4llm
This converts a pdf to markdown which is easier to parse.

In [32]:
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(pdf_name)
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

Processing mapreduce.pdf...


36936

## Exploring partition_pdf and unstructured
A library made by unstructured that allows us to parse a pdf along with all the images.

In [None]:
# !pip install pydantic
# !pip install partition_pdf
# !pip install "unstructured[all-docs]"

# !brew install poppler

from typing import Any
from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

In [None]:
raw_pdf_elements = partition_pdf(
    filename=pdf_name,
    
    # Using pdf format to find embedded image blocks
    
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    
    # Post processing to aggregate text once we have the title
    # chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    # max_characters=4000,
    # new_after_n_chars=3800,
    # combine_text_under_n_chars=2000,
    image_output_dir_path="static/pdfImages/",
)

In [55]:
raw_pdf_elements[9].text

'As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault tolerance, data distri- bution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional lan- guages. We realized that most of our computations involved applying a map operation to each logical record’ in our input in order to'

In [108]:
elements = partition_pdf(
    filename=pdf_name,

    # Unstructured Helpers
    strategy="fast", 
    infer_table_structure=True, 
    model_name="yolox",
    extract_images_in_pdf=True,
    image_output_dir_path="static/pdfImages/"
)

### Investigating the result of the extraction

**Type of each element**

In [110]:
set([element.category for element in elements])

{'Footer', 'Header', 'ListItem', 'NarrativeText', 'Title', 'UncategorizedText'}

In [99]:
for element in elements:
    print(element.category, element.text)

Image Check for updates.
Title MapReduce: Simplified Data Processing on Large Clusters
Title by Jeffrey Dean and Sanjay Ghemawat
Title Abstract
NarrativeText MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the under- lying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make effi- cient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google’s clusters every day, processing a total of more than twenty petabytes of data per day.
Title 1 Introduction
NarrativeText Prio

### Extracting Images    

In [103]:
images = [element for element in elements if element.category == "Image"]
[f"Image Path: {image.metadata.image_path} and location {image.metadata.page_number}" for image in images]

['Image Path: /Users/faizahmed/Documents/SJSU/topics_in_db/Project/explorations/pdf_parsing/figures/figure-1-1.jpg and location 1',
 'Image Path: /Users/faizahmed/Documents/SJSU/topics_in_db/Project/explorations/pdf_parsing/figures/figure-3-2.jpg and location 3',
 'Image Path: /Users/faizahmed/Documents/SJSU/topics_in_db/Project/explorations/pdf_parsing/figures/figure-5-3.jpg and location 5',
 'Image Path: /Users/faizahmed/Documents/SJSU/topics_in_db/Project/explorations/pdf_parsing/figures/figure-5-4.jpg and location 5',
 'Image Path: /Users/faizahmed/Documents/SJSU/topics_in_db/Project/explorations/pdf_parsing/figures/figure-6-5.jpg and location 6']

'/Users/faizahmed/Documents/SJSU/topics_in_db/Project/explorations/pdf_parsing/figures/figure-5-3.jpg'