# PDF Text Extraction and Formatting with LlamaParse

This notebook provides a comprehensive solution for extracting text from a list of PDF documents, identifying headings and subheadings, and formatting the extracted text into Markdown. It is designed to handle PDFs of various styles and formats, ensuring that the text is extracted in a logical reading order with markdown-compatible paragraph breaks.

## Features

- **PDF List Reading**: Automatically reads a list of PDF file paths from a specified text file.
- **Text Extraction**: Utilizes `llamaparse` for robust text extraction, focusing on maintaining the logical reading order.
- **Heading and Subheading Identification**: Identifies headings and subheadings within the text, formatting them with appropriate Markdown syntax.
- **Markdown Formatting**: Formats extracted text with Markdown syntax, retaining paragraph breaks and removing extraneous whitespace.
- **Markdown Output**: For each PDF, the extracted text is saved as a Markdown file in the same directory as the original PDF.

## Prerequisites

Before using this notebook, ensure you have installed the `llamaparse` library. If not, install it using pip:

```bash
pip install llamaparse


In [1]:
!pip install llama-index llama-parse



In [2]:
import os
import csv
import requests

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-h1grtSFBpXYbhlY6exE6S38O811IZfzBdKejCCKjPwAsDENd"

# llama-parse is async-first, running the sync code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

from llama_parse import LlamaParse

In [3]:
#parser = LlamaParse(
#    api_key="llx-h1grtSFBpXYbhlY6exE6S38O811IZfzBdKejCCKjPwAsDENd",  # can also be set in your env as LLAMA_CLOUD_API_KEY
#    result_type="markdown",  # "markdown" and "text" are available
#    num_workers=4, # if multiple files passed, split in `num_workers` API calls
#    verbose=True,
#    language="en" # Optionaly you can define a language, default=en
#)

# sync
#documents = parser.load_data("/Users/willit/Documents/WorldBank/samplefiles/844381468209068874/792930WP0WB0NC0l0Box07929300PUBLIC0.pdf")


In [4]:
def process_pdfs_from_csv(csv_path):
    with open(csv_path, newline='', encoding='utf-8-sig') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            pdf_path, custom_instructions = row
            pdf_path = pdf_path.strip()  # Further ensure there's no whitespace or BOM character in path
            markdown_file_path = os.path.splitext(pdf_path)[0] + '_llama.md'

            print(f"Attempting to open and process {pdf_path}, and process it with custom instructions: {custom_instructions}")

            markdown_content = LlamaParse(result_type="markdown", parsing_instruction=custom_instructions, verbose=True).load_data(pdf_path)

            print(f"Output file opened {markdown_file_path}")
            with open(markdown_file_path, 'w', encoding='utf-8') as mdfile:
#               mdfile.write(markdown_content)
                for item in markdown_content:
                    mdfile.write(f"{item}\n")
 
            print(f"Processed {pdf_path}, output saved to {markdown_file_path}")

# Replace 'pdfs_to_extract.csv' with the path to your CSV file
process_pdfs_from_csv('/Users/willit/Documents/WorldBank/pdfs_to_extract.csv')

Attempting to open and process /Users/willit/Documents/WorldBank/samplefiles/713741468337198922/536490BRI0SPAN10Box345621B01PUBLIC1.pdf, and process it with custom instructions: Colored fonts should be treated as bold.  Tables have headings that are centered just above the table.
Started parsing the file under job_id 418f1813-5a8d-40b9-8bf8-f8fe335dce25


KeyboardInterrupt: 