# PDF text extraction

**Description**: This notebook demonstrates how to extract text from a PDF file using the `PyPDFium2` library.

## Imports

In [1]:
import pypdfium2 as pdfium

from src.pdf_reader.helpers import detect_header_footer

## Load PDF document

In [2]:
pdf_path = "data/pdf_docs/a-practical-guide-to-building-agents.pdf"

In [3]:
pdf = pdfium.PdfDocument(pdf_path)
print(f"Length of PDF: {len(pdf)} pages")

Length of PDF: 34 pages


## Detect header and footer

In [4]:
header_footer_lines = detect_header_footer(document=pdf)
list(header_footer_lines)[:5]

['3 A practical guide to building agents',
 '9 A practical guide to building agents',
 '54',
 '27',
 '6 A practical guide to building agents']

## Extract text from document pages

In [5]:
text_per_page = dict()

for page_id in range(len(pdf)):
    print("---------------------------------------------")
    print(f"Page {page_id + 1} of {len(pdf)}")
    # It seems that the package "pypdfium2" separates lines by "\r\n" by default
    page_text = pdf[page_id].get_textpage().get_text_bounded()

    # Split the text into lines removing the ones contained in the header and footer
    page_lines = [
        line
        for line in page_text.splitlines()
        if line.strip() not in header_footer_lines
    ]

    page_text_without_header_footer = "\n".join(page_lines)
    print(page_text_without_header_footer)

    text_per_page[page_id] = page_text_without_header_footer

---------------------------------------------
Page 1 of 34
A practical 

guide to 

building agents
---------------------------------------------
Page 2 of 34
Contents
What is an agent? 4
When should you build an agent? 5
Agent design foundations 7
Guardrails 24
Conclusion 32
2 Practical guide to building agents
---------------------------------------------
Page 3 of 34
Introduction
Large language models are becoming increasingly capable of handling complex, multi-step tasks. 
Advances in reasoning, multimodality, and tool use have unlocked a new category of LLM-powered 
systems known as agents.

This guide is designed for product and engineering teams exploring how to build their first agents, 
distilling insights from numerous customer deployments into practical and actionable best 
practices. It includes frameworks for identifying promising use cases, clear patterns for designing 
agent logic and orchestration, and best practices to ensure your agents run safely, predictably, 

and 