<a href="https://colab.research.google.com/github/Wea-boo/pdf-loader-notebook/blob/main/06-DocumentLoader/02-PDFLoader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Loader

- Author: [Yejin Park](https://github.com/ppakyeah)
- Peer Review : [Yun Eun](https://github.com/yuneun92), [MinJi Kang](https://www.linkedin.com/in/minji-kang-995b32230/)
- Author: [Yejin Park](https://github.com/ppakyeah)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/02-PDFLoader.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/02-PDFLoader.ipynb)

## Overview
This tutorial covers various PDF processing methods using LangChain and popular PDF libraries.

PDF processing is essential for extracting and analyzing text data from PDF documents.

In this tutorial, we will explore different PDF loaders and their capabilities while working with LangChain's document processing framework.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [How to load PDFs](#how-to-load-pdfs)
- [PyPDF](#pypdf)
- [PyMuPDF](#pymupdf)
- [Unstructured](#unstructured)
- [PyPDFium2](#pypdfium2)
- [PDFMiner](#pdfminer)
- [PDFPlumber](#pdfplumber)

### References

- [LangChain: How to load PDFs](https://python.langchain.com/docs/how_to/document_loader_pdf/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "langchain_text_splitters",
        "pypdf",
        "rapidocr-onnxruntime",
        "pymupdf",
        "unstructured[pdf]"
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "PDFLoader",
    }
)

Environment variables have been set successfully.


## How to load PDFs

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), a file format standardized by ISO 32000, was developed by Adobe in 1992 for presenting documents, which include text formatting and images in a way that is independent of application software, hardware, and operating systems.

This guide covers how to load a PDF document into the LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) format. This format will be used downstream.

LangChain integrates with a variety of PDF parsers. Some are simple and relatively low-level, while others support OCR and image processing or perform advanced document layout analysis.

The right choice depends on your application.


We will demonstrate these approaches on a [sample file](https://github.com/langchain-ai/langchain/blob/master/libs/community/tests/integration_tests/examples/layout-parser-paper.pdf).
Download the sample file and copy it to your data folder.

In [7]:
FILE_PATH = "/content/prog_imperative_repdf.pdf"

In [8]:
def show_metadata(docs):
    if docs:
        print("[metadata]")
        print(list(docs[0].metadata.keys()))
        print("\n[examples]")
        max_key_length = max(len(k) for k in docs[0].metadata.keys())
        for k, v in docs[0].metadata.items():
            print(f"{k:<{max_key_length}} : {v}")

## PyPDF


[PyPDF](https://github.com/py-pdf/pypdf) is one of the most widely used Python libraries for PDF processing.

Here we use PyPDF to load the PDF as an list of Document objects

LangChain's [```PyPDFLoader```](
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html) integrates with PyPDF to parse PDF documents into LangChain Document objects.


In [9]:
from langchain_community.document_loaders import PyPDFLoader

# Initialize the PDF loader
loader = PyPDFLoader(FILE_PATH)

# Load data into Document objects
docs = loader.load()

# Print the contents of the document
print(docs[10].page_content[:300])

11 
12/02/22 09:03 © 2008~2012 par J. Feat 
 
 
 
① REPRÉSENTATIONS SYMBOLIQUES 
L'essentiel de l'activité de programmation avec un langage de haut niveau consiste à manipuler des 
symboles, non pas en tant que noms, mais pour ce qu'ils représentent : leur valeur. 
 
Et pour être utilisables, les sy


In [10]:
# output metadata
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'author', 'moddate', 'title', 'source', 'total_pages', 'page', 'page_label']

[examples]
producer     : Microsoft: Print To PDF
creator      : PyPDF
creationdate : 2025-05-09T18:14:45+01:00
author       : 
moddate      : 2025-05-09T18:14:45+01:00
title        : Microsoft Word - conception de programmes _ l'approche impérative
source       : /content/prog_imperative_repdf.pdf
total_pages  : 124
page         : 0
page_label   : 1


The ```load_and_split()``` method allows customizing how documents are chunked by passing a text splitter object, making it more flexible for different use cases.

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load Documents and split into chunks. Chunks are returned as Documents.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=200)
docs = loader.load_and_split(text_splitter=text_splitter)
print(docs[0].page_content)

conception de programmes 
l'approche impérative 
 
 
 
 
 
UNIVERSITÉ PARIS 8 
 
SYNTHÈSE DE COURS 
EDF2LNia 
2012


### PyPDF(OCR)

Some PDFs contain text images within scanned documents or pictures. You can also use the ```rapidocr-onnxruntime``` package to extract text from images.

In [12]:
# Initialize PDF loader, enable image extraction option
loader = PyPDFLoader(FILE_PATH, extract_images=True)

# load PDF page
docs = loader.load()

# access page content
print(docs[4].page_content[:300])

TypeError: Cannot handle this data type: (1, 1, 1), |u1

In [13]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'author', 'moddate', 'title', 'source', 'total_pages', 'page', 'page_label']

[examples]
producer     : Microsoft: Print To PDF
creator      : PyPDF
creationdate : 2025-05-09T18:14:45+01:00
author       : 
moddate      : 2025-05-09T18:14:45+01:00
title        : Microsoft Word - conception de programmes _ l'approche impérative
source       : /content/prog_imperative_repdf.pdf
total_pages  : 124
page         : 0
page_label   : 1


### PyPDF Directory

Import all PDF documents from directory.

In [14]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

# directory path
loader = PyPDFDirectoryLoader("./data/")

# load documents
docs = loader.load()

# print the number of documents
docs_len = len(docs)
print(docs_len)

# get document from a directory
document = docs[docs_len - 1]

0


IndexError: list index out of range

In [15]:
# print the contents of the document
print(document.page_content[:300])

NameError: name 'document' is not defined

In [16]:
print(document.metadata)

NameError: name 'document' is not defined

## PyMuPDF

[PyMuPDF](https://github.com/pymupdf/PyMuPDF) is speed optimized and includes detailed metadata about the PDF and its pages. It returns one document per page.

LangChain's [```PyMuPDFLoader```](
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html) integrates with PyMuPDF to parse PDF documents into LangChain Document objects.

In [17]:
from langchain_community.document_loaders import PyMuPDFLoader

# create an instance of the PyMuPDF loader
loader = PyMuPDFLoader(FILE_PATH)

# load the document
docs = loader.load()

# print the contents of the document
print(docs[10].page_content[:300])

11
12/02/22 09:03 © 2008~2012 par J. Feat 
 
 
 
① REPRÉSENTATIONS SYMBOLIQUES 
L'essentiel de l'activité de programmation avec un langage de haut niveau consiste à manipuler des 
symboles, non pas en tant que noms, mais pour ce qu'ils représentent : leur valeur. 
 
Et pour être utilisables, les sym


In [18]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'source', 'file_path', 'total_pages', 'format', 'title', 'author', 'subject', 'keywords', 'moddate', 'trapped', 'modDate', 'creationDate', 'page']

[examples]
producer     : Microsoft: Print To PDF
creator      : 
creationdate : 2025-05-09T18:14:45+01:00
source       : /content/prog_imperative_repdf.pdf
file_path    : /content/prog_imperative_repdf.pdf
total_pages  : 124
format       : PDF 1.7
title        : Microsoft Word - conception de programmes _ l'approche impérative
author       : 
subject      : 
keywords     : 
moddate      : 2025-05-09T18:14:45+01:00
trapped      : 
modDate      : D:20250509181445+01'00'
creationDate : D:20250509181445+01'00'
page         : 0


## Unstructured

[Unstructured](https://docs.unstructured.io/welcome) is a powerful library designed to handle various unstructured and semi-structured document formats. It excels at automatically identifying and categorizing different components within documents.
Currently supports loading text files, PowerPoints, HTML, PDFs, images, and more.

LangChain's [```UnstructuredPDFLoader```](
https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredLoader.html) integrates with Unstructured to parse PDF documents into LangChain Document objects.


In [19]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# create an instance of UnstructuredPDFLoader
loader = UnstructuredPDFLoader(FILE_PATH)

# load the data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])

conception de programmes l'approche impérative

UNIVERSITÉ PARIS 8

SYNTHÈSE DE COURS EDF2LNia 2012

2

conception de programmes l'approche impérative

Jym Feat

5° édition, février 2012

Université Paris 8 département informatique UFR MITSIC institut d'enseignement à distance

© 2008~2012 J. Feat -


In [20]:
show_metadata(docs)

[metadata]
['source']

[examples]
source : /content/prog_imperative_repdf.pdf


Internally, unstructured creates different "**elements**" for each chunk of text. By default, these are combined, but can be easily separated by specifying ```mode="elements"```.

In [21]:
# Create an instance of UnstructuredPDFLoader (mode="elements”)
loader = UnstructuredPDFLoader(FILE_PATH, mode="elements")

# load the data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content)

conception de programmes l'approche impérative


In [27]:
print(docs[55].page_content)

0.0 code source ...........................................................................................6 0.1 exemple de programme c .......................................................................6 0.2 compilation du code source ....................................................................7 0.3 rédaction, compilation, exécution ...........................................................7 0.4 exécution du programme ........................................................................8 0.5 langage formel ..................................................................................... 10


See the full set of element types for this particular article.

In [28]:
set(doc.metadata["category"] for doc in docs) # extract data categories

{'Footer', 'Header', 'ListItem', 'NarrativeText', 'Title', 'UncategorizedText'}

In [29]:
show_metadata(docs)

[metadata]
['source', 'coordinates', 'file_directory', 'filename', 'languages', 'last_modified', 'page_number', 'filetype', 'category', 'element_id']

[examples]
source         : /content/prog_imperative_repdf.pdf
coordinates    : {'points': ((150.113570495055, 220.65261745062128), (150.113570495055, 286.7727753724091), (454.18350242439965, 286.7727753724091), (454.18350242439965, 220.65261745062128)), 'system': 'PixelSpace', 'layout_width': 595.32001, 'layout_height': 841.92004}
file_directory : /content
filename       : prog_imperative_repdf.pdf
languages      : ['eng']
last_modified  : 2025-05-09T21:34:41
page_number    : 1
filetype       : application/pdf
category       : Title
element_id     : 5692ed2757ccfb04bf34cd20b8ad5c3a


## PyPDFium2

LangChain's [```PyPDFium2Loader```](
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFium2Loader.html) integrates with [PyPDFium2](https://github.com/pypdfium2-team/pypdfium2) to parse PDF documents into LangChain Document objects.

In [30]:
from langchain_community.document_loaders import PyPDFium2Loader

# create an instance of the PyPDFium2 loader
loader = PyPDFium2Loader(FILE_PATH)

# load data
docs = loader.load()

# print the contents of the document
print(docs[10].page_content[:300])



11
12/02/22 09:03 © 2008~2012 par J. Feat
① REPRÉSENTATIONS SYMBOLIQUES 
L'essentiel de l'activité de programmation avec un langage de haut niveau consiste à manipuler des 
symboles, non pas en tant que noms, mais pour ce qu'ils représentent : leur valeur. 
Et pour être utilisables, les symboles doi


**Note**: When using ```PyPDFium2Loader```, you may notice warning messages related to ```get_text_range()```. These warnings are part of the library's internal operations and do not affect the PDF processing
functionality. You can safely proceed with the tutorial despite these warnings, as they are
a normal part of the development environment and do not impact the learning objectives.

In [31]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'title', 'author', 'subject', 'keywords', 'moddate', 'source', 'total_pages', 'page']

[examples]
producer     : Microsoft: Print To PDF
creator      : 
creationdate : 2025-05-09T18:14:45+01:00
title        : Microsoft Word - conception de programmes _ l'approche impérative
author       : 
subject      : 
keywords     : 
moddate      : 2025-05-09T18:14:45+01:00
source       : /content/prog_imperative_repdf.pdf
total_pages  : 124
page         : 0


## PDFMiner
[PDFMiner](https://github.com/pdfminer/pdfminer.six) is a specialized Python library focused on text extraction and layout analysis from PDF documents.

LangChain's [```PDFMinerLoader```](
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html) integrates with PDFMiner to parse PDF documents into LangChain Document objects.


In [32]:
from langchain_community.document_loaders import PDFMinerLoader

# Create a PDFMiner loader instance
loader = PDFMinerLoader(FILE_PATH)

# load data
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])

conception de programmes 
l'approche impérative 

UNIVERSITÉ PARIS 8 

SYNTHÈSE DE COURS 
EDF2LNia 
2012
2 

APPROCHE IMPÉRATIVE 

conception de programmes 
l'approche impérative 

Jym Feat 

5° édition, février 2012 

Université Paris 8 
département informatique 
UFR MITSIC 
institut d'enseignemen


In [33]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'author', 'moddate', 'title', 'total_pages', 'source']

[examples]
producer     : Microsoft: Print To PDF
creator      : PDFMiner
creationdate : 2025-05-09T18:14:45+01:00
author       : 
moddate      : 2025-05-09T18:14:45+01:00
title        : Microsoft Word - conception de programmes _ l'approche impérative
total_pages  : 124
source       : /content/prog_imperative_repdf.pdf


### Using PDFMiner to generate HTML text

This method allows you to parse the output HTML content through [```BeautifulSoup```](https://www.crummy.com/software/BeautifulSoup/) to get more structured and richer information about font size, page numbers, PDF header/footer, etc. which can help you semantically split the text into sections.

In [34]:
from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

# create an instance of PDFMinerPDFasHTMLLoader
loader = PDFMinerPDFasHTMLLoader(FILE_PATH)

# load the document
docs = loader.load()

# print the contents of the document
print(docs[0].page_content[:300])
# Just for the

<html><head>
<meta http-equiv="Content-Type" content="text/html">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:841px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border


AttributeError: 'Document' object has no attribute 'text'

In [35]:
show_metadata(docs)

[metadata]
['source']

[examples]
source : /content/prog_imperative_repdf.pdf


In [36]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(docs[0].page_content, "html.parser") # initialize HTML parser
content = soup.find_all("div") # search for all div tags

In [88]:
# prompt: i'd like to find the super repetitive headers/footers that tend to... pop up in every page

from collections import Counter

def find_repetitive_headers_footers(filepath):
    """
    Finds repetitive headers and footers in a text file.

    Args:
        filepath: Path to the text file.

    Returns:
        A Counter object with the most frequent lines and their counts.
    """

    try:
        with open(filepath, 'r', encoding='utf-8') as file:  # Specify UTF-8 encoding
            lines = file.readlines()
    except FileNotFoundError:
        print(f"Error: File '{filepath}' not found.")
        return None

    line_counts = Counter(line.strip() for line in lines) # Count line occurrences
    return line_counts.most_common(20) # Return the 20 most frequent lines

# Example usage
filepath = "/content/your_file.txt"  # Replace with the actual path
repetitive_lines = find_repetitive_headers_footers(filepath)

if repetitive_lines:
    print("Most frequent lines (potential headers/footers):")
    for line, count in repetitive_lines:
      print(f"'{line}' : {count} times")


[]

In [37]:
import re

cur_fs = None
cur_text = ""
snippets = []  # collect all snippets of the same font size
for c in content:
    sp = c.find("span")
    if not sp:
        continue
    st = sp.get("style")
    if not st:
        continue
    fs = re.findall("font-size:(\d+)px", st)
    if not fs:
        continue
    fs = int(fs[0])
    if not cur_fs:
        cur_fs = fs
    if fs == cur_fs:
        cur_text += c.text
    else:
        snippets.append((cur_text, cur_fs))
        cur_fs = fs
        cur_text = c.text
snippets.append((cur_text, cur_fs))
# Note: Possibility to add a strategy for removing duplicate snippets (since the header/footer of a PDF appears across multiple pages, it can be considered duplicate information when found)

In [65]:
[(x[0], i) for i,x in enumerate(snippets) if x[1] == 11]

[('0.0 CODE SOURCE \n', 21),
 ('0.1 EXEMPLE DE PROGRAMME C \n', 24),
 ('0.2 COMPILATION DU CODE SOURCE \n', 29),
 ('0.3 RÉDACTION, COMPILATION, EXÉCUTION \n', 33),
 ('0.4 EXÉCUTION DU PROGRAMME \n', 38),
 ('0.5 LANGAGE FORMEL \n', 74),
 ('1.1 DÉCLARATION : TYPE NOM \n', 86),
 ('1.2 DÉCLARATION + DÉFINITION : TYPE NOM = VALEUR \n', 94),
 ('1.3 REDÉFINITION : NOM = VALEUR \n', 98),
 ('1.4 VARIABLE DÉNOTANT UNE SÉQUENCE \n', 103),
 ('1.5 CONSTANTE ET VALEUR LITTÉRALE \n', 115),
 ('1.6 VARIABLE GLOBALE \n', 121),
 ('1.7 VARIABLE LOCALE \n', 128),
 ('1.8 WHILE : ITÉRATION CONDITIONNELLE \n', 134),
 ("2.1 ANATOMIE D'UN PROGRAMME C STANDARD UNIX \n", 165),
 ('2.2 ACCÈS INDEXÉ\n', 181),
 ('2.3 ACCÈS PAR RÉFÉRENCE : INDIRECTION \n', 202),
 ('2.4 SÉQUENCES : REPRÉSENTATION INTERNE \n', 222),
 ('2.5 SÉQUENCES NUMÉRIQUES \n', 253),
 ('2.6 DÉFINITION DE FONCTIONS \n', 264),
 ('2.7 PARAMÈTRES DE LA FONCTION MAIN \n', 284),
 ('3.1 CARACTÈRES \n', 310),
 ('3.2 CHIFFRES ET NOMBRES \n', 317),
 ("3.3 SCA

In [87]:
# snippets[20] # Fits the first sub section 0.0 CODE SOURCE
snippets[1011]

("$ ajouter du code pour initialiser le module, c'est appeler la fonction C Py_InitModule() en lui passant le \n",
 11)

In [38]:
from langchain_core.documents import Document

cur_idx = -1
semantic_snippets = []
# Assumption: headings have higher font size than their respective content
for s in snippets:
    # if current snippet's font size > previous section's heading => it is a new heading
    if (
        not semantic_snippets
        or s[1] > semantic_snippets[cur_idx].metadata["heading_font"]
    ):
        metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
        metadata.update(docs[0].metadata)
        semantic_snippets.append(Document(page_content="", metadata=metadata))
        cur_idx += 1
        continue

    # if current snippet's font size <= previous section's content => content belongs to the same section (one can also create
    if (
        not semantic_snippets[cur_idx].metadata["content_font"]
        or s[1] <= semantic_snippets[cur_idx].metadata["content_font"]
    ):
        semantic_snippets[cur_idx].page_content += s[0]
        semantic_snippets[cur_idx].metadata["content_font"] = max(
            s[1], semantic_snippets[cur_idx].metadata["content_font"]
        )
        continue

    # if current snippet's font size > previous section's content but less than previous section's heading than also make a new
    metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
    metadata.update(docs[0].metadata)
    semantic_snippets.append(Document(page_content="", metadata=metadata))
    cur_idx += 1

print(semantic_snippets[4])

page_content='APPROCHE IMPÉRATIVE 
' metadata={'heading': '0.0 CODE SOURCE \n', 'content_font': 8, 'heading_font': 11, 'source': '/content/prog_imperative_repdf.pdf'}


In [52]:
#print(semantic_snippets[20].page_content)


TypeError: object of type 'Document' has no len()

In [59]:
# prompt: i wanna count the number of tokens for a certain string (very important for the embedding model later)

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text):
  return len(tokenizer.encode(text))

# Example usage
max
for doc in semantic_snippets:
  if count_tokens(doc.page_content) > max:
    max = count_tokens(doc.page_content)
    max_doc = doc
print(max)
print(max_doc.page_content)

top10_length = []
for doc in semantic_snippets:
  top10_length.append(count_tokens(doc.page_content))
top10_length.sort(reverse=True)
print(top10_length[:10])

# Apply to the document
# docs_token_count = [count_tokens(doc.page_content) for doc in docs]
# print(f"The docs have {docs_token_count} tokens each.")

# total_tokens = sum(docs_token_count)
# print(f"Total tokens in all documents: {total_tokens}")


Token indices sequence length is longer than the specified maximum sequence length for this model (2373 > 1024). Running this sequence through the model will result in indexing errors


8802
Dans l'expérience précédente, l'important n'était pas tant de réaliser le programme idéal que de 
comprendre comment deux cultures peuvent privilégier des approches différentes. 
En plus modeste, voici l'exemple d'un petit programme elisp (i.e. EMACS LISP) qui convertit en binaire un 
nombre en base 10. Sauriez-vous le transcoder en CLISP, en PYTHON et en C ? 
12/02/22 09:03 © 2008~2012 par J. Feat
84 
APPROCHE IMPÉRATIVE 
; conversion base 10 -> base 2 
; elisp 
(defun binaire (int &optional bin) 
(cond 
((zerop int) (apply 'concat bin)) 
((zerop (% int 2)) (binaire (/ int 2) (cons "0" bin))) 
((binaire (/ (1- int) 2) (cons "1" bin))) ) ) 
[px20.0][cx20.1]  coder la conversion décimal ! binaire, d'abord en CLISP, en PYTHON, puis en C, où le 
programme devra faire la somme de 2 nombres binaires passés sur la LdC. 
On avait, quelques pages plus haut, évoqué l'algorithme du QUICK SORT et sa mise œuvre avec 
l'approche fonctionnelle, comme celle de PYTHON ou de LISP. Maintenant, comm

## PDFPlumber
[PDFPlumber](https://github.com/jsvine/pdfplumber) is a PDF parsing library that excels at extracting text and tables from PDFs.

LangChain's [```PDFPlumberLoader```](
https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFPlumberLoader.html) integrates with PDFPlumber to parse PDF documents into LangChain Document objects.

Like PyMuPDF, the output document contains detailed metadata about the PDF and its pages, and returns one document per page.

In [62]:
from langchain_community.document_loaders import PDFPlumberLoader

# create a PDF document loader instance
loader = PDFPlumberLoader(FILE_PATH)

# load the document
docs = loader.load()

# access the first document data
print(docs[10].page_content[:300])

ImportError: pdfplumber package not found, please install it with `pip install pdfplumber`

In [None]:
show_metadata(docs)

[metadata]
['source', 'file_path', 'page', 'total_pages', 'Author', 'CreationDate', 'Creator', 'Keywords', 'ModDate', 'PTEX.Fullbanner', 'Producer', 'Subject', 'Title', 'Trapped']

[examples]
source          : ./data/layout-parser-paper.pdf
file_path       : ./data/layout-parser-paper.pdf
page            : 0
total_pages     : 16
Author          : 
CreationDate    : D:20210622012710Z
Creator         : LaTeX with hyperref
Keywords        : 
ModDate         : D:20210622012710Z
PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2
Producer        : pdfTeX-1.40.21
Subject         : 
Title           : 
Trapped         : False
