# PDF

[Portable Document Format (PDF)](https://en.wikipedia.org/wiki/PDF), ISO 32000으로 표준화된 파일 형식은 Adobe가 1992년에 문서를 제시하기 위해 개발했으며, 이는 응용 소프트웨어, 하드웨어 및 운영 시스템에 독립적인 방식으로 텍스트 서식 및 이미지를 포함합니다.

이 가이드는 `PDF` 문서를 LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) 형식으로 로드하는 방법을 다룹니다. 이 형식은 다운스트림에서 사용됩니다.

LangChain은 다양한 PDF 파서와 통합됩니다. 일부는 간단하고 상대적으로 저수준이며, 다른 일부는 OCR 및 이미지 처리를 지원하거나 고급 문서 레이아웃 분석을 수행합니다. 

올바른 선택은 사용자의 애플리케이션에 따라 달라집니다.

**참고**

- [LangChain 도큐먼트](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/)

## AutoRAG 팀에서의 PDF 실험

AutoRAG 에서 진행한 실험을 토대로 작성한 순위표

아래 표기된 숫자는 등수를 나타냅니다. (The lower, the better)

| | PDFMiner | PDFPlumber | PyPDFium2 | PyMuPDF | PyPDF2 |
|----------|:---------:|:----------:|:---------:|:-------:|:-----:|
| Medical  | 1         | 2          | 3         | 4       | 5     |
| Law      | 3         | 1          | 1         | 3       | 5     |
| Finance  | 1         | 2          | 2         | 4       | 5     |
| Public   | 1         | 1          | 1         | 4       | 5     |
| Sum      | 5         | 5          | 7         | 15      | 20    |

출처: [AutoRAG Medium 블로그](https://velog.io/@autorag/PDF-%ED%95%9C%EA%B8%80-%ED%85%8D%EC%8A%A4%ED%8A%B8-%EC%B6%94%EC%B6%9C-%EC%8B%A4%ED%97%98#%EC%B4%9D%ED%8F%89)

In [32]:
# API KEY를 환경변수로 관리하기 위한 설정 파일
from dotenv import load_dotenv

# API KEY 정보로드
load_dotenv()

True

## 실습에 활용한 문서

소프트웨어정책연구소(SPRi) - 2023년 12월호

- 저자: 유재흥(AI정책연구실 책임연구원), 이지수(AI정책연구실 위촉연구원)
- 링크: https://spri.kr/posts/view/23669
- 파일명: `SPRI_AI_Brief_2023년12월호_F.pdf`

**참고**: 위의 파일은 `data` 폴더 내에 다운로드 받으세요

In [33]:
# PDF 파일 로드
# FILE_PATH = "./data/SPRI_AI_Brief_2023년12월호_F.pdf"
FILE_PATH = "./data/2103_page1.pdf"


In [34]:
def show_metadata(docs):
    if docs:
        print("[metadata]")
        print(list(docs[0].metadata.keys()))
        print("\n[examples]")
        max_key_length = max(len(k) for k in docs[0].metadata.keys())
        for k, v in docs[0].metadata.items():
            print(f"{k:<{max_key_length}} : {v}")

## PyPDF

여기에서는 `pypdf`를 사용하여 PDF를 문서 배열로 로드하며, 각 문서는 `page` 번호와 함께 페이지 내용 및 메타데이터를 포함합니다.

In [35]:
# 설치
# !pip install -qU pypdf

In [36]:
# PDF 파일 로드
# FILE_PATH = "./data/SPRI_AI_Brief_2023년12월호_F.pdf"
FILE_PATH = "./data/2103_page1.pdf"


In [37]:
from langchain_community.document_loaders import PyPDFLoader

# 파일 경로 설정
loader = PyPDFLoader(FILE_PATH)

# PDF 로더 초기화
docs = loader.load()

# 문서의 내용 출력
# print(docs[10].page_content[:300])  # 10페이지의 300자 내용 출력
print(docs[0].page_content[:])

LayoutParser: A Unified Toolkit for Deep 
Learning Based Documen t Image Analysis 
IZOZ 
unr 
It Zejiang Shen1 詞),Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain 
Lee4, Jacob Carlson3, and Weining Li5 
1 Allen Institute for AI 
shannons©allena i.org 
2 Brown University 
ruochen_zhan g©brown.edu 
3 Harvard University 
{melissadell, jacob_carlson}© fas.harvard.edu 
4 University of Washington 
bcgl©cs.wash ington.edu 
5 University of Waterloo 
w422 li©u멀aterloo.ca 
[AU
·s~] 
현
/oo
寸
ESI•EOINA!X1B Abstract . Recent advances in document image analysis (DIA) have been 
primarily driven by the application of neural networks. Ideally, research 
outcomes could be easily deployed in production and extended for further 
investigation. However, various factors like loosely organized codebases 
and sophisticated model configurations complicate the easy reuse of im­
portant innovations by a wide audience. Though there have been on-going 
efforts to improve reusability and simplify deep learn

In [38]:
# 메타데이터 출력
show_metadata(docs)

# langchain의 버전에 따라서 메타데이터 출력 형식이 다를 수 있음

[metadata]
['producer', 'creator', 'creationdate', 'author', 'moddate', 'title', 'source', 'total_pages', 'page', 'page_label']

[examples]
producer     : Adobe Acrobat (64-bit) 25 Paper Capture Plug-in
creator      : PyPDF
creationdate : 2025-08-22T12:38:36+09:00
author       : Heejin Park
moddate      : 2025-08-22T12:47:11+09:00
title        : C:\Users\park0\Downloads\2103.15348v2.pdf
source       : ./data/2103_page1.pdf
total_pages  : 1
page         : 0
page_label   : 1


### PyPDF(OCR)

일부 PDF에는 스캔된 문서나 그림 내에 텍스트 이미지가 포함되어 있습니다. `rapidocr-onnxruntime` 패키지를 사용하여 이미지에서 텍스트를 추출할 수도 있습니다.

In [39]:
# 설치
# !pip install -qU rapidocr-onnxruntime

In [41]:
# PDF 파일 로드
# FILE_PATH = "./data/SPRI_AI_Brief_2023년12월호_F.pdf"
FILE_PATH = "./data/2103_page1.pdf"

In [43]:
# PDF 로더 초기화, 이미지 추출 옵션 활성화
loader = PyPDFLoader(FILE_PATH, extract_images=True)

# PDF 페이지 로드
docs = loader.load()

# 페이지 내용 접근
# print(docs[4].page_content[:300])
print(docs[0].page_content[:])

LayoutParser: A Unified Toolkit for Deep 
Learning Based Documen t Image Analysis 
IZOZ 
unr 
It Zejiang Shen1 詞),Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain 
Lee4, Jacob Carlson3, and Weining Li5 
1 Allen Institute for AI 
shannons©allena i.org 
2 Brown University 
ruochen_zhan g©brown.edu 
3 Harvard University 
{melissadell, jacob_carlson}© fas.harvard.edu 
4 University of Washington 
bcgl©cs.wash ington.edu 
5 University of Waterloo 
w422 li©u멀aterloo.ca 
[AU
·s~] 
현
/oo
寸
ESI•EOINA!X1B Abstract . Recent advances in document image analysis (DIA) have been 
primarily driven by the application of neural networks. Ideally, research 
outcomes could be easily deployed in production and extended for further 
investigation. However, various factors like loosely organized codebases 
and sophisticated model configurations complicate the easy reuse of im­
portant innovations by a wide audience. Though there have been on-going 
efforts to improve reusability and simplify deep learn

In [44]:
show_metadata(docs)

# langchain의 버전에 따라서 메타데이터 출력 형식이 다를 수 있음

[metadata]
['producer', 'creator', 'creationdate', 'author', 'moddate', 'title', 'source', 'total_pages', 'page', 'page_label']

[examples]
producer     : Adobe Acrobat (64-bit) 25 Paper Capture Plug-in
creator      : PyPDF
creationdate : 2025-08-22T12:38:36+09:00
author       : Heejin Park
moddate      : 2025-08-22T12:47:11+09:00
title        : C:\Users\park0\Downloads\2103.15348v2.pdf
source       : ./data/2103_page1.pdf
total_pages  : 1
page         : 0
page_label   : 1


## PyMuPDF

**PyMuPDF** 는 속도 최적화가 되어 있으며, PDF 및 해당 페이지에 대한 자세한 메타데이터를 포함하고 있습니다. 페이지 당 하나의 문서를 반환합니다:

In [45]:
# 설치
# !pip install -qU pymupdf

In [48]:
# PDF 파일 로드
# FILE_PATH = "./data/SPRI_AI_Brief_2023년12월호_F.pdf"
FILE_PATH = "./data/2103_page1.pdf"

In [49]:
from langchain_community.document_loaders import PyMuPDFLoader

# PyMuPDF 로더 인스턴스 생성
loader = PyMuPDFLoader(FILE_PATH)

# 문서 로드
docs = loader.load()

# 문서의 내용 출력
print(docs[0].page_content[:])

Layout Parser: A Unified Toolkit for Deep 
Learning Based Document Image Analysis 
IZOZ 
unr 
It 
Zejiang Shen1 詞), Ruochen Zhang2, Melissa Dell3, Benj amin Charles Germain 
Lee4, Jacob Carlson3, and Weining Li 5 
1 Allen Institut e for AI 
shannons©allenai .org 
2 Brown University 
ruochen_zhang©brown.edu 
3 Harvard University 
{meli ssadell, j acob_carlson}©f as.harvard.edu 
4 University of Washington 
bcgl©cs.washi ngton.edu 
5 University of Waterloo 
w422li©u멀at erloo.ca 
[AU
·s~] 
현
/
o
o寸
E
S
I
•
E
O
I
N
A
!
X
1
B
 
Abst ract . Recent advances in document image analysis (DIA) have been 
primarily driven by t he application of neural networks. Ideally, research 
outcomes could be easily deployed in production and extended for further 
investigation. However, various fact ors like loosely organized codebases 
and sophisticated model configurations complicat e t he easy reuse of im-
portant innovations by a wide audience. Though there have been on-going 
effort s t o improve reusabi

In [50]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'source', 'file_path', 'total_pages', 'format', 'title', 'author', 'subject', 'keywords', 'moddate', 'trapped', 'modDate', 'creationDate', 'page']

[examples]
producer     : Adobe Acrobat (64-bit) 25 Paper Capture Plug-in
creator      : 
creationdate : 2025-08-22T12:38:36+09:00
source       : ./data/2103_page1.pdf
file_path    : ./data/2103_page1.pdf
total_pages  : 1
format       : PDF 1.7
title        : C:\Users\park0\Downloads\2103.15348v2.pdf
author       : Heejin Park
subject      : 
keywords     : 
moddate      : 2025-08-22T12:47:11+09:00
trapped      : 
modDate      : D:20250822124711+09'00'
creationDate : D:20250822123836+09'00'
page         : 0


## Unstructured

[Unstructured](https://unstructured-io.github.io/unstructured/)는 Markdown이나 PDF와 같은 비구조화된 또는 반구조화된 파일 형식을 다루기 위한 공통 인터페이스를 지원합니다. 

LangChain의 [UnstructuredPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.UnstructuredPDFLoader.html)는 Unstructured와 통합되어 PDF 문서를 LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) 객체로 파싱합니다.

In [51]:
# 설치
# !pip install -qU unstructured

In [52]:
# pdfminer 호환성 문제 해결을 위한 패키지 설치
import subprocess
import sys


def install_compatible_packages():
    """호환되는 버전의 패키지들 설치"""
    packages = [
        "pdfminer.six==20220319",  # 호환되는 구 버전
        "unstructured[pdf]==0.10.30",  # 호환되는 unstructured 버전
    ]

    for package in packages:
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"✓ {package} 설치 완료")
        except subprocess.CalledProcessError as e:
            print(f"✗ {package} 설치 실패: {e}")


# 호환 패키지 설치 실행
install_compatible_packages()
print("\n패키지 설치 완료. 커널을 재시작하고 다시 시도해주세요.")

✓ pdfminer.six==20220319 설치 완료
✓ unstructured[pdf]==0.10.30 설치 완료

패키지 설치 완료. 커널을 재시작하고 다시 시도해주세요.


In [53]:
# PDF 파일 로드
# FILE_PATH = "./data/SPRI_AI_Brief_2023년12월호_F.pdf"
FILE_PATH = "./data/2103_page1.pdf"

In [54]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# UnstructuredPDFLoader 인스턴스 생성
loader = UnstructuredPDFLoader(FILE_PATH)

# 데이터 로드
docs = loader.load()

# 문서의 내용 출력
print(docs[0].page_content[:])

IZOZ

unr

It

[AU

s~]

현 / o 寸 E o S I • E O I N A ! X 1 B

Layout Parser: A Unified Toolkit for Deep Learni n g Based Document Image Analysi s

Zejiang Shen1 詞), Ruochen Zhang2, Melissa Dell3, Benj ami n Charles Germai n Lee4, Jacob Carlson3, and Wei n i n g Li 5

1 Allen Institut e for AI shannons©allenai .org 2 Brown University ruochen_zhang©brown.edu 3 Harvard University {meli ssadell, j acob_carlson}©f as.harvard.edu 4 University of Washi ngton bcgl©cs.washi ngton.edu 5 University of Wat erloo

w422 li ©u멀at erloo.ca

Abst ract . Recent advances in document image analysis (DIA) have been primarily driven by t he application of neural net works. Ideally, research out comes could be easily deployed i n production and ext ended for furt her investigation. However, various fact ors like loosely organized codebases and sophisticat ed model configurations complicat e t he easy reuse of im port ant innovations by a wide audience. Though t here have been on-goi ng effort s t o improve r

IZOZ

unr

It

[AU

s~]

현 / o 寸 E o S I • E O I N A ! X 1 B

Layout Parser: A Unified Toolkit for Deep Learni n g Based Document Image Analysi s

Zejiang Shen1 詞), Ruochen Zhang2, Melissa Dell3, Benj ami n Charles Germai n Lee4, Jacob Carlson3, and Wei n i n g Li 5

1 Allen Institut e for AI shannons©allenai .org 2 Brown University ruochen_zhang©brown.edu 3 Harvard University {meli ssadell, j acob_carlson}©f as.harvard.edu 4 University of Washi ngton bcgl©cs.washi ngton.edu 5 University of Wat erloo

w422 li ©u멀at erloo.ca

Abst ract . Recent advances in document image analysis (DIA) have been primarily driven by t he application of neural net works. Ideally, research out comes could be easily deployed i n production and ext ended for furt her investigation. However, various fact ors like loosely organized codebases and sophisticat ed model configurations complicat e t he easy reuse of im port ant innovations by a wide audience. Though t here have been on-goi ng effort s t o improve reusability and simplify deep learni ng (DL) model development in disciplines like nat ural language processing and comput er vision, none of t hem are optimized for challenges i n t he domai n of DIA. This represent s a maj or gap i n t he existing t oolkit, as DIA is cent ral t o academic research across a wide range of disciplines i n t he social sciences and humanities. This paper i nt roduces Layout Parser, an open-source library for st reamli ni ng t he usage of DL i n DIA research and applica tions. The core Layout Parser library comes with a set of simple and int uitive int erfaces for applying and cust omizing DL models for layout de t ection, charact er recognition, and many ot her document processing t asks. To promot e ext ensi bility, Layout Parser also i ncorporat es a community platform for shari ng bot h pre-t rained models and full document digiti zation pipelines. We demonst rat e t hat Layout Parser is helpful for bot h light weight and large-scale digitization pipelines i n real-word use cases. The library is publicly available at https : //layout-parser . github . i o.

Key words: Document Image Analysis • Deep Learni ng • Layout Analysis • Charact er Recognition • Open Source library • Toolkit.

1

Int roduction

Deep Learni ng(DL)-based approaches are t he st at e-of-the-art for a wide range of document image analysis (DIA) t asks i ncludi ng document i mage classification [ 11,

In [55]:
show_metadata(docs)

[metadata]
['source']

[examples]
source : ./data/2103_page1.pdf


내부적으로 비정형에서는 텍스트 청크마다 서로 다른 "**요소**"를 만듭니다. 기본적으로 이들은 결합되어 있지만 `mode="elements"`를 지정하여 쉽게 분리할 수 있습니다.

In [56]:
# UnstructuredPDFLoader 인스턴스 생성(mode="elements")
loader = UnstructuredPDFLoader(FILE_PATH, mode="elements")

# 데이터 로드
docs = loader.load()

# 문서의 내용 출력
print(docs[0].page_content)

IZOZ


이 특정 문서에 대한 전체 요소 유형 집합을 참조하세요

In [57]:
set(doc.metadata["category"] for doc in docs)  # 데이터 카테고리 추출

{'ListItem', 'NarrativeText', 'Title', 'UncategorizedText'}

In [58]:
show_metadata(docs)

[metadata]
['source', 'coordinates', 'filename', 'file_directory', 'last_modified', 'filetype', 'page_number', 'category', 'element_id']

[examples]
source         : ./data/2103_page1.pdf
coordinates    : {'points': ((25.44, 207.87349999999992), (25.44, 218.37349999999992), (62.67233920000001, 218.37349999999992), (62.67233920000001, 207.87349999999992)), 'system': 'PixelSpace', 'layout_width': 595.32, 'layout_height': 841.92}
filename       : 2103_page1.pdf
file_directory : ./data
last_modified  : 2025-08-22T13:08:42
filetype       : application/pdf
page_number    : 1
category       : Title
element_id     : d480e4bbfe8892846ed91c12a6c4fe67


## PyPDFium2

In [60]:
from langchain_community.document_loaders import PyPDFium2Loader

# PyPDFium2 로더 인스턴스 생성
loader = PyPDFium2Loader(FILE_PATH)

# 데이터 로드
docs = loader.load()

# 문서의 내용 출력
print(docs[0].page_content[:])

Layout Parser: A Unified Toolkit for Deep 
Learning Based Document Image Analysis 
IZOZ 
unr 
I
t 
Zejiang Shen1 詞), Ruochen Zhang2, Melissa Dell3, Benj amin Charles Germain 
Lee4, Jacob Carlson3, and Wei ning Li 5 
1 Allen Institut e for AI 
shannons©allenai .org 
2 Brown University 
ruochen_zhang©brown.edu 
3 Harvard University 
{meli ssadell, j acob_carlson}©f as.harvard.edu 
4 University of Washington 
bcgl©cs.washi ngton.edu 
5 University of Wat erloo 
w422li©u erloo.ca 
[AU
·
s~] 
Abst ract . Recent advances in document image analysis (DIA) have been 
primarily driven by t he application of neural networks. Ideally, research 
out comes could be easily deployed in production and ext ended for furt her 
investigation. However, various fact ors like loosely organized codebases 
and sophisticat ed model configurations complicat e t he easy reuse of import ant innovations by a wide audience. Though t here have been on-going 
efforts t o improve reusability and simplify deep learning 

In [61]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'title', 'author', 'subject', 'keywords', 'moddate', 'source', 'total_pages', 'page']

[examples]
producer     : Adobe Acrobat (64-bit) 25 Paper Capture Plug-in
creator      : 
creationdate : 2025-08-22T12:38:36+09:00
title        : C:\Users\park0\Downloads\2103.15348v2.pdf
author       : Heejin Park
subject      : 
keywords     : 
moddate      : 2025-08-22T12:47:11+09:00
source       : ./data/2103_page1.pdf
total_pages  : 1
page         : 0


## PDFMiner

In [62]:
from langchain_community.document_loaders import PDFMinerLoader

# PDFMiner 로더 인스턴스 생성
loader = PDFMinerLoader(FILE_PATH)

# 데이터 로드
docs = loader.load()

# 문서의 내용 출력
print(docs[0].page_content[:300])

# 한글 문서의 경우 띄어쓰기가 2칸씩 될 수 있으므로 주의가 필요

Layout Parser:  A  Unified Toolkit for Deep 
Learni n g Based Document  Image  Analysi s 

Zejiang  Shen1  詞), Ruochen  Zhang2,  Melissa Dell3,  Benj ami n  Charles  Germai n 
Lee4,  Jacob  Carlson3,  and Wei n i n g  Li 5 

1  Allen  Institut e  for  AI 
shannons©allenai .org 
2  Brown University 



In [63]:
show_metadata(docs)

[metadata]
['producer', 'creator', 'creationdate', 'author', 'moddate', 'title', 'total_pages', 'source']

[examples]
producer     : Adobe Acrobat (64-bit) 25 Paper Capture Plug-in
creator      : PDFMiner
creationdate : 2025-08-22T12:38:36+09:00
author       : Heejin Park
moddate      : 2025-08-22T12:47:11+09:00
title        : C:\Users\park0\Downloads\2103.15348v2.pdf
total_pages  : 1
source       : ./data/2103_page1.pdf


**PDFMiner**를 사용하여 HTML 텍스트 생성

이 방법은 출력된 HTML 콘텐츠를 `BeautifulSoup`을 통해 파싱함으로써 글꼴 크기, 페이지 번호, PDF 헤더/푸터 등에 대한 보다 구조화되고 풍부한 정보를 얻을 수 있게 하여 텍스트를 의미론적으로 섹션으로 분할하는 데 도움이 될 수 있습니다.

In [64]:
from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

# PDFMinerPDFasHTMLLoader 인스턴스 생성
loader = PDFMinerPDFasHTMLLoader(FILE_PATH)

# 문서 로드
docs = loader.load()

# 문서의 내용 출력
print(docs[0].page_content[:300])

<html><head>
<meta http-equiv="Content-Type" content="text/html">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:841px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border


In [65]:
show_metadata(docs)

[metadata]
['source']

[examples]
source : ./data/2103_page1.pdf


In [66]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(docs[0].page_content, "html.parser")  # HTML 파서 초기화
content = soup.find_all("div")  # 모든 div 태그 검색

In [67]:
import re

cur_fs = None
cur_text = ""
snippets = []  # 동일한 글꼴 크기의 모든 스니펫 수집
for c in content:
    sp = c.find("span")
    if not sp:
        continue
    st = sp.get("style")
    if not st:
        continue
    fs = re.findall("font-size:(\d+)px", st)
    if not fs:
        continue
    fs = int(fs[0])
    if not cur_fs:
        cur_fs = fs
    if fs == cur_fs:
        cur_text += c.text
    else:
        snippets.append((cur_text, cur_fs))
        cur_fs = fs
        cur_text = c.text
snippets.append((cur_text, cur_fs))
# 중복 스니펫 제거 전략 추가 가능성 (PDF의 헤더/푸터가 여러 페이지에 걸쳐 나타나므로 중복 발견 시 중복 정보로 간주 가능)

In [68]:
from langchain_core.documents import Document

cur_idx = -1
semantic_snippets = []
# 제목 가정: 높은 글꼴 크기
for s in snippets:
    # 새 제목 판별: 현재 스니펫 글꼴 > 이전 제목 글꼴
    if (
        not semantic_snippets
        or s[1] > semantic_snippets[cur_idx].metadata["heading_font"]
    ):
        metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
        metadata.update(docs[0].metadata)
        semantic_snippets.append(Document(page_content="", metadata=metadata))
        cur_idx += 1
        continue

    # 동일 섹션 내용 판별: 현재 스니펫 글꼴 <= 이전 내용 글꼴
    if (
        not semantic_snippets[cur_idx].metadata["content_font"]
        or s[1] <= semantic_snippets[cur_idx].metadata["content_font"]
    ):
        semantic_snippets[cur_idx].page_content += s[0]
        semantic_snippets[cur_idx].metadata["content_font"] = max(
            s[1], semantic_snippets[cur_idx].metadata["content_font"]
        )
        continue

    # 새 섹션 생성 조건: 현재 스니펫 글꼴 > 이전 내용 글꼴, 이전 제목 글꼴 미만
    metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
    metadata.update(docs[0].metadata)
    semantic_snippets.append(Document(page_content="", metadata=metadata))
    cur_idx += 1

print(semantic_snippets[4])

page_content='It 
' metadata={'heading': 'unr \n', 'content_font': 7, 'heading_font': 19, 'source': './data/2103_page1.pdf'}


## PyPDF 디렉토리

디렉토리에서 PDF를 로드하세요

In [69]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

# 디렉토리 경로
loader = PyPDFDirectoryLoader("data/")

# 문서 로드
docs = loader.load()

# 문서의 개수 출력
print(len(docs))

156


In [75]:
# 문서의 내용 출력
print(docs[0].page_content[:])

Layout Parser: A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 詞), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
IZOZ
shannons©allenai .org
2 Brown University
ruochen_zhang©brown.edu
unr
3 Harvard University
{meli ssadell,jacob_carlson}©fas.harvard.edu
4 University of Washington
bcgl©cs.washington.edu
I
t 5 University of Waterloo
w422li©u멀aterloo.ca
[AU
Abstract . Recent advances in document image analysis (DIA) have been
· s~] primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model configurations complicate the easy reuse of im
현 portant innovations by a wide audience. Though there have been on-going
/ efforts to improve reusability and simplify deep learning (DL) model
development in disciplines 

In [71]:
# metadata 출력
print(docs[50].metadata)

{'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.5 (Windows)', 'creationdate': '2024-12-17T10:16:15+09:00', 'moddate': '2024-12-17T10:18:06+09:00', 'trapped': '/False', 'source': 'data\\planner.pdf', 'total_pages': 111, 'page': 28, 'page_label': '29'}


## PDFPlumber

PyMuPDF와 마찬가지로, 출력 문서는 PDF와 그 페이지에 대한 자세한 메타데이터를 포함하며, 페이지 당 하나의 문서를 반환합니다.

In [73]:
from langchain_community.document_loaders import PDFPlumberLoader

# PDF 문서 로더 인스턴스 생성
loader = PDFPlumberLoader(FILE_PATH)

# 문서 로딩
docs = loader.load()

# 첫 번째 문서 데이터 접근
print(docs[0].page_content[:])

Layout Parser: A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 詞), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
IZOZ
shannons©allenai .org
2 Brown University
ruochen_zhang©brown.edu
unr
3 Harvard University
{meli ssadell,jacob_carlson}©fas.harvard.edu
4 University of Washington
bcgl©cs.washington.edu
I
t 5 University of Waterloo
w422li©u멀aterloo.ca
[AU
Abstract . Recent advances in document image analysis (DIA) have been
· s~] primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model configurations complicate the easy reuse of im
현 portant innovations by a wide audience. Though there have been on-going
/ efforts to improve reusability and simplify deep learning (DL) model
development in disciplines 

In [74]:
show_metadata(docs)

[metadata]
['source', 'file_path', 'page', 'total_pages', 'Author', 'CreationDate', 'ModDate', 'Producer', 'Title']

[examples]
source       : ./data/2103_page1.pdf
file_path    : ./data/2103_page1.pdf
page         : 0
total_pages  : 1
Author       : Heejin Park
CreationDate : D:20250822123836+09'00'
ModDate      : D:20250822124711+09'00'
Producer     : Adobe Acrobat (64-bit) 25 Paper Capture Plug-in
Title        : C:\Users\park0\Downloads\2103.15348v2.pdf
