---
**LangPDFDemo**
- 使用 LangChain UnstructuredPDFLoader 解析 PDF
- 分類、逐一打印元素
- 適合作為個人練習範例
---


In [1]:
pdf_path = r"D:\Learning-lab\Test\sample.pdf"


In [3]:
from langchain_community.document_loaders import UnstructuredPDFLoader

pdf_loader = UnstructuredPDFLoader(
    pdf_path,
    mode="elements"   # 或 "single_page"
)

pdf_docs = pdf_loader.load()




In [8]:
print(f"共讀取 {len(pdf_docs)} 份 Document")



共讀取 359 份 Document


In [9]:
# 逐一列印
for i, doc in enumerate(pdf_docs, start=1):
    print(f"--- Document {i} ---")
    print(doc.page_content[:500])  # 前 500 字
    print("Metadata:", doc.metadata)
    print(doc.metadata.keys())
    print()


--- Document 1 ---
2510.12323v1 [cs.AI] 14 Oct 2025
Metadata: {'source': 'D:\\Learning-lab\\Test\\sample.pdf', 'coordinates': {'points': ((np.float64(51.0), np.float64(631.0)), (np.float64(51.0), np.float64(1429.0)), (np.float64(99.0), np.float64(1429.0)), (np.float64(99.0), np.float64(631.0))), 'system': 'PixelSpace', 'layout_width': 1700, 'layout_height': 2200}, 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2025-10-21T15:20:27', 'page_number': 1, 'file_directory': 'D:\\Learning-lab\\Test', 'filename': 'sample.pdf', 'category': 'UncategorizedText', 'element_id': 'dc03f3ec8b6688c2ff97432265348f10'}
dict_keys(['source', 'coordinates', 'filetype', 'languages', 'last_modified', 'page_number', 'file_directory', 'filename', 'category', 'element_id'])

--- Document 2 ---
arXiv
Metadata: {'source': 'D:\\Learning-lab\\Test\\sample.pdf', 'coordinates': {'points': ((np.float64(50.0), np.float64(1447.0)), (np.float64(50.0), np.float64(1571.0)), (np.float64(88.0), np.float

In [10]:
# 逐一打印文字和類別
for i, doc in enumerate(pdf_docs, start=1):
    print(f"--- Element {i} ---")
    print("Category:", doc.metadata.get("category"))
    print("Page number:", doc.metadata.get("page_number"))
    print("Content snippet:", doc.page_content[:200])  # 顯示前200字
    print()


--- Element 1 ---
Category: UncategorizedText
Page number: 1
Content snippet: 2510.12323v1 [cs.AI] 14 Oct 2025

--- Element 2 ---
Category: Title
Page number: 1
Content snippet: arXiv

--- Element 3 ---
Category: Title
Page number: 1
Content snippet: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK

--- Element 4 ---
Category: Title
Page number: 1
Content snippet: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK

--- Element 5 ---
Category: UncategorizedText
Page number: 1
Content snippet: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang* The University of Hong Kong zrguol01@hku.hk xubinrencs@gmail.com chaohuang75@gmail.com

--- Element 6 ---
Category: Title
Page number: 1
Content snippet: ABSTRACT

--- Element 7 ---
Category: NarrativeText
Page number: 1
Content snippet: Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between

--- Element 8 ---
Categor

In [18]:
# 打印特定分類區塊
for i, doc in enumerate(pdf_docs, start=1):
    if doc.metadata.get("category") == "UncategorizedText":
        print("區塊:", doc.page_content)


區塊: 2510.12323v1 [cs.AI] 14 Oct 2025
區塊: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang* The University of Hong Kong zrguol01@hku.hk xubinrencs@gmail.com chaohuang75@gmail.com
區塊: ogur panaiyay plug,
區塊: .
區塊: i ' Documents. Text Encoder | = = § 1 Maitimedal voa t ' '
區塊: o4uz jopow 14m Jonyxay,
區塊: T =emb(s): 5 € VUEUG,, (5)
區塊: Response = VLM(q,P(q), ¥*(2)): 6)
區塊: GPT-40-mini 40.3. 46.9 60.3 59.2 61.0 61.0 43.8 49.6 51.2 LightRAG 53.8 56.2 59.5 61.8 65.7 85.0 59.7 46.8 58.4 MMGraphRAG 64.3 52.8 64.9 40.0 61.5 67.6 66.0 60.5 61.0 RAGAnything 61.4 67.0 61.5 60.2 663 85.0 76.3 46.0 63.4
區塊: GPT-40-mini 35.5 44.0 246 33.1 29.5 46.8 31.1 33.5 LightRAG 40.8 341 36.2 39.4 41.0 44.4 38.3 38.9 MMGraphRAG 40.8 36.5 35.7 35.8 28.2 46.9 38.5 37.7 RAGAnything 46.6 43.5 38.7 43.9 34.0 45.7 43.6 42.8
區塊: Chunk-only 55.8 61.5 60.1 60.7 640 816 66.2 43.5 60.0 w/o Reranker 60.9 63.5 58.8 60.2 68.6 81.7 74.7 45.4 62.4 RAGAnything 61.4 67.0 61.5 60.2 663 85.0 76.3 46.0 63.4
區塊: 10
區塊: 11
區塊

In [12]:
# 各元素分組
from collections import defaultdict

elements_by_category = defaultdict(list)

for doc in pdf_docs:
    cat = doc.metadata.get("category", "Uncategorized")
    elements_by_category[cat].append(doc.page_content)

# 印出每個類別的元素數量
for cat, contents in elements_by_category.items():
    print(f"{cat}: {len(contents)} elements")


UncategorizedText: 35 elements
Title: 118 elements
NarrativeText: 193 elements
ListItem: 13 elements


# Metadata 結構說明
```python
{
    'source': 'D:\\Learning-lab\\Test\\sample.pdf',  # PDF 原始檔案路徑
    'coordinates': {                                   # 元素在頁面上的位置
        'points': ((51.0, 631.0), (51.0, 1429.0), (99.0, 1429.0), (99.0, 631.0)),
        'system': 'PixelSpace',
        'layout_width': 1700,
        'layout_height': 2200
    },
    'filetype': 'application/pdf',                     # 檔案類型
    'languages': ['eng'],                              # 偵測語言
    'last_modified': '2025-10-21T15:20:27',           # 檔案最後修改時間
    'page_number': 1,                                  # 所屬頁碼
    'file_directory': 'D:\\Learning-lab\\Test',       # PDF 所在資料夾
    'filename': 'sample.pdf',                          # 檔案名稱
    'category': 'UncategorizedText',                  # 元素類別（文字、程式碼、公式、圖片等）
    'element_id': 'dc03f3ec8b6688c2ff97432265348f10' # 唯一元素 ID
}
```

In [19]:
# 抓圖片、逐頁載入、OCR
# 載入模式有三：'single', 'paged', 'elements'。單檔、逐頁、元素
# "ocr_languages": ["eng", "chi_tra"],  # OCR 英文 + 繁體中文
"""
舊方法：

pdf_loader = UnstructuredPDFLoader(
    pdf_path,
    mode="paged", 
    unstructured_kwargs={
        "ocr_languages": ["eng"],  
        "extract_images_in_pdf": True
    }
)
"""
# 新方式建議用指定'elements'，並設定 chunking_strategy:"by_page"，該方法可以提供更靈活得拆分方式
pdf_loader = UnstructuredPDFLoader(
    pdf_path,
    mode="elements", 
    unstructured_kwargs={
        "ocr_languages": ["eng"],  # 需要 Tesseract
        "extract_images_in_pdf": True,
        "chunking_strategy":"by_page"
    }
)

pdf_docs = pdf_loader.load()




In [None]:
# 再次查看各元素分組
from collections import defaultdict

elements_by_category = defaultdict(list)

for doc in pdf_docs:
    cat = doc.metadata.get("category", "Uncategorized")
    elements_by_category[cat].append(doc.page_content)

# 印出每個類別的元素數量
for cat, contents in elements_by_category.items():
    print(f"{cat}: {len(contents)} elements")


UncategorizedText: 35 elements
Title: 118 elements
NarrativeText: 193 elements
ListItem: 13 elements


In [20]:
# pdf_docs 是 mode="elements" 或 chunking_strategy="by_page" 後的結果
from collections import defaultdict

# 記錄每頁圖片數量
images_per_page = defaultdict(int)

for doc in pdf_docs:
    page = doc.metadata.get("page_number", 1)
    # 檢查元素類型或 category 是否是圖片
    if doc.metadata.get("element_type") in ["Image", "ImageBlock"] or doc.metadata.get("category") == "Image":
        images_per_page[page] += 1

# 印出結果
for page, count in images_per_page.items():
    print(f"Page {page}: {count} 張圖片")


In [21]:
# 抽取圖片並存檔
import os
from PIL import Image

output_dir = "extracted_images"
os.makedirs(output_dir, exist_ok=True)

for i, doc in enumerate(pdf_docs, start=1):
    if doc.metadata.get("element_type") == "Image":
        image = doc.metadata.get("image")  # PIL Image 物件
        if image:
            image_path = os.path.join(output_dir, f"page{i}_img{doc.metadata['element_id']}.png")
            image.save(image_path)
            print(f"存檔完成: {image_path}")


In [25]:
# 抓標題
outline = []

for doc in pdf_docs:
    cat = doc.metadata.get("category")
    if cat == "Title":
        outline.append({
            "page": doc.metadata.get("page_number"),
            "title": doc.page_content.strip()
        })

# 印出導覽大綱
for item in outline:
    print(f"Page {item['page']}: {item['title']}")

Page 1: arXiv
Page 1: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
Page 1: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
Page 1: ABSTRACT
Page 1: 1 INTRODUCTION
Page 2: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
Page 3: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
Page 3: 2 THE RAG-ANYTHING FRAMEWORK
Page 3: 2.1 PRELIMINARY
Page 3: 2.1.1 MOTIVATING RAG-ANYTHING
Page 3: 2.2 UNIVERSAL REPRESENTATION FOR HETEROGENEOUS KNOWLEDGE
Page 3: {ej = (tj, ey) FE ()
Page 3: Decompose
Page 3: ky
Page 4: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
Page 4: Gras oege
Page 4: mage Caption & Graph
Page 4: ‘Metadata Extraction
Page 4: e Structural Knowledge Negation
Page 4: Multi-modal Processors
Page 4: : vu
Page 4: tet vDB
Page 4: OQ
Page 4: LaTeX Equation Recognition
Page 4: ofuz vouonb3 —oyuy a6ouz
Page 4: Semantic Similarity Matching
Page 4: V8 over All
Page 4: Table Structure & Content Parsing
Page 4: our aIqmL
Page 4: 2.2.1 DUAL-GRAPH CONSTRUCTION FOR MULTIMODAL KNOWLEDGE
Page 5: RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
Page 5:

In [26]:
# 重建版面
from operator import itemgetter

# 假設 pdf_docs 是 mode="elements" 的結果
pages = {}

for doc in pdf_docs:
    page = doc.metadata.get("page_number", 1)
    coords = doc.metadata.get("coordinates", {})
    y_top = coords.get("points", [(0,0)])[0][1]  # 上方座標
    x_left = coords.get("points", [(0,0)])[0][0]  # 左方座標

    if page not in pages:
        pages[page] = []
    pages[page].append((y_top, x_left, doc.metadata.get("category"), doc.page_content))

# 依 y_top 排序，每頁重建版面
for page_num, elements in pages.items():
    print(f"--- Page {page_num} ---")
    elements_sorted = sorted(elements, key=itemgetter(0, 1))
    for y, x, cat, content in elements_sorted:
        print(f"[{cat}] {content}")


--- Page 1 ---
[Title] RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
[Title] RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK
[UncategorizedText] Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang* The University of Hong Kong zrguol01@hku.hk xubinrencs@gmail.com chaohuang75@gmail.com
[Title] ABSTRACT
[NarrativeText] Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inher- ently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables compre- hensive knowledge retrieval across all modalities. Our appro

In [27]:
# 依 category 加上格式化標記，可以讓內容更接近原 PDF：
for y, x, cat, content in elements_sorted:
    if cat == "Title":
        print(f"\n=== {content} ===\n")
    elif cat == "ListItem":
        print(f"• {content}")
    else:
        print(content)


=== RAG-ANYTHING: ALL-IN-ONE RAG FRAMEWORK ===

detailed diagrams, or exact spatial relationships. Corresponding text typically provides general, conceptual descriptions. This semantic misalignment introduces noise that actively misleads the reasoning process. The system attempts to reconcile incompatible levels of detail and specificity.
Issue 2: Rigid Spatial Processing Patterns. Current visual processing models exhibit fundamental rigidity in spatial interpretation. Most systems default to sequential scanning patterns—top-to- bottom and left-to-right—that mirror natural reading conventions. While effective for simple text documents, this approach creates systematic failures with structurally complex real-world content. Many documents require non-conventional processing strategies. Tables demand column-wise interpretation, technical diagrams follow specific directional flows, and scientific figures embed critical information in unexpectedly positioned annotations. These structural v