# 📄 中文英文混合 PDF 发票信息抽取：LayoutXLM + PaddleOCR
本教程展示如何将发票 PDF 文件转为图像，使用 OCR 获取文本框和位置信息，再用 LayoutXLM 模型抽取结构化信息，如：InvoiceNo、InvoiceDate、Amount、Currency 等。

https://huggingface.co/microsoft/layoutxlm-base

### LayoutXLM 和 LayoutLMv3 都是用于多模态文档理解的强大模型，主要区别在于：是否支持多语言、预训练目标、模型架构差异，以及适用任务场景。

| 特性/维度             | 🔤 LayoutLMv3                                          | 🌍 LayoutXLM                                          |
|----------------------|--------------------------------------------------------|------------------------------------------------------|
| 📚 发布机构           | Microsoft                                              | Microsoft                                            |
| 📅 发布时间           | 2022 年底                                              | 2022 年初                                            |
| 🧩 输入模态           | 文本 + 位置 + 图像                                     | 文本 + 位置 + 图像                                   |
| 🌐 多语言支持         | ❌ 英文为主（支持中文但效果不稳定）                    | ✅ 支持多语言（包括中文、英文、韩文等）              |
| 🎯 预训练任务         | MLM + image-text alignment + Patch Order              | 多语言 MLM + XFUN多语言任务                         |
| 🏗 架构基础           | Transformer + 图像嵌入（Vision Transformer）           | Transformer（Text）+ 简单图像处理                    |
| 🖼 图像嵌入处理方式   | 使用 Visual Backbone（如 ResNet/VIT）                  | 较为简化（仅支持图像 patch）                        |
| 🔧 适用任务           | 文档分类、问答、命名实体识别、结构化抽取              | 多语言 NER、问答、信息抽取                          |
| 💬 预训练数据集       | IIT-CDIP、DocVQA、FUNSD、SROIE                        | XFUN 多语言表单数据集                               |
| ✅ Huggingface支持     | ✅ 已支持（layoutlmv3-base）                          | ✅ 支持（layoutxlm-base）                            |
| 📈 中文文档能力       | ⚠️ 中等（非专门多语）                                 | ✅ 很强（针对中英混排训练）                          |

In [1]:
# ✅ 安装依赖（首次运行）
!pip install paddleocr
!pip install -U transformers datasets seqeval
!apt install poppler-utils
!pip install pdf2image

Collecting packaging (from paddlex>=3.1.0->paddlex[ie,multimodal,ocr,trans]>=3.1.0->paddleocr)
  Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Using cached packaging-24.2-py3-none-any.whl (65 kB)
Installing collected packages: packaging
  Attempting uninstall: packaging
    Found existing installation: packaging 25.0
    Uninstalling packaging-25.0:
      Successfully uninstalled packaging-25.0
Successfully installed packaging-24.2
Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting seqeval
  Using cached seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25ldone
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
Building wheels for collected packages: seqeval
[33m  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting 

In [3]:
# ✅ STEP 1: PDF 转图像
from pdf2image import convert_from_path
from PIL import Image
import os

pdf_path = "/Users/xiaotingzhou/Documents/Lectures/AI_OCR/data/測試股份有限公司.pdf"  # 👈 请上传你的 PDF 文件
image_dir = "/Users/xiaotingzhou/Documents/Lectures/AI_OCR/data"
os.makedirs(image_dir, exist_ok=True)
images = convert_from_path(pdf_path, dpi=200)
image_paths = []
for i, img in enumerate(images):
    path = f"{image_dir}/page_{i+1}.png"
    img.save(path)
    image_paths.append(path)

image_paths  # 显示转换后的图像路径

['/Users/xiaotingzhou/Documents/Lectures/AI_OCR/data/page_1.png']

In [None]:
# ✅ STEP 2: 使用 PaddleOCR 获取 OCR 文本 + 位置信息
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang="en")

ocr_results = ocr.ocr(image_paths[0], cls=True)[0]  # 处理第一页
ocr_results[:3]  # 示例输出前三条

  ocr = PaddleOCR(use_angle_cls=True, lang="en")
[33mMKL-DNN is not available. Using `paddle` instead.[0m
[32mCreating model: ('PP-LCNet_x1_0_doc_ori', None)[0m
[32mUsing official model (PP-LCNet_x1_0_doc_ori), the model files will be automatically downloaded and saved in /Users/xiaotingzhou/.paddlex/official_models.[0m
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 9579.68it/s]
[32mCreating model: ('UVDoc', None)[0m
[32mUsing official model (UVDoc), the model files will be automatically downloaded and saved in /Users/xiaotingzhou/.paddlex/official_models.[0m
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 47934.90it/s]
[32mCreating model: ('PP-LCNet_x1_0_textline_ori', None)[0m
[32mUsing official model (PP-LCNet_x1_0_textline_ori), the model files will be automatically downloaded and saved in /Users/xiaotingzhou/.paddlex/official_models.[0m
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 11449.42it/s]
[32mCreating model: ('PP-OCRv5_server_det', None)[0m

: 

In [None]:
# ✅ STEP 3: 构建 LayoutXLM 格式数据
from transformers import LayoutXLMTokenizer
from PIL import ImageDraw
import torch

tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")
image = Image.open(image_paths[0]).convert("RGB")

words = []
boxes = []
for res in ocr_results:
    text = res[1][0]
    bbox = res[0]  # 四个点的坐标
    if text.strip():
        words.append(text)
        x0 = min([p[0] for p in bbox])
        y0 = min([p[1] for p in bbox])
        x1 = max([p[0] for p in bbox])
        y1 = max([p[1] for p in bbox])
        boxes.append([int(x0), int(y0), int(x1), int(y1)])

# 标准化 bbox 到 0-1000
width, height = image.size
norm_boxes = [[
    int(1000 * x0 / width), int(1000 * y0 / height),
    int(1000 * x1 / width), int(1000 * y1 / height)
] for x0, y0, x1, y1 in boxes]

encoding = tokenizer(
    words,
    boxes=norm_boxes,
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)

In [None]:
# ✅ STEP 4: 加载 LayoutXLM 进行推理（占位模型，无微调）
from transformers import LayoutXLMForTokenClassification

model = LayoutXLMForTokenClassification.from_pretrained("microsoft/layoutxlm-base")
model.eval()

with torch.no_grad():
    outputs = model(**encoding)
    predictions = outputs.logits.argmax(-1)

# 显示结果（演示用）
tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])
preds = predictions[0].tolist()
for token, pred in zip(tokens, preds):
    print(f"{token}: {pred}")

## 后续可扩展：

- 使用 Label Studio 或手工打标生成 `label`，进行微调
- 利用 LayoutXLMForTokenClassification + ID2LABEL 构建实体抽取
- 可组合 json 生成：InvoiceNo / Date / Currency / Amount 等字段