# Azure Form Recognizer（Document Intelligence） + Donut 应用解析

https://learn.microsoft.com/zh-tw/azure/ai-services/document-intelligence/prebuilt/invoice?view=doc-intel-4.0.0

https://github.com/Azure-Samples/document-intelligence-code-samples/tree/main/Data/invoice

## Azure Form Recognizer 简介

### 🧾 什么是 Document Intelligence？

Azure 的智能文档分析服务，支持文档 OCR + 信息提取 + 表格结构分析 + 自定义训练。

### ✅ 支持任务

| 类型       | 模型              | 功能                             |
|------------|-------------------|----------------------------------|
| 预训练发票 | prebuilt-invoice  | 自动识别发票字段                 |
| 通用文档   | prebuilt-document | 提取标题、段落、表格、关键短语   |
| 自定义模型 | 你上传标注数据    | 提取业务字段，支持自训练         |


### 📊 架构原理

	1.	图像/PDF 输入
	2.	内部 OCR 模块识别文字 + 坐标
	3.	Transformer 结构理解布局（LayoutLM 架构）
	4.	JSON 输出结构化字段


## 使用 Azure 提取发票信息

### ✅ 使用 Azure Studio（GUI）

	•	访问 https://aka.ms/formrecognizerstudio
	•	上传 PDF 或图片
	•	选择 prebuilt-invoice 模型
	•	自动输出字段如：



```
{
  "InvoiceId": "INV-12345",
  "InvoiceDate": "2025-06-30",
  "InvoiceTotal": 300.00,
  "Currency": "USD",
  "VendorName": "Test Co., Ltd",
  "Items": [
    {
      "Description": "請款",
      "Quantity": 100,
      "UnitPrice": 3.00,
      "Amount": 300.00
    }
  ]
}
```





### 🐍 Python SDK 使用代码


In [None]:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential

client = DocumentIntelligenceClient(endpoint="<your-endpoint>", credential=AzureKeyCredential("<your-key>"))

with open("invoice.pdf", "rb") as f:
    poller = client.begin_analyze_document(model_id="prebuilt-invoice", document=f)
    result = poller.result()

    invoice = result.documents[0]
    print("Invoice ID:", invoice.fields.get("InvoiceId").value)

## 分析输出 JSON Schema

### ✅ 字段结构参考

```
{
  "InvoiceId": "INV-001",
  "InvoiceDate": "2024-06-01",
  "InvoiceTotal": 1234.56,
  "Currency": "USD",
  "VendorName": "Contoso Ltd.",
  "CustomerName": "Fabrikam Inc.",
  "Items": [
    {
      "Description": "Consulting Service",
      "Quantity": 10,
      "UnitPrice": 100,
      "Amount": 1000
    }
  ],
  "TaxTotal": 234.56,
  "PaymentTerm": "Net 30"
}
```



所有字段都有置信度（confidence），还支持 table 自动识别

In [None]:
"""
This code sample shows Prebuilt Invoice operations with the Azure AI Document Intelligence client library.
The async versions of the samples require Python 3.8 or later.

To learn more, please visit the documentation - Quickstart: Document Intelligence (formerly Form Recognizer) SDKs
https://learn.microsoft.com/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api?pivots=programming-language-python
"""

from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest

"""
Remember to remove the key from your code when you're done, and never post it publicly. For production, use
secure methods to store and access your credentials. For more information, see
https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration
"""
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"
key = "YOUR_FORM_RECOGNIZER_KEY"

# sample document
formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/invoice_sample.jpg"

document_intelligence_client  = DocumentIntelligenceClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

poller = document_intelligence_client.begin_analyze_document(
    "prebuilt-invoice", AnalyzeDocumentRequest(url_source=formUrl)
)
invoices = poller.result()

for idx, invoice in enumerate(invoices.documents):
    print("--------Recognizing invoice #{}--------".format(idx + 1))
    vendor_name = invoice.fields.get("VendorName")
    if vendor_name:
        print(
            "Vendor Name: {} has confidence: {}".format(
                vendor_name.value_string, vendor_name.confidence
            )
        )
    vendor_address = invoice.fields.get("VendorAddress")
    if vendor_address:
        print(
            "Vendor Address: {} has confidence: {}".format(
                vendor_address.value_address, vendor_address.confidence
            )
        )
    vendor_address_recipient = invoice.fields.get("VendorAddressRecipient")
    if vendor_address_recipient:
        print(
            "Vendor Address Recipient: {} has confidence: {}".format(
                vendor_address_recipient.value_string, vendor_address_recipient.confidence
            )
        )
    customer_name = invoice.fields.get("CustomerName")
    if customer_name:
        print(
            "Customer Name: {} has confidence: {}".format(
                customer_name.value_string, customer_name.confidence
            )
        )
    customer_id = invoice.fields.get("CustomerId")
    if customer_id:
        print(
            "Customer Id: {} has confidence: {}".format(
                customer_id.value_string, customer_id.confidence
            )
        )
    customer_address = invoice.fields.get("CustomerAddress")
    if customer_address:
        print(
            "Customer Address: {} has confidence: {}".format(
                customer_address.value_address, customer_address.confidence
            )
        )
    customer_address_recipient = invoice.fields.get("CustomerAddressRecipient")
    if customer_address_recipient:
        print(
            "Customer Address Recipient: {} has confidence: {}".format(
                customer_address_recipient.value_string,
                customer_address_recipient.confidence,
            )
        )
    invoice_id = invoice.fields.get("InvoiceId")
    if invoice_id:
        print(
            "Invoice Id: {} has confidence: {}".format(
                invoice_id.value_string, invoice_id.confidence
            )
        )
    invoice_date = invoice.fields.get("InvoiceDate")
    if invoice_date:
        print(
            "Invoice Date: {} has confidence: {}".format(
                invoice_date.value_date, invoice_date.confidence
            )
        )
    invoice_total = invoice.fields.get("InvoiceTotal")
    if invoice_total:
        print(
            "Invoice Total: {} has confidence: {}".format(
                invoice_total.value_currency.amount, invoice_total.confidence
            )
        )
    due_date = invoice.fields.get("DueDate")
    if due_date:
        print(
            "Due Date: {} has confidence: {}".format(
                due_date.value_date, due_date.confidence
            )
        )
    purchase_order = invoice.fields.get("PurchaseOrder")
    if purchase_order:
        print(
            "Purchase Order: {} has confidence: {}".format(
                purchase_order.value_string, purchase_order.confidence
            )
        )
    billing_address = invoice.fields.get("BillingAddress")
    if billing_address:
        print(
            "Billing Address: {} has confidence: {}".format(
                billing_address.value_address, billing_address.confidence
            )
        )
    billing_address_recipient = invoice.fields.get("BillingAddressRecipient")
    if billing_address_recipient:
        print(
            "Billing Address Recipient: {} has confidence: {}".format(
                billing_address_recipient.value_string,
                billing_address_recipient.confidence,
            )
        )
    shipping_address = invoice.fields.get("ShippingAddress")
    if shipping_address:
        print(
            "Shipping Address: {} has confidence: {}".format(
                shipping_address.value_address, shipping_address.confidence
            )
        )
    shipping_address_recipient = invoice.fields.get("ShippingAddressRecipient")
    if shipping_address_recipient:
        print(
            "Shipping Address Recipient: {} has confidence: {}".format(
                shipping_address_recipient.value_string,
                shipping_address_recipient.confidence,
            )
        )
    print("Invoice items:")
    for idx, item in enumerate(invoice.fields.get("Items").value_array):
        print("...Item #{}".format(idx + 1))
        item_description = item.value_object.get("Description")
        if item_description:
            print(
                "......Description: {} has confidence: {}".format(
                    item_description.value_string, item_description.confidence
                )
            )
        item_quantity = item.value_object.get("Quantity")
        if item_quantity:
            print(
                "......Quantity: {} has confidence: {}".format(
                    item_quantity.value_number, item_quantity.confidence
                )
            )
        unit = item.value_object.get("Unit")
        if unit:
            print(
                "......Unit: {} has confidence: {}".format(
                    unit.value_number, unit.confidence
                )
            )
        unit_price = item.value_object.get("UnitPrice")
        if unit_price:
            print(
                "......Unit Price: {} has confidence: {}".format(
                    unit_price.value_currency.amount, unit_price.confidence
                )
            )
        product_code = item.value_object.get("ProductCode")
        if product_code:
            print(
                "......Product Code: {} has confidence: {}".format(
                    product_code.value_string, product_code.confidence
                )
            )
        item_date = item.value_object.get("Date")
        if item_date:
            print(
                "......Date: {} has confidence: {}".format(
                    item_date.value_date, item_date.confidence
                )
            )
        tax = item.value_object.get("Tax")
        if tax:
            print(
                "......Tax: {} has confidence: {}".format(tax.value_string, tax.confidence)
            )
        amount = item.value_object.get("Amount")
        if amount:
            print(
                "......Amount: {} has confidence: {}".format(
                    amount.value_currency.amount, amount.confidence
                )
            )
    subtotal = invoice.fields.get("SubTotal")
    if subtotal:
        print(
            "Subtotal: {} has confidence: {}".format(
                subtotal.value_currency.amount, subtotal.confidence
            )
        )
    total_tax = invoice.fields.get("TotalTax")
    if total_tax:
        print(
            "Total Tax: {} has confidence: {}".format(
                total_tax.value_currency.amount, total_tax.confidence
            )
        )
    previous_unpaid_balance = invoice.fields.get("PreviousUnpaidBalance")
    if previous_unpaid_balance:
        print(
            "Previous Unpaid Balance: {} has confidence: {}".format(
                previous_unpaid_balance.value_currency.amount, previous_unpaid_balance.confidence
            )
        )
    amount_due = invoice.fields.get("AmountDue")
    if amount_due:
        print(
            "Amount Due: {} has confidence: {}".format(
                amount_due.value_currency.amount, amount_due.confidence
            )
        )
    service_start_date = invoice.fields.get("ServiceStartDate")
    if service_start_date:
        print(
            "Service Start Date: {} has confidence: {}".format(
                service_start_date.value_date, service_start_date.confidence
            )
        )
    service_end_date = invoice.fields.get("ServiceEndDate")
    if service_end_date:
        print(
            "Service End Date: {} has confidence: {}".format(
                service_end_date.value_date, service_end_date.confidence
            )
        )
    service_address = invoice.fields.get("ServiceAddress")
    if service_address:
        print(
            "Service Address: {} has confidence: {}".format(
                service_address.value_address, service_address.confidence
            )
        )
    service_address_recipient = invoice.fields.get("ServiceAddressRecipient")
    if service_address_recipient:
        print(
            "Service Address Recipient: {} has confidence: {}".format(
                service_address_recipient.value_string,
                service_address_recipient.confidence,
            )
        )
    remittance_address = invoice.fields.get("RemittanceAddress")
    if remittance_address:
        print(
            "Remittance Address: {} has confidence: {}".format(
                remittance_address.value_address, remittance_address.confidence
            )
        )
    remittance_address_recipient = invoice.fields.get("RemittanceAddressRecipient")
    if remittance_address_recipient:
        print(
            "Remittance Address Recipient: {} has confidence: {}".format(
                remittance_address_recipient.value_string,
                remittance_address_recipient.confidence,
            )
        )
    print("----------------------------------------")


## 如何借鉴到 Donut 等模型

### 📚 Donut 与 Azure 的关键异同

| 维度               | Azure Document Intelligence     | Donut                                  |
|--------------------|----------------------------------|----------------------------------------|
| 输入               | OCR 后坐标 + token              | 原始图像                               |
| 多语言支持         | 强（50+语言）                   | 需 fine-tune                           |
| 表格/字段支持      | JSON 格式结构化字段             | 输出结构 JSON，自定义字段自由         |
| 模型结构           | LayoutLM / LayoutLMv3           | VisionEncoderDecoder（Swin + BART）    |
| 推理逻辑（如税金） | 部分推理自动完成                | 需额外后处理或 LLM 辅助                |


### ✅ 如何借鉴 Azure 模式训练 Donut

	1.	定义你自己的 JSON schema（模仿 Azure 输出格式）
	2.	为 Donut 准备训练数据（图像 + 对应 JSON）
	3.	使用 prompt 模式微调 Donut，如：

`<s_invoice>{json}</s>`

invoice_no : value
other: xxx

### 要让 Donut 模型成功应对「来自不同厂商 + 多语言 + 多格式」的复杂发票场景，并准确提取出你想要的细粒度字段（如发票号、日期、金额、币种、税金、收款信息等），你需要在数据设计、模型结构理解、训练策略、后处理机制等方面精细打磨整个 pipeline。


### ✅ 1. 明确目标输出字段（Schema 设计）

#### 首先定义清晰一致的字段结构（建议模仿 Azure 的 Schema）：


```
{
  "InvoiceNo": "",
  "InvoiceDate": "",
  "Currency": "",
  "TotalAmount": 0.0,
  "TaxAmount": 0.0,
  "NetAmount": 0.0,
  "VendorName": "",
  "CustomerName": "",
  "PaymentTerm": "",
  "BankInfo": {
    "BankName": "",
    "AccountNumber": "",
    "Contact": ""
  },
  "Items": [
    {
      "Description": "",
      "PONumber": "",
      "Quantity": 0,
      "UnitPrice": 0.0,
      "Amount": 0.0
    }
  ]
}

```



保持训练/验证数据字段统一，极大提升模型输出质量。

### 1. 显示文本
### 2. 推理文本
### 3. 其他文本

🧾 2. 多样化高质量训练数据准备（最关键）

| 维度       | 要求                                                         |
|------------|--------------------------------------------------------------|
| 厂商多样 | 不同国家、不同排版、公司名变化大                           |
| 多语言   | 中/英/日/韩发票样式混合，含繁体、英文发票                   |
| 多币种   | 有 RMB, USD, JPY, EUR，避免模型过拟合某币种                 |
| 多格式   | 横排、竖排、页脚页头都有表格的混乱格式                     |
| 隐含信息 | 无明示字段但需要推理（如税金、币种）                       |

标注建议：

	•	使用真实图像（建议不少于 200 张）
	•	每张图配套一个清晰的 JSON 文件（格式严格一致）
	•	可用工具：
	  •	Label Studio（配合图像 + JSON 标注）
	  •	自制脚本（PDF 转图像 + JSON 对）

| 字段名       | 类型       | 备注                                                                 |
|--------------|------------|----------------------------------------------------------------------|
| InvoiceNo    | 显式字段   | 多为英文或数字                                                     |
| InvoiceDate  | 显式字段   | 多为多语言格式（如中英文混排的日期）                              |
| Currency     | 推理字段   | 需从列名、金额单位等上下文中推理（如"RMB", "$", "USD"等）         |
| TotalAmount  | 显式字段   | 可能在汇总栏出现，或由每行金额累加而得                             |
| Tax          | 推理字段   | 可能缺失，需通过金额差异或税率反推计算                             |
| NetAmount    | 可选字段   | 如存在，可辅助推理税金或总金额                                     |
| Items        | 表格字段   | 通常为多语言混合内容，可能包含品名、单价、数量、小计等             |
| Bank Info    | 多行段落   | 可归为结构化字段（如账户名、银行名、账号等）或作为一段文本提取     |

### 3. prompt设计 + Donut训练技巧

Donut 是 encoder-decoder 模型，训练时输入为图像，输出为 prompt + JSON格式文本

推荐 Prompt 格式：


```
<s_invoice><s>InvoiceNo: ... InvoiceDate: ... Currency: ... <eos>

```



### 训练建议：

| 参数                  | 推荐设置                          |
|-----------------------|-----------------------------------|
| batch size            | 2–8（显存限制）                   |
| learning rate         | 3e-5 or 5e-5                      |
| epochs                | 10–30（观察 loss 稳定）           |
| early stopping        | 开启                              |
| tokenizer truncation  | max length = 768~1024             |
| image size            | 1280 × 960（适配发票清晰度）      |

## 增强策略：

	•	多语言混训（中文英文图像混合）
	•	图像增强（旋转 ±5°、加噪、色彩扰动）
	•	label 乱序鲁棒性训练（字段顺序打乱，提升泛化）

### 4. 后处理：推理增强模块（外接逻辑）

Donut 适合提取可视字段，而对如下隐含字段需后处理模块推理：

| 字段              | 推理逻辑建议                                         |
|-------------------|------------------------------------------------------|
| 税金 TaxAmount    | TotalAmount - NetAmount 计算                         |
| 币种 Currency     | 从列名 "Unit Price USD" 或金额符号推断              |
| 总价 TotalAmount  | 若没有单独字段，从 items 累加                        |
| 日期标准化        | 将 "30 JUN 2025" → 2025-06-30                         |

可在预测输出后运行一个 Python 脚本进行字段补全和清洗。

### 5. 验证与评估策略

| 方法                  | 描述                                                       |
|-----------------------|------------------------------------------------------------|
| 按字段对比            | JSON 中每个字段单独比较是否匹配                           |
| 结构化字段精度        | 对 Items 表格进行行级别比较（字段值 × 行数）              |
| BLEU / edit distance  | 用于生成文本格式评估整体 JSON 准确率                      |

### 6. 文件结构建议（多语言发票项目）



```
project_root/
├── images/
│   ├── invoice001.jpg
│   └── invoice002.jpg
├── annotations/
│   ├── invoice001.json
│   └── invoice002.json
├── train.json (or train.jsonl)
├── val.json
├── processor_config.json
├── train.py
├── inference.py
├── postprocess.py
└── README.md
```



### 7. 借鉴 Azure 发票识别模型的设计

Azure 的设计启发如下：

| Azure 模型特性       | 如何借鉴 Donut 训练                                        |
|----------------------|-------------------------------------------------------------|
| 明确 schema 输出     | 训练前设计好字段结构，保持一致性                          |
| 多语言多币种支持     | 数据中涵盖不同语言发票模板                                |
| 表格结构支持         | Items 结构需写入 JSON，保持表格字段统一                  |
| 置信度控制           | Donut 无置信度，可加入额外后处理检测异常值               |

# 如何让 Donut 模型适配多语言、多格式发票信息抽取

Donut 是一个端到端文档理解模型，基于视觉 Transformer（Swin Transformer）+ 文本生成（Seq2Seq）。为了处理中文、英文等 **多语言** 和 **不同排版格式** 的发票，需要从模型架构、参数设置、训练数据和策略多个维度进行优化。

---

## 一、字段定义与信息抽取目标

| 字段名       | 类型       | 备注                                                                 |
|--------------|------------|----------------------------------------------------------------------|
| InvoiceNo    | 显式字段   | 多为英文或数字                                                     |
| InvoiceDate  | 显式字段   | 多为多语言格式（如中英文混排的日期）                              |
| Currency     | 推理字段   | 需从列名、金额单位等上下文中推理（如"RMB", "$", "USD"等）         |
| TotalAmount  | 显式字段   | 可能在汇总栏出现，或由每行金额累加而得                             |
| Tax          | 推理字段   | 可能缺失，需通过金额差异或税率反推计算                             |
| NetAmount    | 可选字段   | 如存在，可辅助推理税金或总金额                                     |
| Items        | 表格字段   | 通常为多语言混合内容，可能包含品名、单价、数量、小计等             |
| Bank Info    | 多行段落   | 可归为结构化字段（如账户名、银行名、账号等）或作为一段文本提取     |

---

## 二、Donut 模型参数解释（视觉编码器）

| 参数名 | 默认值 | 中文解释 |
|--------|--------|-----------|
| `image_size` | 224 | 输入图像的尺寸（建议提升至512或768以保留发票细节） |
| `patch_size` | 4 | 每个patch的大小，越小越能捕捉细节 |
| `num_channels` | 3 | 输入图像通道数（RGB图像为3） |
| `embed_dim` | 96 | patch 嵌入后的维度（建议提升至192+） |
| `depths` | [2, 2, 6, 2] | Transformer编码器每层的深度（建议加深） |
| `num_heads` | [3, 6, 12, 24] | 每层注意力头数量（与 embed_dim 成比例） |
| `window_size` | 7 | 局部窗口注意力的感受野 |
| `mlp_ratio` | 4.0 | MLP 层隐藏维度与嵌入维度之比 |
| `qkv_bias` | True | 是否对 Query/Key/Value 添加可学习偏置 |
| `hidden_dropout_prob` | 0.0 | 全连接层的 dropout（训练中可加大） |
| `attention_probs_dropout_prob` | 0.0 | 注意力概率的 dropout |
| `drop_path_rate` | 0.1 | Stochastic Depth 的比率（建议适当加大） |
| `hidden_act` | "gelu" | 激活函数类型，常用："gelu", "relu" 等 |
| `use_absolute_embeddings` | False | 是否使用绝对位置编码（默认使用相对编码） |
| `initializer_range` | 0.02 | 权重初始化标准差 |
| `layer_norm_eps` | 1e-5 | LayerNorm 的 epsilon 值 |

---

## 三、模型适配多语言和多格式的优化建议

### ✅ 模型结构调整建议

| 目标 | 推荐调整 |
|------|-----------|
| 增强文档解析能力 | `image_size=512`, `embed_dim=192`, `depths=[2,2,18,2]` |
| 保留复杂布局结构 | 使用较小 `patch_size=4` 和合适 `window_size=7` |
| 提升泛化与稳定性 | 增加 `drop_path_rate`, `dropout_prob`，避免过拟合 |

### ✅ 数据与 Tokenizer 处理

- 训练数据必须覆盖多种语言和格式（中英混排、不同版式的发票）
- Ground truth 统一为结构化 JSON，保持字段一致
- Tokenizer 使用多语言支持：
  ```python
  from transformers import AutoTokenizer
  tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
    ```


### ✅ 训练策略

- 使用 donut-base 或 donut-docvqa 微调
- 可先冻结视觉编码器，只训练文本生成器
- 对 Currency / Tax 等推理字段添加后处理逻辑或额外监督任务
- 加入字体扰动、OCR噪声、图像压缩等数据增强策略

### 📦 推荐配置示例

In [None]:
donut_config = {
    "image_size": 512,
    "patch_size": 4,
    "embed_dim": 192,
    "depths": [2, 2, 18, 2],
    "num_heads": [3, 6, 12, 24],
    "window_size": 7,
    "mlp_ratio": 4.0,
    "qkv_bias": True,
    "drop_path_rate": 0.2,
    "hidden_dropout_prob": 0.1,
    "attention_probs_dropout_prob": 0.1,
    "initializer_range": 0.02,
    "layer_norm_eps": 1e-5,
}