## 自动文摘有两种解决方案

1. 抽取式（Extractive）
2. 生成式（Abstractive） 在理解整篇文章内容的基础上，使用简短连贯的语言将原文的主要内容表达出来，即会产生原文中没有出现的词和句子。

评价自动摘要的效果通常使用 ROUGE（Recall Oriented Understudy for Gisting Evaluation）评价。

ROUGE 主要包括以下 4 种评价指标：
1. ROUGE-N，基于 n-gram 的共现统计；
2. ROUGE-L，基于最长公共子串；
3. ROUGE-S，基于顺序词对统计；
4. ROUGE-W，在 ROUGE-L 的基础上，考虑串的连续匹配。


BART 原理与特点分析BART 的全称是 Bidirectional and Auto-Regressive Transformers（双向自回归变压器），它结合了 BERT 和 GPT 的优点，既能像 BERT 一样进行双向编码，又能像 GPT 一样进行自回归解码。BART 主要由编码器和解码器两部分组成，编码器负责理解输入文本的上下文信息，解码器则根据编码器的输出生成目标文本。BART 的主要特点包括：
1. 双向编码：BART 的编码器采用了双向 Transformer 结构，能够同时考虑上下文信息，提高了对输入文本的理解能力。
2. 自回归解码：BART 的解码器采用了自回归生成方式，能够逐步生成目标文本，保证了生成文本的连贯性和一致性。
3. 多任务预训练：BART 在预训练阶段采用了多种任务，如文本填充、句子重排序等，增强了模型的泛化能力。
4. 灵活的输入输出格式：BART 可以处理多种输入输出格式，适用于各种自然语言处理任务，如文本生成、摘要生成、翻译等。

下图为 BART 的模型结构示意图：

![BART 模型结构示意图](https://static001.geekbang.org/resource/image/a6/99/a663e08e28803d6059aae93fea1a0699.png?wh=1898x778)

BART 模型的结构看似与 Transformer 没什么不同，主要区别在于 BART 的预训练阶段。首先在 Encoder 端使用多种噪声对原始文本进行破坏，然后再使用 Decoder 重建原始文本。

1. 文本破坏方式
- 随机删除（Token Masking）：随机选择文本中的一些词进行遮盖，类似于 BERT 的遮盖任务。
- 文本重排序（Text Infilling）：将文本中的某些连续片段删除，然后让模型根据上下文信息填补这些空白。
- 句子重排序（Sentence Permutation）：打乱文本中句子的顺序，然后让模型恢复正确的顺序。
- 文本删除（Document Rotation）：将文本的一部分移动到开头或结尾，模型需要重新排列文本顺序。     

BART 本身就是在 sequence-to-sequence 的基础上构建并且进行预训练，它天然就适合做序列生成的任务，例如：问答、文本摘要、机器翻译等。在生成任务上获得进步的同时，在一些文本理解类任务上它也可以取得很好的效果。

BART 模型的预训练任务主要包括以下几种：
- 文本重建任务（Text Reconstruction）：模型需要根据破坏后的文本重建原始文本。
- 序列到序列任务（Sequence-to-Sequence Tasks）：模型需要将输入序列转换为输出序列，如翻译任务。
- 语言建模任务（Language Modeling Tasks）：模型需要预测下一个词或填补缺失的词。



In [20]:
import torch
print(torch.__version__)

# 下面的 summarizer 用于文本摘要任务 Hugging Face Transformers 最近（2025年10月）开始强制要求使用 PyTorch ≥ 2.6.0，因为旧版本存在一个安全漏洞（CVE-2025-32434）
# 而 cuda 121 只能用 2.5.1， 所以需要升级到 cuda 124 才能保证 pytorch >= 2.6.0 正常使用

2.6.0+cu124


In [21]:
from transformers import pipeline

# 作用是构建一个自动文摘的 pipeline，pipeline 会自动下载并缓存训练好的自动文摘生成模型。这个自动文摘生成模型是 BART 模型在 CNN/Daily Mail 数据集上训练得到的。
summarizer = pipeline("summarization", device=3)

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

print(summarizer(ARTICLE, max_length=130, min_length=30))


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:3
Device set to use cuda:3


[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]


In [22]:
from transformers import BartTokenizer, BartForConditionalGeneration

device = "cuda:3" if torch.cuda.is_available() else "cpu"

# 实例化一个 BART 的模型和分词器对象
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn').to(device)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
# https://huggingface.co/facebook/bart-large-cnn

# 利用分词器对象 tokenizer 对原始文本 ARTICLE 进行分词，并得到词语 id 的 Tensor。return_tensors='pt’表示返回值是 PyTorch 的 Tensor。
inputs = tokenizer([ARTICLE], max_length=1024, return_tensors='pt').to(device)

# 生成文摘
# 生成长度和早停需要注意
summary_ids = model.generate(inputs['input_ids'], max_length=300, early_stopping=False)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Liana Barrientos, 39, is charged with two counts of offering a false instrument for filing in the first degree. In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men, and at one time, she was married to eight men at once. If convicted, she faces up to four years in prison.


# Fine-tune BART 进行文本摘要生成

用自己的数据集来训练 BART 模型



In [23]:
from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

## 数据准备

我们使用PyTorch 原生的的 Dataset 类读取数据集
使用 Torchtext 工具torchtext.datasets来读取数据集。

In [24]:
import datasets
train_dataset = datasets.load_dataset("imdb", split="train")
print(train_dataset.column_names)
# print first 3 samples
for i in range(3):
    print(f"Sample {i}:")
    print("Text:", train_dataset[i]['text'])
    print("Label:", train_dataset[i]['label'])
    print()

['text', 'label']
Sample 0:
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity s

In [25]:
# 通过 data_files 指定训练，验证，测试集文件路径
# 使用map对数据集进行预处理

def add_prefix(example):
    example['text'] = 'My sentence: ' + example['text']
    return example
# updated_dataset = dataset.map(add_prefix)
# updated_dataset['train']['text'][:5]
'''
示例输出：
['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
"My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
]
'''

from transformers.models.bart.modeling_bart import shift_tokens_right
from functools import partial

# 使用已有的 train_dataset (IMDB) 构造一个带有 text 和 summary 字段的 DatasetDict
def make_summary(example):
    first_sentence = example['text'].split('. ')[0].strip()
    example['summary'] = first_sentence if first_sentence else example['text'][:200]
    return example

# 为了加快示例运行，截取一个子集
_train = train_dataset.select(range(5000)).map(make_summary)
_val = train_dataset.shuffle(seed=42).select(range(1000)).map(make_summary)

dataset = datasets.DatasetDict({
    "train": _train.remove_columns(["label"]),
    "validation": _val.remove_columns(["label"]),
})

def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(
        example_batch['text'],
        padding='max_length',
        max_length=1024,
        truncation=True
    )
    target_encodings = tokenizer.batch_encode_plus(
        example_batch['summary'],
        padding='max_length',
        max_length=128,
        truncation=True
    )
    
    labels = torch.tensor(target_encodings['input_ids'])
    decoder_input_ids = shift_tokens_right(labels, model.config.pad_token_id)
    labels[labels == model.config.pad_token_id] = -100
    
    encodings = {
        'input_ids': torch.tensor(input_encodings['input_ids']),
        'attention_mask': torch.tensor(input_encodings['attention_mask']),
        'decoder_input_ids': decoder_input_ids,
        'labels': labels,
    }
    return encodings

# Ensure shift_tokens_right uses the decoder_start_token_id expected by BART
_shift_tokens_right = shift_tokens_right
shift_tokens_right = partial(_shift_tokens_right, decoder_start_token_id=model.config.decoder_start_token_id)

dataset = dataset.map(convert_to_features, batched=True, remove_columns=['text', 'summary'])
columns = ['input_ids', 'labels', 'decoder_input_ids', 'attention_mask']
dataset.set_format(type='torch', columns=columns)


### 代码流程与逻辑分析

这段代码的主要目的是**构造一个用于 BART 微调的伪摘要数据集**，并将其转换为模型可接受的 Tensor 格式。

#### 1. 数据构造：IMDB -> (Text, Summary)
由于 IMDB 是情感分类数据集（只有 text 和 label），代码通过规则强行构造了“摘要”任务：
- **逻辑 (`make_summary` 函数)**：
  - 取 `text` 的第一句话（通过 `. ` 分割）作为 `summary`。
  - 如果分割失败，则截取前 200 个字符作为 `summary`。
- **操作**：
  - `train_dataset.select(...)`: 为了演示快速运行，只取了前 5000 条作为训练集，1000 条作为验证集。
  - `.map(make_summary)`: 将上述规则应用到每一条数据，新增 `summary` 字段。
  - `DatasetDict`: 将处理后的子集重新封装成标准的 Hugging Face 数据集字典。

#### 2. 特征工程：Text -> Model Inputs
BART 是 Encoder-Decoder 架构，训练时需要三部分输入：
1. `input_ids`: Encoder 的输入（原文）。
2. `decoder_input_ids`: Decoder 的输入（目标文摘，右移一位）。
3. `labels`: Decoder 的预测目标（目标文摘，用于计算 Loss）。

**`convert_to_features` 函数详解**：
- **分词 (`tokenizer.batch_encode_plus`)**：
  - `input_encodings`: 处理原文 `text`，pad/truncate 到 1024 长度。
  - `target_encodings`: 处理摘要 `summary`，pad/truncate 到 128 长度。
- **构造 Decoder 输入 (`shift_tokens_right`)**：
  - BART 训练采用**Teacher Forcing**，Decoder 的输入是“右移一位”的目标序列（即 `<bos> token1 token2 ...`）。
  - `shift_tokens_right(labels, ...)`: 自动完成这个移位操作，并在开头补上 `decoder_start_token_id`。
- **构造 Labels (`labels`)**：
  - 直接使用 `target_encodings['input_ids']`。
  - **关键操作**：`labels[labels == pad_token_id] = -100`。将 padding 位置的 label 设为 -100，这样 PyTorch 的 CrossEntropyLoss 会自动忽略这些位置，不计算损失。

#### 3. 格式转换
- `dataset.map(..., remove_columns=...)`: 应用上述转换，并移除原始文本列，只保留 Tensor。
- `set_format(type='torch')`: 将数据格式转为 PyTorch Tensor，便于 DataLoader 读取。

### 关键变量含义详解

在 `encodings` 字典中，这四个变量构成了 BART 模型训练所需的完整输入：

1.  **`input_ids` (Encoder Input)**
    -   **含义**：原文（Source Text）经过分词后映射成的数字序列。
    -   **作用**：输入给 BART 的 **Encoder**，用于理解文章内容。
    -   **ids 缩写**：**IDs** = **Identifiers**（标识符）。这里特指 Token IDs，即每个词在词表（Vocabulary）中对应的唯一索引数字。

2.  **`attention_mask`**
    -   **含义**：一个由 0 和 1 组成的掩码序列。
    -   **作用**：告诉模型哪些是真实的 Token（1），哪些是填充的 Padding（0）。模型在计算注意力（Self-Attention）时会忽略 0 的位置，避免被填充字符干扰。

3.  **`decoder_input_ids` (Decoder Input)**
    -   **含义**：目标摘要（Target Summary）经过移位后的数字序列。通常是 `<BOS>` + 摘要内容。
    -   **作用**：输入给 BART 的 **Decoder**。在训练时，Decoder 看到当前时刻及之前的词（由 `decoder_input_ids` 提供），去预测下一个词。
    -   **Teacher Forcing**：训练时我们不使用模型自己上一步生成的词，而是直接把“正确答案”喂给它，这叫 Teacher Forcing。

4.  **`labels` (Target)**
    -   **含义**：模型需要预测的真实目标序列（Ground Truth）。
    -   **作用**：用于计算损失（Loss）。模型输出的预测结果会和 `labels` 进行对比。
    -   **特殊处理**：代码中将 Padding 部分的 label 设为了 `-100`。这是 PyTorch `CrossEntropyLoss` 的默认忽略索引，意味着模型不需要预测 Padding 部分，只关注有效内容的预测准确性。

**总结关系**：
- **Encoder** 读 `input_ids` (看原文)。
- **Decoder** 读 `decoder_input_ids` (看已生成的摘要片段)。
- **Loss** 计算 `Output` vs `labels` (对比预测结果和标准答案)。

![image17.png](mdfiles/image17.png)

指标,含义,衡量重点,计算方式（简化）

ROUGE-N,基于 N-gram 的共现率,信息保留（保留了多少关键词）,ROUGE-N=参考摘要中的 N 词组总数系统摘要与参考摘要重叠的 N 词组数​

ROUGE-1,单个词重叠率 (N=1),核心信息（有多少关键词被捕捉）,

ROUGE-2,连续双词重叠率 (N=2),流畅度和短语准确性,

ROUGE-L,最长公共子序列 (LCS),长句结构和语序,不依赖连续匹配，而是计算最长公共词语序列。