## BART
BART模型是Facebook AI推出的一个有趣的模型。它基于Transformer架构，本质上是一个降噪自编码器，是通过重建受损文本进行训练的。

BART模型的**编码器是双向**的，这意味着它可以从两个方向（从左到右和从右到左）读取一个句子.

但BART模型的**解码器是单向**的，它只能从左到右读取一个句子。

因此，在BART模型中，我们有一个双向编码器（针对两个方向）和一个自回归解码器（针对单一方向）。

![](4.png)

### 受损文本的构造
- 标记掩盖
- 标记删除
- 标记填充
- 句子重排
- 文档轮换

### BART的使用

In [1]:
! pip install Transformers==3.5.1

Collecting Transformers==3.5.1
  Downloading transformers-3.5.1-py3-none-any.whl.metadata (32 kB)
Collecting tokenizers==0.9.3 (from Transformers==3.5.1)
  Downloading tokenizers-0.9.3.tar.gz (172 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentencepiece==0.1.91 (from Transformers==3.5.1)
  Downloading sentencepiece-0.1.91.tar.gz (500 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m500.5/500.5 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a su

首先，从Transformers库中导入用于分词的BartTokenizer和用于文本摘要任务的BartForConditionalGeneration。

In [2]:
from transformers import BartTokenizer, BartForConditionalGeneration

In [3]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



设置原始文本以获得文本摘要。

In [4]:
text = """Machine learning (ML) is the study of computer algorithms that improve automatically through experience.
It is seen as a subset of artificial intelligence.
Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.
Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks."""

对文本进行标记

In [6]:
inputs = tokenizer([text], max_length=1024, return_tensors='pt')
print(inputs)

{'input_ids': tensor([[    0, 46100,  2239,    36, 10537,    43,    16,     5,   892,     9,
          3034, 16964,    14,  1477,  6885,   149,   676,     4, 50118,   243,
            16,   450,    25,    10, 37105,     9,  7350,  2316,     4,  1437,
         50118, 46100,  2239, 16964,  1119,    10, 30412,  1421,   716,    15,
          7728,   414,     6,   684,    25,  1058,   414,     6,    11,   645,
             7,   146, 12535,    50,  2390,   396,   145, 16369, 30825,     7,
           109,    98,     4, 50118, 46100,  2239, 16964,    32,   341,    11,
            10,  1810,  3143,     9,  2975,     6,   215,    25,  1047, 35060,
             8,  3034,  3360,     6,   147,    24,    16,  1202,    50,  4047,
         29358,  4748,     7,  2179,  9164, 16964,     7,  3008,     5,   956,
          8558,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

　获取摘要ID，也就是模型生成的标记ID。

In [7]:
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True)

现在，对摘要ID进行解码，得到相应的标记（单词）。

In [8]:
summary = ([tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=False) for i in summary_ids])

In [9]:
print(summary)

['Machine learning is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.']
