## RoBERTa 介绍

RoBERTa 是 Robustly Optimized BERT Pretraining Approach（稳健优化的BERT预训练方法）的简写，他与BERT的区别是预训练有变化，具体如下：
- 在掩码语言模型构建任务中使用动态掩码而不是静态掩码。
- 不执行下句预测任务，只用掩码语言模型构建任务进行训练。
- 以大批量的方式进行训练。
- 使用字节级字节对编码作为子词词元化算法。

### 动态掩码
一个句子复制10份，随机掩盖15%的词`[MASK]`，然后训练四十次，每一个小句子平均训练4次。

### 移除下句预测任务
使用了很多实验来验证下句预测任务来进行预训练并不是那么有效

### 使用更多数据集训练
使用五个数据集训练：多伦多图书语料库（Toronto BookCorpus）和维基百科数据集、CC-News（Common Crawl-News）、Open WebText和Stories（Common Crawl的子集）

### 以大批量的方式进行训练
我们知道，BERT的预训练有100万步，批量大小为256。而RoBERTa将采用更大的批量进行预训练，即批量大小为8000，共30万步。它还可以用同样的批量大小进行更长时间的预训练，比如50万步。

**较大的批量进行训练可以提高模型的速度和性能。**

### 使用字节级字节对编码作为子词词元化算法
BERT使用的词表有30 000个标记，而RoBERTa使用的词表有50 000个标记。

### RoBERTa 词元分析器

In [1]:
! pip install transformers==3.5.1

Collecting transformers==3.5.1
  Downloading transformers-3.5.1-py3-none-any.whl.metadata (32 kB)
Collecting tokenizers==0.9.3 (from transformers==3.5.1)
  Downloading tokenizers-0.9.3.tar.gz (172 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentencepiece==0.1.91 (from transformers==3.5.1)
  Downloading sentencepiece-0.1.91.tar.gz (500 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m500.5/500.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a su

In [2]:
from transformers import RobertaConfig, RobertaModel, RobertaTokenizer

In [3]:
model = RobertaModel.from_pretrained('roberta-base')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


检查RoBERTa的配置

In [4]:
model.config

RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.44.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

在加载的RoBERTa-base模型中，有12层编码器、12个注意力头和768个隐藏神经元

下载并加载RoBERTa词元分析器

In [5]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [6]:
tokenizer.tokenize('It was a great day')

['It', 'Ġwas', 'Ġa', 'Ġgreat', 'Ġday']

Ġ **用来代表一个空格**

假设对同一句进行标记，在句子的第一个单词前面添加空格，如下所示

In [7]:
tokenizer.tokenize(' It was a great day')

['ĠIt', 'Ġwas', 'Ġa', 'Ġgreat', 'Ġday']

In [8]:
tokenizer.tokenize('I had a sudden epiphany')

['I', 'Ġhad', 'Ġa', 'Ġsudden', 'Ġep', 'iphany']

因为`epiphany`不存在于词表中，所以它被分割成子词`ep`和`iphany`。