参考：  
[HuggingFace-Transformers系列的下游应用](https://www.jianshu.com/p/cdb13530a8fd)  
[官方文档的Quickstart](https://huggingface.co/transformers/quickstart.html#)

一个完整的transformer模型主要包含三部分：

Config，控制模型的名称、最终输出的样式、隐藏层宽度和深度、激活函数的类别等。将Config类导出时文件格式为 json格式，就像下面这样：
```
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

```
Tokenizer，将纯文本转换为编码。**Tokenizer并不涉及将词转化为词向量的过程，仅仅是将纯文本分词，添加`[MASK]`标记、`[SEP]`、`[CLS]`标记，并转换为字典索引**  
Tokenizer类导出时将分为三个文件，也就是：  
- vocab.txt 词典文件，每一行为一个词或词的一部分
- special_tokens_map.json 特殊标记的定义方式
```
{
"unk_token": "[UNK]", 
"sep_token": "[SEP]", 
"pad_token": "[PAD]", 
"cls_token": "[CLS]", 
"mask_token": "[MASK]"
}
```
- tokenizer_config.json 配置文件，主要存储特殊的配置。

Model，也就是各种各样的模型。除了初始的Bert、GPT等基本模型，针对下游任务，还定义了诸如BertForQuestionAnswering等下游任务模型。模型导出时将生成config.json和pytorch_model.bin参数文件。前者就是1中的配置文件,后者和torch.save()存储得到的文件是相同的

## 导入模型
### 通过官网自动导入(需要联网)

In [2]:
import torch
from transformers import BertTokenizer, BertModel
# pytorch的方式
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# tensorflow的方式
# from transformers import BertTokenizer, TFBertModel
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = TFBertModel.from_pretrained("bert-base-uncased")

### 从本地导入
从 [HuggingFace官方模型库](https://huggingface.co/models) 找到需要下载的模型，如bert-base-uncased模型  
点击 [Files and versions](https://huggingface.co/bert-base-uncased/tree/main)
将其中的文件一一下载到同一目录中，如 D:/transformr_files/bert-base-uncased/
```python
from transformers import BertTokenizer,BertConfig,BertModel
MODEL_PATH = r"D:/transformr_files/bert-base-uncased/"
tokenizer = BertTokenizer.from_pretrained(r"D:/transformr_files/bert-base-uncased/bert-base-uncased-vocab.txt") 
model_config = BertConfig.from_pretrained(MODEL_PATH)
# 修改配置
model_config.output_hidden_states = True
model_config.output_attentions = True
# 通过配置和路径导入模型
model = BertModel.from_pretrained(MODEL_PATH, config = model_config)
```

In [5]:
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

In [6]:
masked_index = 8  # 第8个位置加入 '[MASK]'
tokenized_text[masked_index] = '[MASK]'
# tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]'

In [13]:
# 分词，str 转 id
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
indexed_tokens

[101,
 2040,
 2001,
 3958,
 27227,
 1029,
 102,
 3958,
 103,
 2001,
 1037,
 13997,
 11510,
 102]

In [8]:
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

In [14]:
# python list 转为 pytorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

## 各模型对应的输入格式

In [None]:
bert:       [CLS] + tokens + [SEP] + padding

roberta:    [CLS] + prefix_space + tokens + [SEP] + padding

distilbert: [CLS] + tokens + [SEP] + padding

xlm:        [CLS] + tokens + [SEP] + padding

xlnet:      padding + tokens + [SEP] + [CLS]

In [17]:
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [19]:
# 转到cuda上
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

## 模型的输出

In [20]:
with torch.no_grad():
    # Transformers models 返回的outputs是一个元组
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # 这里，返回的第一个元素是bert模型的最后一层的hidden_state
    encoded_layers = outputs[0]
# bert把input sequence 变成 shape= [batch_size, sequence_length, model_hidden_dimension] 的 FloatTensor
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

In [26]:
# outputs第二个元素的shape
outputs[1].shape

torch.Size([1, 768])

In [28]:
# bert模型最后一层的hidden_state
outputs[0].shape

torch.Size([1, 14, 768])

## 自定义模型

1)如果要更改模型本身的构建方式，则可以定义自定义配置类。  
每种体系结构都有其自己的相关配置（对于DistilBERT，DistilBertConfig），它允许您指定任意hidden的维度，dropout率等。如果进行核心修改（如更改隐藏层的大小），则不能使用预训练模型，需要从头开始训练。 可以直接从自定义配置里实例化模型

使用DistilBERT的预定义词汇表（即使用from_pretrained()方法加载tokenizer）  
并从头开始初始化模型（即从配置中实例化模型，而不是使用from_pretrained()方法）
```python
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
model = DistilBertForSequenceClassification(config)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
```

2)仅改变模型头部（例如：标签数），仍可以对身体使用预训练模型。 例如，使用预训练的身体定义一个10分类的分类器。 我们可以创建一个配置，使用所有的默认值，只需更改标签的数量。  
更容易地，您可以将携带任意参数的configuration直接传递给from_pretrained()方法，它将使用该参数更新默认配置

```python
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model =DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
```

In [None]:
## 模型的构建：
# 一般情况下，一个基本模型对应一个Tokenizer, 所以并不存在对应于具体下游任务的Tokenizer。
# 这里通过bert_model初始化BertForQuestionAnswering
# from transformers import BertTokenizer,BertConfig, BertForQuestionAnswering
# import torch

# model = BertForQuestionAnswering.from_pretrained('bert-base-uncased',num_labels=2)
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


##  利用模型进行运算：
# 设定模式
# model.eval()
# question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
# # 获取input_ids编码
# input_ids = tokenizer.encode(question, text)
# # 手动进行token_type_ids编码，可用encode_plus代替
# token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
# # 得到评分
# start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
# # 进行逆编码，得到原始的token 
# all_tokens = tokenizer.convert_ids_to_tokens(input_ids)


## 将模型输出转化为任务输出：
# 对输出的答案进行解码的过程
# answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
# print(answer)

## 快速浏览所有模型

In [3]:
# 模型|分词|预训练权重
from transformers import *
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),

(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),

(GPT2Model, GPT2Tokenizer, 'gpt2'),

(CTRLModel, CTRLTokenizer, 'ctrl'),

(TransfoXLModel, TransfoXLTokenizer, 'transfo-xl-wt103'),

(XLNetModel, XLNetTokenizer, 'xlnet-base-cased'),

(XLMModel, XLMTokenizer, 'xlm-mlm-enfr-1024'),

(DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),

(RobertaModel, RobertaTokenizer, 'roberta-base'),

(XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),

]

In [None]:
import torch

# 用每个模型将一些文本编码成隐藏状态序列:
for model_class, tokenizer_class, pretrained_weights in MODELS:
    # 加载pretrained模型/分词器
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights)
    # 编码文本
    input_ids = torch.tensor([tokenizer.encode("Here is some text to encode", add_special_tokens=True)])  # 添加特殊标记
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]  # 模型输出是元组

# 每个架构都提供了几个类，用于对下游任务进行调优，例如
BERT_MODEL_CLASSES = [BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
BertForSequenceClassification, BertForTokenClassification, BertForQuestionAnswering]


# 体系结构的所有类都可以从该体系结构的预训练权重开始
# 注意：为微调添加的额外权重只在需要接受下游任务的训练时初始化
pretrained_weights = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)
for model_class in BERT_MODEL_CLASSES:
    # 载入模型
    # model = model_class.from_pretrained(pretrained_weights)
    # 模型可以在每一层返回 隐藏状态和带有注意力机制的权值
    model = model_class.from_pretrained(pretrained_weights, output_hidden_states=True, output_attentions=True)
    input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
    all_hidden_states, all_attentions = model(input_ids)[-2:]

    

# 模型和分词的保存
model.save_pretrained('./directory/to/save/') # 保存
model = model_class.from_pretrained('./directory/to/save/') # 重载
tokenizer.save_pretrained('./directory/to/save/') # 保存
tokenizer = BertTokenizer.from_pretrained('./directory/to/save/') # 重载


## 使用albert_chinese_small

In [1]:
import torch
from transformers import BertTokenizer,BertModel,BertConfig
import numpy as np
from torch.utils import data
from sklearn.model_selection import train_test_split
import pandas as pd

In [2]:
pretrained = 'voidful/albert_chinese_small' #使用small版本Albert
tokenizer = BertTokenizer.from_pretrained(pretrained)
model = BertModel.from_pretrained(pretrained)
config = BertConfig.from_pretrained(pretrained)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=633.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=19258882.0, style=ProgressStyle(descrip…




Some weights of the model checkpoint at voidful/albert_chinese_small were not used when initializing BertModel: ['albert.embeddings.word_embeddings.weight', 'albert.embeddings.position_embeddings.weight', 'albert.embeddings.token_type_embeddings.weight', 'albert.embeddings.LayerNorm.weight', 'albert.embeddings.LayerNorm.bias', 'albert.encoder.embedding_hidden_mapping_in.weight', 'albert.encoder.embedding_hidden_mapping_in.bias', 'albert.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.weight', 'albert.encoder.albert_layer_groups.0.albert_layers.0.full_layer_layer_norm.bias', 'albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query.weight', 'albert.encoder.albert_layer_groups.0.albert_layers.0.attention.query.bias', 'albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key.weight', 'albert.encoder.albert_layer_groups.0.albert_layers.0.attention.key.bias', 'albert.encoder.albert_layer_groups.0.albert_layers.0.attention.value.weight', 'albert.enco

查看albert模型的输出

In [3]:
inputtext = "今天心情情很好啊，买了很多东西，我特别喜欢，终于有了自己喜欢的电子产品，这次总算可以好好学习了"
tokenized_text = tokenizer.encode(inputtext)
input_ids = torch.tensor(tokenized_text).view(-1,len(tokenized_text))
outputs = model(input_ids)
outputs[0].shape,outputs[1].shape

(torch.Size([1, 49, 384]), torch.Size([1, 384]))

### 在albert后面接自定义的线性层

In [4]:
class AlbertClassfier(torch.nn.Module):
    def __init__(self, bert_model, bert_config, num_class):
        super(AlbertClassfier,self).__init__()
        self.bert_model=bert_model
        self.dropout=torch.nn.Dropout(0.4)
        self.fc1=torch.nn.Linear(bert_config.hidden_size,bert_config.hidden_size)
        self.fc2=torch.nn.Linear(bert_config.hidden_size,num_class)
    def forward(self,token_ids):
        bert_out=self.bert_model(token_ids)[1] #句向量 [batch_size,hidden_size]
        bert_out=self.dropout(bert_out)
        bert_out=self.fc1(bert_out) 
        bert_out=self.dropout(bert_out)
        bert_out=self.fc2(bert_out) #[batch_size,num_class]
        return bert_out

In [5]:
albertBertClassifier = AlbertClassfier(model,config, 2)
device = torch.device("cuda:0") if torch.cuda.is_available() else 'cpu'
albertBertClassifier = albertBertClassifier.to(device)  # 转为GPU

## 加载roberta-base

In [8]:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

In [9]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1146,  0.1103, -0.0149,  ..., -0.0809, -0.0018, -0.0271],
         [-0.0225,  0.1612,  0.0556,  ...,  0.5366,  0.1196,  0.1576],
         [ 0.0532, -0.0020,  0.0370,  ..., -0.4887,  0.1641,  0.2736],
         ...,
         [-0.1586,  0.0837,  0.1302,  ...,  0.3970,  0.1715, -0.0848],
         [-0.1065,  0.1044, -0.0383,  ..., -0.1068, -0.0015, -0.0517],
         [ 0.0059,  0.0758,  0.1228,  ...,  0.1037,  0.0075,  0.0976]]],
       grad_fn=<NativeLayerNormBackward>), pooler_output=tensor([[-2.8347e-03, -1.8850e-01, -2.1461e-01, -1.1530e-01,  1.3189e-01,
          2.3539e-01,  2.7159e-01, -6.0856e-02, -8.3708e-02, -1.9298e-01,
          2.6565e-01, -4.3355e-05, -1.1200e-01,  1.4636e-01, -1.4502e-01,
          4.9390e-01,  2.0317e-01, -5.3051e-01,  7.5674e-02, -3.7784e-02,
         -2.8706e-01,  7.9115e-02,  4.9498e-01,  3.6626e-01,  1.0755e-01,
          4.5707e-02, -1.5804e-01,  1.3338e-02,  1.5470e-01,  2.6411