## ALBERT 中提取嵌入

A Lite version of BERT，意思是BERT的轻量化版本，BERT全称Bidirectional Encoder Representations from Transformers，中文意思为双向编码器表示变换器。

与BERT相比，ALBERT的参数更少，它使用以下两种技术减少参数的数量
- 跨层参数共享，一般默认使用全共享，每一层的编码器共享同样的参数
- 嵌入层参数因子分解，将词嵌入层分解为两个矩阵的乘积，减少参数的数量

此外，ALBERT预训练和BERT本体不太相同，采用的是**掩码语言模型构建任务**和**句序预测构建任务。**

句序预测构建任务本质上和下句预测任务一样，都是分类任务，只不过一个是预测正确顺序、错误顺序，另一个是预测isNext、notNext。

In [1]:
! pip install Transformers==3.5.1

Collecting Transformers==3.5.1
  Downloading transformers-3.5.1-py3-none-any.whl.metadata (32 kB)
Collecting tokenizers==0.9.3 (from Transformers==3.5.1)
  Downloading tokenizers-0.9.3.tar.gz (172 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.0/172.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentencepiece==0.1.91 (from Transformers==3.5.1)
  Downloading sentencepiece-0.1.91.tar.gz (500 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m500.5/500.5 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a su

In [2]:
from transformers import AlbertTokenizer, AlbertModel

下载并加载预训练的ALBERT模型和词元分析器。在本例中，我们使用ALBERT- base模型。

In [3]:
model = AlbertModel.from_pretrained('albert-base-v2')
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]



使用词元分析器

In [4]:
sentence = "Paris is a beautiful city"
inputs = tokenizer(sentence, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[   2, 1162,   25,   21, 1632,  136,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


一共有7个标记，因为加上了`[CLS]`和`[SEP]`

In [7]:
output = model(**inputs)
hidden_rep = output.last_hidden_state
cls_head = output.pooler_output

In [8]:
print(hidden_rep.shape)
print(cls_head.shape)

torch.Size([1, 7, 768])
torch.Size([1, 768])


In [9]:
print(hidden_rep[0][0])
print(hidden_rep[0][1])
print(hidden_rep[0][2])

tensor([ 8.1377e-01, -2.1923e-01,  7.0966e-01, -8.3763e-02, -9.1877e-01,
        -4.3319e-01,  6.3536e-01, -7.0293e-01,  2.9164e-01,  2.6814e-01,
         1.1825e+00,  7.5952e-01,  4.8658e-01,  9.7709e-01,  1.3055e+00,
         2.7781e-02,  7.4772e-01,  6.5719e-02, -4.3754e-01, -4.1870e-01,
         2.6135e-01, -3.2317e+00,  5.7344e-01,  4.2397e-01,  2.8537e-01,
        -7.7882e-01,  1.7221e-01,  8.1314e-02,  1.3086e+00, -2.7584e+00,
        -1.4090e+00,  1.5111e+00,  1.4577e+00, -4.6637e-01,  1.1702e-01,
        -2.3313e+00, -2.1564e-01, -1.6803e-01, -1.4514e-01,  1.5809e-01,
         2.0603e-01,  5.6583e-01, -1.4213e+00, -2.0782e-01,  1.0610e+00,
         1.9396e+00, -9.4811e-01,  8.6039e-01,  3.4406e-01, -2.8977e-02,
         7.6713e-01, -1.2607e+00, -3.9024e-01,  1.3233e+00,  6.6874e-01,
         7.9670e-01,  2.8161e-01,  6.5828e-03, -4.1506e-01,  1.1746e+00,
         7.6763e-02, -1.5090e+00, -4.1608e-01, -9.3307e-01,  5.5302e-01,
        -3.3460e-01,  5.4276e-01, -4.7788e-01,  4.2

- `hidden_rep[0][0]`包含`[CLS]`标记的上下文嵌入。
- `hidden_rep[0][1]`包含Paris标记的上下文嵌入。
- `hidden_rep[0][2]`包含is标记的上下文嵌入。