<a href="https://colab.research.google.com/github/appleshiou/HuggingFace2/blob/main/colab04d_%E6%89%BE%E5%88%B0%E6%96%87%E5%AD%97%E7%9A%84embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 1. 安裝 `transformers` 套件

In [None]:
!pip install transformers



## 2. 使用 Bert 的 Tokenizer

In [None]:
from transformers import BertTokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

一句話來變成 BERT 編碼! 注意可以單一個句子, 也可以是多個句子!

In [None]:
tokenizer("測試一下哦")

{'input_ids': [101, 3947, 6275, 671, 678, 1521, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(["測試一句話"])

{'input_ids': [[101, 3947, 6275, 671, 1368, 6282, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]]}

我們可以看出這是一個字、一個字切的。

In [None]:
tokenizer.tokenize("測試一句話")

['測', '試', '一', '句', '話']

從數字再變回文字, 看看要送進 BERT 的長什麼樣子!

In [None]:
tokenizer.convert_ids_to_tokens([101, 3947, 6275, 671, 1368, 6282, 102])

['[CLS]', '測', '試', '一', '句', '話', '[SEP]']

我們可以看出 101 是代表 `[CLS]`, 102 代表 `[SEP]`。



In [None]:
tokenizer.convert_ids_to_tokens([3947, 6275, 671, 1368, 6282])

['測', '試', '一', '句', '話']

一次好幾句的試驗!

In [None]:
tokenizer(["測試一句話", "中間夾。", "也許這樣吧。", "買iPhone。"])

{'input_ids': [[101, 3947, 6275, 671, 1368, 6282, 102], [101, 704, 7279, 1933, 511, 102], [101, 738, 6258, 6857, 3564, 1416, 511, 102], [101, 6525, 100, 511, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

In [None]:
tokenizer(["abc", "David", "蘋果Apple"])

{'input_ids': [[101, 8425, 102], [101, 100, 102], [101, 5981, 3362, 100, 102]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1, 1, 1]]}

In [None]:
tokenizer.convert_ids_to_tokens([8425,8776,8350])

['abc', 'david', 'apple']

## 3. 找到文字的 embedding

這次我們用更有彈性的 `AutoTokenizer`, 未來你用別的 `model` 時可能會用到的。

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = "bert-base-chinese"

In [None]:
tokenizer2 = AutoTokenizer.from_pretrained(checkpoint)

注意這和上一個 `tokenizer` 基本上是完全一樣的!

In [None]:
sequence = "我想看看這會變成什麼"

In [None]:
model_inp = tokenizer2(sequence)

In [None]:
model_inp

{'input_ids': [101, 2769, 2682, 4692, 4692, 6857, 3298, 6365, 2768, 784, 7938, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

可以看到基本上就是一樣的!!

In [None]:
model_inp["input_ids"]

[101, 2769, 2682, 4692, 4692, 6857, 3298, 6365, 2768, 784, 7938, 102]

把 BERT 找來吧!!

In [None]:
from transformers import TFBertModel

In [None]:
model = TFBertModel.from_pretrained(checkpoint)

Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


把 model_inp 轉成 TensorFlow 接受的輸入。

In [None]:
import tensorflow as tf

In [None]:
model_inputs = tf.constant([model_inp["input_ids"]])

In [None]:
model_inputs

<tf.Tensor: shape=(1, 12), dtype=int32, numpy=
array([[ 101, 2769, 2682, 4692, 4692, 6857, 3298, 6365, 2768,  784, 7938,
         102]], dtype=int32)>

In [None]:
output = model(model_inputs)

來看看 output 長什麼樣子!

In [None]:
output

TFBaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                                 <tf.Tensor: shape=(1, 12, 768), dtype=float32, numpy=
                                                 array([[[-0.299048  ,  0.45682305,  0.5829431 , ...,  0.25641617,
                                                           0.21600437, -0.33521703],
                                                         [ 0.27438423, -0.16981234,  0.7077313 , ..., -0.9895742 ,
                                                          -0.42360958, -0.16366687],
                                                         [-0.21923515, -0.5830589 , -1.0193688 , ..., -0.07433864,
                                                           0.87813926, -0.19414367],
                                                         ...,
                                                         [ 0.17810795, -0.17334643,  0.25140914, ...,  0.3847427 ,
                                                  

發現我們要的東西是 `last_hidden_state`。

In [None]:
output['last_hidden_state']

<tf.Tensor: shape=(1, 12, 768), dtype=float32, numpy=
array([[[-0.299048  ,  0.45682305,  0.5829431 , ...,  0.25641617,
          0.21600437, -0.33521703],
        [ 0.27438423, -0.16981234,  0.7077313 , ..., -0.9895742 ,
         -0.42360958, -0.16366687],
        [-0.21923515, -0.5830589 , -1.0193688 , ..., -0.07433864,
          0.87813926, -0.19414367],
        ...,
        [ 0.17810795, -0.17334643,  0.25140914, ...,  0.3847427 ,
         -0.09988672, -0.2767766 ],
        [ 0.44689083,  0.48839512,  0.52651215, ..., -0.11707255,
          0.4660459 , -0.14046808],
        [-0.6728222 ,  0.21168761,  0.72463286, ..., -0.75098574,
         -0.01720798, -0.1501956 ]]], dtype=float32)>

真正的數值部份是在 `numpy` 裡, 看一下下長什麼樣子。

In [None]:
output['last_hidden_state'].numpy().shape

(1, 12, 768)

結果輸出是 1 個, 有輸入 12 個 token 最後的代表向量, 每個都 768 維。我們想要 `[CLS]` 最終的那 768 維代表向量。

In [None]:
h = output['last_hidden_state'].numpy()[0][0]

In [None]:
h

array([-2.99048007e-01,  4.56823051e-01,  5.82943082e-01, -4.11516964e-01,
       -1.71039820e-01, -4.91626292e-01, -6.47809148e-01, -5.26151180e-01,
       -4.55655694e-01,  4.39405113e-01,  3.13840002e-01,  3.97040337e-01,
        2.57390112e-01, -7.44233787e-01,  1.57073975e+00, -6.02495432e-01,
        3.26073140e-01, -1.29350936e+00,  4.72841889e-01, -2.41677999e-01,
       -3.53348821e-01,  3.60392094e-01, -3.66648644e-01, -5.83259344e-01,
        2.78585523e-01,  5.10834813e-01, -1.07827701e-01, -6.50267527e-02,
        4.87270892e-01,  6.90398455e-01,  6.89542368e-02, -2.00402051e-01,
       -7.77645111e-01, -4.56595480e-01, -2.12552428e-01, -5.58262408e-01,
       -3.43980610e-01,  2.74059445e-01,  5.78915477e-02, -4.94238883e-01,
        1.05730891e+00, -9.70214307e-02,  8.59945118e-02,  2.27669859e+00,
        1.93625242e-01, -3.34333181e-01,  3.02552760e-01,  3.14233571e-01,
       -6.46616280e-01,  2.31464982e-01,  3.39613646e-01,  1.01438055e+01,
        2.00304723e+00,  

In [None]:
len(h)

768