
## 範例：CKIP　的預訓練模型

資料來源：https://github.com/ckiplab/ckip-transformers

目標：

1.   **在同一任務中，使用不同模型，並比較結果**
2.   **察看某個模型所使用的參數量**
3.   深入探索 CKIP 套件 : 因應不同下游任務



In [None]:
# 安裝 transformers
!pip install transformers



In [None]:
from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
   AutoTokenizer,
   pipeline,
)
import torch

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')

text1 = "[CLS] 等到潮水 [MASK] 了，就知道誰沒穿褲子。"
text2 = "[CLS] 數大 [MASK] 是美"

text = text1
tokens = tokenizer.tokenize(text)
ids = tokenizer.convert_tokens_to_ids(tokens)

print(text)
print(tokens[:10], '...')
print(ids[:10], '...')

print('-=' * 120)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

[CLS] 等到潮水 [MASK] 了，就知道誰沒穿褲子。
['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', '，', '就', '知'] ...
[101, 5023, 1168, 4060, 3717, 103, 749, 8024, 2218, 4761] ...
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


### Transformer

In [None]:
# 回顧: 上週的程式
from transformers import BertForMaskedLM    # 載入pretrained masked 語言模型並對有 [MASK] 的句子做預測
# 除了 tokens 以外我們還需要辨別句子的 segment ids
tokens_tensor = torch.tensor([ids])  # (1, seq_len)
segments_tensors = torch.zeros_like(tokens_tensor)  # (1, seq_len)
baseline_maskedLM_model = BertForMaskedLM.from_pretrained("bert-base-chinese")
#clear_output()

# 使用 masked LM 估計 [MASK] 位置所代表的實際 token
baseline_maskedLM_model.eval()
with torch.no_grad():
    outputs = baseline_maskedLM_model(tokens_tensor, segments_tensors)
    baseline_predictions = outputs[0]
    print('預測出來的樣貌：', baseline_predictions.shape)
    #print(baseline_predictions)


model.safetensors:   0%|          | 0.00/412M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


預測出來的樣貌： torch.Size([1, 17, 21128])


In [None]:
# 將 [MASK] 位置的機率分佈取 top k 最有可能的 tokens 出來
masked_index = 5    # 數大"便"是美    #5  等到潮水'退'了    # 蓋牌處的 index
k = 5
probs, indices = torch.topk(torch.softmax(baseline_predictions[0, masked_index], -1), k)
baseline_predicted_tokens = tokenizer.convert_ids_to_tokens(indices.tolist())

# 顯示 top k 可能的字。一般我們就是取 top 1 當作預測值
print("輸入 tokens ：", tokens[:10], '...')
print('-' * 50)
for i, (t, p) in enumerate(zip(baseline_predicted_tokens, probs), 1):
    tokens[masked_index] = t
    print("Top {} ({:2}%)：{}".format(i, int(p.item() * 100), tokens[:10]), '...')

輸入 tokens ： ['[CLS]', '等', '到', '潮', '水', '[MASK]', '了', '，', '就', '知'] ...
--------------------------------------------------
Top 1 ( 9%)：['[CLS]', '等', '到', '潮', '水', '等', '了', '，', '就', '知'] ...
Top 2 ( 4%)：['[CLS]', '等', '到', '潮', '水', '不', '了', '，', '就', '知'] ...
Top 3 ( 4%)：['[CLS]', '等', '到', '潮', '水', '來', '了', '，', '就', '知'] ...
Top 4 ( 2%)：['[CLS]', '等', '到', '潮', '水', '了', '了', '，', '就', '知'] ...
Top 5 ( 2%)：['[CLS]', '等', '到', '潮', '水', '要', '了', '，', '就', '知'] ...


In [None]:
#看看模型長的樣子
print(baseline_maskedLM_model)
total_params = sum(
	param.numel() for param in baseline_maskedLM_model.parameters()
)
print('bert-base-chinese 模型參數量:', total_params)
# 上週範例到此為止
######################################################################################################################################

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

### ckiplab

In [None]:
# 本週：使用　ckiplab　的模型
# 除了 tokens 以外我們還需要辨別句子的 segment ids
tokens_tensor = torch.tensor([ids])  # (1, seq_len)
segments_tensors = torch.zeros_like(tokens_tensor)  # (1, seq_len)
# 取得　ckiplab　的預訓練模型
# maskedLM_model = BertForMaskedLM.from_pretrained("ckiplab/albert-base-chinese")
maskedLM_model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-base-chinese') # or other models above

# 使用 masked LM 估計 [MASK] 位置所代表的實際 token
maskedLM_model.eval()
with torch.no_grad():
    outputs = maskedLM_model(tokens_tensor, segments_tensors)
    predictions = outputs[0]

config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/40.3M [00:00<?, ?B/s]

In [None]:
# 將 [MASK] 位置的機率分佈取 top k 最有可能的 tokens 出來
masked_index = 5    #3 數大"便"是美    #5  等到潮水'退'了    # 蓋牌處的 index
k = 5
probs, indices = torch.topk(torch.softmax(predictions[0, masked_index], -1), k)
predicted_tokens = tokenizer.convert_ids_to_tokens(indices.tolist())

# 顯示 top k 可能的字。一般我們就是取 top 1 當作預測值
print("輸入 tokens ：", tokens[:10], '...')
print('-' * 50)
for i, (t, p) in enumerate(zip(predicted_tokens, probs), 1):
    tokens[masked_index] = t
    print("Top {} ({:2}%)：{}".format(i, int(p.item() * 100), tokens[:10]), '...')

輸入 tokens ： ['[CLS]', '等', '到', '潮', '水', '要', '了', '，', '就', '知'] ...
--------------------------------------------------
Top 1 (84%)：['[CLS]', '等', '到', '潮', '水', '的', '了', '，', '就', '知'] ...
Top 2 ( 2%)：['[CLS]', '等', '到', '潮', '水', '家', '了', '，', '就', '知'] ...
Top 3 ( 1%)：['[CLS]', '等', '到', '潮', '水', '人', '了', '，', '就', '知'] ...
Top 4 ( 1%)：['[CLS]', '等', '到', '潮', '水', '，', '了', '，', '就', '知'] ...
Top 5 ( 0%)：['[CLS]', '等', '到', '潮', '水', '和', '了', '，', '就', '知'] ...


In [None]:
#看看模型長的樣子
print(maskedLM_model)
total_params = sum(
	param.numel() for param in maskedLM_model.parameters()
)
print('ckiplab/albert-base-chinese 模型參數量:', total_params)

AlbertForMaskedLM(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(21128, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
  

### 怎麼使用 AutoModelForCausalLM ？

In [None]:
import transformers
print(transformers.__version__)

4.41.2


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from torch.nn import functional as F

# from transformers import top_k_top_p_filtering
from torch import Tensor
def top_k_top_p_filtering(
    logits: Tensor,
    top_k: int = 0,
    top_p: float = 1.0,
    filter_value: float = -float("Inf"),
    min_tokens_to_keep: int = 1,
) -> Tensor:
    """Filter a distribution of logits using top-k and/or nucleus (top-p) filtering
    Args:
        logits: logits distribution shape (batch size, vocabulary size)
        if top_k > 0: keep only top k tokens with highest probability (top-k filtering).
        if top_p < 1.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering).
            Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751)
        Make sure we keep at least min_tokens_to_keep per batch example in the output
    From: https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
    """
    if top_k > 0:
        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))  # Safety check
        # Remove all tokens with a probability less than the last token of the top-k
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    if top_p < 1.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above the threshold (token with 0 are kept)
        sorted_indices_to_remove = cumulative_probs > top_p
        if min_tokens_to_keep > 1:
            # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below)
            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        # scatter sorted tensors to original indexing
        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
        logits[indices_to_remove] = filter_value
    return logits

# Load GPT-2 tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2_model = AutoModelForCausalLM.from_pretrained('gpt2')
# pretrained model 有夠大! 謹慎使用~

# Tokenize input phrase
phrase = f"This is an example. You can probably think of a more fun text to use than this one."
inputs = tokenizer.encode(phrase, return_tensors='pt')

# Get logits from last layer
last_layer_logits = gpt2_model(inputs).logits[:, -1, :]

# Keep top 30 logits at max; stop if cumulative probability >= 1.0.
top_logits = top_k_top_p_filtering(last_layer_logits, top_k=100, top_p=1.0)

# Softmax the logits into probabilities
probabilities = F.softmax(top_logits, dim=-1)

# Generate next token
generated_next_token = torch.multinomial(probabilities, num_samples=1)
generated = torch.cat([inputs, generated_next_token], dim=-1)

generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(generated_text)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

This is an example. You can probably think of a more fun text to use than this one. For


In [None]:
print(gpt2_model)

total_params = sum(
	param.numel() for param in gpt2_model.parameters()
)
print('gpt2 模型參數量:', total_params)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
gpt2 模型參數量: 124439808


### casual language model (GPT2)

In [None]:
# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

input_text = "愛如潮水" # 可以設定別的內容~ "等到潮水都退了" "我的愛如潮水" "數大便是美"

input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids, max_length=36)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/421M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


愛 如 潮 水 的 水 果, 讓 人 感 到 不 可 思 議 。 他 說, 這 次 的 水 果 銷 售 量 是 一 百 五 十 萬


In [None]:
#看看模型長的樣子
print(model)

total_params = sum(
	param.numel() for param in model.parameters()
)
print('ckiplab/gpt2-base-chinese 模型參數量:', total_params)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(21128, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=21128, bias=False)
)
ckiplab/gpt2-base-chinese 模型參數量: 102068736


### pipeline

In [None]:
# model using pipeline
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
tc_model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above

input_text = "等到潮水都退了" # 可以設定別的內容~ "等到潮水都退了" "我的愛如潮水" "數大便是美"

ws_pipeline = pipeline('token-classification', model=tc_model, tokenizer=tokenizer)
ws_pipeline(input_text)

config.json:   0%|          | 0.00/832 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/15.9M [00:00<?, ?B/s]

[{'entity': 'B',
  'score': 0.9999869,
  'index': 1,
  'word': '等',
  'start': 0,
  'end': 1},
 {'entity': 'I',
  'score': 0.98758113,
  'index': 2,
  'word': '到',
  'start': 1,
  'end': 2},
 {'entity': 'B',
  'score': 0.99996626,
  'index': 3,
  'word': '潮',
  'start': 2,
  'end': 3},
 {'entity': 'I',
  'score': 0.99979514,
  'index': 4,
  'word': '水',
  'start': 3,
  'end': 4},
 {'entity': 'B',
  'score': 0.9995327,
  'index': 5,
  'word': '都',
  'start': 4,
  'end': 5},
 {'entity': 'B',
  'score': 0.99974006,
  'index': 6,
  'word': '退',
  'start': 5,
  'end': 6},
 {'entity': 'B',
  'score': 0.9996985,
  'index': 7,
  'word': '了',
  'start': 6,
  'end': 7}]

In [None]:
#看看模型長的樣子
print(tc_model)

total_params = sum(
	param.numel() for param in tc_model.parameters()
)
print('ckiplab/albert-tiny-chinese-ws 模型參數量:', total_params)

AlbertForTokenClassification(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(21128, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=312, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((312,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=312, out_features=312, bias=True)
                (key): Linear(in_features=312, out_features=312, bias=True)
                (value): Linear(in_features=312, out_features=312, 

## **Quiz - 1: 以上每個範例分別使用不盡相同的操作流程。請問，你覺得哪個流程最簡單明瞭？為什麼？**

## **Quiz - 2: 以上哪個模型，你覺得最好用？為什麼？（從下游任務，模型參數，以及執行時間來探討。）**