# Chapter 3 Fundamentals of Large Language Models

## 3.2 GPT（Decoder）

### 3.2.4 Usage in Transformers

In [1]:
!pip -q install transformers[ja,sentencepiece,torch] pandas xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m600.9/600.9 kB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.4/47.4 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/71.7 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m12.6 MB/s[0m eta 

In [2]:
from transformers import pipeline

In [3]:
# Create a pipeline for predicting subsequent text
generator = pipeline(
    "text-generation", model="abeja/gpt2-large-japanese"
)
# Generate text following "日本で一番高い山は"
# (which translates to "The highest mountain in Japan is")
outputs = generator("日本で一番高い山は")
'''
Output meaning: The highest mountain in Japan is Mt. Fuji, with an altitude of 2895m (compared to sea level). This is higher than the current tallest structure in Japan, the Imperial Palace, which is higher than Mt. Fuji. So how much ...

Note: The translation contains a factual inaccuracy about the Imperial Palace being the tallest building in Japan, which it is not. The sentence seems to be incomplete and may need additional context for a full translation.
'''
print(outputs[0]["generated_text"])

config.json:   0%|          | 0.00/974 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.04G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/282 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/784k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/153 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


日本で一番高い山は富士山で、その標高は2895m(海抜の比較)。 これは、現在日本で最も高い建造物である皇居が、富士山より高いのですが、それよりも高いのです。 それでは、どのくらい


## 3.3 BERT・RoBERTa（Encoder）

### 3.3.4 Usage of Transformers

In [4]:
import pandas as pd

# Create a pipeline to predict masked tokens
fill_mask = pipeline(
    "fill-mask", model="cl-tohoku/bert-base-japanese-v3"
)
# masked_text meaning: The capital of Japan is [MASK]
masked_text = "日本の首都は[MASK]である"
# Predict the [MASK] part
outputs = fill_mask(masked_text)
# Display the top 3 items in a table
display(pd.DataFrame(outputs[:3]))

config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/447M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/251 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/231k [00:00<?, ?B/s]

Unnamed: 0,score,token,token_str,sequence
0,0.88417,12569,東京,日本 の 首都 は 東京 で ある
1,0.02482,12759,大阪,日本 の 首都 は 大阪 で ある
2,0.020864,13017,京都,日本 の 首都 は 京都 で ある


In [5]:
# masked_text meaning: Today's movie was exciting and interesting. This movie is [MASK].
masked_text = "今日の映画は刺激的で面白かった。この映画は[MASK]。"
outputs = fill_mask(masked_text)
display(pd.DataFrame(outputs[:3]))

Unnamed: 0,score,token,token_str,sequence
0,0.683933,23845,素晴らしい,今日 の 映画 は 刺激 的 で 面白かっ た 。 この 映画 は 素晴らしい 。
1,0.101234,24683,面白い,今日 の 映画 は 刺激 的 で 面白かっ た 。 この 映画 は 面白い 。
2,0.048003,26840,楽しい,今日 の 映画 は 刺激 的 で 面白かっ た 。 この 映画 は 楽しい 。


## 3.4 T5（Encoder/Decoder）

### 3.4.4 Usage of Transformers

In [6]:
# Create a pipeline for text-to-text generation
t2t_generator = pipeline(
    "text2text-generation", model="retrieva-jp/t5-large-long"
)
# Predict the masked span
# masked_text meaning: <extra_id_0> opened the Edo Shogunate.
masked_text = "江戸幕府を開いたのは、<extra_id_0>である"
outputs = t2t_generator(masked_text, eos_token_id=32098)
# Output meaning: Tokugawa Ieyasu
print(outputs[0]["generated_text"])

config.json:   0%|          | 0.00/793 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



徳川家康


In [7]:
# Convert those token into specific IDs
t2t_generator.tokenizer.convert_tokens_to_ids("<extra_id_1>")

32098

In [8]:
# masked_text meaning: <extra_id_0> issues currency in Japan.
masked_text = "日本で通貨を発行しているのは、<extra_id_0>である"
outputs = t2t_generator(masked_text, eos_token_id=32098)
# Output meaning: Bank of Japan
print(outputs[0]["generated_text"])

日本銀行


In [9]:
# Check if the string below is present in the vocabulary of the tokenizer
# String's meaning: Bank of Japan
"日本銀行" in t2t_generator.tokenizer.vocab

False