<a href="https://colab.research.google.com/github/ailab-nda/ML/blob/main/TOEIC_Part5_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer による自然言語処理

### 準備

関連ライブラリのインストール

In [1]:
!pip install -q transformers
!pip install -q sentencepiece
!pip install -q datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m22.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency 

関連ライブラリのインポート

In [61]:
import numpy as np
import torch
import textwrap
from transformers import T5Tokenizer, BertTokenizer
from transformers import AutoModelForCausalLM, RobertaForMaskedLM
from transformers import BertForPreTraining

## 1. 日本語モデルによる文章中の空欄埋め





### モデルのダウンロード

In [3]:
# トーカナイザの設定
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading
# モデルの設定
model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")

tokenizer_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/806k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/153 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

### 問題文の作成

In [5]:
# 原文
text = "4年に1度オリンピックは開かれる。"

# 文頭に [CLS] を付加
text = "[CLS]" + text + "[SEP]"

# トークン化
tokens = tokenizer.tokenize(text)
print(tokens)

# トークンにマスクをかける
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens)

['[CLS]', '▁4', '年に', '1', '度', 'オリンピック', 'は', '開かれる', '。', '[SEP]']
['[CLS]', '▁4', '年に', '1', '度', '[MASK]', 'は', '開かれる', '。', '[SEP]']


### 穴埋め問題を解く

補充すべき単語の推定 (id)

In [6]:
# トークンから単語 ID に変換
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

# テンソルに変換
token_tensor = torch.LongTensor([token_ids])

[4, 1602, 44, 24, 368, 6, 11, 21583, 8, 5]


結果の表示

In [7]:
# 場所の ID を与える
position_ids = list(range(0, token_tensor.size(1)))
position_id_tensor = torch.LongTensor([position_ids])

# マスクされたトークンの予測値（トップ 10）
with torch.no_grad():
    outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

0 オールスターゲーム
1 スーパーボウル
2 ワールドカップ
3 アジア競技大会
4 株主総会
5 都市対抗野球
6 オリンピック
7 東京オリンピック
8 世界選手権
9 日本シリーズ


## 2. 英語モデルによる TOEIC Part 5 の解答

In [104]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

### 問題文の作成

問題文：text、選択肢：candidate

In [105]:
text = "Customer reviews indicate that many modern mobile devices are often unnecessarily [MASK] ."
candidate = ["complication", "complicates", "complicate", "complicated"]

BERTに分かるように変換 (text --> tokens)

In [108]:
tokens = tokenizer.tokenize(text)
print(tokens)

masked_index = tokens.index("[MASK]")
tokens = ["[CLS]"] + tokens + ["[SEP]"]

print(tokens)
print(masked_index)

['customer', 'reviews', 'indicate', 'that', 'many', 'modern', 'mobile', 'devices', 'are', 'often', 'un', '##ne', '##ces', '##sari', '##ly', '[MASK]', '.']
['[CLS]', 'customer', 'reviews', 'indicate', 'that', 'many', 'modern', 'mobile', 'devices', 'are', 'often', 'un', '##ne', '##ces', '##sari', '##ly', '[MASK]', '.', '[SEP]']
15


### 解答の作成

In [107]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids = torch.tensor(ids).reshape(1,-1)  # バッチサイズ1の形に整形
predictions = model(ids)[0][0]
print(predictions)

tensor([[ -6.6169,  -6.6184,  -6.5846,  ...,  -5.8624,  -5.6686,  -3.7055],
        [ -9.9890,  -9.9992, -10.0754,  ...,  -9.5736,  -8.6640,  -7.1757],
        [ -1.2393,  -1.3488,  -1.7734,  ...,  -1.9904,  -3.6062,  -1.6186],
        ...,
        [ -0.8096,  -1.0072,  -0.7167,  ...,   0.0812,  -1.4079,   0.1565],
        [-12.0386, -11.9543, -12.2466,  ...,  -9.7020, -10.8202,  -7.3316],
        [ -9.9204, -10.3186, -10.5684,  ...,  -9.2968,  -9.4447,  -7.2326]],
       grad_fn=<SelectBackward0>)


In [88]:
_, predicted_indexes = torch.topk(predictions[masked_index+1], k=1000)
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_indexes.tolist())
# -> ['expensive', 'small', 'priced', 'used', ...
print(predicted_tokens)

['expensive', 'small', 'priced', 'used', 'unreliable', 'cheap', 'noisy', 'mobile', 'portable', 'slow', 'costly', 'bulky', 'worn', 'outdated', 'poor', 'modified', 'large', 'upgraded', 'designed', 'damaged', 'popular', 'installed', 'available', 'inexpensive', 'robust', 'fragile', 'rugged', 'difficult', 'rare', 'useful', 'sophisticated', 'long', 'reliable', 'successful', 'new', 'produced', 'oversized', 'lightweight', 'defective', 'dull', 'problematic', 'heavy', 'limited', 'fast', 'unnecessary', 'late', 'redesigned', 'durable', 'old', 'bundled', 'updated', 'disabled', 'obsolete', 'purchased', 'competitive', 'functional', 'loaded', 'powered', 'replaced', 'short', 'packaged', 'ineffective', 'equipped', 'weak', 'modern', 'useless', 'customized', 'clumsy', 'unstable', 'sensitive', 'sized', 'responsive', 'ordered', 'low', 'complicated', 'annoying', 'recommended', 'disconnected', 'ignored', 'shipped', 'faulty', 'profitable', 'scarce', 'complex', 'dangerous', 'sold', 'busy', 'inaccurate', 'dated'

In [89]:
for i, v in enumerate(predicted_tokens):
    if v in candidate:
        print(i, v)
        break

74 complicated


### 課題
ここまでの一連の作業を関数 part5_solver(text, candidate) として実装せよ。

ただし、この関数は、問題文(text)、選択肢の単語(candidate)を入力とし、解答となる単語を出力するものとする。

In [90]:
def part5_slover(text, candidate):
    # ここにプログラムを記述する
    return answer

### 関数の動作確認（Part 5 練習問題で試してみる）
公式サンプル問題 --> https://www.iibc-global.org/toeic/test/lr/about/format/sample05.html

第２問〜第５問（問題文：text2〜5、選択肢：candidate2〜5）

In [91]:
text2 = "Jamal Nawzad has received top performance reviews [MASK] he joined the sales department two years ago ."
candidate2 = ["despite", "except", "since", "during"]
text3 = "Gyeon Corporation’s continuing education policy states that [MASK] learning new skills enhances creativity and focus ."
candidate3 = ["regular", "regularity", "regulate", "regularly"]
text4 = "Among [MASK] recognized at the company awards ceremony were senior business analyst Natalie Obi and sales associate Peter Comeau. ."
candidate4 = ["who", "whose", "they", "those"]
text5 = "All clothing sold in Develyn’s Boutique is made from natural materials and contains no [MASK] dyes ."
candidate5 = ["immediate", "synthetic", "reasonable", "assumed"]

解答の作成

In [92]:
print("answer:", part5_slover(text2, candidate2))
print("answer:", part5_slover(text3, candidate3))
print("answer:", part5_slover(text4, candidate4))
print("answer:", part5_slover(text5, candidate5))

answer: since
answer: regularly
answer: those
answer: synthetic
