<a href="https://colab.research.google.com/github/ailab-nda/NLP/blob/main/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT による自然言語処理

### 準備（関連ライブラリのインストール）

In [6]:
!pip install -q transformers
!pip install -q sentencepiece
!pip install -q datasets

## 学習済みモデルによる TOEIC Part 5 の解答

In [51]:
import torch
from transformers import BertTokenizer, BertForPreTraining
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 問題文の作成

In [52]:
text = "Customer reviews indicate that many modern mobile devices are often unnecessarily [MASK] ."
candidate = ["complication", "complicates", "complicate", "complicated"]

BERTに分かるように変換 (text --> tokens)

In [53]:
tokens = tokenizer.tokenize(text)
print(tokens)
masked_index = tokens.index("[MASK]")
tokens = ["[CLS]"] + tokens + ["[SEP]"]

print(tokens)
print(masked_index)

['customer', 'reviews', 'indicate', 'that', 'many', 'modern', 'mobile', 'devices', 'are', 'often', 'un', '##ne', '##ces', '##sari', '##ly', '[MASK]', '.']
['[CLS]', 'customer', 'reviews', 'indicate', 'that', 'many', 'modern', 'mobile', 'devices', 'are', 'often', 'un', '##ne', '##ces', '##sari', '##ly', '[MASK]', '.', '[SEP]']
15


## 解答の作成

In [54]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids = torch.tensor(ids).reshape(1,-1)  # バッチサイズ1の形に整形
predictions = model(ids)[0][0]
print(predictions)

tensor([[ -6.6169,  -6.6184,  -6.5846,  ...,  -5.8624,  -5.6686,  -3.7055],
        [ -9.9890,  -9.9992, -10.0754,  ...,  -9.5736,  -8.6640,  -7.1756],
        [ -1.2393,  -1.3488,  -1.7734,  ...,  -1.9904,  -3.6062,  -1.6186],
        ...,
        [ -0.8096,  -1.0072,  -0.7167,  ...,   0.0812,  -1.4079,   0.1565],
        [-12.0386, -11.9543, -12.2466,  ...,  -9.7020, -10.8202,  -7.3316],
        [ -9.9204, -10.3186, -10.5684,  ...,  -9.2968,  -9.4447,  -7.2326]],
       grad_fn=<SelectBackward>)


In [55]:
_, predicted_indexes = torch.topk(predictions[masked_index+1], k=1000)
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_indexes.tolist())
# -> ['expensive', 'small', 'priced', 'used', ...
print(predicted_tokens)

['expensive', 'small', 'priced', 'used', 'unreliable', 'cheap', 'noisy', 'mobile', 'portable', 'slow', 'costly', 'bulky', 'worn', 'outdated', 'poor', 'modified', 'large', 'upgraded', 'designed', 'damaged', 'popular', 'installed', 'available', 'inexpensive', 'robust', 'fragile', 'rugged', 'difficult', 'rare', 'useful', 'sophisticated', 'long', 'reliable', 'successful', 'new', 'produced', 'oversized', 'lightweight', 'defective', 'dull', 'problematic', 'heavy', 'limited', 'fast', 'unnecessary', 'late', 'redesigned', 'durable', 'old', 'bundled', 'updated', 'disabled', 'obsolete', 'purchased', 'competitive', 'functional', 'loaded', 'powered', 'replaced', 'short', 'packaged', 'ineffective', 'equipped', 'weak', 'modern', 'useless', 'customized', 'clumsy', 'unstable', 'sensitive', 'sized', 'responsive', 'ordered', 'low', 'complicated', 'annoying', 'recommended', 'disconnected', 'ignored', 'shipped', 'faulty', 'profitable', 'scarce', 'complex', 'dangerous', 'sold', 'busy', 'inaccurate', 'dated'

In [56]:
for i, v in enumerate(predicted_tokens):
    if v in candidate:
        print(i, v)
        break

74 complicated


## 関数にしてみた

In [68]:
def part5_slover(text, candidate):
    tokens = tokenizer.tokenize(text)
    masked_index = tokens.index("[MASK]")
    tokens = ["[CLS]"] + tokens + ["[SEP]"]

    ids = tokenizer.convert_tokens_to_ids(tokens)
    ids = torch.tensor(ids).reshape(1,-1)
    predictions = model(ids)[0][0]

    _, predicted_indexes = torch.topk(predictions[masked_index+1], k=10000)
    predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_indexes.tolist())

    for i, v in enumerate(predicted_tokens):
        if v in candidate:
            return "answer: " + v
    return "don't know"

In [69]:
text = "The supremarket giant donotes food to [MASK] people ."
candidate = ["need", "needed", "needy", "necessary"]
#text = "Although Mrs. Baker has U.S. citizenship, she is [MASK] from New Zealand ."
#candidate = ["originally", "originality", "original", "originated"]
#text = "The Chairperson shed false tears at the shareholders' meeting to [MASK] the dramatic effect ."
#candidate = ["height", "heighten", "high", "highly"]
#text = "Demonstrators gathrered in front of the pharmaceutical company to [MASK] against animal testing ."
#candidate = ["prospect", "protect", "protest", "protract"]
#text = "Miss Marting's argument for the new project was so [MASK] that no one could challenge it ."
#candidate = ["persuasive", "persuading", "persuasion", "persuasively"]

In [70]:
part5_slover(text, candidate)

'answer: persuasion'

## 1. RoBERTa による文章中の空欄埋め



### モデルのダウンロード

In [None]:
from transformers import T5Tokenizer, RobertaForMaskedLM

tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-roberta-base")
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

model = RobertaForMaskedLM.from_pretrained("rinna/japanese-roberta-base")

### 問題文の作成

In [None]:
# original text
text = "4年に1度オリンピックは開かれる。"
#text = ""

# prepend [CLS]
text = "[CLS]" + text

# tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# mask a token
masked_idx = 5
tokens[masked_idx] = tokenizer.mask_token
print(tokens) 

### 穴埋め問題を解く

補充すべき単語の推定 (id)

In [None]:
# convert to ids
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

# convert to tensor
import torch
token_tensor = torch.LongTensor([token_ids])

結果の表示

In [None]:
# provide position ids explicitly
position_ids = list(range(0, token_tensor.size(1)))
position_id_tensor = torch.LongTensor([position_ids])

# get the top 10 predictions of the masked token
with torch.no_grad():
    outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)
    predictions = outputs[0][0, masked_idx].topk(10)

for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

## 2. GPT-2 による文書生成

### (1) rinna/japanese-gpt2 の利用

### モデルのダウンロード

In [None]:
from transformers import T5Tokenizer, AutoModelForCausalLM

tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-gpt2-medium")
tokenizer.do_lower_case = True  # due to some bug of tokenizer config loading

model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt2-medium")

### 文書生成の例

In [None]:
input = tokenizer.encode("私が防衛大学校に入校してから、", return_tensors="pt")
output = model.generate(input, do_sample=True, max_length=100, num_return_sequences=3)
sentences = tokenizer.batch_decode(output)
for i in sentences:
    print(i)

ここで、train.txt と run_clm.py のアップロードを行う。

In [None]:
%%time
!rm -r output

# ファインチューニングの実行
!python ./run_clm.py \
    --model_name_or_path=rinna/japanese-gpt2-small \
    --train_file=train.txt \
    --validation_file=train.txt \
    --do_train \
    --do_eval \
    --num_train_epochs=10 \
    --save_steps=5000 \
    --save_total_limit=3 \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --output_dir=output/ \
    --use_fast_tokenizer=False

In [None]:
# モデルの準備
model = AutoModelForCausalLM.from_pretrained("output/")

# 推論
input = tokenizer.encode("おはよう、お兄ちゃん。", return_tensors="pt")
output = model.generate(input, do_sample=True, max_length=100, num_return_sequences=8)
sentences = tokenizer.batch_decode(output)
for i in sentences:
    print(i)

## (2) GPT2-Japanese の利用

### モデルのダウンロードとインストール

In [None]:
# gpt2-japaneseのインストール
!git clone https://github.com/tanreinama/gpt2-japanese
%cd gpt2-japanese
#!pip uninstall tensorflow -y
!pip install -r requirements.txt

In [None]:
# smallモデルのダウンロード
!wget https://www.nama.ne.jp/models/gpt2ja-small.tar.bz2
!tar xvfj gpt2ja-small.tar.bz2

### ランダムな文章の作成

In [None]:
# smallモデルの動作確認
!python gpt2-generate.py --model gpt2ja-small --num_generate 3

### 文章の続きを作成

In [None]:
!python gpt2-generate.py --model gpt2ja-small --num_generate 3 --context="私は防衛大学校に入校してから、"

In [None]:
# データセットの作成
!git clone https://github.com/tanreinama/Japanese-BPEEncoder.git

ここで、gtp2-japanese の下に mydata というフォルダを作成し、データを置く。

In [None]:
!python ./Japanese-BPEEncoder/encode_bpe.py --src_dir mydata --dst_file finetune

ここで、run_finetune.py を gtp2-japanese の下に置く。

In [None]:
!python run_finetune.py --base_model gpt2ja-small --dataset finetune.npz --run_name gpr2ja-finetune_run1-small

In [None]:
!python gpt2-generate.py --model checkpoint/gpr2ja-finetune_run1-small --num_generate 8