<a id='top'></a><a name='top'></a>
# Chapter 5: Natural Language Generation and Conversion with Transformer

## 5.1 Transformer and Text Generation¶

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/gbih/nlp/blob/main/ja_nlp_book/chp05_5_1_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

* [Imports and Setup](#setup)
* [5.1 Transformer and Text Generation](#5.1)
    - [5.1.1 What is the Transformer?](#5.1.1)
    - [5.1.2 Text Generation](#5.1.2)

---
<a name='setup'></a><a id='setup'></a>
# Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
import pathlib
from pathlib import Path

data_root = Path("chp05_01")
req_file = data_root / "requirements_5_5_1.txt"

if not data_root.is_dir():
    data_root.mkdir()
else:
    print(f"{data_root} exists.")

In [2]:
%%writefile {req_file}
datasets==1.11.0
fugashi[unidic]==1.2.1
gensim==3.6.0
japanize_matplotlib==1.1.3
sentencepiece==0.1.97
tqdm==4.64.1
transformers==4.9.0
unidic_lite==1.0.8
watermark==2.3.1

Writing chp05_01/requirements_5_5_1.txt


In [3]:
import sys
import os
check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

if IS_COLAB:
    print("Installing packages")
    !pip install --quiet -r {req_file}
    !python -m unidic download
    !apt-get install jq 2>&1 > /dev/null
    !jq --version
    print("Need to restart runtime after installing sentencepiece.")
    print("> Runtime > Restart runtime ...")
else:
    print("Running locally.")

Installing packages
[K     |████████████████████████████████| 264 kB 29.7 MB/s 
[K     |████████████████████████████████| 615 kB 62.2 MB/s 
[K     |████████████████████████████████| 4.1 MB 29.7 MB/s 
[K     |████████████████████████████████| 1.3 MB 58.8 MB/s 
[K     |████████████████████████████████| 2.6 MB 63.5 MB/s 
[K     |████████████████████████████████| 47.4 MB 1.2 MB/s 
[K     |████████████████████████████████| 212 kB 77.6 MB/s 
[K     |████████████████████████████████| 132 kB 68.0 MB/s 
[K     |████████████████████████████████| 56 kB 5.5 MB/s 
[K     |████████████████████████████████| 880 kB 63.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 56.5 MB/s 
[K     |████████████████████████████████| 1.6 MB 59.4 MB/s 
[?25h  Building wheel for japanize-matplotlib (setup.py) ... [?25l[?25hdone
  Building wheel for unidic-lite (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Building wheel for unidic (setup.py) ...

In [1]:
# Standard Library imports
import sys
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Suppress TensorFlog log messages
import pathlib
from pathlib import Path
import shlex
import subprocess
 
# Third-party imports
from collections import Counter
from datasets import load_dataset # HuggingFace Transformers
import json
import logging
import pandas as pd
import sentencepiece as spm
import torch
from tqdm import tqdm
from transformers import T5Tokenizer # This requires sentencepiece
from transformers import AutoModelForCausalLM
from transformers import Trainer
from transformers import TrainingArguments
from watermark import watermark

# suppress logging from transformers
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("transformers.trainer").setLevel(logging.ERROR)
logging.getLogger("datasets").setLevel(logging.ERROR)

_ = torch.manual_seed(42)

def HR():
    print("-"*50)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device:{device}")
HR()

packages_check="datasets,fugashi,gensim,japanize_matplotlib,sentencepiece,tqdm,transformers,unidic_lite,watermark"
print(watermark(packages=packages_check, python=True,machine=True))

device:cuda
--------------------------------------------------
Python implementation: CPython
Python version       : 3.8.16
IPython version      : 7.9.0

datasets           : 1.11.0
fugashi            : 1.2.1
gensim             : 3.6.0
japanize_matplotlib: 1.1.3
sentencepiece      : 0.1.97
tqdm               : 4.64.1
transformers       : 4.9.0
unidic_lite        : 1.0.8
watermark          : 2.3.1

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.10.133+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [2]:
data_dir = Path("chp05_01")

print(f"data_dir: {data_dir}")

data_dir: chp05_01


---
<a name='5.1'></a><a id='5.1'></a>
# 5.1 Transformer and Text Generation
<a href="#top">[back to top]</a>

<a name='5.1.1'></a><a id='5.1.1'></a>
## 5.1.1 What is the Transformer?
<a href="#top">[back to top]</a>

<a name='5.1.2'></a><a id='5.1.2'></a>
## 5.1.2 Text Generation
<a href="#top">[back to top]</a>

* Use Rinna for our Japanese autoregressive language model GPT-2 model
* Load the pretrained model via `from_pretrained()`.
* To process model input, initialize a corresponding tokenizer.
* Since we are not fine-tuning this model any further, there is no advantage to saving and reusing it.

In [3]:
# Load Rinna — a Japanese GPT-2 model 
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-gpt2-medium")
tokenizer.do_lower_case = True # due to some bug of tokenizer config loading

model = (AutoModelForCausalLM
        .from_pretrained("rinna/japanese-gpt2-medium")
        .to(device))

type(model)

Downloading:   0%|          | 0.00/806k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/153 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/282 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/799 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37G [00:00<?, ?B/s]

transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel

In [4]:
inputs = tokenizer(
    "むかしむかし、あるところに、", 
    return_tensors="pt", 
    add_special_tokens=False
    ).to(device)

inputs

{'input_ids': tensor([[    9,   561,  8359,   561,  8359,     7,   382, 12141,     7]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

* Use `model.generate()` to generate the continuation.
* Decode the results and convert back to text with `tokenizer.decode()`

In [5]:
result = model.generate(
    **inputs,
    do_sample=True,
    top_p=0.9,
    temperature=0.8,
    max_length=100,
    pad_token_id=2,
    repetition_penalty=1.2
)

print(result.shape)
print(result)

torch.Size([1, 95])
tensor([[    9,   561,  8359,   561,  8359,     7,   382, 12141,     7,  1363,
            32,   845,    57, 28511,    10,   187,   268,  5363,     8,  9839,
           180,     7, 28511,    12,  9914, 11151,    19,   248,    84,    18,
          6968,  3628,  2792,  4550,    20,     7,    65,  3804,    11,  5475,
          1833,   987, 19295,    58,   203,  5363,     8,   235,  5465,  4992,
           865,   774,    30,     7,    80,  5475,   987,    28,  1017,  1374,
           736,   842,    21,    20,  5301,   551, 27807,     8,   235, 11152,
         15329, 23486,    40,   736,   969, 21756,  2532,     9,   263,  4660,
         15199, 24152,    21,    20, 26514,  1198,     7, 28511,  1227, 18961,
          8207, 17826,  5363,     8,     2]], device='cuda:0')


In [6]:
tokenizer.decode(result[0])

'むかしむかし、あるところに、ふしぎなキツネの村がありました。 ある時、キツネがひとりきりで野山をさまよい歩いていると、その先には白い小さな森が見えてきました。 「あなたのお母さんも、この白い森から生まれてきたんだよ」と教えられた男の子。 「そんな風に育てられたんだねぇ... すごく幸せだよ」と声をかけると、キツネはその言葉を返してくれました。</s>'

In [7]:
# Another iteration, using this text from weathernews.jp:
# 雨や雪の降りやすい天気が続きます。降り方が強まり、雷やアラレを伴うおそれもあります。
# 寒い日が多く、特に朝晩はグッと冷え込みます。体調管理を万全に。
# https://weathernews.jp/onebox/tenki/niigata/15461/

inputs = tokenizer(
    "雨や雪の降りやすい天気が続きます。降り方が強まり、雷やアラレを伴うおそれもあります。寒い日が多く、特に朝晩はグッと冷え込みます。体調管理を万全に。", 
    return_tensors="pt", 
    add_special_tokens=False
    ).to(device)

result = model.generate(
    **inputs,
    do_sample=True,
    top_p=0.9,
    temperature=0.8,
    max_length=100,
    pad_token_id=2,
    repetition_penalty=1.2
)

tokenizer.decode(result[0])

'雨や雪の降りやすい天気が続きます。降り方が強まり、雷やアラレを伴うおそれもあります。寒い日が多く、特に朝晩はグッと冷え込みます。体調管理を万全に。 <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>'

In [8]:
# Text from Twitter:
# 12/8(木) 15:20現在本日16時からのTeNYテレビ新潟さんの夕方のニュースは、
# かぐらメインゲレンデからの生中継でお届け？？ドラ〜✨#naeba #かぐら #夕方ワイド新潟一番
# https://twitter.com/Naebanow/status/1600738428211335170

inputs = tokenizer(
    "12/8(木) 15:20現在本日16時からのTeNYテレビ新潟さんの夕方のニュースは、かぐらメインゲレンデからの生中継でお届け？？ドラ〜✨#naeba #かぐら #夕方ワイド新潟一番", 
    return_tensors="pt", 
    add_special_tokens=False
    ).to(device)

result = model.generate(
    **inputs,
    do_sample=True,
    top_p=0.9,
    temperature=0.8,
    max_length=100,
    pad_token_id=2,
    repetition_penalty=1.2
)

tokenizer.decode(result[0])

'12/8(木) 15:20現在本日16時からのtenyテレビ新潟さんの夕方のニュースは、かぐらメインゲレンデからの生中継でお届け??ドラ〜<unk> #naeba #かぐら #夕方ワイド新潟一番 pic.twitter.com/f4b3qhxuwt</s>'