<a id='top'></a><a name='top'></a>
# Chapter 5: Natural Language Generation and Conversion with Transformer

## 5.2 Question answering

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/gbih/nlp/blob/main/ja_nlp_book/chp05_5_2_question_answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

* [Introdution](#introduction)
* [Setup](#setup)
* [5.2 Question answering](#5.2)
    - [5.2.1 Evaluate on the JAQKET dataset](#5.2.1)
    - [5.2.2 How to Fine-tune the Language Model for Solving QA](#5.2.2)
        * [5.2.2.1 Checking created artifacts](#5.2.2.1)
        * [5.2.2.2 Saving artifacts to Google Drive](#5.2.2.2)
    - [5.2.3 Next Steps](#5.2.3)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

**Concepts**

* Here we use a language model to solve a question-answering task. 

* A language model is a statistical model that assigns some information (probability) to a given text. 

* Because language models are usually trained on large datasets of naturally occuring text, they can give high probability to sentences that are natural in that language, and give low probability to unnatural sentences.

* A popular language model is the autoregressive or causal language model (CLM). They model the probability of a particular input by first decomposing it into a sequence of individual tokens, and then taking the product of the individual token probability given the preceding context.

* In NLP, language models have traditionally been word n-grams or based on RNNs, such as LSTM. However, recently most state-of-the-art language models are based on the Transformer architecture.

**Implementation**

* We use the HuggingFace Transformers library, which supports pretrained language models such as BERT and GPT-2.

* We also use SentencePiece for tokenizing Japanese text, and PyTorch. 

---
<a name='setup'></a><a id='setup'></a>
# Setup
<a href="#top">[back to top]</a>

In [1]:
# Option to use downloaded/pre-trained data from Google Drive with Colab.
USE_GD_DATA = True

if USE_GD_DATA:
    try:
        from google.colab import drive
        drive.mount('/content/drive', force_remount=True)
        print("Creating a local copy of chp05_02")
        # Assumes prepared data is stored as 'My Drive/chp05_02' on Google Drive 
        !cp -r /content/drive/MyDrive/chp05_02 /content/chp05_02
        print()
        !ls -l /content/chp05_02
    except Exception as e:
        print(f"Error: {e}")

Mounted at /content/drive
Creating a local copy of chp05_02

total 22224
-rw------- 1 root root   853639 Dec 12 15:39 dev1_questions.json
drwx------ 2 root root     4096 Dec 12 15:39 model_best
drwx------ 3 root root     4096 Dec 12 15:39 models
-rw------- 1 root root      172 Dec 12 15:39 requirements_5_5_2.txt
-rw------- 1 root root 11405534 Dec 12 15:39 train_questions.json
-rw------- 1 root root 10477599 Dec 12 15:39 train_questions.nooa.json


In [2]:
import pathlib
from pathlib import Path

data_root = Path("chp05_02")
req_file = data_root / "requirements_5_5_2.txt"

if not data_root.is_dir():
    data_root.mkdir()
else:
    print(f"{data_root} exists.")

chp05_02 exists.


In [3]:
%%writefile {req_file}
datasets==1.11.0
fugashi[unidic]==1.2.1
gensim==3.6.0
japanize_matplotlib==1.1.3
sentencepiece==0.1.97
tqdm==4.64.1
transformers==4.9.0
unidic_lite==1.0.8
watermark==2.3.1

Overwriting chp05_02/requirements_5_5_2.txt


In [4]:
import os
import sys
check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

if IS_COLAB:
    print("Installing packages")
    !pip install --quiet -r {req_file}
    !python -m unidic download
    !apt-get install tree &> /dev/null
    !apt-get install jq 2>&1 > /dev/null
    !jq --version
    print()
    print("** Need to restart runtime after installing sentencepiece **")
    print("> Runtime > Restart runtime ...")
else:
    print("Running locally.")

Installing packages
[K     |████████████████████████████████| 264 kB 18.7 MB/s 
[K     |████████████████████████████████| 615 kB 52.3 MB/s 
[K     |████████████████████████████████| 4.1 MB 67.5 MB/s 
[K     |████████████████████████████████| 1.3 MB 57.8 MB/s 
[K     |████████████████████████████████| 2.6 MB 49.3 MB/s 
[K     |████████████████████████████████| 47.4 MB 100 kB/s 
[K     |████████████████████████████████| 212 kB 58.1 MB/s 
[K     |████████████████████████████████| 132 kB 56.5 MB/s 
[K     |████████████████████████████████| 56 kB 5.8 MB/s 
[K     |████████████████████████████████| 880 kB 58.1 MB/s 
[K     |████████████████████████████████| 3.3 MB 55.5 MB/s 
[K     |████████████████████████████████| 1.6 MB 56.6 MB/s 
[?25h  Building wheel for japanize-matplotlib (setup.py) ... [?25l[?25hdone
  Building wheel for unidic-lite (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Building wheel for unidic (setup.py) ...

In [1]:
# Standard Library imports
import os
import pathlib
from pathlib import Path
import shlex
import subprocess
import sys
 
# Third-party imports
from collections import Counter
from datasets import load_dataset # HuggingFace Transformers
import json
import logging
import pandas as pd
import torch
from tqdm import tqdm
from transformers import T5Tokenizer # This requires sentencepiece
from transformers import AutoModelForCausalLM
from transformers import Trainer
from transformers import TrainingArguments
from watermark import watermark

# Suppress TensorFlog log messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# suppress logging from transformers
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("transformers.trainer").setLevel(logging.ERROR)
logging.getLogger("datasets").setLevel(logging.ERROR)

_ = torch.manual_seed(42)

def HR():
    print("-"*50)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device:{device}")
HR()

# Need to redefine again, after restarting runtime:
check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

packages_check="datasets,fugashi,gensim,japanize_matplotlib,sentencepiece,torch,tqdm,transformers,unidic_lite,watermark"
print(watermark(packages=packages_check, python=True,machine=True))

device:cpu
--------------------------------------------------
Python implementation: CPython
Python version       : 3.8.16
IPython version      : 7.9.0

datasets           : 1.11.0
fugashi            : 1.2.1
gensim             : 3.6.0
japanize_matplotlib: 1.1.3
sentencepiece      : 0.1.97
torch              : 1.13.0+cu116
tqdm               : 4.64.1
transformers       : 4.9.0
unidic_lite        : 1.0.8
watermark          : 2.3.1

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.10.133+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [2]:
data_file = "train_questions.json"
data_url = f"https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_01/{data_file}"
data_dir = Path("chp05_02")
data_src = data_dir / data_file
data_path = data_src
data_train_questions = data_dir / "train_questions.nooa.json"

print(f"""
data_file:\t{data_file}
data_url:\t{data_url}
data_dir:\t{data_dir}
data_src:\t{data_src}
data_path:\t{data_path}
data_train_questions:\t{data_train_questions}
""")


data_file:	train_questions.json
data_url:	https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_01/train_questions.json
data_dir:	chp05_02
data_src:	chp05_02/train_questions.json
data_path:	chp05_02/train_questions.json
data_train_questions:	chp05_02/train_questions.nooa.json



In [3]:
data_dev1_file = "dev1_questions.json"
data_dev1_url = f"https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_01/{data_dev1_file}"
data_dev1_dir = Path("chp05_02")
data_dev1_src = data_dev1_dir / data_dev1_file
data_dev1_path = data_dev1_src

print(f"""
data_dev1_file:\t{data_dev1_file}
data_dev1_url:\t{data_dev1_url}
data_dev1_dir:\t{data_dev1_dir}
data_dev1_src:\t{data_dev1_src}
data_dev1_path:\t{data_dev1_path}
""")


data_dev1_file:	dev1_questions.json
data_dev1_url:	https://jaqket.s3.ap-northeast-1.amazonaws.com/data/aio_01/dev1_questions.json
data_dev1_dir:	chp05_02
data_dev1_src:	chp05_02/dev1_questions.json
data_dev1_path:	chp05_02/dev1_questions.json



### Load and instantiate the Rinna Japanese autoregressive GPT-2 language model.

* https://huggingface.co/rinna/japanese-gpt2-medium
* https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM
* https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/auto#transformers.FlaxAutoModelForVision2Seq.from_pretrained


In [4]:
# Load Rinna — a Japanese GPT-2 model 

# Construct a T5 tokenizer. Based on SentencePiece.
# This tokenizer inherits from PreTrainedTokenizer.
tokenizer = T5Tokenizer.from_pretrained("rinna/japanese-gpt2-medium")
tokenizer.do_lower_case = True # due to some bug of tokenizer config loading

# This is a generic model class that will be instantiated as one of the model 
# classes of the library (with a causal language modeling head) when created 
# with the from_pretrained() class method or the from_config() class method.
model_untuned = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt2-medium").to(device)

HR()
print(type(model_untuned))

Downloading:   0%|          | 0.00/806k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/153 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/282 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/799 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37G [00:00<?, ?B/s]

--------------------------------------------------
<class 'transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel'>


In [5]:
model_untuned.config

GPT2Config {
  "_name_or_path": "rinna/japanese-gpt2-medium",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 1,
  "embd_pdrop": 0.1,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": 4096,
  "n_layer": 24,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.9.0",
  "use_cache": true,
  "vocab_size": 32000
}

---
<a name='5.2'></a><a id='5.2'></a>
# 5.2 Question answering
<a href="#top">[back to top]</a>

* We can use language models for problem-answering. 
* Here we use it to answer open-domain trivia questions.
* Language models trained on a large corpus give lower losses to more "natural" input.
* Create a function `rank_answers` which reranks the list of candidates based on the question and the model

In [6]:
def rank_answers(question, candidates, model, tokenizer, top_n=10):
    """Given a question and a list of answer candidates, rank them based on 
    the language model score (negative log likelihood) and return the ranked
    top N candidates."""
    losses = Counter()
    inputs_question = tokenizer(
        question, return_tensors="pt",
        add_special_tokens=False
    ).to(model.device)
    
    labels_question = -100 * torch.ones_like(
        inputs_question["input_ids"],
        device=model.device
    )
    
    results = model(**inputs_question, use_cache=True)
    past = results.past_key_values
    
    for candidate in candidates:
        inputs_candidate = tokenizer(
            candidate,
            return_tensors="pt",
            add_special_tokens=True
        ).to(model.device)
        
        attention_mask = torch.cat(
            (inputs_question["attention_mask"], inputs_candidate["attention_mask"]),
            dim = 1,
        )
        
        results = model(
            input_ids = inputs_candidate["input_ids"],
            attention_mask = attention_mask,
            labels = inputs_candidate["input_ids"],
            past_key_values = past,
        )
        
        loss = results.loss.detach().item()
        losses[candidate] = -loss
        
    return [a for a, _ in losses.most_common(top_n)]

* Question: What is the prefectural capital of Aichi?

In [7]:
question = "愛知県の県庁所在地は?"

In [8]:
answers = ["札幌市", "秋田市", "宇都宮市", "東京", "金沢市", "岐阜市", "名古屋市", "大津市", "奈良市", "岡山市", "高 松市", "佐賀市", "宮崎市"]

In [9]:
rank_answers(question, answers, model_untuned, tokenizer)

['岐阜市', '名古屋市', '金沢市', '奈良市', '宇都宮市', '秋田市', '岡山市', '宮崎市', '佐賀市', '高 松市']

<a name='5.2.1'></a><a id='5.2.1'></a>
## 5.2.1 Evaluate on the JAQKET dataset
<a href="#top">[back to top]</a>

* Examine how good Rinna for answering common sense questions.
* JAQKET is an open-domain dataset including common-sense questions and answers, drawn from Wikipedia. 
* https://www.nlp.ecei.tohoku.ac.jp/projects/jaqket/

In [10]:
# book-link-changed
if not data_src.is_file():
    print(f"Downloading {data_url}")
    subprocess.run(shlex.split(f"wget -q -O {data_src} {data_url}"))
    print("Done.")
else:
    print(f"{data_src} exists.")

chp05_02/train_questions.json exists.


In [11]:
# Since data is written in lines separated by endlines, use `lines=True`
df = pd.read_json(data_src, lines=True)
df.head(2).T

Unnamed: 0,0,1
qid,ABC01-01-0003,ABC01-01-0004
question,格闘家ボブ・サップの出身国はどこでしょう?,ロシア語で「城」という意味がある、ロシアの大統領府の別名は何でしょう?
answer_entity,アメリカ合衆国,クレムリン
answer_candidates,"[アメリカ合衆国, ミネソタ州, オンタリオ州, ペンシルベニア州, オレゴン州, ニューヨ...","[クレムリン, キエフ, 赤の広場, サンクトペテルブルク, モスクワ, 救世主ハリストス大..."
original_question,格闘家ボブ・サップの出身国はどこでしょう？,ロシア語で「城」という意味がある、ロシアの大統領府の別名は何でしょう？
original_answer,アメリカ,クレムリン


In [12]:
# Delete the "original_answer" keys, which causes some discrepancies between
# train and dev splits
if not Path(data_train_questions).is_file():
    !jq 'del(.original_answer)' -c {data_src} > {data_train_questions}
    print("Done")
else:
    print(f"{data_train_questions} exists.")

chp05_02/train_questions.nooa.json exists.


In [13]:
# Examine the changed data
df = pd.read_json(data_train_questions, lines=True)
df.head(2).T

Unnamed: 0,0,1
qid,ABC01-01-0003,ABC01-01-0004
question,格闘家ボブ・サップの出身国はどこでしょう?,ロシア語で「城」という意味がある、ロシアの大統領府の別名は何でしょう?
answer_entity,アメリカ合衆国,クレムリン
answer_candidates,"[アメリカ合衆国, ミネソタ州, オンタリオ州, ペンシルベニア州, オレゴン州, ニューヨ...","[クレムリン, キエフ, 赤の広場, サンクトペテルブルク, モスクワ, 救世主ハリストス大..."
original_question,格闘家ボブ・サップの出身国はどこでしょう？,ロシア語で「城」という意味がある、ロシアの大統領府の別名は何でしょう？


In [14]:
# book-link-changed
if not data_dev1_src.is_file():
    print(f"Downloading {data_dev1_url}")
    subprocess.run(shlex.split(f"wget -q -O {data_dev1_src} {data_dev1_url}"))
    print("Done.")
else:
    print(f"{data_dev1_src} exists.")

chp05_02/dev1_questions.json exists.


In [15]:
# Examine the json data
df = pd.read_json(data_dev1_path, lines=True)
df.head(1)

Unnamed: 0,qid,question,answer_entity,answer_candidates,original_question
0,QA20CAPR-0002,明治時代に西洋から伝わった「テーブル・ターニング」に起源を持つ占いの一種で、50音表などを記...,コックリさん,"[テケテケ, 毛羽毛現, ルームメイトの死, 浄玻璃鏡, 小玉鼠 (妖怪), ベッドの下の男...",明治時代に西洋から伝わった「テーブル・ターニング」に起源を持つ占いの一種で、50音表などを記...


In [16]:
# Create list of questions and answer
qas = []
with open(data_dev1_path) as f:
    for line in f:
        qas.append(json.loads(line))

qas[:1]

[{'qid': 'QA20CAPR-0002',
  'question': '明治時代に西洋から伝わった「テーブル・ターニング」に起源を持つ占いの一種で、50音表などを記入した紙を置き、参加者全員の人差し指をコインに置いて行うのは何でしょう?',
  'answer_entity': 'コックリさん',
  'answer_candidates': ['テケテケ',
   '毛羽毛現',
   'ルームメイトの死',
   '浄玻璃鏡',
   '小玉鼠 (妖怪)',
   'ベッドの下の男',
   '板鬼',
   '赤い紙、青い紙',
   '縊鬼',
   '疱瘡婆',
   '紫ババア',
   '塵塚怪王',
   '辻神',
   '耳から白い糸',
   'コックリさん',
   '紫の鏡',
   '天井下り',
   '野寺坊',
   'ヨジババ',
   'カシマさん'],
  'original_question': '明治時代に西洋から伝わった「テーブル・ターニング」に起源を持つ占いの一種で、50音表などを記入した紙を置き、参加者全員の人差し指をコインに置いて行うのは何でしょう？'}]

### evaluate() method:

* Takes a list of questions and answers.
* Evaluates the questions with the model and tokenizer.
* Returns the number of answers the model got correct.


In [17]:
def evaluate(qas, model, tokenizer, stop_at=None, show_preview=False):
    """
    Takes a list of questions and answers.
    Evaluates the questions with the model and tokenizer.
    Returns the number of answers the model get correct.
    """
    num_correct = 0
    num_questions = 0
    
    for qa in tqdm(qas):
        question = qa["question"]
        candidates = qa["answer_candidates"]
        gold = qa["answer_entity"]
        preds = rank_answers(question, candidates, model, tokenizer)
        is_correct = preds[0] == gold
        if is_correct:
            num_correct += 1
        if show_preview and num_questions < 5:
            tqdm.write(
                f" is_correct: {is_correct}, gold: {gold}, Q: {question}, pred: {preds[:5]},"
            )

        num_questions += 1
                
        if stop_at is not None and num_questions == stop_at:
            break
        
    tqdm.write(
        f" Success rate = {100 * num_correct / num_questions}% ({num_correct} / {num_questions})"
    )

### Evaluate using an untuned Rinna model (Japanese GPT-2 autoregressive language model)


In [19]:
if IS_COLAB:
    stop_n= 20 # 100
else:
    stop_n = 10

evaluate(
    qas, 
    model_untuned, 
    tokenizer, 
    stop_at=stop_n, 
    show_preview=True
)

print()
print("Done")

# Success rate = 5.0% (1 / 20)

  0%|          | 1/995 [00:06<1:44:50,  6.33s/it]

 is_correct: False, gold: コックリさん, Q: 明治時代に西洋から伝わった「テーブル・ターニング」に起源を持つ占いの一種で、50音表などを記入した紙を置き、参加者全員の人差し指をコインに置いて行うのは何でしょう?, pred: ['テケテケ', '赤い紙、青い紙', 'コックリさん', '小玉鼠 (妖怪)', 'ヨジババ'],


  0%|          | 2/995 [00:12<1:39:06,  5.99s/it]

 is_correct: False, gold: 集英社, Q: 『non・no』『週刊プレイボーイ』『週刊少年ジャンプ』といえば、発行している出版社はどこでしょう?, pred: ['実業之日本社', '白泉社', '宝島社', '日本文芸社', '幻冬舎'],


  0%|          | 3/995 [00:19<1:50:57,  6.71s/it]

 is_correct: False, gold: SASUKE, Q: 「パイプスライダー」や「そり立つ壁」などの関門がある、TBS系列で不定期に放送されている視聴者参加型のTV番組は何でしょう?, pred: ['最強の男は誰だ!壮絶筋肉バトル!!スポーツマンNo.1決定戦', '究極の男は誰だ!?最強スポーツ男子頂上決戦', 'SASUKE', '島田紳助がオールスターの皆様に芸能界の厳しさ教えますスペシャル!', 'クイズ王最強決定戦〜THE OPEN〜'],


  0%|          | 4/995 [00:26<1:49:26,  6.63s/it]

 is_correct: False, gold: 浅草寺, Q: 東京都内では最も古い歴史を持つ寺院でもある、入口にある「雷門」で有名な観光名所は何でしょう?, pred: ['天龍寺 (新宿区)', '大龍寺 (東京都北区)', '大円寺 (目黒区)', '源覚寺 (文京区)', '浅草寺'],


  1%|          | 5/995 [00:32<1:45:17,  6.38s/it]

 is_correct: False, gold: グラタン, Q: 「鍋についたおこげ」という意味の言葉が語源であるとされる、日本ではマカロニを使ったものが一般的な西洋料理は何でしょう?, pred: ['オムレツ', 'ポテトサラダ', 'ロールキャベツ', 'フレンチトースト', 'スパゲッティ'],


  2%|▏         | 19/995 [02:07<1:49:01,  6.70s/it]

 Success rate = 5.0% (1 / 20)

Done





<a name='5.2.2'></a><a id='5.2.2'></a>
## 5.2.2 How to Fine-tune the Rinna Language Model for Solving QA
<a href="#top">[back to top]</a>

* We can improve Rinna and any other pretrained model by showing it examples and optimize its parameters. 
* This is called *fine-tuning* and is the most common way to adapt a pretrained model to another task.
* We first load the JAQKET dataset in JSONL format with the HuggingFace dataset library. 
* Then we fine-tune the model by presenting the [question_text][answer_text] pair to the language model.
* Then we tokenize the concatenated text field.
* Then we fine-tune the model. 
* We specify some hyper-parameters with `TrainingArguments`, create a `Trainer` instance, and invoke the `.train()` method. 
* This training may take an hour even on a fast GPU.

**Reference: class transformers.Trainer API** 

* https://huggingface.co/docs/transformers/main_classes/trainer
* https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/trainer#transformers.Trainer

In [20]:
# Load the JAQKET dataset in the JSONL format with HuggingFace
dataset = load_dataset(
    'json',
    data_files = {
        'train': str(data_train_questions),
        'valid': str(data_dev1_path)
    }
)

Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-36bf441dda671356/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...


0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-36bf441dda671356/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264. Subsequent calls will reuse this data.


In [21]:
# Inspect the data
print(dataset['train'].shape)
dataset['train'][-1] # last entry

(13061, 5)


{'qid': 'ABC12-04-0400',
 'question': '昨年シングル『HANDS UP!』でデビューし、日本レコード大賞最優秀新人賞を受賞した歌手は誰でしょう?',
 'answer_entity': '新里宏太',
 'answer_candidates': ['新里宏太',
  '橋本裕太',
  'FUKI',
  '川本璃',
  '中崎祐衣',
  '田中雅功',
  'Cloe',
  '福留仁菜',
  '歩乃圭',
  'LUHICA',
  'SORA (歌手)',
  'まつもとななみ',
  '武藤千春',
  '野元空',
  '果山サキ',
  '松崎梨央',
  '番匠谷紗衣',
  '安倍川緑',
  '高萩千夏',
  '井上理香子'],
 'original_question': '昨年（2013年）シングル『HANDS UP！』でデビューし、日本レコード大賞最優秀新人賞を受賞した歌手は誰でしょう？'}

In [22]:
dataset = dataset.map(
    lambda example: {"text": example["question"] + example["answer_entity"]}
)

  0%|          | 0/13061 [00:00<?, ?ex/s]

  0%|          | 0/995 [00:00<?, ?ex/s]

In [23]:
# Check the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['qid', 'question', 'answer_entity', 'answer_candidates', 'original_question', 'text'],
        num_rows: 13061
    })
    valid: Dataset({
        features: ['qid', 'question', 'answer_entity', 'answer_candidates', 'original_question', 'text'],
        num_rows: 995
    })
})

In [24]:
max_length = 256

# It is important to return the 'label' field.
# This a copy of the input_ids field.
# The language model is trained by scoring it based on its ability 
# to reproduce 'label' token by token.
def tokenize_function(examples):
    inputs = tokenizer(
        examples["text"],
        max_length=max_length,
        padding="max_length",
        trucation=True,
        return_tensors="np",
    )
    labels = inputs.input_ids.copy()
    labels[labels == tokenizer.pad_token_id] = -100
    return {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": labels,
    }

# Batch process the dataset
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=8,
    num_proc=4
)

print(tokenized_dataset["train"])

Dataset({
    features: ['answer_candidates', 'answer_entity', 'attention_mask', 'input_ids', 'labels', 'original_question', 'qid', 'question', 'text'],
    num_rows: 13061
})


In [25]:
# Each input has the padding token (id:3).
# This is to ensure same length in all instances of a batch.
print(tokenized_dataset["train"][0]["input_ids"])

[9, 8355, 149, 9255, 13, 209, 2872, 10, 550, 115, 11, 5964, 16744, 3017, 886, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [26]:
# Fine-tune the model
model_best_pathway = data_dir / "model_best"
model_retrain = False

if model_best_pathway.is_dir() and not model_retrain:
    print(f"{model_best_pathway} exists and skipping model retraining.")
else:
    print("Fine-tuning model rinna_japanese_gpt2_mediumfinetuned")
    HR()

    # 1. Create transformers.training_args.TrainingArguments
    training_args = TrainingArguments(
        output_dir = data_dir / "models" / "rinna_japanese_gpt2_mediumfinetuned",
        num_train_epochs = 3,
        evaluation_strategy = "steps",
        learning_rate = 0.00005,
        warmup_steps = 1000,
        per_device_train_batch_size = 6,
        eval_steps = 200,
        logging_steps = 200,
        save_strategy = "no"
    )

    # 2. Create transformers.trainer.Trainer
    trainer = Trainer(
        model = model,
        args = training_args,
        train_dataset = tokenized_dataset["train"],
        eval_dataset = tokenized_dataset["valid"],
    )

    # 3. Train the model
    # This may take over an hour on a GPU (2.5H on Colab)
    print("Start training")
    trainer.train()
    print("Done training")
    HR()

    # 4. Save model
    # Configuration saved in chp05_02/model_best/config.json
    # Model weights saved in chp05_02/model_best/pytorch_model.bin
    print("Saving model.")
    trainer.save_model("chp05_02/model_best")
    print("Model saved.")

    # Last few iterations:
    # Step 	Training Loss 	Validation Loss
    # 200 	2.531400 	    2.994833
    # 400 	2.286800 	    3.030425
    # ...
    # 6200 	1.369500 	    3.148371
    # 6400 	1.377600 	    3.144541

chp05_02/model_best exists and skipping model retraining.


<a name='5.2.2.1'></a><a id='5.2.2.1'></a>
### 5.2.2.1 Checking created artifacts
<a href="#top">[back to top]</a>

In [27]:
!tree chp05_02

chp05_02
├── dev1_questions.json
├── model_best
│   ├── config.json
│   ├── pytorch_model.bin
│   └── training_args.bin
├── models
│   └── rinna_japanese_gpt2_mediumfinetuned
│       └── runs
│           ├── Dec07_12-39-07_52aa230d979d
│           │   ├── 1670416747.7148376
│           │   │   └── events.out.tfevents.1670416747.52aa230d979d.608.1
│           │   └── events.out.tfevents.1670416747.52aa230d979d.608.0
│           └── Dec08_02-26-07_98d2b8421872
│               ├── 1670466367.9504054
│               │   └── events.out.tfevents.1670466367.98d2b8421872.806.1
│               └── events.out.tfevents.1670466367.98d2b8421872.806.0
├── requirements_5_5_2.txt
├── train_questions.json
└── train_questions.nooa.json

8 directories, 11 files


<a name='5.2.2.2'></a><a id='5.2.2.2'></a>
### 5.2.2.2 Saving artifacts to Google Drive
<a href="#top">[back to top]</a>

https://drive.google.com/drive/my-drive

In [28]:
# Set PUSH_TO_GD to True if you want to push chp05_02 to Google Drive 
PUSH_TO_GD = False
if IS_COLAB and PUSH_TO_GD:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    HR()

    print("Overwriting /content/drive/MyDrive/chp05_02")
    !cp -R /content/chp05_02  /content/drive/MyDrive
    HR()

    print("Check contents of chp05_02:")
    !du -ah /content/drive/MyDrive/chp05_02 --max-depth=2 | sort -h

### Load best model

Make sure we can reload the saved mode.

https://discuss.huggingface.co/t/how-to-save-my-model-to-use-it-later/20568/8

In [29]:
!ls -l chp05_02/model_best

total 1337712
-rw------- 1 root root        869 Dec 12 15:39 config.json
-rw------- 1 root root 1369803165 Dec 12 15:39 pytorch_model.bin
-rw------- 1 root root       2747 Dec 12 15:39 training_args.bin


In [30]:
print("Loading model.")
model_tuned = AutoModelForCausalLM.from_pretrained(model_best_pathway)

Loading model.


### Evaluate using a tuned Rinna model

In [32]:
if IS_COLAB:
    stop_n= 20 # 100
else:
    stop_n = 10

evaluate(
    qas, # questions and answer list
    model_tuned,
    tokenizer, # T5Tokenizer.from_pretrained("rinna/japanese-gpt2-medium")
    stop_at=stop_n,
    show_preview=True
)

print()
print("Done")

# Success rate = 50.0% (10 / 20)

  0%|          | 1/995 [00:06<1:49:51,  6.63s/it]

 is_correct: True, gold: コックリさん, Q: 明治時代に西洋から伝わった「テーブル・ターニング」に起源を持つ占いの一種で、50音表などを記入した紙を置き、参加者全員の人差し指をコインに置いて行うのは何でしょう?, pred: ['コックリさん', 'テケテケ', '赤い紙、青い紙', '小玉鼠 (妖怪)', 'ルームメイトの死'],


  0%|          | 2/995 [00:12<1:43:12,  6.24s/it]

 is_correct: True, gold: 集英社, Q: 『non・no』『週刊プレイボーイ』『週刊少年ジャンプ』といえば、発行している出版社はどこでしょう?, pred: ['集英社', '白泉社', '日本文芸社', '小学館', '実業之日本社'],


  0%|          | 3/995 [00:20<1:54:13,  6.91s/it]

 is_correct: True, gold: SASUKE, Q: 「パイプスライダー」や「そり立つ壁」などの関門がある、TBS系列で不定期に放送されている視聴者参加型のTV番組は何でしょう?, pred: ['SASUKE', '究極の男は誰だ!?最強スポーツ男子頂上決戦', '最強の男は誰だ!壮絶筋肉バトル!!スポーツマンNo.1決定戦', 'サスケマニア', '筋肉番付シリーズ'],


  0%|          | 4/995 [00:26<1:52:25,  6.81s/it]

 is_correct: True, gold: 浅草寺, Q: 東京都内では最も古い歴史を持つ寺院でもある、入口にある「雷門」で有名な観光名所は何でしょう?, pred: ['浅草寺', '天龍寺 (新宿区)', '今戸神社', '天現寺', '大円寺 (目黒区)'],


  1%|          | 5/995 [00:33<1:48:32,  6.58s/it]

 is_correct: False, gold: グラタン, Q: 「鍋についたおこげ」という意味の言葉が語源であるとされる、日本ではマカロニを使ったものが一般的な西洋料理は何でしょう?, pred: ['オムレツ', 'ポトフ', 'ポテトサラダ', 'フレンチトースト', 'サラダ'],


  2%|▏         | 19/995 [02:09<1:50:57,  6.82s/it]

 Success rate = 50.0% (10 / 20)

Done





<a name='5.2.3'></a><a id='5.2.3'></a>
## 5.2.3 Next Steps
<a href="#top">[back to top]</a>