# 9강) Closed book Question Answering 을 수행해보기

## Natural Questions 

[Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) 

Natural Questions 는 open-domain QA 에서 자주 사용되는 데이터셋입니다. 현재 datasets 패키지에선 natural questions 를 제대로 받을 수 없으므로, 따로 전처리된 natural questions 데이터셋을 다운받도록 하겠습니다.


## Requirements

In [1]:
%%bash
# install packages
pip install datasets==1.4.1 > /dev/null 2>&1 # execute command in silence
pip install transformers==4.4.1 > /dev/null 2>&1
pip install tqdm==4.41.1 > /dev/null 2>&1
pip install apache_beam > /dev/null 2>&1 # for trivia or nq datasets, just in case
# download natural qusetion datasets 
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1wK5Q7R294ejOXyumcL7UbN231Si0cl15' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1wK5Q7R294ejOXyumcL7UbN231Si0cl15" -O data_nq-train.tsv && rm -rf /tmp/cookies.txt 
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Ii_wNVpam2wZ5wYebqAqwPHi6g0Ayg9l' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Ii_wNVpam2wZ5wYebqAqwPHi6g0Ayg9l" -O data_nq-validation.tsv && rm -rf /tmp/cookies.txt 

--2021-10-18 04:46:01--  https://docs.google.com/uc?export=download&confirm=&id=1wK5Q7R294ejOXyumcL7UbN231Si0cl15
Resolving docs.google.com (docs.google.com)... 142.250.207.46, 2404:6800:4004:824::200e
Connecting to docs.google.com (docs.google.com)|142.250.207.46|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0c-48-docs.googleusercontent.com/docs/securesc/mct2kh06lc2863kclhc9cn1dnli5gqig/d4cj3sd5o9815n0tg6ha1060cqkvd1sh/1634532300000/10000090644219567406/17489254073164787105Z/1wK5Q7R294ejOXyumcL7UbN231Si0cl15?e=download [following]
--2021-10-18 04:46:02--  https://doc-0c-48-docs.googleusercontent.com/docs/securesc/mct2kh06lc2863kclhc9cn1dnli5gqig/d4cj3sd5o9815n0tg6ha1060cqkvd1sh/1634532300000/10000090644219567406/17489254073164787105Z/1wK5Q7R294ejOXyumcL7UbN231Si0cl15?e=download
Resolving doc-0c-48-docs.googleusercontent.com (doc-0c-48-docs.googleusercontent.com)... 172.217.31.161, 2404:6800:4004:80c::2001
Connecting to doc-0c-48

## 데이터 불러오기

In [2]:
import os
from tqdm.auto import tqdm, trange
import argparse
import random
import numpy as np

from datasets import load_metric, Dataset

In [3]:
DATA_DIR = "./"
nq_tsv_path = {
    "train": os.path.join(DATA_DIR, "data_nq-train.tsv"),
    "valid": os.path.join(DATA_DIR, "data_nq-validation.tsv")
}

In [4]:
import pandas as pd
train_df = pd.read_csv(filepath_or_buffer=nq_tsv_path['train'], sep='\t',
                 header=None)
train_df = train_df.rename(columns = {train_df.columns[0]: 'question', train_df.columns[1]: 'answer'}).dropna()
train_ds = Dataset.from_pandas(train_df)

valid_df = pd.read_csv(filepath_or_buffer=nq_tsv_path['valid'], sep='\t',
                 header=None)
valid_df = valid_df.rename(columns = {valid_df.columns[0]: 'question', valid_df.columns[1]: 'answer'}).dropna()
valid_ds = Dataset.from_pandas(valid_df)

In [5]:
import re
def nq_preprocessor(ex):
  def normalize_text(text):
    """Lowercase and remove quotes from a string."""
    text = text.lower()
    text = re.sub("'(.*)'", r"\1", text)
    return text

  def to_inputs_and_targets(ex):
    """Map {"question": ..., "answer": ...}->{"inputs": ..., "targets": ...}."""
    return {
        "inputs":
             "".join(
                 ["natural question: ", normalize_text(ex["question"])]),
        "targets": normalize_text(ex["answer"])
    }
  return to_inputs_and_targets(ex)

In [6]:
import multiprocessing as mp
cpus = mp.cpu_count()
train_ds = train_ds.map(lambda x: nq_preprocessor(x), num_proc=cpus).remove_columns(['question', 'answer'])
valid_ds = valid_ds.map(lambda x: nq_preprocessor(x), num_proc=cpus).remove_columns(['question', 'answer'])

  

HBox(children=(FloatProgress(value=0.0, description='#0', max=12010.0, style=ProgressStyle(description_width='…

  

HBox(children=(FloatProgress(value=0.0, description='#1', max=12010.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='#2', max=12009.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#3', max=12009.0, style=ProgressStyle(description_width='…

 

HBox(children=(FloatProgress(value=0.0, description='#4', max=12009.0, style=ProgressStyle(description_width='…

  

HBox(children=(FloatProgress(value=0.0, description='#5', max=12009.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='#6', max=12009.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='#7', max=12009.0, style=ProgressStyle(description_width='…









        

HBox(children=(FloatProgress(value=0.0, description='#0', max=287.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#1', max=287.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#4', max=287.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#7', max=286.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#5', max=287.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#2', max=287.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#3', max=287.0, style=ProgressStyle(description_width='in…

HBox(children=(FloatProgress(value=0.0, description='#6', max=287.0, style=ProgressStyle(description_width='in…











In [7]:
train_ds[0]

{'__index_level_0__': 0,
 'inputs': 'natural question: which is the most common use of opt-in e-mail marketing?',
 'targets': "a newsletter sent to an advertising firm's customers"}

In [8]:
# Choose small samples as our dataset 
sample_idx = np.random.choice(range(len(train_ds)), 4) 
training_dataset = train_ds[sample_idx]

## 훈련

In [9]:
import torch
import torch.nn.functional as F

from transformers import (AutoTokenizer, 
                          AutoModelForSeq2SeqLM, 
                          AdamW, 
                          TrainingArguments, 
                          get_linear_schedule_with_warmup)

In [10]:
args = TrainingArguments(
    output_dir="seq2seq_models/bart_nq",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    gradient_accumulation_steps=2
)

In [11]:
args.device

device(type='cuda', index=0)

In [12]:
# load pre-trained model on cuda (if available)
model_checkpoint = "facebook/bart-large" 

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(args.device) # BartForConditionalGeneration

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1600.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1018571383.0, style=ProgressStyle(descr…




In [13]:
!nvidia-smi

Mon Oct 18 04:47:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-PCIE...  Off  | 00000000:00:05.0 Off |                  Off |
| N/A   36C    P0    36W / 250W |   9500MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [14]:
torch.manual_seed(2021)
torch.cuda.manual_seed(2021)
np.random.seed(2021)
random.seed(2021)

In [15]:
from torch.utils.data import (DataLoader, RandomSampler, TensorDataset)
max_len = 128 # to reduce memory per sample! 
q_seqs = tokenizer(training_dataset['inputs'], padding="max_length", max_length=max_len, truncation=True, return_tensors='pt')
a_seqs = tokenizer(training_dataset['targets'], padding="max_length", max_length=max_len, truncation=True, return_tensors='pt')
train_dataset = TensorDataset(q_seqs['input_ids'], q_seqs['attention_mask'],
                        a_seqs['input_ids'], a_seqs['attention_mask'], )

In [16]:
# Optimizer
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
    ]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)

In [17]:
def train(args, dataset, model, optimizer):
    # Dataloader
    train_sampler = RandomSampler(dataset)
    
    train_dataloader = DataLoader(dataset, batch_size=args.per_device_train_batch_size,
                                  sampler=train_sampler, )

    t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total)

    # 훈련 시작 
    global_step = 0

    model.zero_grad()

    train_iterator = trange(int(args.num_train_epochs), desc="Epoch")

    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration")

        for step, batch in enumerate(epoch_iterator):
            model.train()

            q_ids, q_mask, a_ids, a_mask = batch
            # 레이블 구하기 - answer의 0번째를 제외한 나머지  
            lm_labels = a_ids[:, 1:].contiguous().clone()
            lm_labels[a_mask[:, 1:].contiguous() == 0] = -100

            # decoder_input_ids 는 원래 주어지지 않아도 모델이 자동으으 계산합니다 
            model_inputs = {
                "input_ids": q_ids.cuda(),
                "attention_mask": q_mask.cuda(),
                "decoder_input_ids": a_ids[:, :-1].contiguous().cuda(),
                "labels": lm_labels.cuda(),
            }

            outputs = model(**model_inputs)  # (batch_size, emb_dim)
            loss = outputs[0]
                
            loss.backward()

            optimizer.step()
            scheduler.step()  # 학습률을 조정하는 스케쥴러 
            model.zero_grad()
            global_step += 1

            # save model
            model.save_pretrained(args.output_dir)

    return model

In [18]:
model = train(args, train_dataset, model, optimizer)

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=2.0, style=ProgressStyle(description_widt…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=2.0, style=ProgressStyle(description_widt…





## 미리 학습된 모델로 테스트 해보기


In [19]:
# 1. 미리 학습해둔 encoder file 다운로드 
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gCgeJIPiMeM0pmFq3toEm91QhUEwdc4H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gCgeJIPiMeM0pmFq3toEm91QhUEwdc4H" -O bart_nq.tar.gz && rm -rf /tmp/cookies.txt
# 2. the .tar.gz file 압축해제 
!mkdir ./seq2seq_models/ && tar -xf bart_nq.tar.gz -C ./seq2seq_models/

# 3. 직접 학습해보기 
#!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1KkzY0cAyVe-c-ur4b-xZ3BYqnA7MzVt4' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1KkzY0cAyVe-c-ur4b-xZ3BYqnA7MzVt4" -O train_clqa_trainer.py && rm -rf /tmp/cookies.txt
#!python train_clqa_trainer.py

--2021-10-18 04:48:34--  https://docs.google.com/uc?export=download&confirm=kslO&id=1gCgeJIPiMeM0pmFq3toEm91QhUEwdc4H
Resolving docs.google.com (docs.google.com)... 142.250.207.46, 2404:6800:4004:824::200e
Connecting to docs.google.com (docs.google.com)|142.250.207.46|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-5s-docs.googleusercontent.com/docs/securesc/kjjlo6sqjd58f7nqbeoc5nf0t5d76bnu/sol9foh1hci4agbjuhcncf3igi7ok391/1634532450000/10000090644219567406/10588907199810661152Z/1gCgeJIPiMeM0pmFq3toEm91QhUEwdc4H?e=download [following]
--2021-10-18 04:48:35--  https://doc-00-5s-docs.googleusercontent.com/docs/securesc/kjjlo6sqjd58f7nqbeoc5nf0t5d76bnu/sol9foh1hci4agbjuhcncf3igi7ok391/1634532450000/10000090644219567406/10588907199810661152Z/1gCgeJIPiMeM0pmFq3toEm91QhUEwdc4H?e=download
Resolving doc-00-5s-docs.googleusercontent.com (doc-00-5s-docs.googleusercontent.com)... 172.217.31.161, 2404:6800:4004:80c::2001
Connecting to doc-0

In [20]:
# args.output_dir 이 우리가 모델을 저장해둔 위치  
output_dir = "/content/seq2seq_models/bart_nq/checkpoint-12010/"
model = AutoModelForSeq2SeqLM.from_pretrained(output_dir).cuda()

404 Client Error: Not Found for url: https://huggingface.co//content/seq2seq_models/bart_nq/checkpoint-12010//resolve/main/config.json


OSError: Can't load config for '/content/seq2seq_models/bart_nq/checkpoint-12010/'. Make sure that:

- '/content/seq2seq_models/bart_nq/checkpoint-12010/' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/content/seq2seq_models/bart_nq/checkpoint-12010/' is the correct path to a directory containing a config.json file



generate() 메서드를 활용한 텍스트 생성

In [None]:
def qa_s2s_generate(
    model_inputs,
    qa_s2s_model,
    qa_s2s_tokenizer,
    num_answers=1,
    num_beams=2,
    min_len=1,
    max_len=64,
    do_sample=False,
    temp=1.0,
    top_p=None,
    top_k=None,
):
    n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs[0],
        attention_mask=model_inputs[1],
        min_length=min_len,
        max_length=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [qa_s2s_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in generated_ids]

In [None]:
# Choose small samples as our dataset 
sample_idx = np.random.choice(range(len(valid_ds)), 4)
validation_dataset = valid_ds[sample_idx]

input_dict = tokenizer(validation_dataset['inputs'], padding="max_length", max_length=max_len, truncation=True, return_tensors='pt')
target_dict = tokenizer(validation_dataset['targets'], padding="max_length", max_length=max_len, truncation=True, return_tensors='pt')
valid_dataset = TensorDataset(input_dict['input_ids'], input_dict['attention_mask'], 
                              target_dict['input_ids']) # target 정보는 필요 없으나 모델 결과를 확인하기 위해 넣음
valid_dataloader = DataLoader(valid_dataset, batch_size=args.per_device_eval_batch_size)


for step, batch in enumerate(valid_dataloader):
    model.eval()
    
    if torch.cuda.is_available():
        batch = tuple(t.cuda() for t in batch)

    inputs = [input.strip() for input in tokenizer.batch_decode(batch[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)]
    targets = [target.strip() for target in tokenizer.batch_decode(batch[2], skip_special_tokens=True, clean_up_tokenization_spaces=True)]
    
    results = qa_s2s_generate(batch, model, tokenizer)

    for inp, tgt, pred in zip(inputs, targets, results):
        print("Input:", inp)
        print("Target:", tgt)
        print("Prediction:", pred)
        print()

Input: natural question: when was the latest version of chrome released?
Target: 2018-01-22
Prediction: September 27, 2017

Input: natural question: mount and blade with fire and sword time period?
Target: 1648-51
Prediction: the period between the ages of the first and second centuries BC

Input: natural question: which way does the earth orbit the sun?
Target: counter clockwise
Prediction: about the same distance from the Sun

Input: natural question: what year is the deer hunter set in?
Target: 1967
Prediction: the late 19th century

