<a href="https://colab.research.google.com/github/cyyeh/kaggle/blob/master/google-qa/google_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This notbook demonstrates how to train/evaluate/test short answer classification task based on Google Natural Question dataset using BERT.

## General Steps to Solve This Problem

1. [X] Prepare raw data
2. [ ] Transform raw data into BERT compatiple format
3. [ ] Add new layers for downstream task on the BERT model
4. [ ] Train the model
5. [ ] Make inference on new data

### References

- [進擊的 BERT：NLP 界的巨人之力與遷移學習](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html)

# Import Libraries and Environment Setup

In [1]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


In [2]:
!pip install transformers # BertModel

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ee/fc/bd726a15ab2c66dc09306689d04da07a3770dad724f0883f0a4bfb745087/transformers-2.4.1-py3-none-any.whl (475kB)
[K     |▊                               | 10kB 24.1MB/s eta 0:00:01[K     |█▍                              | 20kB 3.2MB/s eta 0:00:01[K     |██                              | 30kB 4.6MB/s eta 0:00:01[K     |██▊                             | 40kB 3.0MB/s eta 0:00:01[K     |███▍                            | 51kB 3.7MB/s eta 0:00:01[K     |████▏                           | 61kB 4.4MB/s eta 0:00:01[K     |████▉                           | 71kB 5.1MB/s eta 0:00:01[K     |█████▌                          | 81kB 5.7MB/s eta 0:00:01[K     |██████▏                         | 92kB 6.4MB/s eta 0:00:01[K     |██████▉                         | 102kB 5.0MB/s eta 0:00:01[K     |███████▋                        | 112kB 5.0MB/s eta 0:00:01[K     |████████▎                       | 122kB 5.0M

In [0]:
import tensorflow as tf
import json
import os
import pandas as pd
import ast

from transformers import BertTokenizer, TFBertModel

# Prepare Raw Data

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Check if training/testing dataset is available in your google drive. If it's not available, you should run code inside the "Prepare Kaggle Dataset" section.

In [7]:
if os.path.exists('drive/My Drive/simplified-nq-train.csv') and \
os.path.exists('drive/My Drive/simplified-nq-test.csv'):
  print("Training/testing dataset is available!")
else:
  print("Training/testing dataset is not found, please run code inside the 'Prepare Kaggle Dataset' section.")

Training/testing dataset is available!


## Prepare Kaggle Dataset for [TensorFlow 2.0 Question Answering](https://www.kaggle.com/c/tensorflow2-question-answering)

### Download Question Answering Dataset

In [77]:
import os
os.environ['KAGGLE_USERNAME'] = "chihyuyeh" # username from the json file
os.environ['KAGGLE_KEY'] = "f21b340fc8082977cbf954c80ad69ae1" # key from the json file
!kaggle competitions download -c tensorflow2-question-answering

Downloading sample_submission.csv to /content
  0% 0.00/18.2k [00:00<?, ?B/s]
100% 18.2k/18.2k [00:00<00:00, 10.2MB/s]
Downloading simplified-nq-train.jsonl.zip to /content
100% 4.45G/4.46G [01:21<00:00, 66.6MB/s]
100% 4.46G/4.46G [01:21<00:00, 58.9MB/s]
Downloading simplified-nq-test.jsonl.zip to /content
  0% 0.00/4.78M [00:00<?, ?B/s]
100% 4.78M/4.78M [00:00<00:00, 43.9MB/s]


### Unzip Question Answering Dataset

In [72]:
!unzip simplified-nq-train.jsonl.zip
!unzip simplified-nq-test.jsonl.zip

Archive:  simplified-nq-train.jsonl.zip
  inflating: simplified-nq-train.jsonl  
Archive:  simplified-nq-test.jsonl.zip
  inflating: simplified-nq-test.jsonl  


### Generate data I need and export it to csv

check data fields in simplified-nq-train.jsonl

In [78]:
with open('simplified-nq-train.jsonl') as f:
  line = f.readline()
  json_obj = json.loads(line)
  print(json_obj.keys())

dict_keys(['document_text', 'long_answer_candidates', 'question_text', 'annotations', 'document_url', 'example_id'])


Data fields in simplified-nq-train.jsonl
- document_text
- long_answer_candidates
- question_text
- annotations
- document_url
- example_id

check data fields in simplified-nq-test.jsonl

In [80]:
with open('simplified-nq-test.jsonl') as f:
  line = f.readline()
  json_obj = json.loads(line)
  print(json_obj.keys())

dict_keys(['example_id', 'question_text', 'document_text', 'long_answer_candidates'])


Data fields in simplified-nq-test.jsonl
- example_id
- question_text
- document_text
- long_answer_candidates

Data fields that are not needed in training data:
- document_text
- document_url

Data fields that are not needed in testing data:
- document_text

I will remove these data fields in order to reduce memory size for the dataset!

In [0]:
import csv

LONG_ANSWER_CANDIDATES = 'long_answer_candidates'
QUESTION_TEXT = 'question_text'
ANNOTATIONS = 'annotations'
EXAMPLE_ID = 'example_id'
DOCUMENT_TEXT = 'document_text'

Since a short answer exists only if a long answer exists, so we will remove cases where long answers don't exist first.

For long answers that don't exist, `start_token`, `candidate_index`, and `end_token` are all -1 in `annotations` of `simplified-nq-train.jsonl`

In [0]:
def is_long_answer_existed(annotations):
  long_answer = annotations['long_answer']
  return long_answer['start_token'] != -1 \
  and long_answer['candidate_index'] != -1 \
  and long_answer['end_token'] != -1 

In [0]:
def make_annotation_dataset(document_text, annotations):
  orig_long_answer = annotations['long_answer']
  new_long_answer = ' '.join(document_text[
                                       orig_long_answer['start_token']:
                                       orig_long_answer['end_token']
                                      ])
  orig_short_answer = annotations['short_answers']
  new_short_answer = ' '.join(document_text[
                                            orig_short_answer[0]['start_token']:
                                            orig_short_answer[0]['end_token']
                                          ]) if len(orig_short_answer) else ''
  return {
      "yes_no_answer": annotations["yes_no_answer"],
      "long_answer": new_long_answer,
      "short_answer": new_short_answer
  }

In [0]:
# write to simplified-nq-train.csv
with open('simplified-nq-train.csv', 'w', newline='') as csvfile:
  writer = csv.writer(csvfile, delimiter=';')
  # headline
  writer.writerow([
                   EXAMPLE_ID,
                   QUESTION_TEXT,
                   LONG_ANSWER_CANDIDATES,
                   ANNOTATIONS
                   ])

  with open('simplified-nq-train.jsonl', 'r') as f:
    for line in f:
      json_obj = json.loads(line)
      if is_long_answer_existed(json_obj[ANNOTATIONS][0]):
        document_text = json_obj[DOCUMENT_TEXT].split(' ')
        writer.writerow([
                        json_obj[EXAMPLE_ID], 
                        json_obj[QUESTION_TEXT],
                        [
                          ' '.join(document_text[candidate['start_token']:candidate['end_token']]) 
                          for candidate in json_obj[LONG_ANSWER_CANDIDATES]
                        ],
                        make_annotation_dataset(document_text, json_obj[ANNOTATIONS][0])
                        ])

In [0]:
# write to simplified-nq-test.csv
with open('simplified-nq-test.csv', 'w', newline='') as csvfile:
  writer = csv.writer(csvfile, delimiter=';')
  # headline
  writer.writerow([
                   EXAMPLE_ID,
                   QUESTION_TEXT, 
                   LONG_ANSWER_CANDIDATES
                   ])

  with open('simplified-nq-test.jsonl', 'r') as f:
    for line in f:
      json_obj = json.loads(line)
      document_text = json_obj[DOCUMENT_TEXT].split(' ')

      writer.writerow([
                       json_obj[EXAMPLE_ID], 
                       json_obj[QUESTION_TEXT],
                       [
                        ' '.join(document_text[candidate['start_token']:candidate['end_token']]) 
                        for candidate in json_obj[LONG_ANSWER_CANDIDATES]
                       ],
                      ])

### Move these generated csv files to my google drive

In [86]:
!mv simplified-nq-test.csv drive/My\ Drive/
!mv simplified-nq-train.csv drive/My\ Drive/

mv: cannot stat 'simplified-nq-test.csv': No such file or directory


In [40]:
# load training data and check if it's ok
simplified_train_csv_path = "drive/My Drive/simplified-nq-train.csv"

simplified_train_pd = pd.read_csv(simplified_train_csv_path, sep=';', chunksize=1)
next(simplified_train_pd)

Unnamed: 0,example_id,question_text,long_answer_candidates,annotations
0,5655493461695504401,which is the most common use of opt-in e-mail ...,['<Table> <Tr> <Td> </Td> <Td> ( hide ) This a...,"{'yes_no_answer': 'NONE', 'long_answer': {'sta..."


In [41]:
# load testing data and check if it's ok
simplified_test_csv_path = "drive/My Drive/simplified-nq-test.csv"

simplified_test_pd = pd.read_csv(simplified_test_csv_path, sep=';')
simplified_test_pd.head()

Unnamed: 0,example_id,question_text,long_answer_candidates
0,-1220107454853145579,who is the south african high commissioner in ...,"['<Table> <Tr> <Th_colspan=""2""> High Commissio..."
1,8777415633185303067,the office episode when they sing to michael,"['<Table> <Tr> <Th_colspan=""2""> `` Michael \'s..."
2,4640548859154538040,what is the main idea of the cross of gold speech,['<Table> Cross of Gold speech <Tr> <Td_colspa...
3,-5316095317154496261,when was i want to sing in opera written,"['<Table> <Tr> <Th_colspan=""2""> Wilkie Bard </..."
4,-8752372642178983917,who does the voices in ice age collision course,"['<Table> <Tr> <Th_colspan=""2""> Ice Age : Coll..."


# Prepare Short Answer Dataset

For long answers that exist, there are several cases of short answers:
1. YES/NO
2. a sentence or phrase
3. no short answer

In [0]:
class Short_answer_dataset():
  def __init__(self, tokenizer, data_path='drive/My Drive/simplified-nq-train.csv', mode='train'):
    assert mode in ['train', 'test']
    self.df = pd.read_csv(data_path, sep=";", chunksize=1)
    self.mode = mode
    self.tokenizer = tokenizer
    self.yes_no_label_map = {'YES': 1, 'NO': 2, 'NONE': 3}
    self._next_index = 0


  def get_dataset_generator_function(self):
    return self._create_data_generator


  '''
  target_format: 'raw'|'bert'
  '''
  def _create_data_generator(self, target_format='bert'):
    temp_next_index = 0
    for instance in self.df:
      if temp_next_index == self._next_index:
        self._next_index += 1
        if target_format == 'bert':
          yield self.get_bert_compatible_instance(instance)
        else:
          yield instance
      else:
        temp_next_index += 1


  '''
  make a pair that consists of question text and long answer, then return 3 tensors
  for the pair:
  - tokens_tensor：tokens list made from concatenating two sentences. special tokens are included([CLS], [SEP], etc.)
  - segments_tensor： classify the boundary of each sentence; 0 for the first sentence, 1 for the second sentence
  - masks_tensor
  - label_tensor： none if it's in testing mode
  '''
  def get_bert_compatible_instance(self, instance):
    '''
    helper functions
    '''
    def _get_question_long_answer_pair(instance):
      question_text = instance["question_text"].iloc[0]
      json_obj = ast.literal_eval(instance["annotations"].iloc[0])
      long_answer_text = json_obj["long_answer"]
      return question_text, long_answer_text

    def _get_short_answer_label(instance):
      json_obj = ast.literal_eval(instance["annotations"].iloc[0])
      return json_obj["yes_no_answer"]

    ### build label_tensor
    if self.mode == 'train':
      short_answer_label = _get_short_answer_label(instance)
      label_id = self.yes_no_label_map[short_answer_label]
      label_tensor = tf.constant(label_id, dtype=tf.int64)
    else:
      label_tensor = None

    # question_text is the first sentence(a)
    # long_answer_text is the second sentence(b)
    question_text, long_answer_text = _get_question_long_answer_pair(instance)
    
    ### build tokens_tensor
    # first sentence
    word_pieces = ["[CLS]"]
    tokens_a = self.tokenizer.tokenize(question_text)
    word_pieces += tokens_a + ["[SEP]"]
    len_a = len(word_pieces)

    # second sentence
    tokens_b = self.tokenizer.tokenize(long_answer_text)
    word_pieces += tokens_b + ["[SEP]"]
    len_b = len(word_pieces) - len_a

    ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
    tokens_tensor = tf.constant(ids, dtype=tf.int64)

    ### build segments_tensor
    segments_tensor = tf.constant([0] * len_a + [1] * len_b, dtype=tf.int64)

    ### build masks_tensor
    masks_tensors = tf.zeros(tokens_tensor.shape, dtype=tf.int64)
    masks_tensors = tf.where(tokens_tensor != 0 , 1, 0)

    return (tokens_tensor, segments_tensor, masks_tensors), label_tensor

  
  def convert_ids_to_tokens(self, tokens_tensor):
    return self.tokenizer.convert_ids_to_tokens(tokens_tensor)

### Initialize BertTokenizer

In [0]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_basic_tokenize=False)

### Testing

In [13]:
short_answer_train_dataset_gen = Short_answer_dataset(bert_tokenizer).get_dataset_generator_function()
short_answer_train_dataset = tf.data.Dataset.from_generator(
    short_answer_train_dataset_gen, 
    output_types=((tf.int64, tf.int64, tf.int64), tf.int64)
)

for instance in short_answer_train_dataset:
  print(instance)
  break

((<tf.Tensor: shape=(92,), dtype=int64, numpy=
array([  101,  1134,  1110,  1103,  1211,  1887,  1329,  1104, 11769,
        1204, 28137,  1394,   174, 28137, 14746,  6213,   102,   133,
        2101, 28144,   138,  1887,  1859,  1104,  6156,  6213,  1110,
         170, 24343,  1850,  1106,  1126,  6437,  3016,   112,  1116,
        5793,   119,  5723, 24343,  1116, 12862,  5793,  1104,  8851,
        1958,  1137, 18949,   117,  1137,  1207,  2982,   119,  1130,
        1142,  2076,  1104,  6437,   117,   170,  1419,  1115,  3349,
        1106,  3952,   170, 24343,  1106,  1147,  5793,  1336,  2367,
        1172,  1120,  1103,  1553,  1104,  4779,  1191,  1152,  1156,
        1176,  1106,  3531,  1103, 24343,   119,   133, 28139,  2101,
       28144,   102])>, <tf.Tensor: shape=(92,), dtype=int64, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

# Short Answer Idenfiticator

![short-answer-identificator](https://github.com/cyyeh/kaggle/blob/master/google-qa/short_answer_identificator.png?raw=true)

## Long Answer Encoder

see Prepare Short Answer Dataset

## Short Answer Binary Classifier

## Short Answer Null Classifier