<a href="https://colab.research.google.com/github/cyyeh/kaggle/blob/master/google-qa/google_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This notbook demonstrates how to train/evaluate/test short answer classification task based on Google Natural Question dataset using BERT.

## General Steps to Solve This Problem

1. [X] Prepare raw data
2. [ ] Transform raw data into BERT compatiple format
3. [ ] Add new layers for downstream task on the BERT model
4. [ ] Train the model
5. [ ] Make inference on new data

### References

- [進擊的 BERT：NLP 界的巨人之力與遷移學習](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html)

# Import Libraries and Environment Setup

In [0]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


In [0]:
!pip install transformers # BertModel



In [0]:
import tensorflow as tf
import json
import os
import pandas as pd

from transformers import BertTokenizer, TFBertModel

# Prepare Raw Data

Note: You don't need to run code inside the "Prepare Kaggle Dataset" section, since it's written to make you understand how is the dataset generated from raw Kaggle dataset.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

KeyboardInterrupt: ignored

In [40]:
# load training data and check if it's ok
simplified_train_csv_path = "drive/My Drive/simplified-nq-train.csv"

simplified_train_pd = pd.read_csv(simplified_train_csv_path, sep=';', chunksize=1)
next(simplified_train_pd)

Unnamed: 0,example_id,question_text,long_answer_candidates,annotations
0,5655493461695504401,which is the most common use of opt-in e-mail ...,['<Table> <Tr> <Td> </Td> <Td> ( hide ) This a...,"{'yes_no_answer': 'NONE', 'long_answer': {'sta..."


In [41]:
# load testing data and check if it's ok
simplified_test_csv_path = "drive/My Drive/simplified-nq-test.csv"

simplified_test_pd = pd.read_csv(simplified_test_csv_path, sep=';')
simplified_test_pd.head()

Unnamed: 0,example_id,question_text,long_answer_candidates
0,-1220107454853145579,who is the south african high commissioner in ...,"['<Table> <Tr> <Th_colspan=""2""> High Commissio..."
1,8777415633185303067,the office episode when they sing to michael,"['<Table> <Tr> <Th_colspan=""2""> `` Michael \'s..."
2,4640548859154538040,what is the main idea of the cross of gold speech,['<Table> Cross of Gold speech <Tr> <Td_colspa...
3,-5316095317154496261,when was i want to sing in opera written,"['<Table> <Tr> <Th_colspan=""2""> Wilkie Bard </..."
4,-8752372642178983917,who does the voices in ice age collision course,"['<Table> <Tr> <Th_colspan=""2""> Ice Age : Coll..."


## Prepare Kaggle Dataset for [TensorFlow 2.0 Question Answering](https://www.kaggle.com/c/tensorflow2-question-answering)

### Download Question Answering Dataset

In [0]:
import os
os.environ['KAGGLE_USERNAME'] = "chihyuyeh" # username from the json file
os.environ['KAGGLE_KEY'] = "f21b340fc8082977cbf954c80ad69ae1" # key from the json file
!kaggle competitions download -c tensorflow2-question-answering

Downloading sample_submission.csv to /content
  0% 0.00/18.2k [00:00<?, ?B/s]
100% 18.2k/18.2k [00:00<00:00, 16.0MB/s]
Downloading simplified-nq-test.jsonl.zip to /content
  0% 0.00/4.78M [00:00<?, ?B/s]
100% 4.78M/4.78M [00:00<00:00, 77.9MB/s]
Downloading simplified-nq-train.jsonl.zip to /content
100% 4.46G/4.46G [01:20<00:00, 29.4MB/s]
100% 4.46G/4.46G [01:20<00:00, 59.4MB/s]


### Unzip Question Answering Dataset

In [0]:
!unzip simplified-nq-train.jsonl.zip
!unzip simplified-nq-test.jsonl.zip

Archive:  simplified-nq-train.jsonl.zip
  inflating: simplified-nq-train.jsonl  
Archive:  simplified-nq-test.jsonl.zip
  inflating: simplified-nq-test.jsonl  


### Generate data I need and export it to csv

check data fields in simplified-nq-train.jsonl

In [0]:
with open('simplified-nq-train.jsonl') as f:
  line = f.readline()
  json_obj = json.loads(line)
  print(json_obj.keys())

dict_keys(['document_text', 'long_answer_candidates', 'question_text', 'annotations', 'document_url', 'example_id'])


Data fields in simplified-nq-train.jsonl
- document_text
- long_answer_candidates
- question_text
- annotations
- document_url
- example_id

check data fields in simplified-nq-test.jsonl

In [0]:
with open('simplified-nq-test.jsonl') as f:
  line = f.readline()
  json_obj = json.loads(line)
  print(json_obj.keys())

dict_keys(['example_id', 'question_text', 'document_text', 'long_answer_candidates'])


Data fields in simplified-nq-test.jsonl
- example_id
- question_text
- document_text
- long_answer_candidates

Data fields that are not needed in training data:
- document_text
- document_url

Data fields that are not needed in testing data:
- document_text

I will remove these data fields in order to reduce memory size for the dataset!

In [0]:
import csv

LONG_ANSWER_CANDIDATES = 'long_answer_candidates'
QUESTION_TEXT = 'question_text'
ANNOTATIONS = 'annotations'
EXAMPLE_ID = 'example_id'
DOCUMENT_TEXT = 'document_text'

In [0]:
def is_long_answer_existed(annotations):
  long_answer = annotations['long_answer']
  return long_answer['start_token'] != -1 \
  and long_answer['candidate_index'] != -1 \
  and long_answer['end_token'] != -1 

In [0]:
# write to simplified-nq-train.csv
with open('simplified-nq-train.csv', 'w', newline='') as csvfile:
  writer = csv.writer(csvfile, delimiter=';')
  # headline
  writer.writerow([
                   EXAMPLE_ID,
                   QUESTION_TEXT,
                   LONG_ANSWER_CANDIDATES,
                   ANNOTATIONS
                   ])

  with open('simplified-nq-train.jsonl', 'r') as f:
    for line in f:
      json_obj = json.loads(line)
      if is_long_answer_existed(json_obj[ANNOTATIONS][0]):
        document_text = json_obj[DOCUMENT_TEXT].split(' ')
        writer.writerow([
                        json_obj[EXAMPLE_ID], 
                        json_obj[QUESTION_TEXT],
                        [
                          ' '.join(document_text[candidate['start_token']:candidate['end_token']]) 
                          for candidate in json_obj[LONG_ANSWER_CANDIDATES]
                        ],
                        json_obj[ANNOTATIONS][0]
                        ])

In [0]:
# write to simplified-nq-test.csv
with open('simplified-nq-test.csv', 'w', newline='') as csvfile:
  writer = csv.writer(csvfile, delimiter=';')
  # headline
  writer.writerow([
                   EXAMPLE_ID,
                   QUESTION_TEXT, 
                   LONG_ANSWER_CANDIDATES
                   ])

  with open('simplified-nq-test.jsonl', 'r') as f:
    for line in f:
      json_obj = json.loads(line)
      document_text = json_obj[DOCUMENT_TEXT].split(' ')

      writer.writerow([
                       json_obj[EXAMPLE_ID], 
                       json_obj[QUESTION_TEXT],
                       [
                        ' '.join(document_text[candidate['start_token']:candidate['end_token']]) 
                        for candidate in json_obj[LONG_ANSWER_CANDIDATES]
                       ],
                      ])

### Move these generated csv files to my google drive

In [0]:
!mv simplified-nq-test.csv drive/My\ Drive/
!mv simplified-nq-train.csv drive/My\ Drive/

# Prepare Short Answer Dataset

Since a short answer exists only if a long answer exists, so we will remove cases where long answers don't exist first.

For long answers that exist, there are several cases of short answers:
1. YES/NO
2. a sentence or phrase
3. no short answer

For long answers that don't exist, `start_token`, `candidate_index`, and `end_token` are all -1 in `annotations` of `simplified-nq-train.jsonl`

In [0]:
class Short_answer_dataset():
  def __init__(self, tokenizer, data_path='drive/My Drive/simplified-nq-train.csv', mode='train'):
    assert mode in ['train', 'test']
    self.df = pd.read_csv(data_path, sep=";", chunksize=1)
    self.mode = mode
    self.tokenizer = tokenizer
    self.yes_no_label_map = {'YES': 1, 'NO': 2, 'NONE': 3}


  def get_dataset_generator_function(self):
    return self._create_data_generator


  '''
  target_format: 'raw'|'bert'
  '''
  def _create_data_generator(self, target_format='bert'):
    def _is_long_answer_existed(annotations):
      long_answer = annotations['long_answer']
      return long_answer['start_token'] != -1 \
      and long_answer['candidate_index'] != -1 \
      and long_answer['end_token'] != -1 

    with open(self.data_path, 'r') as data:
      temp_next_index = 0
      for instance in data:
        if temp_next_index == self._next_index:
          self._next_index += 1
          nq_json = json.loads(instance)
          if self.mode == 'train':
            # we only care about short answers where long answers exist
            if _is_long_answer_existed(nq_json['annotations'][0]):
              if target_format == 'raw':
                yield nq_json
              else:
                yield self.get_bert_compatible_instance(nq_json)
          else:
            if target_format == 'raw':
              yield nq_json
            else:
              yield self.get_bert_compatible_instance(nq_json)
        else:
          temp_next_index += 1 


  '''
  make a pair that consists of question text and long answer, then return 3 tensors
  for the pair:
  - tokens_tensor：tokens list made from concatenating two sentences. special tokens are included([CLS], [SEP], etc.)
  - segments_tensor： classify the boundary of each sentence; 0 for the first sentence, 1 for the second sentence
  - masks_tensor
  - label_tensor： none if it's in testing mode
  '''
  def get_bert_compatible_instance(self, instance):
    '''
    helper functions
    '''
    def _get_question_long_answer_pair(instance):
      question_text = instance['question_text']
      long_answer = instance['annotations'][0]['long_answer']
      long_answer_start_token, long_answer_end_token = long_answer['start_token'], long_answer['end_token']
      document_text_tokenized = instance['document_text'].split(' ')
      long_answer_text = ' '.join(document_text_tokenized[long_answer_start_token:long_answer_end_token])
      return question_text, long_answer_text

    def _get_short_answer_label(instance):
      return instance['annotations'][0]['yes_no_answer']

    ### build label_tensor
    if self.mode == 'train':
      short_answer_label = _get_short_answer_label(instance)
      label_id = self.yes_no_label_map[short_answer_label]
      label_tensor = tf.constant(label_id, dtype=tf.int64)
    else:
      label_tensor = None

    # question_text is the first sentence(a)
    # long_answer_text is the second sentence(b)
    question_text, long_answer_text = _get_question_long_answer_pair(instance)
    
    ### build tokens_tensor
    # first sentence
    word_pieces = ["[CLS]"]
    tokens_a = self.tokenizer.tokenize(question_text)
    word_pieces += tokens_a + ["[SEP]"]
    len_a = len(word_pieces)

    # second sentence
    tokens_b = self.tokenizer.tokenize(long_answer_text)
    word_pieces += tokens_b + ["[SEP]"]
    len_b = len(word_pieces) - len_a

    ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
    tokens_tensor = tf.constant(ids, dtype=tf.int64)

    ### build segments_tensor
    segments_tensor = tf.constant([0] * len_a + [1] * len_b, dtype=tf.int64)

    ### build masks_tensor
    masks_tensors = tf.zeros(tokens_tensor.shape, dtype=tf.int64)
    masks_tensors = tf.where(tokens_tensor != 0 , 1, 0)

    return (tokens_tensor, segments_tensor, masks_tensors), label_tensor

  
  def convert_ids_to_tokens(self, tokens_tensor):
    return self.tokenizer.convert_ids_to_tokens(tokens_tensor)

### Initialize BertTokenizer

In [0]:
html_tags = ['<P>', '</P>', '<Table>', '</Table>', '<Tr>', '</Tr>', '<Li>', '</Li>', '<Ol>', '</Ol>', '<Dl>', '</Dl>', '<Ul>','</Ul>']
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_basic_tokenize=False)
bert_tokenizer.add_tokens(html_tags)

HBox(children=(IntProgress(value=0, description='Downloading', max=213450, style=ProgressStyle(description_wid…




14

### Testing

In [0]:
short_answer_train_dataset_gen = Short_answer_dataset(bert_tokenizer).get_dataset_generator_function()
ds_short_answer_train_dataset = tf.data.Dataset.from_generator(
    short_answer_train_dataset_gen, 
    output_types=(tf.int64, tf.int64, tf.int64, tf.int64)

#for a in ds_short_answer_train_dataset.take(1):
#  print(a)

#for a in ds_short_answer_train_dataset.padded_batch(10, padded_shapes=([None, None, None, None])).take(10):
#  print(a)

SyntaxError: ignored

### Possible Improvements

1. Use `jsonlines` package

# Short Answer Idenfiticator

![short-answer-identificator](https://github.com/cyyeh/kaggle/blob/master/google-qa/short_answer_identificator.png?raw=true)

## Long Answer Encoder

see Prepare Short Answer Dataset

## Short Answer Binary Classifier

## Short Answer Null Classifier