<a href="https://colab.research.google.com/github/cyyeh/kaggle/blob/master/google-qa/google_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Libraries

In [1]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


In [2]:
!pip install transformers # BertModel



In [0]:
import tensorflow as tf
import json
import os

# Prepare Kaggle Dataset for [TensorFlow 2.0 Question Answering](https://www.kaggle.com/c/tensorflow2-question-answering)

Check if files are existed first, if not, download dataset from Kaggle


In [48]:
def are_train_and_test_data_existed():
  nq_test_data_file = 'simplified-nq-test.jsonl'
  nq_train_data_file = 'simplified-nq-train.jsonl'
  return os.path.exists(nq_test_data_file) and os.path.exists(nq_train_data_file)

def are_train_and_test_zip_existed():
  nq_test_data_zip = 'simplified-nq-test.jsonl.zip'
  nq_train_data_zip = 'simplified-nq-train.jsonl.zip'
  return os.path.exists(nq_test_data_zip) and os.path.exists(nq_train_data_zip)

if are_train_and_test_data_existed():
  print(f"Question Answering dataset is ready!")
elif are_train_and_test_zip_existed():
  print(f"Question Answering dataset is found in zip format, unzip them first!")
else:
  print(f"Question Answering dataset is not found!")

Question Answering dataset is ready!


### Download Question Answering Dataset

In [5]:
import os
os.environ['KAGGLE_USERNAME'] = "chihyuyeh" # username from the json file
os.environ['KAGGLE_KEY'] = "f21b340fc8082977cbf954c80ad69ae1" # key from the json file
!kaggle competitions download -c tensorflow2-question-answering

Downloading sample_submission.csv to /content
  0% 0.00/18.2k [00:00<?, ?B/s]
100% 18.2k/18.2k [00:00<00:00, 15.7MB/s]
Downloading simplified-nq-train.jsonl.zip to /content
100% 4.46G/4.46G [01:30<00:00, 82.9MB/s]
100% 4.46G/4.46G [01:30<00:00, 53.1MB/s]
Downloading simplified-nq-test.jsonl.zip to /content
100% 4.78M/4.78M [00:00<00:00, 16.9MB/s]



### Unzip Question Answering Dataset

In [6]:
!unzip simplified-nq-train.jsonl.zip
!unzip simplified-nq-test.jsonl.zip

Archive:  simplified-nq-train.jsonl.zip
  inflating: simplified-nq-train.jsonl  
Archive:  simplified-nq-test.jsonl.zip
  inflating: simplified-nq-test.jsonl  


# Prepare Short Answer Dataset

Since a short answer exists only if a long answer exists, so we will remove cases where long answers don't exist first.

For long answers that exist, there are several cases of short answers:
1. YES/NO
2. a sentence or phrase
3. no short answer

For long answers that don't exist, `start_token`, `candidate_index`, and `end_token` are all -1 in `annotations` of `simplified-nq-train.jsonl`

In [43]:
class Short_answer_dataset():
  def __init__(self, data_path='simplified-nq-train.jsonl'):
    self.data_path = data_path
    self._start_index = 0


  def get_dataset(self):
    return self._create_train_data_generator()


  def _create_train_data_generator(self):
    with open(self.data_path, 'r') as train_data:
      _temp_start_index = 0
      for instance in train_data:
        if _temp_start_index == self._start_index:
          self._start_index += 1
          nq_train_json = json.loads(instance)
          if self._is_long_answer_existed(nq_train_json['annotations'][0]):
            yield nq_train_json
        else:
          _temp_start_index += 1


  def _is_long_answer_existed(self, annotations):
    long_answer = annotations['long_answer']
    return long_answer['start_token'] == -1 \
    and long_answer['candidate_index'] == -1 \
    and long_answer['end_token'] == -1    

{'document_text': 'Roanoke Colony - wikipedia <H1> Roanoke Colony </H1> Jump to : navigation , search `` Lost Colony \'\' redirects here . For other uses , see Lost Colony ( disambiguation ) . <Table> <Tr> <Td> </Td> <Td> This article \'s lead section does not adequately summarize key points of its contents . Please consider expanding the lead to provide an accessible overview of all important aspects of the article . Please discuss this issue on the article \'s talk page . ( March 2018 ) </Td> </Tr> </Table> <Table> <Tr> <Td_colspan="3"> Roanoke Colony </Td> </Tr> <Tr> <Td_colspan="3"> Colony of England </Td> </Tr> <Tr> <Td_colspan="3"> <Table> <Tr> <Td> </Td> <Td> 1585 -- c. 1590 </Td> <Td> </Td> </Tr> </Table> </Td> </Tr> <Tr> <Td_colspan="3"> Virginea Pars map , drawn by John White during his initial visit in 1585 . Roanoke is the small pink island in the middle right of the map . </Td> </Tr> <Tr> <Td_colspan="2"> History </Td> <Td> </Td> </Tr> <Tr> <Td> </Td> <Td> Established </Td

### Testing

In [44]:
short_answer_dataset = Short_answer_dataset()
for i in range(5):
  print(next(short_answer_dataset.get_dataset()))

{'document_text': 'Roanoke Colony - wikipedia <H1> Roanoke Colony </H1> Jump to : navigation , search `` Lost Colony \'\' redirects here . For other uses , see Lost Colony ( disambiguation ) . <Table> <Tr> <Td> </Td> <Td> This article \'s lead section does not adequately summarize key points of its contents . Please consider expanding the lead to provide an accessible overview of all important aspects of the article . Please discuss this issue on the article \'s talk page . ( March 2018 ) </Td> </Tr> </Table> <Table> <Tr> <Td_colspan="3"> Roanoke Colony </Td> </Tr> <Tr> <Td_colspan="3"> Colony of England </Td> </Tr> <Tr> <Td_colspan="3"> <Table> <Tr> <Td> </Td> <Td> 1585 -- c. 1590 </Td> <Td> </Td> </Tr> </Table> </Td> </Tr> <Tr> <Td_colspan="3"> Virginea Pars map , drawn by John White during his initial visit in 1585 . Roanoke is the small pink island in the middle right of the map . </Td> </Tr> <Tr> <Td_colspan="2"> History </Td> <Td> </Td> </Tr> <Tr> <Td> </Td> <Td> Established </Td

# Short Answer Idenfiticator

In [0]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel



class Encoder():
  def __init__(self):
    self.tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    self.model = TFBertModel.from_pretrained('bert-base-uncased')
  
  def encode(self, sentence1, sentence2=None):
    input_ids = tf.constant(self.tokenizer.encode(sentence1, sentence2, add_special_tokens=True))[None, :]  # Batch size 1

    return self.model(input_ids, )

dataset = tf.data.Dataset.from_tensor_slices(['8', '3', '0', '8', '2', '1'])
question_encoder = Encoder()
for elem in dataset:
  print(elem)
  outputs = question_encoder.encode(elem)
  print(outputs[0])

tf.Tensor(b'8', shape=(), dtype=string)


ValueError: ignored