# Question Answering (QA) on TyDI QA dataset

TyDI QA is a multilingual dataset. We will try to reproduce the authors' results in the following notebook using the [official](https://github.com/google-research/language/tree/master/language/canine) implementation of CANINE (in TensorFlow).

In [None]:
!git clone --quiet https://github.com/google-research/language.git

In [None]:
!pip3 install --upgrade tensorflow-gpu &> /dev/null
!pip3 install absl-py &> /dev/null
!pip3 install tf-slim &> /dev/null

In [None]:
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages will be REMOVED:
  libcudnn8-dev
The following held packages will be changed:
  libcudnn8
The following packages will be upgraded:
  libcudnn8
1 upgraded, 0 newly installed, 1 to remove and 37 not upgraded.
Need to get 430 MB of archives.
After this operation, 3,139 MB disk space will be freed.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  libcudnn8 8.1.0.77-1+cuda11.2 [430 MB]
Fetched 430 MB in 8s (53.8 MB/s)
(Reading database ... 155335 files and directories currently installed.)
Removing libcudnn8-dev (8.0.5.39-1+cuda11.1) ...
(Reading database ... 155313 files and directories currently installed.)
Preparing to unpack .../libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb ...
Unpacking libcudnn8 (8.1.0.77-1+cuda11.2) over (8.0.5.39-1+cuda11.1) ...
Setting up libcudnn8 (8.1.0.77-1+cuda11.2) ...


## Get data

Instructions: https://github.com/google-research-datasets/tydiqa#download-the-dataset

In [None]:
!wget https://storage.googleapis.com/tydiqa/v1.0/tydiqa-v1.0-dev.jsonl.gz
!wget https://storage.googleapis.com/tydiqa/v1.0/tydiqa-v1.0-train.jsonl.gz

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Prepare data for finetuning

In [None]:
%cd "./language/"
!python3 -m language.canine.tydiqa.prepare_tydi_data \
  --input_jsonl=tydiqa-v1.0-dev.jsonl.gz \
  --output_tfrecord=/content/drive/MyDrive/models/dev.tfrecord \
  --max_seq_length=2048 \
  --doc_stride=512 \
  --max_question_length=256 \
  --is_training=false

In [None]:
!python3 -m language.canine.tydiqa.prepare_tydi_data \
  --input_jsonl=tydiqa-v1.0-train.jsonl.gz \
  --output_tfrecord=/content/drive/MyDrive/models/train_samples.tfrecord \
  --record_count_file=/content/drive/MyDrive/models/train_samples_record_count.txt \
  --max_seq_length=2048 \
  --doc_stride=512 \
  --max_question_length=256 \
  --include_unknowns=0.1 \
  --is_training=true

## Training

In [None]:
!python3 -m language.canine.tydiqa.run_tydi \
  --model_config_file=/content/drive/MyDrive/models/canine-c/canine_config.json \
  --init_checkpoint=/content/drive/MyDrive/models/tydiqa_run5 \
  --train_records_file=/content/drive/MyDrive/models/train_samples.tfrecord \
  --record_count_file=/content/drive/MyDrive/models/train_samples_record_count.txt \
  --do_train \
  --max_seq_length=2048 \
  --train_batch_size=4 \
  --learning_rate=5e-5 \
  --num_train_epochs=5 \
  --warmup_proportion=0.1 \
  --output_dir=/content/drive/MyDrive/models/tydiqa_run6 \

# Predict using finetuned model

In [None]:
!python3 -m language.canine.tydiqa.run_tydi \
  --model_config_file=/content/drive/MyDrive/models/canine-c/canine_config.json \
  --init_checkpoint=/content/drive/MyDrive/models/tydiqa_run6/ \
  --predict_file=/content/drive/MyDrive/models/tydiqa-v1.0-dev.jsonl.gz \
  --do_predict \
  --max_seq_length=2048 \
  --max_answer_length=100 \
  --candidate_beam=30 \
  --predict_batch_size=4 \
  --max_passages=45 \
  --max_position=45 \
  --doc_stride=512 \
  --max_seq_length=2048 \
  --max_question_length=256 \
  --include_unknowns=-1 \
  --predict_file_shard_size=1000 \
  --output_dir=/content/drive/MyDrive/models/tydiqa_run6/predict \
  --output_prediction_file=/content/drive/MyDrive/models/tydiqa_run6/predict/pred.jsonl \

# Evaluation

To evaluate the predictions of our model we will follow the official guidelines, available [here](https://github.com/google-research-datasets/tydiqa)

In [None]:
!git clone --quiet https://github.com/google-research-datasets/tydiqa.git
%cd "./tydiqa/"
!python3 tydi_eval.py \
  --gold_path=/content/drive/MyDrive/models/tydiqa-v1.0-dev.jsonl.gz \
  --predictions_path=/content/drive/MyDrive/models/tydiqa_run6/predict/pred.jsonl
