<a href="https://colab.research.google.com/github/abisubramanya27/ChAII-docker/blob/master/ChAII_COPY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Env Setup (Check out https://towardsdatascience.com/conda-google-colab-75f7c867a522 for detailed instructions)
BEFORE STARTING, SET RUNTIME_TYPE TO GPU. 

# ChAII starter notebook

This is a starter notebook for running a baseline mBERT model on the task. This is a standalone notebook which will allow you to do the following:
1. Train an mBERT model on the ChAII data (utilizing Colab GPUs), 
2. Get dev evaluation numbers, 
3. Generate a submission for the Kaggle leaderboard with the appropriate format.

This notebook uses the [Xtreme](https://github.com/google-research/xtreme) codebase for training QA-finetuned models. Feel free to run your own scripts/variants locally, experiment with other models and pipelines, not necessarily limited to Xtreme. Here are some caveats of this method:
1. Since runtimes (GPU/TPU) are re-allocated when the notebook is idle, Anaconda and dependencies installations need to be rerun everytime the notebook has to be reconnected.
2. When runtime is disconnected, you lose all the files you have stored. Therefore you need to mount your Google Drive and store the relevant code and data there. Upon reconnection you can simply remount to access the data.

Given the above conditions, we encourage you to have local installations (of Anaconda) and clones of the Xtreme codebase (with some changes mentioned below), use this notebook for training with GPU (and inference), and conduct evaluations locally. 

## Environment + Xtreme Setup

These steps set up Anaconda (Miniconda) on Colab, and set up the Xtreme Github codebase. 

**BEFORE YOU BEGIN:** Ensure you have set runtime as GPU before running. These set of cells have to rerun everytime you disconnect/change runtime.

Colab navigation: With Colab, you can access your folders and files with the upper-left icon. Your files will be stored in ```/content/```. 

Useful links:
1. Mounting your Google Drive, Cloud Storage: [link](https://colab.sandbox.google.com/notebooks/io.ipynb#scrollTo=eikfzi8ZT_rW)
2. Setting up Miniconda on Colab: [link](https://towardsdatascience.com/conda-google-colab-75f7c867a522)
3. Xtreme Github Repo: [link](https://github.com/google-research/xtreme)


### Anaconda (Miniconda) Setup

See [this link](https://towardsdatascience.com/conda-google-colab-75f7c867a522) for more details. 

In [None]:
# Verify PYTHONPATH is blank to avoid problems later
!echo $PYTHONPATH # should return <blank>

/env/python


In [None]:
# Verify that Miniconda installation and updation worked

!which conda # should return /usr/local/bin/conda
!conda --version # should return 4.10.3
!which python # should return /usr/local/bin/python
!python --version # should return Python 3.6.13 :: Anaconda, Inc.

conda 4.10.3
Python 3.7.10


In [None]:
# Overview of path files

import sys
sys.path

['',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython']

In [None]:
# View different installed packages

!ls /usr/local/lib/python3.7/dist-packages

absl
absl_py-0.12.0.dist-info
alabaster
alabaster-0.7.12.dist-info
albumentations
albumentations-0.1.12.dist-info
altair
altair-4.1.0.dist-info
apiclient
appdirs-1.4.4.dist-info
appdirs.py
apt
apt_inst.cpython-37m-x86_64-linux-gnu.so
apt_inst.pyi
apt_pkg.cpython-37m-x86_64-linux-gnu.so
apt_pkg.pyi
aptsources
argon2
argon2_cffi-20.1.0.dist-info
arviz
arviz-0.11.2.dist-info
astor
astor-0.8.1.dist-info
astropy
astropy-4.2.1.dist-info
astunparse
astunparse-1.6.3.dist-info
async_generator
async_generator-1.10.dist-info
atari_py
atari_py-0.2.9.dist-info
atari_py.libs
atomicwrites
atomicwrites-1.4.0.dist-info
attr
attrs-21.2.0.dist-info
audioread
audioread-2.1.9.dist-info
autograd
autograd-1.3.dist-info
babel
Babel-2.9.1.dist-info
backcall
backcall-0.2.0.dist-info
beautifulsoup4-4.6.3.dist-info
bin
bleach
bleach-3.3.0.dist-info
blis
blis-0.4.1.dist-info
bokeh
bokeh-2.3.3.dist-info
bottleneck
Bottleneck-1.3.2.dist-info
branca
branca-0.4.2.dist-info
bs4
bs4-0.0.1.dist-info
bson
cachecontrol
Cac

In [None]:
# Appending to sys path. This is where your installs will be located

import sys
_ = (sys.path
        .append("/usr/local/lib/python3.7/site-packages"))

In [None]:
# To install any packages, run a command similar to the one below, pip also works
!conda install --channel conda-forge featuretools --yes

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::asn1crypto==0.24.0=py36_0
| / - \ | / - \ | / - \ | / - \ done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - featuretools


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    asn1crypto-1.4.0   

### Xtreme codebase setup

Now, we will set up the Xtreme repo ([link](https://github.com/google-research/xtreme)). The below cells do the following:
1. Clone Xtreme
2. Create a Conda env called ```xtreme``` and install dependencies into it.

In [None]:
%%bash
cd drive/MyDrive/ # Optional but recommended as we will be modifying Xtreme code below
git clone https://github.com/google-research/xtreme.git

bash: line 1: cd: drive/MyDrive/: No such file or directory
Cloning into 'xtreme'...


This cell below is a modified version of ```xtreme/install_tools.sh```. 

**Note:** who are using Xtreme repo locally may also encounter errors with the original script, such as with ```conda activate```. You can copy-paste this script to resolve the error.

In [None]:
# First, we need to install required dependencies. Instead of running their install_tools.sh, run this cell, which has a few minor modifications. This may take a few minutes to run.

%%bash
cd drive/MyDrive/ # Optional but recommended
cd xtreme/
# Copyright 2020 Google and DeepMind.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set +eux  # for easier debugging

REPO=$PWD
LIB=$REPO/third_party
mkdir -p $LIB

# install conda env
conda create --yes --name xtreme --file conda-env.txt 
conda init bash
source activate xtreme
set -eux

# install latest transformer
cd $LIB
rm -rf transformers
git clone https://github.com/huggingface/transformers
cd transformers
git checkout cefd51c50cc08be8146c1151544495968ce8f2ad --force
pip install .
cd $LIB

pip install seqeval
pip install tensorboardx
pip install tqdm

# install XLM tokenizer
pip install sacremoses
pip install pythainlp
pip install jieba

#git clone https://github.com/neubig/kytea.git && cd kytea
#./configure --prefix=${CONDA_PREFIX}
#make && make install
pip install kytea


Downloading and Extracting Packages
blas-1.0             |            |   0% blas-1.0             | ########## | 100% blas-1.0             | ########## | 100% 
ca-certificates-2020 |            |   0% ca-certificates-2020 | #2         |  12% ca-certificates-2020 | ########## | 100% 
cudatoolkit-10.0.130 |            |   0% cudatoolkit-10.0.130 |            |   0% cudatoolkit-10.0.130 | 3          |   4% cudatoolkit-10.0.130 | 8          |   8% cudatoolkit-10.0.130 | #2         |  13% cudatoolkit-10.0.130 | #7         |  17% cudatoolkit-10.0.130 | ##1        |  21% cudatoolkit-10.0.130 | ##5        |  25% cudatoolkit-10.0.130 | ##8        |  29% cudatoolkit-10.0.130 | ###3       |  33% cudatoolkit-10.0.130 | ###7       |  37% cudatoolkit-10.0.130 | ####       |  41% cudatoolkit-10.0.130 | ####4      |  45% cudatoolkit-10.0.130 | ####8      |  49% cudatoolkit-10.0.130 | #####2     |  53% cudatoolkit-10.0.130 | #####6     |  56% cudatoolkit-10.0.130 | #####9     | 

bash: line 1: cd: drive/MyDrive/: No such file or directory
+ cd /content/xtreme/third_party
+ rm -rf transformers
+ git clone https://github.com/huggingface/transformers
Cloning into 'transformers'...
+ cd transformers
+ git checkout cefd51c50cc08be8146c1151544495968ce8f2ad --force
Note: checking out 'cefd51c50cc08be8146c1151544495968ce8f2ad'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at cefd51c50 Fix glue processor failing on tf datasets
+ pip install .
+ cd /content/xtreme/third_party
+ pip install seqeval
+ pip install tensorboardx
+ pip install tqdm
+ pip install sacremoses
+ pip install pythainlp
+ pip install ji

## Training

Although this codebase can be used for many varieties of training methods and experiments, we will only train a straightforward baseline. We will create a monolingual Hindi QA model. We encourage you to read and experiment with the Xtreme codebase, and also with other repos. Some promising avenues:

* Train model on both Hindi and Tamil ChAII data,
* Multi-task learning with Xtreme,
* Annotate your own data into a QA format and augment training,
* Zero-shot transfer learning

The cells below do the following:

1. Convert the given ChAII data (from competition) to QA (SQuAD) format, split into train and dev sets.
2. Finetune mBERT (bert-base-multilingual-cased) on the ChAII data.
3. Save the model and dev predictions into GDrive for evaluation later

### Data conversion to QA format

The below cell converts the ChAII data from the TyDiQA format to the SQuAD QA format, so it can be used with the Xtreme pipeline. You need to download the ChAII data from Kaggle, and either upload on your Google Drive or locally onto the Colab notebook.

In [None]:
# Convert TyDiQA format to a QA format

import os
import collections
import functools
import glob
import json
import random
from pprint import pprint
from tqdm import tqdm

data_path = "/content/drive/MyDrive/chaii_data/"
json_dicts = []

with open(os.path.join(data_path,"train.ta.simple.jsonl"),'r') as f:
  for line in tqdm(f):
    json_dict = json.loads(line)
    json_dicts.append(json_dict)

print("TyDiQA format:")
pprint(json_dicts[0])

FileNotFoundError: ignored

In [None]:
print("Total number of datapoints: %d" % len(json_dicts))

In [None]:
# Split datapoints language-wise and into QA format
import re

from pprint import pprint

def byte_str(text):
  return text.encode("utf-8")

def byte_len(text):
  # Python 3 encodes text as character sequences, not byte sequences
  # (like Python 2).
  return len(byte_str(text))

def byte_slice(text, start, end, errors="replace"):
  # Python 3 encodes text as character sequences, not byte sequences
  # (like Python 2).
  return byte_str(text)[start:end].decode("utf-8", errors=errors)

def convert_to_qa_format(tydi_json):
  answer = {}
  for annotation in tydi_json["annotations"]:
    minimal_answer = annotation["minimal_answer"]
    if minimal_answer["plaintext_start_byte"] != -1 and minimal_answer["plaintext_end_byte"] != -1:
      answer["text"] = byte_slice(tydi_json["document_plaintext"],minimal_answer["plaintext_start_byte"],minimal_answer["plaintext_end_byte"])
      answer["answer_start"] = [m.start() for m in re.finditer(answer["text"],tydi_json["document_plaintext"])][0]
      break
  if answer == {}:
    return {}
  
  qa_json = {
      "title" : tydi_json["document_title"],
      "paragraphs" : [
                      {
                          "context": tydi_json["document_plaintext"],
                          "qas" : [
                                   {
                                    "question" : tydi_json["question_text"],
                                    "id" : tydi_json["language"] + '-' + str(tydi_json["example_id"]),
                                    "answers" : [answer],
                                   }
                          ]
                      }
      ],
  }

  return qa_json

  
language_list = [
       'tamil',
  ]


qa_data = {"data":[], "version":f"TyDiQA_chaii_ta"}
for json_dict in json_dicts:
  if json_dict["language"] in language_list:
    qa_datapoint = convert_to_qa_format(json_dict)
    if qa_datapoint != {}:
      qa_data["data"].append(qa_datapoint)

print("QA (SQuAD) format:")
pprint(qa_data["data"][0])

In [None]:
# Splitting data into train and dev and saving converted QA formats

import random

qa_data_datapoints = qa_data["data"]
random.shuffle(qa_data_datapoints)
train_size = int(len(qa_data_datapoints)*0.8)
train_qa_data_datapoints, dev_qa_data_datapoints = qa_data_datapoints[:train_size], qa_data_datapoints[train_size:]

train_qa_data = {"data":train_qa_data_datapoints, "version":f"TyDiQA_chaii_ta_train"}
dev_qa_data = {"data":dev_qa_data_datapoints, "version":f"TyDiQA_chaii_ta_dev"}

with open(os.path.join(data_path,"train.ta.qa.jsonl"),'w') as f:
  json.dump(train_qa_data,f)

with open(os.path.join(data_path,"dev.ta.qa.jsonl"),'w') as f:
  json.dump(dev_qa_data,f)

print("Training data size: %d" % len(train_qa_data_datapoints))
print("Dev data size: %d" % len(dev_qa_data_datapoints))

The cell below is optional (we have not used it for our baseline model), but it downloads the original TyDiQA data in the QA format. You can combine it with our ChAII data and boost training!

In [None]:
# Downloading the data. For baseline, we can ignore the other training datasets for other tasks, and focus on training with just TyDiQA data. 

%%bash
source activate xtreme
cd drive/MyDrive/ # Optional but recommended
cd xtreme/
# Copyright 2020 Google and DeepMind.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

REPO=$PWD
DIR=$REPO/download/
mkdir -p $DIR

function download_tydiqa {
    echo "download tydiqa-goldp"
    base_dir=$DIR/tydiqa/
    mkdir -p $base_dir && cd $base_dir
    tydiqa_train_file=tydiqa-goldp-v1.1-train.json
    tydiqa_dev_file=tydiqa-goldp-v1.1-dev.tgz
    wget https://storage.googleapis.com/tydiqa/v1.1/${tydiqa_train_file} -q --show-progress
    wget https://storage.googleapis.com/tydiqa/v1.1/${tydiqa_dev_file} -q --show-progress
    tar -xf ${tydiqa_dev_file}
    rm ${tydiqa_dev_file}
    out_dir=$base_dir/tydiqa-goldp-v1.1-train
    python $REPO/utils_preprocess.py --data_dir $base_dir --output_dir $out_dir --task tydiqa
    mv $base_dir/$tydiqa_train_file $out_dir/
    echo "Successfully downloaded data at $DIR/tydiqa" >> $DIR/download.log
}

download_tydiqa

### Training mBERT on Hindi ChAII data

The below script uses the Xtreme script to train the data. Here, we need to modify the code in the folders to train it on the ChAII data. You can double click on the scripts, modify the code and change them. 

For the baseline, the following changes were made to the Xtreme repo code:


1.   In ```scripts/train.sh```, an additional task called "chaii_hi" was added as such:
```
...
elif [ $TASK == 'chaii_hi' ]; then
  bash $REPO/scripts/train_qa.sh $MODEL chaii_hi $TASK $GPU $DATA_DIR $OUT_DIR
...
```
2.   In ```scripts/train_qa.sh```, the following flags were added:
```
TRAIN_LANG="en"
EVAL_LANG="en"
```
Another elif condition was added as such to modify path of data dir:
```
...
elif [ $SRC == 'chaii_hi' ]; then
  TASK_DATA_DIR="/content/drive/MyDrive/chaii_data"
  TRAIN_FILE=${TASK_DATA_DIR}/train.hi.qa.jsonl
  PREDICT_FILE=${TASK_DATA_DIR}/dev.hi.qa.jsonl
  TRAIN_LANG="hi"
  EVAL_LANG="hi"
...
```
Finally, TRAIN_LANG and EVAL_LANG replaced the hardcoded "en":
```
 --weight_decay 0.0001 \
  --threads 8 \
  --train_lang ${TRAIN_LANG} \
  --eval_lang ${EVAL_LANG}
```

Since you may be making your own changes for experimentation, it is HIGHLY RECOMMENDED to clone the Xtreme repo into your GDrive.

Finally, we create a run.sh script in the current root directory (```/content```), and paste the following commands:

```
#!/bin/bash

source activate xtreme
cd drive/MyDrive/ # Optional but recommended
cd xtreme
bash scripts/train.sh bert-base-multilingual-cased chaii_hi
```

Your model should be stored in ```xtreme/outputs-temp/```.

In [None]:
# Now that the data is downloaded, you can run the training script directly from the repo. Here the best way to do it, is to create a new file called run.sh in home folder, and copy paste the below commands, then just run this cell:
# Also ensure that you set your runtime type to GPU for training.
'''
#!/bin/bash

source activate xtreme
cd drive/MyDrive/
cd xtreme
bash scripts/train.sh bert-base-multilingual-cased chaii_hi
'''

!pip install ipython-autotime
%load_ext autotime

!bash run.sh

## Inference and Evaluation

For inference, we do the following modifications to Xtreme repo:
1. In ```predict_qa.sh```, add the following (line 40):
```
elif [ $TGT == 'chaii_hi' ]; then
  langs=( hi )
```


Also, we create a bash file (similar to ```run.sh```) called ```predict.sh```, and copy the commands below into it:

```
#!/bin/bash

source activate xtreme
cd drive/MyDrive/
cd xtreme

MODEL_PATH="/content/drive/MyDrive/xtreme/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384"
GPU=-0
DATA_DIR="/content/drive/MyDrive/"

bash scripts/predict_qa.sh bert-base-multilingual-cased bert chaii_hi $MODEL_PATH chaii_hi $GPU $DATA_DIR
```

In [None]:
# First, you need to run inference on the models

!bash predict.sh

In [None]:
# Before you can start local evaluation, we need to put files in a particular format. Run the below command to transfer predictions and labels
# Copy dev data
%%bash

cd drive/MyDrive # Optional but recommended

TASK_NAME="chaii_hi"
mkdir -p eval_dir/predictions/$TASK_NAME/
mkdir -p eval_dir/labels/$TASK_NAME/

GOLD_DATA_LOCATION="/content/drive/MyDrive/chaii_data/"

cp $GOLD_DATA_LOCATION/* eval_dir/labels/$TASK_NAME/
for file in eval_dir/labels/$TASK_NAME/dev.*.jsonl; do
filename=$(basename "$file")
fname="${filename%.*}"
lg=$(echo $fname | cut -d"." -f 2)
mv $file eval_dir/labels/$TASK_NAME/test-$lg.json
done

cp xtreme/predictions/$TASK_NAME/predictions* eval_dir/predictions/$TASK_NAME/
for file in eval_dir/predictions/$TASK_NAME/predictions*.json; do
filename=$(basename "$file")
fname="${filename%.*}"
lg=$(echo $fname | cut -d"_" -f 2)
mv $file eval_dir/predictions/tydiqa/test-$lg.json
done

In [None]:
# Evaluation 
%%bash
source activate xtreme
cd xtreme/
python evaluate.py --prediction_folder /content/eval_dir/predictions --label_folder /content/eval_dir/labels/

In [None]:
import json
with open("/content/drive/MyDrive/xtreme/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384/predictions_hi_.json") as f:
  preds = json.load(f)

with open("/content/drive/MyDrive/chaii_data/dev.hi.qa.jsonl") as f:
  dev_data = json.load(f)

dev_data

In [None]:
from pprint import pprint
dev_answer_pair_matches = []
for d in dev_data['data']:
  for para in d['paragraphs']:
    for qa in para['qas']:
      dev_answer_pair_matches.append({'context':para['context'],'question':qa['question'],'gold_answer':qa['answers'],'mbert_pred':preds[qa['id']],'id':qa['id']})

In [None]:
pprint(dev_answer_pair_matches[2])

In [None]:
#Matches in predictions
correct_ans = [d for d in dev_answer_pair_matches if d['mbert_pred']==d['gold_answer'][0]['text']]
with open('/content/drive/MyDrive/correct_chaii_hi_mbert.txt','w',encoding='utf-8') as f:
  for c in correct_ans:
    f.write(f"id:{c['id']}\n")
    f.write(f"context:{c['context']}\n")
    f.write(f"question:{c['question']}\n")
    f.write(f"gold_answer:{c['gold_answer'][0]['text']}\n")
    f.write(f"mbert_pred:{c['mbert_pred']}\n")
    f.write("\n\n")


In [None]:
#Mismatches in predictions
wrong_ans = [d for d in dev_answer_pair_matches if d['mbert_pred']!=d['gold_answer'][0]['text']]
with open('/content/drive/MyDrive/wrong_chaii_hi_mbert.txt','w',encoding='utf-8') as f:
  for c in wrong_ans:
    f.write(f"id:{c['id']}\n")
    f.write(f"context:{c['context']}\n")
    f.write(f"question:{c['question']}\n")
    f.write(f"gold_answer:{c['gold_answer'][0]['text']}\n")
    f.write(f"mbert_pred:{c['mbert_pred']}\n")
    f.write("\n\n")

In [None]:
len(correct_ans),len(wrong_ans)