<a href="https://colab.research.google.com/github/abisubramanya27/ChAII-docker/blob/master/ChAII_COPY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Env Setup (Check out https://towardsdatascience.com/conda-google-colab-75f7c867a522 for detailed instructions)
BEFORE STARTING, SET RUNTIME_TYPE TO GPU. 

# ChAII starter notebook

This is a starter notebook for running a baseline mBERT model on the task. This is a standalone notebook which will allow you to do the following:
1. Train an mBERT model on the ChAII data (utilizing Colab GPUs), 
2. Get dev evaluation numbers, 
3. Generate a submission for the Kaggle leaderboard with the appropriate format.

This notebook uses the [Xtreme](https://github.com/google-research/xtreme) codebase for training QA-finetuned models. Feel free to run your own scripts/variants locally, experiment with other models and pipelines, not necessarily limited to Xtreme. 

In [None]:
# Verify PYTHONPATH is blank to avoid problems later
!echo $PYTHONPATH # should return <blank>

In [None]:
# Verify that Miniconda installation and updation worked

!which conda # should return /usr/local/bin/conda
!conda --version # should return 4.10.3
!which python # should return /usr/local/bin/python
!python --version # should return Python 3.6.13 :: Anaconda, Inc.

In [None]:
# Overview of path files

import sys
sys.path

## Training

Although this codebase can be used for many varieties of training methods and experiments, we will only train a straightforward baseline. We will create a monolingual Hindi QA model. We encourage you to read and experiment with the Xtreme codebase, and also with other repos. Some promising avenues:

* Train model on both Hindi and Tamil ChAII data,
* Multi-task learning with Xtreme,
* Annotate your own data into a QA format and augment training,
* Zero-shot transfer learning

The cells below do the following:

1. Convert the given ChAII data (from competition) to QA (SQuAD) format, split into train and dev sets.
2. Finetune mBERT (bert-base-multilingual-cased) on the ChAII data.
3. Save the model and dev predictions into GDrive for evaluation later

In [None]:
# Load ChAII dataset

import os
import collections
import functools
import glob
import json
import random
from pprint import pprint
from tqdm import tqdm
import pandas as pd
from pathlib import Path

pd.set_option("display.max_rows", 20, "display.max_columns", None)

data_path = Path("/root/mount/dataset")
json_dicts = []

def get_dataframe(file_path):
    df = pd.DataFrame()
    with open(file_path,'r') as f:
        df = pd.read_csv(f)
    df = df.astype(str)
    df = df.apply(lambda x: x.str.strip())
    return df

chaii_data = get_dataframe(data_path / "train.csv")
chaii_data.head

### Data conversion to QA format

The below cells convert TyDiQA and ChAII Kaggle data format to the SQuAD QA format, so it can be used with the Xtreme pipeline. You need to download the ChAII data from Kaggle, and put it in the mount/dataset directory.

In [None]:
# TODO: Convert from Kaggle training format to QA format
language = 'hindi' # replace this with tamil when processing tamil dataset
lang_code = 'hi' # for tamil use ta

def convert_to_qa_format_kaggle(row):
    answer = {}
    answer["text"] = row["answer_text"]
    answer["answer_start"] = int(row["answer_start"])
    qa_json = {
        "title": "",
        "paragraphs": [
            {
                "context": row["context"],
                "qas": [
                    {
                        "question": row["question"],
                        "id": row["language"] + '-' + str(row["id"]),
                        "answers": [answer]
                    }
                ]
            }
        ],
    }
    
    return qa_json

# Process one language at a time
# Here chaii_data is a pandas dataframe
def get_qa_data_from_kaggle_format(chaii_data, language): 
    language = 'hindi'

    qa_data = {"data":[], "version":f"chaii_{language}"}
    for index, row in chaii_data.iterrows():
        if row["language"] == language:
            qa_datapoint = convert_to_qa_format_kaggle(row)
            qa_data["data"].append(qa_datapoint)

    print("QA (SQuAD) format:")
    print(qa_data["data"][0])
    return qa_data

qa_data = get_qa_data_from_kaggle_format(chaii_data, language)

In [None]:
print("Total number of datapoints: %d" % len(qa_data["data"]))

In [None]:
# Split datapoints language-wise and into QA format
# Run this cell only if you need to convert from TyDiQA to SQuAD format, otherwise run the nexy one.
import re

from pprint import pprint

def byte_str(text):
  return text.encode("utf-8")

def byte_len(text):
  # Python 3 encodes text as character sequences, not byte sequences
  # (like Python 2).
  return len(byte_str(text))

def byte_slice(text, start, end, errors="replace"):
  # Python 3 encodes text as character sequences, not byte sequences
  # (like Python 2).
  return byte_str(text)[start:end].decode("utf-8", errors=errors)

def convert_to_qa_format_tydiqa(tydi_json):
  answer = {}
  for annotation in tydi_json["annotations"]:
    minimal_answer = annotation["minimal_answer"]
    if minimal_answer["plaintext_start_byte"] != -1 and minimal_answer["plaintext_end_byte"] != -1:
      answer["text"] = byte_slice(tydi_json["document_plaintext"],minimal_answer["plaintext_start_byte"],minimal_answer["plaintext_end_byte"])
      answer["answer_start"] = [m.start() for m in re.finditer(answer["text"],tydi_json["document_plaintext"])][0]
      break
  if answer == {}:
    return {}
  
  qa_json = {
      "title" : tydi_json["document_title"],
      "paragraphs" : [
                      {
                          "context": tydi_json["document_plaintext"],
                          "qas" : [
                                   {
                                    "question" : tydi_json["question_text"],
                                    "id" : tydi_json["language"] + '-' + str(tydi_json["example_id"]),
                                    "answers" : [answer],
                                   }
                          ]
                      }
      ],
  }

  return qa_json

# Here chaii_data is json list
def get_qa_data_from_tydiqa_format(chaii_data, language):
    language = 'hindi'
    qa_data = {"data":[], "version":f"chaii_{language}"}
    for json_dict in json_dicts:
      if json_dict["language"] == language:
        qa_datapoint = convert_to_qa_format_tydiqa(json_dict)
        if qa_datapoint != {}:
          qa_data["data"].append(qa_datapoint)
        qa_data['data'].append(json_dict)

    print("QA (SQuAD) format:")
    print(qa_data["data"][0])

In [None]:
# Splitting data into train and dev and saving converted QA formats

import random

qa_data_datapoints = qa_data["data"]
random.shuffle(qa_data_datapoints)
train_size = int(len(qa_data_datapoints)*0.8)
train_qa_data_datapoints, dev_qa_data_datapoints = qa_data_datapoints[:train_size], qa_data_datapoints[train_size:]

train_qa_data = {"data":train_qa_data_datapoints, "version":f"chaii_{lang_code}_train"}
dev_qa_data = {"data":dev_qa_data_datapoints, "version":f"chaii_{lang_code}_dev"}

with open(os.path.join(data_path,f"train.{lang_code}.qa.jsonl"),'w') as f:
  json.dump(train_qa_data,f)

with open(os.path.join(data_path,f"dev.{lang_code}.qa.jsonl"),'w') as f:
  json.dump(dev_qa_data,f)

print("Training data size: %d" % len(train_qa_data_datapoints))
print("Dev data size: %d" % len(dev_qa_data_datapoints))

The cell below is optional (we have not used it for our baseline model), but it downloads the original TyDiQA data in the QA format. You can combine it with our ChAII data and boost training!

In [None]:
%%bash
# Downloading the data. For baseline, we can ignore the other training datasets for other tasks, and focus on training with just TyDiQA data. 

source activate xtreme
cd /root/xtreme # Optional but recommended

# Copyright 2020 Google and DeepMind.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

REPO=$PWD
DIR=$REPO/download/
mkdir -p $DIR

function download_tydiqa {
    echo "download tydiqa-goldp"
    base_dir=$DIR/tydiqa/
    mkdir -p $base_dir && cd $base_dir
    tydiqa_train_file=tydiqa-goldp-v1.1-train.json
    tydiqa_dev_file=tydiqa-goldp-v1.1-dev.tgz
    wget https://storage.googleapis.com/tydiqa/v1.1/${tydiqa_train_file} -q --show-progress
    wget https://storage.googleapis.com/tydiqa/v1.1/${tydiqa_dev_file} -q --show-progress
    tar -xf ${tydiqa_dev_file}
    rm ${tydiqa_dev_file}
    out_dir=$base_dir/tydiqa-goldp-v1.1-train
    python $REPO/utils_preprocess.py --data_dir $base_dir --output_dir $out_dir --task tydiqa
    mv $base_dir/$tydiqa_train_file $out_dir/
    echo "Successfully downloaded data at $DIR/tydiqa" >> $DIR/download.log
}

download_tydiqa

### Training mBERT on Hindi ChAII data

The below script uses the Xtreme script to train the data. Here, we need to modify the code in the folders to train it on the ChAII data. You can double click on the scripts, modify the code and change them. 

For the baseline, the following changes were made to the Xtreme repo code:


1.   In ```scripts/train.sh```, an additional task called "chaii_hi" was added as such:
```
...
elif [ $TASK == 'chaii_hi' ]; then
  bash $REPO/scripts/train_qa.sh $MODEL chaii_hi $TASK $GPU $DATA_DIR $OUT_DIR
...
```
2.   In ```scripts/train_qa.sh```, the following flags were added:
```
TRAIN_LANG="en"
EVAL_LANG="en"
```
Another elif condition was added as such to modify path of data dir:
```
...
elif [ $SRC == 'chaii_hi' ]; then
  TASK_DATA_DIR=${DATA_DIR}
  TRAIN_FILE=${TASK_DATA_DIR}/train.hi.qa.jsonl
  PREDICT_FILE=${TASK_DATA_DIR}/dev.hi.qa.jsonl
  TRAIN_LANG="hi"
  EVAL_LANG="hi"
...
```
Finally, TRAIN_LANG and EVAL_LANG replaced the hardcoded "en":
```
 --weight_decay 0.0001 \
  --threads 8 \
  --train_lang ${TRAIN_LANG} \
  --eval_lang ${EVAL_LANG}
```

If you want to make your own changes for experimentation, clone the xtreme repo locally in the mount folder and mount it as part of the docker container. 

Finally, we create a run.sh script in the current root directory, and paste the following commands:

```
#!/bin/bash

TASK=${1:-chaii_hi}
DATA_DIR=${2:-"/root/xtreme/download/chaii_data/"}
OUT_DIR=${3:-"/root/xtreme/outputs-temp/"}
MODEL=${4:-bert-base-multilingual-cased}
GPU=${5:-0}
TRAIN_FILE_NAME=${6}
PREDICT_FILE_NAME=${7}

source activate xtreme
cd /root/xtreme
bash scripts/train.sh $MODEL $TASK $GPU $DATA_DIR $OUT_DIR $TRAIN_FILE_NAME $PREDICT_FILE_NAME

```

Your model should be stored in ```/root/mount/outputs-temp/```.

In [None]:
!pip install ipython-autotime
%load_ext autotime

# We store all the outputs and models to our local mount directory 
# For tamil change the task to 'chaii_ta'
!bash /root/run.sh chaii_hi "/root/mount/dataset" "/root/mount/outputs-temp" bert-base-multilingual-cased 0 train.hi.qa.jsonl dev.hi.qa.jsonl

## Inference and Evaluation

For inference, we do the following modifications to Xtreme repo:
1. In ```predict_qa.sh```, add the following (line 40):
```
elif [ $TGT == 'chaii_hi' ]; then
  langs=( hi )
```


Also, we create a bash file (similar to ```run.sh```) called ```predict.sh```, and copy the commands below into it:

```
#!/bin/bash

source activate xtreme
cd /root/xtreme

MODEL_PATH=${1:-"/root/xtreme/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384"}
TASK=${2:-chaii_hi}
DATA_DIR=${2:-"/root/xtreme/download/chaii_data/"}
PREDICTIONS_DIR=${3:-"/root/xtreme/predictions/"}
MODEL=${4:-bert-base-multilingual-cased}
MODEL_TYPE=${5:-bert}
GPU=${6:-0}
PREDICT_FILE_NAME=${7}
 
bash scripts/predict_qa.sh bert-base-multilingual-cased bert $MODEL_PATH $TASK $GPU $DATA_DIR $PREDICTIONS_DIR $PREDICT_FILE_NAME
```

In [None]:
# First, you need to run inference on the models

!bash /root/predict.sh "/root/mount/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384" \
      chaii_hi "/root/mount/dataset" "/root/eval_dir/predictions" 

In [None]:
# Before you can start local evaluation, we need to put files in a particular format. Run the below command to transfer predictions and labels

!bash /root/pre_evaluate.sh chaii_hi /root/mount/dataset /root/mount/dataset

In [None]:
# Evaluation 

!bash /root/evaluate.sh

In [None]:
import json
with open("/root/mount/outputs-temp/chaii_hi/bert-base-multilingual-cased_LR3e-5_EPOCH2.0_maxlen384/predictions_hi_.json") as f:
  preds = json.load(f)

with open("/root/mount/dataset/dev.hi.qa.jsonl") as f:
  dev_data = json.load(f)

dev_data

In [None]:
from pprint import pprint
dev_answer_pair_matches = []
for d in dev_data['data']:
  for para in d['paragraphs']:
    for qa in para['qas']:
      dev_answer_pair_matches.append({'context':para['context'],'question':qa['question'],'gold_answer':qa['answers'],'mbert_pred':preds[qa['id']],'id':qa['id']})

In [None]:
pprint(dev_answer_pair_matches[2])

In [None]:
#Matches in predictions
correct_ans = [d for d in dev_answer_pair_matches if d['mbert_pred']==d['gold_answer'][0]['text']]
with open('/root/mount/correct_chaii_hi_mbert.txt','w',encoding='utf-8') as f:
  for c in correct_ans:
    f.write(f"id:{c['id']}\n")
    f.write(f"context:{c['context']}\n")
    f.write(f"question:{c['question']}\n")
    f.write(f"gold_answer:{c['gold_answer'][0]['text']}\n")
    f.write(f"mbert_pred:{c['mbert_pred']}\n")
    f.write("\n\n")

In [None]:
#Mismatches in predictions
wrong_ans = [d for d in dev_answer_pair_matches if d['mbert_pred']!=d['gold_answer'][0]['text']]
with open('/root/mount/wrong_chaii_hi_mbert.txt','w',encoding='utf-8') as f:
  for c in wrong_ans:
    f.write(f"id:{c['id']}\n")
    f.write(f"context:{c['context']}\n")
    f.write(f"question:{c['question']}\n")
    f.write(f"gold_answer:{c['gold_answer'][0]['text']}\n")
    f.write(f"mbert_pred:{c['mbert_pred']}\n")
    f.write("\n\n")

In [None]:
len(correct_ans),len(wrong_ans)