# Hindi and Tamil Question Answering

The goal of the competition to perform Natural Language Extractive Question Answering in Hindi and Tamil languages. The train data contains 1114 samples and test set contains 5 samples. As you can see, this is a Small Dataset and it tests the transfer learning capability of Transformer Models. https://www.kaggle.com/competitions/chaii-hindi-and-tamil-question-answering

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/train.csv')
train.head()

In [3]:
train.language.value_counts()

The train dataset contains 746 samples in hindi and 368 samples in tamil. The presence of two languages makes this challenge slightly harder.

The test dataset contains only 5 samples.

In [4]:
test = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/test.csv')
test

The model and the dataset pipeline is built referring to the code from Hugging Face course: https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb

In [None]:
import transformers

<img src="images/23.png">
Huang, Z., Low, C., Teng, M., Zhang, H., Ho, D. E., Krass, M. S., & Grabmair, M. (2021, June). Context-aware legal citation recommendation using deep learning. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law. https://doi.org/10.1145/3462757.3466066

RoBERTa uses the same architecture as BER 

Multilingual Roberta model and tokenizer trained on Question-Answering dataset SQuAD 2.0 is used.

In [6]:
model_checkpoint = '../input/xlm-roberta-squad2/deepset/xlm-roberta-base-squad2'
batch_size = 4

In [7]:
from transformers import AutoTokenizer

# Load the tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The dataset contains a context and question. The model needs to refer to the context and extract the words that represent the answer to the question.

In [8]:
# The number of tokens in the context is high.
train['num_tokens_context'] = train['context'].apply(lambda t: len(tokenizer(t)['input_ids']))
train['num_tokens_context'].hist()

If the context is larger than the max_length, the tokenizer will truncate the context. To avoid this, the context is split into multiple features if it is too long. This is some overlap between the tokens in case of the split. The reason is to avoid the splitting from reducing the model performance. For example:- If four tokens are the answer and they have been split as 2 each on two features, the model will not be able to properly answer the question with the 4 tokens. Hence, overlap is allowed to solve this issue.

In [9]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [10]:
pad_on_right = tokenizer.padding_side == "right"

In [11]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and is removed
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding. If the context is too long, it is split
    # into multiple features with overlap. The offset mapping will allow the model to properly answer
    # the question using the index of the context.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [12]:
!pip uninstall fsspec -qq -y
!pip install --no-index --find-links ../input/hf-datasets/wheels datasets -qq

In [13]:
from datasets import Dataset

In [14]:
# A function for converting the answer to a better format.
def convert_answers(r):
    start = r[0]
    text = r[1]
    return {
        'answer_start': [start],
        'text': [text]
    }

In [15]:
# Converting the answer in the train dataset.
train = train.sample(frac=1, random_state=42)
train['answers'] = train[['answer_start', 'answer_text']].apply(convert_answers, axis=1)

In [16]:
# The last 64 samples are used for validation set.
df_train = train[:-64].reset_index(drop=True)
df_valid = train[-64:].reset_index(drop=True)

In [17]:
# Converting the dataset to proper format from pandas csv file
train_dataset = Dataset.from_pandas(df_train)
valid_dataset = Dataset.from_pandas(df_valid)

In [18]:
train_dataset[0]

In [19]:
# Using map to use prepare train function that was defined earlier.
tokenized_train_ds = train_dataset.map(prepare_train_features, batched=True, remove_columns=train_dataset.column_names)
tokenized_valid_ds = valid_dataset.map(prepare_train_features, batched=True, remove_columns=train_dataset.column_names)

In [20]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# Multilingual Roberta model trained on Question-Answering dataset SQuAD 2.0 is used.
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [21]:
# WANDB is disabled
%env WANDB_DISABLED=True

In [26]:
# Thr args used for training the models
args = TrainingArguments(
    f"chaii-qa",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5,
    warmup_ratio=0.1,
    gradient_accumulation_steps=8,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
)

In [27]:
from transformers import default_data_collator

# Data Collator is used to form batches of data.
data_collator = default_data_collator

In [28]:
# Trainer to train the model.
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_valid_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [29]:
# Train the model
trainer.train()

In [30]:
# Save the trained model
trainer.save_model("chaii-roberta-trained")

In [31]:
# Prepare the validation set.
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit with the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [32]:
# Using map to use prepare validation function that was defined earlier.
validation_features = valid_dataset.map(
    prepare_validation_features,
    batched=True,
    remove_columns=valid_dataset.column_names
)

In [33]:
# Remove example_id and offset_mapping columns.
valid_feats_small = validation_features.map(lambda example: example, remove_columns=['example_id', 'offset_mapping'])
valid_feats_small

In [34]:
# Using the trained model to make predictions on the validation set.
raw_predictions = trainer.predict(valid_feats_small)

In [35]:
# Maximum length of the answer.
max_answer_length = 30

In [36]:
# Formatting the validation outputs for postprocessing.
import collections

examples = valid_dataset
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

Pipeline for postprocessing is shown below. If context is too long, it is split into multiple features. Each of the features that were created by splitting the large context will yield a logits corresponding to the start and end token which is then converted to answer. I goal of the postprocessing step is to select one answer from the multiple answers that were selected by the multiple features. 
The pipeline is shown in the image below. 

<img src="images/21.png">
https://www.youtube.com/watch?v=BNy08iIWVJM

In [37]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Map the example to its corresponding features which were created by splitting the large context.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        predictions[example["id"]] = best_answer["text"]

    return predictions

In [38]:
# Use the postpocessing function to get the predicted answer
final_predictions = postprocess_qa_predictions(valid_dataset, validation_features, raw_predictions.predictions)

In [39]:
references = [{"id": ex["id"], "answer": ex["answers"]['text'][0]} for ex in valid_dataset]

<img src="images/22.png">
https://medium.com/data-science-bootcamp/understand-jaccard-index-jaccard-similarity-in-minutes-25a703fbf9d7

In [40]:
# Compute the jaccard score.
def jaccard(row): 
    str1 = row[0]
    str2 = row[1]
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [41]:
res = pd.DataFrame(references)
res['prediction'] = res['id'].apply(lambda r: final_predictions[r])
res['jaccard'] = res[['answer', 'prediction']].apply(jaccard, axis=1)
res

In [53]:
res.jaccard.mean()