<a href="https://colab.research.google.com/github/danielsaggau/deep-learning-for-nlp/blob/main/exercise9_tweetqa_12144037.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise Sheet 9: Domain Adaptation on TweetQA using T5

In this exercise, you will evaluate T5 on question answering in the twitter domain, and then finetune the model.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 03.02.2021, 23:59 Central European Time

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise9_tweetqa_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 9).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory, but it will speed up training epochs, thereby allowing you to test more hyperparameters.

# Libraries and Hyperparameters

In [1]:
!pip install -q transformers==4.2.0
!pip install -q datasets==1.2.0

[K     |████████████████████████████████| 1.8MB 4.0MB/s 
[K     |████████████████████████████████| 890kB 20.2MB/s 
[K     |████████████████████████████████| 2.9MB 29.1MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 163kB 5.8MB/s 
[K     |████████████████████████████████| 20.7MB 71.5MB/s 
[K     |████████████████████████████████| 245kB 53.9MB/s 
[?25h

In [10]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset, load_metric
import numpy as np
import torch
import tensorflow as tf

In [11]:
NUM_EPOCHS = 3 if torch.cuda.is_available() else 1
PERCENTILES = (95, 100) if torch.cuda.is_available() else (80, 100)

TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 64
WARMUP_STEPS = 200
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 100
LEARNING_RATE = 5e-05

In [12]:
torch.manual_seed(0)
if torch.cuda.is_available():
  torch.cuda.manual_seed(0)

#Data preprocessing

In this exercise, we will load the [tweet_qa dataset](https://huggingface.co/datasets/tweet_qa), which is a closed question answering dataset. Each data sample consists of a question, a tweet as context which contains the necessary information, and the correct answer. Luckily for us, T5 has already been trained on SQUAD, which is a similar task, but with the context consisting of snippets from Wikipedia articles. This means that we just need to bring our dataset into the right format, and then we can already take a look at how well T5 performs on it. After that, we will finetune on tweet_qa to improve performance (domain adaptation).

In [13]:
tokenizer = T5TokenizerFast.from_pretrained('t5-small')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




Here, we are going to bring our dataset into the right format, tokenize it, and truncate the inputs using the same method as in the last exercise. Hint: to find out what format T5 expects questions and answers in, take a look at the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf), specifically Appendix D.15. 

In [113]:
def reformat_for_t5(example):
# i looked into a number of different approaches online, but most of the documents i found were specifically for tensors or used predefined pre-processing 
# i was not really able to find any clear documentation on how to do the modification here 
# I have looked at the paper and was hoping to get the following results  {'context': <article>, 'question': <question>} where rather than the context or
# question appears immediately, there was a prefix with "context:..." and "question:..."
# I looked at e.g. https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb#scrollTo=KPOteeqctpzw
# I was not entire sure what exactly 'src_texts' was. 
  example[src_texts]  = ("question: " + dataset['question'] + "context: " + dataset['context'])
  example[src_texts]  = example[src_texts].lower()
# https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb#scrollTo=4VQaCWw5JYK-
  return  example  # TODO: bring the question and context into the format that T5 is used to from SQUAD, for each example. # You should also lowercase the inputs.

def get_max_length(tokenizer, train_dataset, column, percentile):
  def get_lengths(batch):
    return tokenizer(batch, padding=False, return_length=True)

  lengths = train_dataset.map(get_lengths, input_columns=column, batched=True)['length']
  return int(np.percentile(lengths, percentile)) +1

ANSWER_SEP = ':::::'
dataset = load_dataset('tweet_qa')
dataset = dataset.map(reformat_for_t5)
dataset = dataset.map(lambda x: {'tgt_texts': ANSWER_SEP.join(x['Answer'])}) 
#note: this is a hack: The Seq2SeqTrainer we will use later does not allow us to pass multiple labels to it. 
#However, our dataset contains multiple correct answers for each question in the validation part.
# Therefore we concatenate them here so we can later split them in compute_metrics. Don't worry about this too much.

max_length = get_max_length(tokenizer, dataset['train'], 'src_texts', PERCENTILES[0])
max_target_length = get_max_length(tokenizer, dataset['train'], 'tgt_texts', PERCENTILES[1])

def tokenize(batch):
  batch = tokenizer.prepare_seq2seq_batch(batch, padding = max_length)
  # TODO: call the prepare_seq2seq_batch function of the T5Tokenizer. Be careful to set padding to 'max_length'
  return batch 

dataset = dataset.map(tokenize, batched=True)
dataset.set_format('torch', columns=['input_ids', 'labels', 'attention_mask'])

Using custom data configuration default
Reusing dataset tweet_qa (/root/.cache/huggingface/datasets/tweet_qa/default/1.0.0/1d584f8fe9cb9d3d8d9ab63a36a25716b482850adc0e94f5d63c081ea2f78f01)


KeyError: ignored

# Defining Metrics

We are going to evaluate with the same metrics used by SQUAD, Exact Match and Bag-of-Words F1. They're already available in 🤗 datasets, but we need to get our predictions and answers into shape first.

In [109]:
def squad_metrics(eval_prediction):
  predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True) # TODO: use the batch_decode function of the tokenizer to decode the 
  answers = tokenizer.batch_decode(label_ids, skip_special_tokens=True) # predictions and the label_ids 
# (hint: they're both in the only argument of this function). Remember to set skip_special_tokens so that we don't generate padding tokens.
  answers = [answer.split(ANSWER_SEP) for answer in answers]
  assert all([len(ans) == 2 for ans in answers])

  predictions = [{'id': str(i), 'prediction_text': pred.strip().lower()} \
                 for i, pred in enumerate(predictions)]
  references = [{'id': str(i), 'answers': {'text': ans, 'answer_start': []}} \
                for i, ans in enumerate(answers)]

  metric = squad_metrics('squad')   #TODO: load the metric "squad" using the function we've imported at the top of the notebook. 
  metric = metric.add_batch(predictions, references)  #TODO: use the add_batch method to pass our predictions and references to the metric
  return metric  #TODO: compute the metric

# Training

In [110]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [111]:
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs/',
    evaluation_strategy="steps",
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    predict_with_generate=True,
)
trainer = Seq2SeqTrainer(model, training_args) #TODO: instantiate the Seq2SeqTrainer. Remember to pass all the objects we've constructed so far that we can use here.

# Evaluation before Domain Adaptation

First, let's test how T5 performs on tweet_qa without domain adaptation. It should work fairly well, given that it's been trained on SQUAD, which is a very similar task, just on a different domain. For evaluation, we'll perform beam search during generation with a beam size of 2.

In [112]:
print(trainer.evaluate(num_beams=2))

ValueError: ignored

# Domain Adaptation

Now let's see how much we can improve our performance by finetuning on our in-domain data:

In [81]:
trainer.train()

ValueError: ignored

In [70]:
print(trainer.evaluate())
print(trainer.evaluate(num_beams=2))

ValueError: ignored

# Report

**TODO:** Write a brief report of your results. Include the hyperparameters that you used, the results T5 achieved on the dataset before and after domain adaptation, and why the evaluation during training may have returned different results than the separate evaluations before and after training. Also report on the different evaluation speeds resulting from this, and give a brief explanation why finetuning on in-domain data was helpful for performance.

In [None]:
Unfortunateley, i was unable to complete the pre-processing step, hindering me from running the rest of the analysis. 
Independent of the results, there are a few things one may have derived.  

In the beginning we defined the following hyperparameters: 
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 64
WARMUP_STEPS = 200
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 100
LEARNING_RATE = 5e-05

Hyperparameters after adoptation = N/A
evaluation interpretation = N/A
evaluation speed = N/A
finetuning = N/A

As suggested in the lecture slides, with respect to ensembling, training the pretrained models with the same base performance 
worse than completely separate models. 
Further, as mentioned updating all parameters works best but is computationally expensive. 
So, when fine tuning and updating all parameters, it probably was rather slow to evaluate, given the computational costs.
Performance increase arise from more data, larger models and ensembling. 
Henceforth, the combined evaluation during training might have been better due to ensembling.

For fine tuning, generally speaking, it is incremental to use a very small learning rate because we have a huge model with a relatively small test-dataset.
As suggested by Raffel et al. 2019 larger models lead to better performances. 
I suppose intuitively speaking, transfer learning is improved by training tailored for specific domains.
One should probably take these performances with a grain of salt, given that commerical settings demand for continous
adaptation and some studies have advocated for domain-indepedent data embedded into the fine-tuning stage. 