# Exercise Sheet 9: Domain Adaptation on TweetQA using T5

In this exercise, you will evaluate T5 on question answering in the twitter domain, and then finetune the model.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 03.02.2021, 23:59 Central European Time

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise9_tweetqa_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 9).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory, but it will speed up training epochs, thereby allowing you to test more hyperparameters.

# Libraries and Hyperparameters

In [29]:
!pip3 install -q transformers==4.2.0
!pip3 install -q datasets==1.2.0

In [30]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset, load_metric
import numpy as np
import torch

In [31]:
NUM_EPOCHS = 3 if torch.cuda.is_available() else 1
PERCENTILES = (95, 100) if torch.cuda.is_available() else (80, 100)

TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 64
WARMUP_STEPS = 200
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 100
LEARNING_RATE = 5e-05

In [32]:
torch.manual_seed(0)
if torch.cuda.is_available():
  torch.cuda.manual_seed(0)

# Data preprocessing

In this exercise, we will load the [tweet_qa dataset](https://huggingface.co/datasets/tweet_qa), which is a closed question answering dataset. Each data sample consists of a question, a tweet as context which contains the necessary information, and the correct answer. Luckily for us, T5 has already been trained on SQUAD, which is a similar task, but with the context consisting of snippets from Wikipedia articles. This means that we just need to bring our dataset into the right format, and then we can already take a look at how well T5 performs on it. After that, we will finetune on tweet_qa to improve performance (domain adaptation).

In [33]:
tokenizer = T5TokenizerFast.from_pretrained('t5-small')

Here, we are going to bring our dataset into the right format, tokenize it, and truncate the inputs using the same method as in the last exercise. Hint: to find out what format T5 expects questions and answers in, take a look at the [T5 paper](https://arxiv.org/pdf/1910.10683.pdf), specifically Appendix D.15. 

In [34]:
def reformat_for_t5(example):
  example['src_texts'] = "question: "+ example["Question"].lower() + " context: " + example["Tweet"].lower()
  # maybe delete example["Question"] and example["Tweet"] values in example afterwards? (to save memory?)
  # del example['Question']
  # del example['Tweet']
  return example

def get_max_length(tokenizer, train_dataset, column, percentile):
  def get_lengths(batch):
    return tokenizer(batch, padding=False, truncation=True ,return_length=True)

  lengths = train_dataset.map(get_lengths, input_columns=column, batched=True)['length']
  return int(np.percentile(lengths, percentile)) +1

ANSWER_SEP = ':::::'
dataset = load_dataset('tweet_qa')
dataset = dataset.map(reformat_for_t5)
dataset = dataset.map(lambda x: {'tgt_texts': ANSWER_SEP.join(x['Answer'])}) 
#note: this is a hack: The Seq2SeqTrainer we will use later does not allow us to pass multiple labels to it. However, our dataset contains multiple correct answers for each question in the validation part. Therefore we concatenate them here so we can later split them in compute_metrics. Don't worry about this too much.

max_length = get_max_length(tokenizer, dataset['train'], 'src_texts', PERCENTILES[0])
max_target_length = get_max_length(tokenizer, dataset['train'], 'tgt_texts', PERCENTILES[1])

def tokenize(batch):
  return tokenizer.prepare_seq2seq_batch(src_texts=batch['src_texts'], tgt_texts=batch['tgt_texts'],
                                      max_length=max_length , max_target_length=max_target_length , 
                                      truncation=True , padding='max_length'
                                         ) # truncation="only_first" works as well. I don't see a pair of sequences though, so I use True

dataset = dataset.map(tokenize, batched=True)
dataset.set_format('torch', columns=['input_ids', 'labels', 'attention_mask'])

Using custom data configuration default
Reusing dataset tweet_qa (/home/flo/.cache/huggingface/datasets/tweet_qa/default/1.0.0/1d584f8fe9cb9d3d8d9ab63a36a25716b482850adc0e94f5d63c081ea2f78f01)
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweet_qa/default/1.0.0/1d584f8fe9cb9d3d8d9ab63a36a25716b482850adc0e94f5d63c081ea2f78f01/cache-6b3502f4b4cf7d8b.arrow
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweet_qa/default/1.0.0/1d584f8fe9cb9d3d8d9ab63a36a25716b482850adc0e94f5d63c081ea2f78f01/cache-693f8ce9cb991afd.arrow
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweet_qa/default/1.0.0/1d584f8fe9cb9d3d8d9ab63a36a25716b482850adc0e94f5d63c081ea2f78f01/cache-a286464fd86137b0.arrow
Loading cached processed dataset at /home/flo/.cache/huggingface/datasets/tweet_qa/default/1.0.0/1d584f8fe9cb9d3d8d9ab63a36a25716b482850adc0e94f5d63c081ea2f78f01/cache-997d69a0ca36a8fa.arrow
Loading cached processed dataset at /home/f

# Defining Metrics

We are going to evaluate with the same metrics used by SQUAD, Exact Match and Bag-of-Words F1. They're already available in 🤗 datasets, but we need to get our predictions and answers into shape first.

In [35]:
def squad_metrics(eval_prediction):
  # TODO: use the batch_decode function of the tokenizer to decode the predictions and the label_ids  
  # (hint: they're both in the only argument of this function).
  # Remember to set skip_special_tokens so that we don't generate padding tokens.
  predictions = tokenizer.batch_decode(eval_prediction.predictions, skip_special_tokens=True)
  answers = tokenizer.batch_decode(eval_prediction.label_ids, skip_special_tokens=True)

  answers = [answer.split(ANSWER_SEP) for answer in answers]
  assert all([len(ans) == 2 for ans in answers])
  
  predictions = [{'id': str(i), 'prediction_text': pred.strip().lower()} \
                 for i, pred in enumerate(predictions)]
  references = [{'id': str(i), 'answers': {'text': ans, 'answer_start': []}} \
                for i, ans in enumerate(answers)]

  metric = load_metric('squad')
  metric.add_batch(predictions=predictions, references=references)
  return metric.compute() 

# Training

In [36]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [37]:
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs/',
    evaluation_strategy="steps",
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    predict_with_generate=True,
)

trainer = Seq2SeqTrainer(model, train_dataset=dataset["train"], eval_dataset=dataset["validation"],
                         args=training_args, compute_metrics=squad_metrics, tokenizer=tokenizer
                        )

# Evaluation before Domain Adaptation

First, let's test how T5 performs on tweet_qa without domain adaptation. It should work fairly well, given that it's been trained on SQUAD, which is a very similar task, just on a different domain. For evaluation, we'll perform beam search during generation with a beam size of 2.

In [38]:
print(trainer.evaluate(num_beams=2))

KeyboardInterrupt: 

# Domain Adaptation

Now let's see how much we can improve our performance by finetuning on our in-domain data:

In [39]:
trainer.train()

Step,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
print(trainer.evaluate())
print(trainer.evaluate(num_beams=2))

# Report

**TODO:** Write a brief report of your results. Include the hyperparameters that you used, the results T5 achieved on the dataset before and after domain adaptation, and why the evaluation during training may have returned different results than the separate evaluations before and after training. Also report on the different evaluation speeds resulting from this, and give a brief explanation why finetuning on in-domain data was helpful for performance.

I used the given default hyperparameters. But I would try increasing the step size because the model learns quite fast. So not that many steps are neccessary. 

T5 before domain adaptation: 'eval_f1': 56.305619507456 , eval_samples_per_second': 68.344
T5 during training: eval_f1:  70.611473, eval_sampels_per_second: ~115.907000

T5 after domain adaptation: 70.44426593459734, eval_sampels_per_second: ~115.907000

T5 after domain adaptation (beam): 70.71750916364167,'eval_samples_per_second': 75.909


The evaluation during training computes different results, because the model is still learning. After enough examples the model should converge. The the seperate evaluation before training returns different results, because the model was not fine tuned on our specific data/task. But evaluation during and after training doesn't diver that much and I would say that the difference is not significant. 

My runtime was quite bad, because I wasn't home and couldn't use my graphics card. But it would be way faster obviously. It was so bad for .train() that I switched to Google Colab and used their GPU. Hence the better Runtime

During training the F1 score gradually improved from 60% to ~70%. But the improvement got smaller and smaller over time. Until it reached 70% and had minimal improvements/regressions. 

The performance(eval_sampels_per_second) seems the same to me. It seems to be a bit faster after training. I am not sure why this is the case. It seems weird to me that the model would learn how to evaluate faster. But I guess that is the case because the model recognizes certain data which it hasn't seen before. 