# COLX 585 Trends in Computational Linguistics
## Finetuning T5 model 

### Goal of this tutorial:
- Know the background of Text-to-Text Transfer Transformer (T5) model.
- Learn how to finetune T5 model for sentiment analysis

### References
Some useful references:
1. T5 Original Paper https://arxiv.org/pdf/1910.10683.pdf
2. T5 HuggingFace blog https://huggingface.co/transformers/model_doc/t5.html
3. T5 model card https://huggingface.co/t5-base 
4. T5 blog from Google AI https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html (material for T5 background is borrowed from this blog)


### T5 model - Background


<img src="https://1.bp.blogspot.com/-89OY3FjN0N0/XlQl4PEYGsI/AAAAAAAAFW4/knj8HFuo48cUFlwCHuU5feQ7yxfsewcAwCLcBGAsYHQ/s1600/image2.png" height="250" width="550"/>


Text-to-Text Transfer Transformer (T5) model is an encoder-decoder model pretrained to fill in dropped-out spans of text (denoted by \<M\>) from documents in a large-scale unlabeled dataset. With T5, all NLP tasks can be reframed into a unified text-to-text format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. T5's text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). One can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.

<img src="https://1.bp.blogspot.com/-o4oiOExxq1s/Xk26XPC3haI/AAAAAAAAFU8/NBlvOWB84L0PTYy9TzZBaLf6fwPGJTR0QCLcBGAsYHQ/s1600/image3.gif" height="250" width="550"/>

In the above illustration, T5 is flexibly finetuned on several (diverse) supervised tasks:
- **Machine Translation** - Translate sentence from English to German 
- **Sentence Acceptability (CoLA)** - Classify if a given sentence is grammatically and syntactically acceptable
- **Semantic Textual Similarity (STS)** - Predict how similar two given sentences are (regression task)
- **Summarization** - Summarize a given passage

### T5 model - Finetuning on sentiment analysis task

In this tutorial we will focus on finetuning T5 model on sentiment analysis task. Specifically, we focus on classifying the sentiment of the tweet. We make use of the dataset provided by ``SemEval-2016 Task 4 on Sentiment Analysis on Twitter`` (http://alt.qcri.org/semeval2016/task4/). We focus on the subtask A which is coined as **message polarity classification task**. In this task, given a tweet, we need to predict whether the tweet is of **positive, negative or neutral sentiment**. We have 6,000, 1,999 and 20,632 tweets in train, validation, and test set respectively. We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``data/sentiment-twitter-2016-task4`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``. Some example tweets include:

| class index | class name | tweet example |
| ----------------- | ----------- |-------------|
| 0  | Negative   | --MENTION-- --MENTION-- the reason i ask is because it may be the manufacturer's fault and they could help you |
| 1  | Neutral | just ordered my ever tablet --MENTION-- surface pro --DIGIT-- ssd hopefully it works out for dev to replace my laptop |
| 2  | Positive | dear --MENTION-- the newooffice for mac is great and all but no lync update c'mon |

This tutorial assumes the data can be found at: `/content/drive/MyDrive/Colab Notebooks/sentiment-twitter-2016-task4`.


#### Install all dependencies




In [21]:
!pip install transformers
!pip install datasets
!pip install sacrebleu
!pip install sentencepiece



#### HuggingFace's run_seq2seq.py

We will use T5 implementation provided by HuggingFace to finetune T5 for sentiment analysis. 

The **run_seq2seq.py** code provides implementation for fine-tuning T5 model and evaluating a trained T5 checkpoint. This tutorial assumes the code can be found at this path: `content/drive/MyDrive/Colab Notebooks/run_seq2seq.py`.

Let's inspect some of the arguments the code takes in:
- **task** - Name of the task. Set it to `translation_en_to_en`, as it doesn't matter for text classification.
- **train_file** - Path to the training data file. The code accepts two format: jsonlines and csv. We will convert our sentiment dataset into jsonlines format.
- **validation_file** - Path to the validation data file. The code accepts two format: jsonlines and csv. We will convert our sentiment dataset into jsonlines format.
- **test_file** - Path to the test data file. The code accepts two format: jsonlines and csv. We will convert our sentiment dataset into jsonlines format.
- **text_column** - Name of the column in the jsonlines dataset that corresponds to the input text (which is tweet in our setting). We will set it to "input_text".
- **summary_column** - Name of the column in the jsonlines dataset that corresponds to the target text (which is sentiment label in our setting). We will set it to "target_text".
- **model_name_or_path** -  Model's shortcut name or path to the pretrained model. For fine-tuning, we will set it to "t5-base". For evaluation, we will set it to path to the saved checkpoint.
- **do_train** - Whether to run training or not. Set it during training.
- **num_train_epochs** - Total number of training epochs to perform
- **output_dir** - Output directory where the model predictions and checkpoints will be written.
- **save_steps** - Number of updates steps before two checkpoint saves (default: 500)
- **save_total_limit** - If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in **output_dir**
- **predict_with_generate**, **do_predict** - Whether to generate the target text (sentiment label for our case) for validation and test or not.

For extensive set of training arguments (e.g., learning rate, maximum steps, batch size), look [here](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).







#### Convert the dataset to jsonlines

Our original sentiment data is in tsv format. For example, the first training sample in the dataset:

dear \<<\<MENTION>>> the newooffice for mac is great and all but no lync update c'mon \<TAB-SPACE>  1

We can convert the sample to jsonlines format:

{"input_text": "dear <<<MENTION>>> the newooffice for mac is great and all but no lync update c'mon", "target_text": "positive"}

Note that the **input_text** field contains the original tweet (that corresponds to the value we use to set **text_column**) and the **target_text** field contains the sentiment label (that corresponds to the value we use to set **summary_column**).

Let's convert the original dataset to jsonlines format now:



In [22]:
import os, json

def convert_sentiment_dataset_to_text2text_format(original_folder, destination_folder):
  # check if jsonlines directtory doesn't exist
  if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)
  for src_file in ["train.tsv", "dev.tsv", "test.tsv"]:
    t5_file = open(destination_folder + "/" + src_file.split(".")[0] + ".json", "w")
    for line in open(original_folder + "/" + src_file):
      # read tsv line
      tweet, sentiment = line.strip().split("\t")
      # prepare json
      t5_out = {}
      t5_out["input_text"] = tweet
      if sentiment == "0":
        t5_out["target_text"] = "negative"
      elif sentiment == "1":
        t5_out["target_text"] = "neutral"
      else:
        t5_out["target_text"] = "positive"
      # write json
      t5_file.write(json.dumps(t5_out))
      t5_file.write("\n")
    t5_file.close()

# assumes "/content/drive/MyDrive/Colab Notebooks/sentiment-twitter-2016-task4" contains original data
# assumes "/content/text2text-sentiment" contains jsonlines data
convert_sentiment_dataset_to_text2text_format("/content/drive/MyDrive/Colab Notebooks/sentiment-twitter-2016-task4", "text2text-sentiment")


#### Fine-tuning T5 model

That's all the preparation needed. We can now use **run_seq2seq.py** script to finetune T5 model.

In [25]:
!rm -rf /content/sentiment-ckpts # ensure the directory to store the checkpoints is empty
!python "/content/drive/MyDrive/Colab Notebooks/run_seq2seq.py" --task translation_en_to_en --text_column input_text --summary_column target_text --train_file /content/text2text-sentiment/train.json --validation_file /content/text2text-sentiment/dev.json --do_predict --predict_with_generate --test_file /content/text2text-sentiment/test.json --save_total_limit 5 --num_train_epochs 3 --output_dir /content/sentiment-ckpts --model_name_or_path t5-base --do_train --do_eval

2021-04-03 03:10:56.964951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
04/03/2021 03:10:58 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/sentiment-ckpts', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Apr03_03-10-58_485fd034beda', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, save

The above run saves a checkpoint every **save_steps** steps (default 500) and keeps only the latest **save_total_limit** checkpoints. The run doesn't print the validation performance so it's harder to monitor the training process. One trick is to save checkpoints frequently (within the hard disk space) and evaluate each checkpoint post training.

#### Evaluating T5 model checkpoint

Let's evaluate the latest checkpoints. We will use f1-micro as the metric to evaluate the quality of the classifier.

In [26]:
from sklearn.metrics import f1_score

def evaluate_checkpoint(prediction_dir, data_dir, metric="f1_micro", do_val=True, do_test=True):
  # compute validation performance
  if do_val:
    # read gold labels
    gold_labels = []
    for line in open(data_dir + "/dev.json"):
      gold_labels.append(json.loads(line.strip())["target_text"])
    # read predicted labels
    pred_labels = []
    for line in open(prediction_dir + "_val_preds_seq2seq.txt"):
      pred_labels.append(line.strip())
    # compute metric
    if metric == "f1_micro":
      print("%s validation F1-micro: %.2f"%(prediction_dir.split("/")[-1], f1_score(gold_labels, pred_labels, average="micro")))

  # compute test performance
  if do_test:
    # read gold labels
    gold_labels = []
    for line in open(data_dir + "/test.json"):
      gold_labels.append(json.loads(line.strip())["target_text"])
    # read predicted labels
    pred_labels = []
    for line in open(prediction_dir + "_test_preds_seq2seq.txt"):
      pred_labels.append(line.strip())
    # compute metric
    if metric == "f1_micro":
      print("%s test F1-micro: %.2f"%(prediction_dir.split("/")[-1], f1_score(gold_labels, pred_labels, average="micro")))


Let's compute the validation and the testing performance of the checkpoint saved after 500th step: `/content/sentiment-ckpts/checkpoint-500`

In [29]:
!python "/content/drive/MyDrive/Colab Notebooks/run_seq2seq.py" --model_name_or_path /content/sentiment-ckpts/checkpoint-500 --task translation_en_to_en --text_column input_text --summary_column target_text --train_file /content/text2text-sentiment/train.json --validation_file /content/text2text-sentiment/dev.json --test_file /content/text2text-sentiment/test.json --do_predict --predict_with_generate --output_dir /content/sentiment-ckpts --do_eval
evaluate_checkpoint("/content/sentiment-ckpts/checkpoint-500", "/content/text2text-sentiment", metric="f1_micro", do_val=True, do_test=True)

2021-04-03 03:35:16.166289: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
04/03/2021 03:35:17 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/sentiment-ckpts', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=True, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Apr03_03-35-17_485fd034beda', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, sav

Let's compute the validation and the testing performance of the checkpoint saved after 2,000th step: `/content/sentiment-ckpts/checkpoint-2000`

In [27]:
!python "/content/drive/MyDrive/Colab Notebooks/run_seq2seq.py" --model_name_or_path /content/sentiment-ckpts/checkpoint-2000 --task translation_en_to_en --text_column input_text --summary_column target_text --train_file /content/text2text-sentiment/train.json --validation_file /content/text2text-sentiment/dev.json --test_file /content/text2text-sentiment/test.json --do_predict --predict_with_generate --output_dir /content/sentiment-ckpts --do_eval
evaluate_checkpoint("/content/sentiment-ckpts/checkpoint-2000", "/content/text2text-sentiment", metric="f1_micro", do_val=True, do_test=True)

2021-04-03 03:21:08.817697: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
04/03/2021 03:21:09 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/sentiment-ckpts', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=True, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Apr03_03-21-09_485fd034beda', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, sav

Let's compute the validation and the testing performance of the final checkpoint: `/content/sentiment-ckpts`

In [28]:
!python "/content/drive/MyDrive/Colab Notebooks/run_seq2seq.py" --model_name_or_path /content/sentiment-ckpts --task translation_en_to_en --text_column input_text --summary_column target_text --train_file /content/text2text-sentiment/train.json --validation_file /content/text2text-sentiment/dev.json --test_file /content/text2text-sentiment/test.json --do_predict --predict_with_generate --output_dir /content/sentiment-ckpts --do_eval
evaluate_checkpoint("/content/sentiment-ckpts/t5-base", "/content/text2text-sentiment", metric="f1_micro", do_val=True, do_test=True)

2021-04-03 03:28:06.882525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
04/03/2021 03:28:08 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/sentiment-ckpts', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=True, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Apr03_03-28-08_485fd034beda', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, sav

That's it!