<!-- # COLX 585 Trends in Computational Linguistic -->
##  Tutorial: Multitask RoBERTa

In this tutorial, we will implement a multi-task learning model. Precisely, we fine-tune the pre-trained RoBERTa model on two classification tasks jointly. Our goal is to train one single model to perform both tasks well. Multi-task learning (MTL) is a type of inductive transfer learning ([Caruana, 1997](https://link.springer.com/article/10.1023/A:1007379606734)). MTL tries to learn the target and source tasks jointly and improve the target task or all tasks using a shared representation. Generally, MTL involves two sharing parameter approaches, i.e., hard sharing and soft sharing. The hard sharing approach shares the hidden layers between tasks and keeps task-specific prediction layers. In soft sharing, each task has its own task-specific hidden and output modules; and the parameters are shared by the additional constrain layers or regularizers. 

In this tutorial, we will implement an MTL RoBERTa with a hard sharing strategy. Transformer layers are shared across multiple tasks, and each task has its own prediction layer on top of shared Transformer layers. 

![](https://ruder.io/content/images/2017/05/mtl_images-001-2.png)

Picture Courtesy: https://ruder.io/multi-task/


 


In this tutorial, we will use two annotated datasets: [Sentiment Analysis in Twitter](https://www.aclweb.org/anthology/S17-2088.pdf) (SemEval-2017 Task 4) and [Emotion Recognition](https://www.aclweb.org/anthology/S18-1032.pdf) (SemEval-2018 Task 1).

- Sentiment Analysis in Twitter task is annotated with the labels of "negative", "positive", and "neutral".

- Emotion Recognition task is annotated with the labels of "anger", "joy", "optimism", and "sadness".

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Install packages

In [None]:
! pip install transformers
! pip install sentencepiece

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 13.2 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 44.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 42.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 42.2 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled 

## Import require Python libraries

In [None]:
import torch, os, json
import pandas as pd
import torch.nn as nn
from tqdm import tqdm, trange
from random import shuffle
import random
import numpy as np
from collections import defaultdict
from itertools import cycle
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import *
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score


In [None]:
## Set seed of randomization and working device
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

print(torch.cuda.get_device_name(0))

cuda
Tesla K80


### Define data generator class and preparation function.

The custom dataset should inherit [`Dataset`](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) and define the following methods:
  * `__len__` so that len(dataset) returns the size of the dataset.
  * `__getitem__` to support the indexing such that `dataset[i]` can be used to get $i$th sample

In [None]:
class CustomDataset(Dataset):
    # initialization
    def __init__(self, dataframe, tokenizer, max_len, lab2ind):
        """
          dataframe: pandas DataFrame.
          tokenizer: Hugginfance BERT/RoBERTa tokenizer
          max_len: maximal length of input sequence
          lab2ind: dictionary of label classes
        """
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = self.data.content
        self.labels = self.data.label
        self.max_len = max_len
        self.lab2ind = lab2ind

    # get the size of the dataset
    def __len__(self):
        return len(self.comment_text)

    # generate sample by index
    def __getitem__(self, index):
        # get ith sample and label
        comment_text = str(self.comment_text[index])
        label = str(self.labels[index])

        label = self.lab2ind[label]
        # use encode_plus() of Transformers to tokenize and vectorize input seuqnce and covert it to tensors. 
        # this method truncate or pad sequence to the maximal length and then return pytorch tensors. 
        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            return_tensors = "pt"
        )
        return {
            'ids': inputs['input_ids'].squeeze(0),  # shape of input_ids: [1, max_length]
            'masks': inputs['attention_mask'].squeeze(0), # shape of attention_mask: [1, max_length]
            'targets': torch.tensor(label, dtype=torch.long)
        }

### Define a function to load datasets and create data iterators.


In [None]:
def regular_encode(file_path, tokenizer, lab2ind, shuffle=True, num_workers = 2, batch_size=64, maxlen = 32, mode = 'train'): 
    '''
      file_path: path to your dataset file
      tokenizer: tokenizer method
      lab2ind: label-to-index dictionary
      shuffle: shuffle the dataset or not
      num_workers: a number of data processors
      batch_size: the number of batch size
      maxlen: maximal sequence length
      mode: the type of dataset
    '''
    # if we are in train mode, we will load two columns (i.e., text and label).
    if mode == 'train':
        # Use pandas to load dataset, the dataset should be a tsv file where the first line is the header.
        df = pd.read_csv(file_path, delimiter='\t',header=0, encoding='utf-8', quotechar=None, quoting=3)
    
    # if we are in predict mode, we will load one column (i.e., text).
    elif mode == 'predict':
        df = pd.read_csv(file_path, delimiter='\t',header=0)
    else:
        print("the type of mode should be either 'train' or 'predict'. ")
        return
        
    print("{} Dataset: {}".format(file_path, df.shape))
    # instantiate the dataset instance 
    custom_set = CustomDataset(df, tokenizer, maxlen,lab2ind)
    num_samples = len(custom_set)
    num_labels = len(lab2ind)

    dataset_params = {'batch_size': batch_size, 'shuffle': shuffle, 'num_workers': num_workers}

    batch_data_loader = DataLoader(custom_set, **dataset_params)
    # return a data iterator
    return batch_data_loader, num_samples, num_labels

### Create a optimizer and scheduler.

In [None]:
def create_optimizer_and_scheduler(model, num_training_steps, warmup_steps, learning_rate):
    """
    Setup the optimizer and the learning rate scheduler.
    num_training_steps: the number of training steps
    warmup_steps: the number of warm-up steps
    learning_rate: the peak learning rate
    """
    optimizer = AdamW(
    model.parameters(),
    lr=learning_rate
    )
    
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer, 
        num_warmup_steps=warmup_steps, 
        num_training_steps=num_training_steps
    )

    return optimizer, lr_scheduler

### Define the hyperparameter and I/O directories. 

* In `/content/drive/MyDrive/Colab Notebooks/multitask`, I have two task folders: "emotion-semeval2018/" and "sentiment-2017task4/".

* Each task folder includes four files: `train.tsv`, `dev.tsv`, `test.tsv`, and `label2ind.json`. 

* `train.tsv`, `dev.tsv`, `test.tsv` are datasets. `label2ind.json` is the label mapping dictionary. 

In [None]:
input_dir = "/content/drive/MyDrive/Colab Notebooks/multitask"
output_dir = "./mtl-rb/"
task_names = ["emotion-2018task1","sentiment-2017task4"]
model_name_path = "roberta-base"

max_seq_length = 64
train_batch_size = 32
eval_batch_size = 128
hidden_size = 768

lr = 2e-5
max_grad_norm = 1.0
warmup_proportion = 0.1
num_train_epochs = 5

## Build Multi-task architecture 

1. We share the Transformer-encoder layers (i.e., BERT layers) across all the tasks. Transformer-encoder layers encode each input sequence. We use the last layer's hidden state of `[CLS]` token as the sequence representation.

2. Each task corresponds to a task-specific feed-forward neural network (FFNN) that includes two non-linear layers. We refer to this task-specific FFNN as a task-specific classification module. 

3. For each input sequence, we pass the sequence representation through the corresponding classification module and get the prediction. Note that each input sequence only belongs to one task in this tutorial. 

### Build task-specific classification module

Each task-specific classification module is the same as the classification layer in the single task BERT model, i.e., a two-layer FFNN including two non-linear layers.

In [None]:
class CLS_LAYER(nn.Module):
    def __init__(self, label_num, hidden_size):
        super(CLS_LAYER, self).__init__()
        self.hidden_size = hidden_size
        self.label_num = label_num
        
        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
        self.dropout = nn.Dropout(0.1)

        # the output dimention is the number of classes in the task. 
        self.fc = nn.Linear(self.hidden_size, self.label_num)
        # initialization
        initial_module(self.dense)
        initial_module(self.fc)
  
    def forward(self, pooler_output):
        
        x = self.dense(pooler_output)
        x = torch.tanh(x)
        x = self.dropout(x)
        logits = self.fc(x)

        return logits   


The classification module weights are initialized by a normal distribution with $mean=0.0$ and $std = 0.02$, and the biases are the value of 0.

In [None]:
def initial_module(module):
    torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    torch.nn.init.constant_(module.bias, 0)

We add the task-specific classification modules to the shared RoBERTa layers. 

* `classifier_layers` is a list of classification modules. Its size is the number of tasks. 

* Same to single task BERT, we use `pooler_output` as the sequence-level representation. 

* `task_id` is the identifier that indicates the task type and navigates the computation flow to the corresponding classification module. In our example, the task id of "emotion-semeval2018" is 0, and "sentiment-2017task4" is 1.



In [None]:
class MT_BERT(nn.Module):
    def __init__(self, model_name_path, classifier_layers):
        super(MT_BERT, self).__init__()

        self.bert_model = RobertaModel.from_pretrained(model_name_path)
        self.classifiers = nn.ModuleList(classifier_layers)

    def forward(self, input_ids, input_mask, task_id):
        outputs = self.bert_model(input_ids = input_ids, attention_mask = input_mask)
        pooler_output = outputs['pooler_output']
        
        # select classification module according to the task index
        logits = self.classifiers[task_id](pooler_output)
        
        return logits  

In [None]:
def create_model(model_name_path, label_list, hidden_size):
    # create a classification module for each task
    classification_layers = [CLS_LAYER(len(task_label2ind), hidden_size) for task_label2ind in label_list]

    model = MT_BERT(model_name_path, classification_layers)
    return model

### Load all label mapping dictionaries. 

In [None]:
all_lab2ind = []
for task in task_names:
    tmp_file = open(os.path.join(os.path.join(input_dir, task), "label2ind.json"))
    lab2ind = json.load(tmp_file)
    tmp_file.close()
    all_lab2ind.append(lab2ind)

In [None]:
# load RoBERTa tokenizer by its shortcut name.
tokenizer = RobertaTokenizerFast.from_pretrained(model_name_path)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

### Prepare Train, Dev and Test dataloaders for each task. 

* Each task has its own Train, Dev and Test dataloaders.
* `train_loaders` is a list of dataloaders that contain all the training dataloaders. It includes 2 items (dataloaders) in our experiment. 
* Same to `train_loaders`, `valid_loaders` and  `test_loaders` are the lists of dataloaders.

* `data_sizes` is a list of sizes of training sets.


In [None]:
train_loaders = []
valid_loaders = []
test_loaders = []

data_sizes = []
total_training_batch = 0
for i, task in enumerate(task_names):
    lab2ind = all_lab2ind[i]
    ##############################
    train_loader, num_samples, num_label  = regular_encode(os.path.join(os.path.join(input_dir, task), "train.tsv"), tokenizer, lab2ind, shuffle=True, batch_size=train_batch_size, maxlen = max_seq_length)
    
    data_sizes.append(num_samples)
    total_training_batch += len(train_loader)
    train_loaders.append(iter(train_loader))
    
    ##############################
    valid_loader, _, _  = regular_encode(os.path.join(os.path.join(input_dir, task), "dev.tsv"), tokenizer, lab2ind, shuffle=False, batch_size=eval_batch_size, maxlen = max_seq_length)
    valid_loaders.append(valid_loader)
    
    ###############################
    test_loader, _, _  = regular_encode(os.path.join(os.path.join(input_dir, task), "test.tsv"), tokenizer, lab2ind, shuffle=False, batch_size=eval_batch_size, maxlen = max_seq_length)
    test_loaders.append(test_loader)


/content/drive/MyDrive/Colab Notebooks/multitask/emotion-2018task1/train.tsv Dataset: (3257, 3)
/content/drive/MyDrive/Colab Notebooks/multitask/emotion-2018task1/dev.tsv Dataset: (374, 3)
/content/drive/MyDrive/Colab Notebooks/multitask/emotion-2018task1/test.tsv Dataset: (1421, 3)
/content/drive/MyDrive/Colab Notebooks/multitask/sentiment-2017task4/train.tsv Dataset: (42756, 3)
/content/drive/MyDrive/Colab Notebooks/multitask/sentiment-2017task4/dev.tsv Dataset: (4751, 3)
/content/drive/MyDrive/Colab Notebooks/multitask/sentiment-2017task4/test.tsv Dataset: (12284, 3)


In [None]:
print("total number of training batches:", total_training_batch)

total number of training batches: 1439


We use `cycle()` method to make each training dataloader as a infinite iterator.

In [None]:
train_loaders = [cycle(it) for it in train_loaders]

### Define the `train()` and `evaluate()` function. 

In [None]:
def train(model, optimizer, scheduler, loss_func, data_sizes, num_per_epoch, train_loaders):
    '''
    model: multi-task model
    optimizer: AdamW optimizer
    scheduler: learning rate scheduler
    loss_func: loss funtion
    data_sizes: a list of sizes of training sets
    num_per_epoch: training steps of each epoch
    train_loaders: a list of training dataloaders
    '''
    model.train()

    # record training losses of all the tasks
    tr_loss = [0. for i in range(len(data_sizes))]

    # At each step, we sample a training dataloader to generate a batch. 
    # The sampling probability is based on the size of training set of each task. 
    total_sample = sum(data_sizes)
    probs = [p/total_sample for p in data_sizes]

    task_id = 0
    epoch = 0

    for step in range(num_per_epoch):
        # Select a training dataloader by the sampling probability. 
        task_id = np.random.choice(int(len(data_sizes)), p=probs)

        # Generate batch of selected task.
        batch = next(train_loaders[task_id])
        
         # load data batch
        input_ids = batch['ids'].to(device)
        input_mask = batch['masks'].to(device)
        labels = batch['targets'].to(device)
        
        # forward
        outputs = model(input_ids, input_mask, task_id)
        loss = loss_func(outputs, labels)

        # delete used variables to free GPU memory
        del batch, input_ids, input_mask, labels
        optimizer.zero_grad()
            
        loss.backward()

        optimizer.step()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) 
        scheduler.step()
    
        # free GPU memory
        if device == 'cuda':
            torch.cuda.empty_cache()

    return tr_loss

### Create a evaluation funtion.

In [None]:
def evaluate(model, iterator, loss_func, task_id):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
        
        for i, batch in enumerate(iterator):

            input_ids = batch['ids'].to(device)
            input_mask = batch['masks'].to(device)
            labels = batch['targets'].to(device)

            outputs = model(input_ids, input_mask, task_id)

            loss = loss_func(outputs, labels)
            # delete used variables to free GPU memory
            del batch, input_ids, input_mask

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 

    return epoch_loss / len(iterator), accuracy, f1score

### Instantiate our multi-task model  

In [None]:
model = create_model(model_name_path, all_lab2ind, hidden_size).to(device)

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Create an optimizer and scheduler. 

### Define a loss function, i.e., CrossEntropyLoss().

In [None]:
num_training_steps  = total_training_batch * num_train_epochs
num_warmup_steps = num_training_steps * warmup_proportion

In [None]:
optimizer, scheduler = create_optimizer_and_scheduler(model, num_training_steps, num_warmup_steps, lr)
loss_func = nn.CrossEntropyLoss()


# Model training

In [None]:
all_result_acc_dev = defaultdict(list)
all_result_loss_dev = defaultdict(list)
all_result_f1_dev = defaultdict(list)

if os.path.isdir(output_dir) == False:
    os.mkdir(output_dir)

for epoch in trange(num_train_epochs, desc="Epoch"):
    text_file = open(os.path.join(output_dir,"results.txt"), "a")
    _ = train(model, optimizer, scheduler, loss_func, data_sizes, total_training_batch, train_loaders)  
    
    # Evaluate at end of each epoch and save the evaluation results to a txt file.
    text_file.write(' Epoch [{}/{}]\n'.format(epoch+1, num_train_epochs))

    for i, task in enumerate(task_names): 
        val_loss, val_acc, val_f1 = evaluate(model, valid_loaders[i], loss_func, i)
        
        all_result_acc_dev[task].append(val_acc)
        all_result_loss_dev[task].append(val_loss)
        all_result_f1_dev[task].append(val_f1)


        text_file.write(' Task {}:\n Validation Accuracy: {:.6f}, Validation F1: {:.6f}\n'.format(task, val_acc, val_f1))
        print(' Task {}:\n Validation Accuracy: {:.6f}, Validation F1: {:.6f}\n'.format(task, val_acc, val_f1))

    text_file.write("\n\n")
    text_file.close()

    final_result = {}
    final_result["all_result_acc_dev"] = all_result_acc_dev
    final_result["all_result_loss_dev"] = all_result_loss_dev
    final_result["all_result_f1_dev"] = all_result_f1_dev

    torch.save(final_result, os.path.join(output_dir, "all_res.pt"))
    
    # Create a model checkpoint at end of each epoch
    if torch.cuda.device_count() <= 1:
        state_dict_model = model.state_dict()
    else:
        state_dict_model = model.module.state_dict()

    state = {
    'epoch': epoch,
    'state_dict': state_dict_model,
    'optimizer': optimizer.state_dict(),
    'scheduler': scheduler.state_dict()
    }
    
    torch.save(state, os.path.join(output_dir,"mt{}_{}.pt".format(len(task_names),str(epoch+1))))


Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

 Task emotion-2018task1:
 Validation Accuracy: 0.780749, Validation F1: 0.705866

 Task sentiment-2017task4:
 Validation Accuracy: 0.713955, Validation F1: 0.695989



Epoch:  20%|██        | 1/5 [16:32<1:06:11, 992.87s/it]

 Task emotion-2018task1:
 Validation Accuracy: 0.802139, Validation F1: 0.736963

 Task sentiment-2017task4:
 Validation Accuracy: 0.728899, Validation F1: 0.721395



Epoch:  40%|████      | 2/5 [33:06<49:40, 993.57s/it]  

 Task emotion-2018task1:
 Validation Accuracy: 0.810160, Validation F1: 0.747092

 Task sentiment-2017task4:
 Validation Accuracy: 0.732898, Validation F1: 0.725183



Epoch:  60%|██████    | 3/5 [49:38<33:05, 992.86s/it]

 Task emotion-2018task1:
 Validation Accuracy: 0.791444, Validation F1: 0.737231

 Task sentiment-2017task4:
 Validation Accuracy: 0.729952, Validation F1: 0.722917



Epoch:  80%|████████  | 4/5 [1:06:12<16:33, 993.11s/it]

 Task emotion-2018task1:
 Validation Accuracy: 0.804813, Validation F1: 0.743751

 Task sentiment-2017task4:
 Validation Accuracy: 0.733740, Validation F1: 0.725674



Epoch: 100%|██████████| 5/5 [1:22:44<00:00, 992.92s/it]


## Testing and Inference

* You should find the best model of each task based on the validation performance. Then, you load the checkpoint of the best model and test on the best model. 
* You may find that the tasks obtain their best results at the different epochs. Hence, you test the best model of each task separately. 

For example, emotion recognition task obtain the best validation F1 score with 1 epoch. Hence, we load `mt2_5.pt` and test on Test set of emotion recognition task.

In [None]:
model = create_model(model_name_path, all_lab2ind, hidden_size)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
checkpoint = torch.load("./mtl-rb/mt2_5.pt", map_location='cpu')
model.load_state_dict(checkpoint['state_dict'])
model = model.to(device)

In [None]:
task_id = 0
test_loss, test_acc, test_f1 = evaluate(model, test_loaders[task_id], loss_func, task_id)
print(' Task {}:\n Test Accuracy: {:.6f}, Test F1: {:.6f}\n'.format(task_names[task_id], test_acc, test_f1))

 Task emotion-2018task1:
 Test Accuracy: 0.815623, Test F1: 0.781675



## Library for Transformer-based multi-task learning: 

JIANT (https://github.com/nyu-mll/jiant)


## Reference

* Caruana, R. (1997). Multitask learning. Machine learning, 28(1), 41-75.
* Liu, X., He, P., Chen, W., & Gao, J. (2019, July). Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4487-4496).