# COLX 585 Trends in Computational Linguistic

## Lab tutorial: RoBERTa with Adapter Module

Traditional fine-tuning can effectively transfer the knowledge of pre-trained language models (e.g., BERT and RoBERTa) to a task-specific task, however, this full fine-tuning strategy is parameter ineffective because it backpropogates through all the layers and updates all the parameters of a model--very time consuming. Hence, the [adapater module](http://proceedings.mlr.press/v97/houlsby19a/houlsby19a.pdf) was introduced to improve the fine-tuning efficency without dropping performance levels. 

As the following figure shows, the adapater module is an additional module added after each pre-trained sublayer in the standard Transformer architecture. Each adapter module is a small feed-forward neural network where hidden size is much smaller than Transformer's hidden size. During downstream task fine-tuning, we freeze the parameters of pre-trained layers and only optimze the parameters of adapter modules. Hence, the number of training parameters is reduced significantly.  




![](https://miro.medium.com/max/570/0*Z2FMWTCmdkgevHr-.png)

Picture Courtesy: Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., ... & Gelly, S. (2019, May). [Parameter-efficient transfer learning for NLP](http://proceedings.mlr.press/v97/houlsby19a/houlsby19a.pdf). In International Conference on Machine Learning (pp. 2790-2799). PMLR.

In this tutorial, we use an off-the-shelf PyTorch library, [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers), to add the adapter module to RoBERTa. 

In [2]:
# adapter-transformers library is based on torch==1.4.0 and transformers==3.1.0.
!pip install torch==1.4.0
!pip transformers==3.1.0
!pip install adapter-transformers==1.1.1

Collecting torch==1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/1a/3b/fa92ece1e58a6a48ec598bab327f39d69808133e5b2fb33002ca754e381e/torch-1.4.0-cp37-cp37m-manylinux1_x86_64.whl (753.4MB)
[K     |████████████████████████████████| 753.4MB 22kB/s 
[31mERROR: torchvision 0.9.1+cu101 has requirement torch==1.8.1, but you'll have torch 1.4.0 which is incompatible.[0m
[31mERROR: torchtext 0.9.1 has requirement torch==1.8.1, but you'll have torch 1.4.0 which is incompatible.[0m
[?25hInstalling collected packages: torch
  Found existing installation: torch 1.8.1+cu101
    Uninstalling torch-1.8.1+cu101:
      Successfully uninstalled torch-1.8.1+cu101
Successfully installed torch-1.4.0
ERROR: unknown command "transformers==3.1.0"
Collecting adapter-transformers==1.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/9e/8a/5a4cd4ed09201f76d5eb6d7a36231bc98da2bfa28e2d03c7abfafcdf6baf/adapter_transformers-1.1.1-py3-none-any.whl (1.3MB)
[K     |█████████████████

## Import required Python libraries

In [3]:
import torch, os
import pandas as pd
import torch.nn as nn
from tqdm import tqdm, trange

from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import RobertaConfig, RobertaModelWithHeads, RobertaTokenizer, AdapterType, AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score


## Set seed of randomization and working device

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print(device)


def set_seed(seed):
    # Set the random seed
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if n_gpu > 0:
        torch.cuda.manual_seed_all(seed)

cuda


### Define data generator class and preparation function.

The custom dataset should inherit [`Dataset`](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) and define the following methods:
  * `__len__` so that len(dataset) returns the size of the dataset.
  * `__getitem__` to support the indexing such that `dataset[i]` can be used to get $i$th sample

In [5]:
class CustomDataset(Dataset):
    # initialization
    def __init__(self, dataframe, tokenizer, max_len, lab2ind):
        """
          dataframe: pandas DataFrame.
          tokenizer: Hugginfance BERT/RoBERTa tokenizer
          max_len: maximal length of input sequence
          lab2ind: dictionary of label classes
        """
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = self.data.content
        self.labels = self.data.label
        self.max_len = max_len
        self.lab2ind = lab2ind

    # get the size of the dataset
    def __len__(self):
        return len(self.comment_text)

    # generate sample by index
    def __getitem__(self, index):
        # get ith sample and label
        comment_text = str(self.comment_text[index])
        label = str(self.labels[index])

        label = self.lab2ind[label]
        # use encode_plus() of Transformers to tokenize and vectorize input seuqnce and covert it to tensors. 
        # this method truncate or pad sequence to the maximal length and then return pytorch tensors. 
        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            padding="max_length",
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            return_tensors = "pt"
        )

        return {
            'ids': inputs['input_ids'],
            'mask': inputs['attention_mask'],
            'targets': torch.tensor(label, dtype=torch.long)
        }

### Define a function to load datasets and create data iterators.

In [6]:
def regular_encode(file_path, tokenizer, lab2ind, shuffle=True, num_workers = 2, batch_size=64, maxlen = 32, mode = 'train'): 
    '''
      file_path: path to your dataset file
      tokenizer: tokenizer method
      lab2ind: label-to-index dictionary
      shuffle: shuffle the dataset or not
      num_workers: a number of data processors
      batch_size: the number of batch size
      maxlen: maximal sequence length
      mode: the type of dataset
    '''
    # if we are in train mode, we will load two columns (i.e., text and label).
    if mode == 'train':
        # Use pandas to load dataset, the dataset should be a tsv file where the first line is the header.
        df = pd.read_csv(file_path, delimiter='\t',header=0, names=['content','label'], encoding='utf-8', quotechar=None, quoting=3)
    
    # if we are in predict mode, we will load one column (i.e., text).
    elif mode == 'predict':
        df = pd.read_csv(file_path, delimiter='\t',header=0, names=['content', 'label'])
    else:
        print("the type of mode should be either 'train' or 'predict'. ")
        return
        
    print("{} Dataset: {}".format(file_path, df.shape))
    # instantiate the dataset instance 
    custom_set = CustomDataset(df, tokenizer, maxlen,lab2ind)
    
    dataset_params = {'batch_size': batch_size, 'shuffle': shuffle, 'num_workers': num_workers}

    batch_data_loader = DataLoader(custom_set, **dataset_params)
    # return a data iterator
    return batch_data_loader

### Training and evaluation functions.

In [7]:
def train(model, iterator, optimizer, scheduler, criterion):
    
    model.train()
    epoch_loss = 0.0
    
    for _, batch in enumerate(iterator):
        # load data batch
        input_ids = batch['ids'].to(device, dtype = torch.long)
        input_mask = batch['mask'].to(device, dtype = torch.long)
        labels = batch['targets'].to(device, dtype = torch.long)
        # forward
        outputs = model(input_ids, input_mask, labels=labels)
        loss, logits = outputs[:2]
        
        # delete used variables to free GPU memory
        del batch, input_ids, input_mask, labels
        optimizer.zero_grad()
        # backward
        if torch.cuda.device_count() == 1:
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            epoch_loss += loss.cpu().item()
        else:
            loss.mean().backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            epoch_loss += loss.mean().cpu().item()

        optimizer.step()
        scheduler.step()
    
    # free GPU memory
    if device == 'cuda':
        torch.cuda.empty_cache()

    return epoch_loss / len(iterator)


def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    all_pred=[]
    all_label = []

    with torch.no_grad():
        for _, batch in enumerate(iterator, 0):
        # Add batch to GPU
            input_ids = batch['ids'].to(device, dtype = torch.long)
            input_mask = batch['mask'].to(device, dtype = torch.long)
            labels = batch['targets'].to(device, dtype = torch.long)
            # forward
            outputs = model(input_ids, input_mask, labels=labels)
            loss, logits = outputs[:2]

            # delete used variables to free GPU memory
            del batch, input_ids, input_mask

            if torch.cuda.device_count() == 1:
                epoch_loss += loss.cpu().item()
            else:
                epoch_loss += loss.sum().cpu().item()
            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(logits.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    # computing metrics 
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    recall = recall_score(all_label, all_pred, average='macro')
    precision = precision_score(all_label, all_pred, average='macro')

    return epoch_loss/len(iterator), accuracy, f1score, recall, precision

### Create a optimizer and scheduler.

The model train with a linear learing rate [scheduler](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#transformers.get_linear_schedule_with_warmup) that decreases linearly from the peak learning rate to 0 after a warmup period where learning rate linearly increase from 0 to the peak learning rate. 

![](https://huggingface.co/transformers/_images/warmup_linear_schedule.png)



In [8]:
def create_optimizer_and_scheduler(model, num_training_steps, warmup_steps, learning_rate):
    """
    Setup the optimizer and the learning rate scheduler.
    num_training_steps: the number of training steps
    warmup_steps: the number of warm-up steps
    learning_rate: the peak learning rate
    """
    optimizer = AdamW(
    model.parameters(),
    lr=learning_rate
    )
    
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer, 
        num_warmup_steps=warmup_steps, 
        num_training_steps=num_training_steps
    )

    return optimizer, lr_scheduler

### Train Adapter


Load model and tokenizer of `RoBERTa-Base` by the shortcut name.

In [14]:

model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaModelWithHeads.from_pretrained(model_name)


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModelWithHeads were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infere

In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The roberta-base has {count_parameters(model):,} trainable parameters')

The roberta-base has 124,645,632 trainable parameters


Add task-specific adapter module for sociality classification task that is a single text classification task. By calling `train_adapter(["social"])`, we freeze all transformer parameters and only optimize the parameters of `social` adapter.

In [19]:
model.add_adapter("social", AdapterType.text_task)
model.train_adapter(["social"])

Add the classification head, i.e., a two-layer feed-forward neural network, on top of the Transformer layers. 

The method `model.set_active_adapters([["social"]])` registers the `social` adapter as a default for training. 

In [20]:
model.add_classification_head("social", num_labels=2)
model.set_active_adapters([["social"]])

Send model to device (CPU/GPU).

In [21]:
model = model.to(device)

In [22]:
print(model)

RobertaModelWithHeads(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): Layer

In [23]:
print(f'The social adapter RoBERTa model has {count_parameters(model):,} trainable parameters')

The social adapter RoBERTa model has 1,486,658 trainable parameters


The number of trainable parameters decrease significantly.

### Training

Specify hyper-parameters and load datasets.

We freeze all the Transformer layers and only optimize the parameters of adapter modules that are new added and randomly initialized. Hence, we use large learning rate (i.e., 3e-4).

In [29]:
lab2ind = {'no': 0, 'yes': 1}
batch_size = 32
max_seq_length = 32
num_epochs = 5
warmup_proportion = 0.1
learning_rate = 3e-4
max_grad_norm = 1.0
data_dir = "./drive/My Drive/Colab Notebooks/happy_db/"

train_file = os.path.join(data_dir, "train.tsv")
dev_file = os.path.join(data_dir, "dev.tsv")
test_file = os.path.join(data_dir, "test.tsv")

Create data interators. 

In [30]:
train_dataloader = regular_encode(train_file, tokenizer, lab2ind, shuffle=True, batch_size=batch_size, maxlen = max_seq_length)
validation_dataloader = regular_encode(dev_file, tokenizer, lab2ind, shuffle=False, batch_size=batch_size, maxlen = max_seq_length)
test_dataloader = regular_encode(test_file, tokenizer, lab2ind, shuffle=False, batch_size=batch_size, maxlen = max_seq_length)


./drive/My Drive/Colab Notebooks/happy_db/train.tsv Dataset: (8448, 2)
./drive/My Drive/Colab Notebooks/happy_db/dev.tsv Dataset: (1056, 2)
./drive/My Drive/Colab Notebooks/happy_db/test.tsv Dataset: (1056, 2)


Optimizer, sheduler, and loss function.

In [31]:
num_training_steps	= len(train_dataloader) * num_epochs
num_warmup_steps = num_training_steps * warmup_proportion

In [32]:
optimizer, scheduler = create_optimizer_and_scheduler(model, num_training_steps, num_warmup_steps, learning_rate)
criterion = nn.CrossEntropyLoss()

Train the model with 10 epochs. The training speed is much faster than fully fine-tuning (i.e., optimize the parameters of the entire RoBERTa). We save the `social` adapter module at the end of each epoach rather than the entire RoBERTa model. This `social` adapter is light weight: it is only 3MB! 

In [35]:
epoch_res = []

for epoch in trange(num_epochs, desc="Epoch"):
    train_loss = train(model, train_dataloader, optimizer, scheduler, criterion)	  
    val_loss, val_acc, val_f1, val_recall, val_precision = evaluate(model, validation_dataloader, criterion)

    epoch_eval_result = {"epoch_num":int(epoch + 1),"train_loss":train_loss,
                      "val_acc":val_acc, "val_recall":val_recall, "val_precision":val_precision, "val_f1":val_f1
                      }
    print(epoch_eval_result)
    epoch_res.append(epoch_eval_result)
    save_path = "./epoch"+str(epoch+1) 
    if os.path.exists(save_path) == False:
      os.makedirs(save_path)

    model.save_all_adapters(save_path)





Epoch:   0%|          | 0/5 [00:00<?, ?it/s][A[A[A


Epoch:  20%|██        | 1/5 [00:32<02:09, 32.35s/it][A[A[A

{'epoch_num': 1, 'train_loss': 0.14573501964861696, 'val_acc': 0.9365530303030303, 'val_recall': 0.9382860040567951, 'val_precision': 0.9351728974483466, 'val_f1': 0.9361959428079305}





Epoch:  40%|████      | 2/5 [01:05<01:37, 32.58s/it][A[A[A

{'epoch_num': 2, 'train_loss': 0.13211637516647126, 'val_acc': 0.9412878787878788, 'val_recall': 0.9431614024920313, 'val_precision': 0.93992981144016, 'val_f1': 0.9409657978165136}





Epoch:  60%|██████    | 3/5 [01:39<01:06, 33.02s/it][A[A[A

{'epoch_num': 3, 'train_loss': 0.11951439521473015, 'val_acc': 0.9384469696969697, 'val_recall': 0.9401984931903796, 'val_precision': 0.9370718023412634, 'val_f1': 0.9381005415300818}





Epoch:  80%|████████  | 4/5 [02:14<00:33, 33.69s/it][A[A[A

{'epoch_num': 4, 'train_loss': 0.11095547920558602, 'val_acc': 0.9384469696969697, 'val_recall': 0.9401984931903796, 'val_precision': 0.9370718023412634, 'val_f1': 0.9381005415300818}





Epoch: 100%|██████████| 5/5 [02:49<00:00, 33.92s/it]

{'epoch_num': 5, 'train_loss': 0.1109762998989247, 'val_acc': 0.9384469696969697, 'val_recall': 0.9401984931903796, 'val_precision': 0.9370718023412634, 'val_f1': 0.9381005415300818}





Present the validation preformance.

In [36]:
report_df = pd.DataFrame(epoch_res)
report_df.sort_values(by=["val_f1"], ascending=False, inplace=True)
report_df

Unnamed: 0,epoch_num,train_loss,val_acc,val_recall,val_precision,val_f1
1,2,0.132116,0.941288,0.943161,0.93993,0.940966
2,3,0.119514,0.938447,0.940198,0.937072,0.938101
3,4,0.110955,0.938447,0.940198,0.937072,0.938101
4,5,0.110976,0.938447,0.940198,0.937072,0.938101
0,1,0.145735,0.936553,0.938286,0.935173,0.936196


### Load the best adapter model and evaluate on Test set

Load the pre-trained RoBERTa model by shortcut name. 

In [37]:
model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaModelWithHeads.from_pretrained(model_name)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModelWithHeads were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infere

Load the trained task-specific adapter module that achieves the best performance on validation set.  

In [39]:
adapter_name = model.load_adapter("./epoch2/social")

Overwriting existing adapter 'social'.
Overwriting existing head 'social'


Add the trained adapter to RoBERTa and evaluate on test set.

In [40]:
model.set_active_adapters(adapter_name)
model = model.to(device)
test_loss, test_acc, test_f1, test_recall, test_precision = evaluate(model, test_dataloader, criterion)

In [41]:
print(test_loss, test_acc, test_f1, test_recall, test_precision)

0.2498893676833673 0.9204545454545454 0.9199480181936321 0.9219626778024272 0.918901412100013


### References:
* Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., ... & Gelly, S. (2019, May). [Parameter-efficient transfer learning for NLP](http://proceedings.mlr.press/v97/houlsby19a/houlsby19a.pdf). In International Conference on Machine Learning (pp. 2790-2799). PMLR.

* https://docs.adapterhub.ml/index.html

* https://medium.com/dair-ai/adapters-a-compact-and-extensible-transfer-learning-method-for-nlp-6d18c2399f62
