<a href="https://colab.research.google.com/github/christianwarmuth/transformer_adapter_bias_evaluation/blob/main/code/adapter_training/sst_adapter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1️⃣ Training an Adapter for a Transformer model

In this notebook, we train an adapter for a **RoBERTa** ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)) model for sequence classification on a **sentiment analysis** task using [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers), the _AdapterHub_ adaptation of HuggingFace's _transformers_ library.

If you're unfamiliar with the theoretical parts of adapters or the AdapterHub framework, check out our [introductory blog post](https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/) first.

We train a **Task Adapter** for a pre-trained model here. Most of the code is identical to a full finetuning setup using HuggingFace's transformers. For comparison, have a look at the [same guide using full finetuning](https://colab.research.google.com/drive/1brXJg5Mokm8h3shxqPRnoIsRwHQoncus?usp=sharing).

For training, we use the [movie review dataset by Pang and Lee (2005)](http://www.cs.cornell.edu/people/pabo/movie-review-data/). It contains movie reviews  from Rotten Tomatoes which are either classified as positive or negative. We download the dataset via HuggingFace's [datasets](https://github.com/huggingface/datasets) library.

## Installation

First, let's install the required libraries:

In [1]:
!pip install -U git+https://github.com/Adapter-Hub/adapter-transformers.git
!pip install datasets

Collecting git+https://github.com/Adapter-Hub/adapter-transformers.git
  Cloning https://github.com/Adapter-Hub/adapter-transformers.git to /tmp/pip-req-build-ug977aj3
  Running command git clone -q https://github.com/Adapter-Hub/adapter-transformers.git /tmp/pip-req-build-ug977aj3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 14.3 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 31.3 MB/s 
Collecting huggingface-hub>=0.0.13
  Downloading huggingface_hub-0.0.14-py3-none-any.whl (43 kB)
[K     |████████████████████████████████| 43 kB 1.3 MB/s 
Building wheels f

In [2]:
import torch
torch.cuda.is_available()

True

In [3]:
from google.colab import drive
drive.mount("/content/gdrive/", force_remount=True)

Mounted at /content/gdrive/


In [4]:
import sys
sys.path.append('/content/gdrive/MyDrive/master_hpi/NLP_Project/code/')

In [5]:
path = "/content/gdrive/MyDrive/master_hpi/NLP_Project/code/"

In [6]:
path

'/content/gdrive/MyDrive/master_hpi/NLP_Project/code/'

## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [7]:
from datasets import load_dataset

dataset = load_dataset("sst")
dataset.num_rows

#make dataset smaller for quicker training purpose



#dataset= dataset.filter(lambda e, i: i<10000, with_indices=True)
#print(dataset)

Downloading:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

No config specified, defaulting to: sst/default


Downloading and preparing dataset sst/default (download: 6.83 MiB, generated: 3.73 MiB, post-processed: Unknown size, total: 10.56 MiB) to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff...


Downloading:   0%|          | 0.00/6.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/790k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset sst downloaded and prepared to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff. Subsequent calls will reuse this data.


{'test': 2210, 'train': 8544, 'validation': 1101}

Every dataset sample has an input text and a binary label:

In [8]:
dataset['train'][0]





{'label': 0.6944400072097778,
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [9]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["sentence"], max_length=80, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)

# Isa work:
def label_mapping(batch):
  batch["labels"] = round(batch["label"])
  return batch
dataset = dataset.map(label_mapping)
# # The transformers model expects the target class column to be named "labels"
# dataset.rename_column_("isa", "labels")

# end of Isa work

# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

dataset["train"]["labels"]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/8544 [00:00<?, ?ex/s]

  0%|          | 0/1101 [00:00<?, ?ex/s]

  0%|          | 0/2210 [00:00<?, ?ex/s]

tensor([1, 1, 1,  ..., 1, 0, 0])

Now we're ready to train our model...

## Training

We use a pre-trained RoBERTa model from HuggingFace. We use `RobertaModelWithHeads`, a class unique to `adapter-transformers`, which allows us to add and configure prediction heads in a flexibler way.

In [10]:
from transformers import AdapterConfig
from transformers import BertTokenizer, BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
config = AdapterConfig.load("pfeiffer")
model.load_adapter("sentiment/sst-2@ukp", config=config)
#model.add_adapter("sentiment/sst-2@ukp")
#model.add_classification_head(
#    "sentiment/sst-2@ukp",
#    num_labels=2,
#    id2label={ 0: "👎", 1: "👍"}
 # )
#model.train_adapter("sentiment/sst-2@ukp")


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/7.18M [00:00<?, ?B/s]

'sst-2'

**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and [the type of adapter](https://docs.adapterhub.ml/adapters.html#adapter-types) (task adapter). Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full finetuning.** Adapter training usually required a few more training epochs than full finetuning.

In [11]:
import numpy as np
from transformers import TrainingArguments, Trainer, EvalPrediction
path = "/content/gdrive/MyDrive/master_hpi/NLP_Project/code/"
import os

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    remove_unused_columns=False,
    output_dir=path

)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
    adapter_names="sst-2",
    do_save_full_model=False,
    do_save_adapter_fusion=True
)

# print(dataset)
print(dataset["train"]["labels"])

tensor([1, 1, 1,  ..., 1, 0, 0])


Start the training 🚀

In [12]:
trainer.train()
print(dataset)

***** Running training *****
  Num examples = 8544
  Num Epochs = 6
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1602


Step,Training Loss
200,0.3922
400,0.2577
600,0.1966
800,0.1306
1000,0.0529
1200,0.0426
1400,0.0284
1600,0.0157


Saving model checkpoint to /content/gdrive/MyDrive/master_hpi/NLP_Project/code/checkpoint-500
Saving model checkpoint to /content/gdrive/MyDrive/master_hpi/NLP_Project/code/checkpoint-1000
Saving model checkpoint to /content/gdrive/MyDrive/master_hpi/NLP_Project/code/checkpoint-1500


Training completed. Do not forget to share your model on huggingface.co/models =)




DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence', 'token_type_ids', 'tokens', 'tree', 'labels'],
        num_rows: 8544
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence', 'token_type_ids', 'tokens', 'tree', 'labels'],
        num_rows: 1101
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'sentence', 'token_type_ids', 'tokens', 'tree', 'labels'],
        num_rows: 2210
    })
})


Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [13]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1101
  Batch size = 32


{'epoch': 6.0,
 'eval_acc': 0.8501362397820164,
 'eval_loss': 0.8421767950057983,
 'eval_runtime': 3.7388,
 'eval_samples_per_second': 294.478,
 'eval_steps_per_second': 9.361}

We can put our trained model into a `transformers` pipeline to be able to make new predictions conveniently:

In [14]:
from transformers import TextClassificationPipeline

# classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0)

classifier("This is great!")

[{'label': 'LABEL_1', 'score': 0.9990543127059937}]

In [15]:
from transformers import AdapterConfig
from transformers import BertTokenizer, BertForSequenceClassification

model2 = BertForSequenceClassification.from_pretrained("bert-base-uncased")
config = AdapterConfig.load("pfeiffer")
model2.load_adapter("sentiment/sst-2@ukp", config=config)
#model.add_adapter("sentiment/sst-2@ukp")
#model.add_classification_head(
#    "sentiment/sst-2@ukp",
#    num_labels=2,
#    id2label={ 0: "👎", 1: "👍"}
 # )
#model.train_adapter("sentiment/sst-2@ukp")

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "adapters": {
    "adapters": {},
    "config_map": {}
  },
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "2.1.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/resolve/mai

'sst-2'

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [19]:
print(model.active_adapters)
model.save_pretrained(path +"models/sst/", "sst-2")
model.save_all_adapter_fusions(path +"models/sst/", "sst-2")
#!ls -lh final_adapter

Configuration saved in /content/gdrive/MyDrive/master_hpi/NLP_Project/code/models/sst/config.json


Stack[sst-2]


Model weights saved in /content/gdrive/MyDrive/master_hpi/NLP_Project/code/models/sst/pytorch_model.bin


**Share your work!**

The next step after training is to share our adapter with the world via _AdapterHub_. [Read our guide](https://docs.adapterhub.ml/contributing.html) on how to prepare the adapter module we just saved and contribute it to the Hub!

➡️ Also continue with [the next Colab notebook](https://colab.research.google.com/github/Adapter-Hub/adapter-transformers/blob/master/notebooks/02_Adapter_Inference.ipynb) to learn how to use adapters from the Hub.