<a href="https://colab.research.google.com/github/christianwarmuth/transformer_adapter_bias_evaluation/blob/main/colab/sst_adapter_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1️⃣ Training an Adapter for a Transformer model

In this notebook, we train an adapter for a **RoBERTa** ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)) model for sequence classification on a **sentiment analysis** task using [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers), the _AdapterHub_ adaptation of HuggingFace's _transformers_ library.

If you're unfamiliar with the theoretical parts of adapters or the AdapterHub framework, check out our [introductory blog post](https://adapterhub.ml/blog/2020/11/adapting-transformers-with-adapterhub/) first.

We train a **Task Adapter** for a pre-trained model here. Most of the code is identical to a full finetuning setup using HuggingFace's transformers. For comparison, have a look at the [same guide using full finetuning](https://colab.research.google.com/drive/1brXJg5Mokm8h3shxqPRnoIsRwHQoncus?usp=sharing).

For training, we use the [movie review dataset by Pang and Lee (2005)](http://www.cs.cornell.edu/people/pabo/movie-review-data/). It contains movie reviews  from Rotten Tomatoes which are either classified as positive or negative. We download the dataset via HuggingFace's [datasets](https://github.com/huggingface/datasets) library.

## Installation

First, let's install the required libraries:

In [35]:
!pip install -U git+https://github.com/Adapter-Hub/adapter-transformers.git
!pip install datasets

Collecting git+https://github.com/Adapter-Hub/adapter-transformers.git
  Cloning https://github.com/Adapter-Hub/adapter-transformers.git to /tmp/pip-req-build-29oan6zy
  Running command git clone -q https://github.com/Adapter-Hub/adapter-transformers.git /tmp/pip-req-build-29oan6zy
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: adapter-transformers
  Building wheel for adapter-transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for adapter-transformers: filename=adapter_transformers-2.0.1-cp37-none-any.whl size=2099150 sha256=05cfa03c5a52ae36eea0de4508426ed35892d19249a4259f612de98b87dfbf4d
  Stored in directory: /tmp/pip-ephem-wheel-cache-iiwmieei/wheels/b0/56/c9/5bf1c51cd513412090ad751ab10fc025210176bf0a82dd8af3
Successfully built adapter-transformers
Installing collected packages: adapter-transformers
  Found existing insta



In [36]:
import torch
torch.cuda.is_available()

True

In [37]:
from google.colab import drive
drive.mount("/content/gdrive/", force_remount=True)

Mounted at /content/gdrive/


In [38]:
import sys
sys.path.append('/content/gdrive/MyDrive/master_hpi/NLP_Project/code/')

In [39]:
path = "/content/gdrive/MyDrive/master_hpi/NLP_Project/code/"

In [40]:
path

'/content/gdrive/MyDrive/master_hpi/NLP_Project/code/'

## Load Steroset

In [14]:
from collections import defaultdict 
import dataloader
import sys
sys.path.append('/content/gdrive/MyDrive/master_hpi/NLP_Project/code/')
stereoset = dataloader.StereoSet(path+"dev.json")
from dataloader import SentimentIntrasentenceLoader, StereoSet

In [15]:
intersentence_examples = stereoset.get_intersentence_examples() 
intrasentence_examples = stereoset.get_intrasentence_examples()

id2term = {}
id2gold = {}
id2score = {}
example2sent = {}
domain2example = {"intersentence": defaultdict(lambda: []), "intrasentence": defaultdict(lambda: [])}

for example in intrasentence_examples:
  for sentence in example.sentences:
    id2term[sentence.ID] = example.target
    id2gold[sentence.ID] = sentence.gold_label
    example2sent[(example.ID, sentence.gold_label)] = sentence.ID
    domain2example['intrasentence'][example.bias_type].append(example)

for example in intersentence_examples:
  for sentence in example.sentences:
    id2term[sentence.ID] = example.target
    id2gold[sentence.ID] = sentence.gold_label
    example2sent[(example.ID, sentence.gold_label)] = sentence.ID
    domain2example['intersentence'][example.bias_type].append(example)



In [9]:
domain2example

{'intersentence': defaultdict(<function __main__.<lambda>>,
             {'gender': [<dataloader.IntersentenceExample at 0x7fdf7f176910>,
               <dataloader.IntersentenceExample at 0x7fdf7f176910>,
               <dataloader.IntersentenceExample at 0x7fdf7f176910>,
               <dataloader.IntersentenceExample at 0x7fdf7f103110>,
               <dataloader.IntersentenceExample at 0x7fdf7f103110>,
               <dataloader.IntersentenceExample at 0x7fdf7f103110>,
               <dataloader.IntersentenceExample at 0x7fdf7f10ce50>,
               <dataloader.IntersentenceExample at 0x7fdf7f10ce50>,
               <dataloader.IntersentenceExample at 0x7fdf7f10ce50>,
               <dataloader.IntersentenceExample at 0x7fdf7f10ff10>,
               <dataloader.IntersentenceExample at 0x7fdf7f10ff10>,
               <dataloader.IntersentenceExample at 0x7fdf7f10ff10>,
               <dataloader.IntersentenceExample at 0x7fdf7f111a50>,
               <dataloader.IntersentenceExampl

## Dataset Preprocessing

Before we start to train our adapter, we first prepare the training data. Our training dataset can be loaded via HuggingFace `datasets` using one line of code:

In [59]:
from datasets import load_dataset

dataset = load_dataset("sst")
dataset.num_rows

#make dataset smaller for quicker training purpose



dataset= dataset.filter(lambda e, i: i<10000, with_indices=True)
#print(dataset)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2590.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1173.0, style=ProgressStyle(description…




No config specified, defaulting to: sst/default


Downloading and preparing dataset sst/default (download: 6.83 MiB, generated: 3.73 MiB, post-processed: Unknown size, total: 10.56 MiB) to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=6372817.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=789539.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset sst downloaded and prepared to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




Every dataset sample has an input text and a binary label:

In [60]:
dataset['train'][0]





{'label': 0.6944400072097778,
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}

Now, we need to encode all dataset samples to valid inputs for our Transformer model. Since we want to train on `roberta-base`, we load the corresponding `RobertaTokenizer`. Using `dataset.map()`, we can pass the full dataset through the tokenizer in batches:

In [61]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["sentence"], max_length=80, truncation=True, padding="max_length")

# Encode the input data
dataset = dataset.map(encode_batch, batched=True)

# Isa work:
def label_mapping(batch):
  batch["labels"] = round(batch["label"])
  return batch
dataset = dataset.map(label_mapping)
# # The transformers model expects the target class column to be named "labels"
# dataset.rename_column_("isa", "labels")

# end of Isa work

# Transform to pytorch tensors and only output the required columns
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

dataset["train"]["labels"]

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=8544.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1101.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2210.0), HTML(value='')))




tensor([1, 1, 1,  ..., 1, 0, 0])

Now we're ready to train our model...

## Training

We use a pre-trained RoBERTa model from HuggingFace. We use `RobertaModelWithHeads`, a class unique to `adapter-transformers`, which allows us to add and configure prediction heads in a flexibler way.

In [62]:
from transformers import AdapterConfig
from transformers import BertTokenizer, BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
config = AdapterConfig.load("pfeiffer")
model.load_adapter("sentiment/sst-2@ukp", config=config)
#model.add_adapter("sentiment/sst-2@ukp")
#model.add_classification_head(
#    "sentiment/sst-2@ukp",
#    num_labels=2,
#    id2label={ 0: "👎", 1: "👍"}
 # )
#model.train_adapter("sentiment/sst-2@ukp")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

'sst-2'

**Here comes the important part!**

We add a new adapter to our model by calling `add_adapter()`. We pass a name (`"rotten_tomatoes"`) and [the type of adapter](https://docs.adapterhub.ml/adapters.html#adapter-types) (task adapter). Next, we add a binary classification head. It's convenient to give the prediction head the same name as the adapter. This allows us to activate both together in the next step. The `train_adapter()` method does two things:

1. It freezes all weights of the pre-trained model so only the adapter weights are updated during training.
2. It activates the adapter and the prediction head such that both are used in every forward pass.

For training, we make use of the `Trainer` class built-in into `transformers`. We configure the training process using a `TrainingArguments` object and define a method that will calculate the evaluation accuracy in the end. We pass both, together with the training and validation split of our dataset, to the trainer instance.

**Note the differences in hyperparameters compared to full finetuning.** Adapter training usually required a few more training epochs than full finetuning.

In [70]:
import numpy as np
from transformers import TrainingArguments, Trainer, EvalPrediction
path = "/content/gdrive/MyDrive/master_hpi/NLP_Project/code/"
import os

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    remove_unused_columns=False,
    output_dir=path

)

def compute_accuracy(p: EvalPrediction):
  preds = np.argmax(p.predictions, axis=1)
  return {"acc": (preds == p.label_ids).mean()}


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
    adapter_names="sst-2",
    do_save_full_model=True,
    do_save_adapter_fusion=True
)

# print(dataset)
print(dataset["train"]["labels"])

tensor([1, 1, 1,  ..., 1, 0, 0])


Start the training 🚀

In [71]:
trainer.train()
print(dataset)

Step,Training Loss
200,0.0746
400,0.0576
600,0.0516
800,0.0271
1000,0.0096
1200,0.0118
1400,0.0043
1600,0.0026


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'labels', 'sentence', 'token_type_ids', 'tokens', 'tree'],
        num_rows: 8544
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'labels', 'sentence', 'token_type_ids', 'tokens', 'tree'],
        num_rows: 1101
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'labels', 'sentence', 'token_type_ids', 'tokens', 'tree'],
        num_rows: 2210
    })
})


Looks good! Let's evaluate our adapter on the validation split of the dataset to see how well it learned:

In [72]:
trainer.evaluate()

{'epoch': 6.0,
 'eval_acc': 0.8383287920072662,
 'eval_loss': 1.2428271770477295,
 'eval_mem_cpu_alloc_delta': 364544,
 'eval_mem_cpu_peaked_delta': 0,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 138499072,
 'eval_runtime': 6.016,
 'eval_samples_per_second': 183.012}

We can put our trained model into a `transformers` pipeline to be able to make new predictions conveniently:

At last, we can also extract the adapter from our model and separately save it for later reuse. Note the size difference compared to a full model!

In [73]:
print(model.active_adapters)
model.save_adapter(path + "models", "sst-2")
model.save_pretrained(path + "models")

!ls -lh "/content/gdrive/MyDrive/master_hpi/NLP_Project/code/models"

Stack[sst-2]
total 425M
-rw------- 1 root root  581 Jul  4 12:12 adapter_config.json
-rw------- 1 root root 1.2K Jul  4 12:12 config.json
-rw------- 1 root root  225 Jul  4 12:12 head_config.json
-rw------- 1 root root 3.5M Jul  4 12:12 pytorch_adapter.bin
-rw------- 1 root root 422M Jul  4 12:12 pytorch_model.bin
-rw------- 1 root root 7.0K Jul  4 12:12 pytorch_model_head.bin


**Share your work!**

The next step after training is to share our adapter with the world via _AdapterHub_. [Read our guide](https://docs.adapterhub.ml/contributing.html) on how to prepare the adapter module we just saved and contribute it to the Hub!

➡️ Also continue with [the next Colab notebook](https://colab.research.google.com/github/Adapter-Hub/adapter-transformers/blob/master/notebooks/02_Adapter_Inference.ipynb) to learn how to use adapters from the Hub.

## Our own Adapter-Configurations

In [41]:
from collections import defaultdict 
import dataloader
from torch.utils.data import DataLoader
import sys
import utils
sys.path.append('/content/gdrive/MyDrive/master_hpi/NLP_Project/code/')
stereoset = dataloader.StereoSet(path+"dev.json")
from dataloader import SentimentIntrasentenceLoader, StereoSet
from torch import nn

In [74]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def encode_batch(batch):
  """Encodes a batch of input data using the model tokenizer."""
  return tokenizer(batch["sentence"], max_length=80, truncation=True, padding="max_length")

# Encode the input data
#dataset = dataset.map(encode_batch, batched=True)

# Isa work:
def label_mapping(batch):
  batch["labels"] = round(batch["label"])
  return batch
#dataset = dataset.map(label_mapping)
# # The transformers model expects the target class column to be named "labels"
# dataset.rename_column_("isa", "labels")

# end of Isa work

# Transform to pytorch tensors and only output the required columns
###dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

#dataset["train"]["labels"]

In [75]:
from transformers import AdapterConfig
from transformers import BertTokenizer, BertForSequenceClassification
from tqdm import tqdm

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
config = AdapterConfig.load("pfeiffer")
model.load_adapter("sentiment/sst-2@ukp", config=config)
model.add_adapter("sentiment/sst-2@ukp")
#model.add_classification_head(
#    "sentiment/sst-2@ukp",
#    num_labels=2,
#    id2label={ 0: "👎", 1: "👍"}
 # )
model.train_adapter("sentiment/sst-2@ukp")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [76]:
from transformers import TextClassificationPipeline

# classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=training_args.device.index)
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer, device=0)

classifier("Many people live in Ethiopia. Cats have sharp claws.")

[{'label': 'LABEL_1', 'score': 0.978216290473938}]

In [77]:
path = "/content/gdrive/MyDrive/master_hpi/NLP_Project/code/"

In [84]:
no_cuda=False
input_file=path + "dev.json"
skip_intrasentence=False
skip_intersentence=False
batch_size=1
max_seq_length=128
output_dir=path + "predictions/"
output_file= path + "predictions.json"
load_path="best_models/SentimentBert.pth" #TODO ändern
print(f"Loading {input_file}...")
filename = path + "dev.json" #os.path.abspath(input_file)
stereoset = dataloader.StereoSet(filename)
#self.dataloader = StereoSet(filename)
device = "cuda" if no_cuda else "cpu"

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# to keep padding consistent with the other models -> improves LM score.
if tokenizer.__class__.__name__ == "XLNetTokenizer":
    tokenizer.padding_side = "right"

# Set this to be none if you don't want to batch items together!
max_seq_length = None if batch_size == 1 else max_seq_length

# TODO hier weitermachen, um das zu lösen
def evaluate_intrasentence():
        print()
        print(
            f"Evaluating bias on intrasentence tasks...")
        dataset = SentimentIntrasentenceLoader(tokenizer, max_seq_length=max_seq_length, pad_to_max_length=True, input_file=input_file)
        dataloader = DataLoader(
            dataset, batch_size=batch_size, shuffle=False, num_workers=5)
        num_labels = 2

        #model = utils.BertForSequenceClassification(num_labels)
        #device = torch.device("cuda" if not no_cuda else "cpu")
        print(f"Number of parameters: {count_parameters(model):,}")

        #model.to(device).eval()
        #if torch.cuda.device_count() > 1:
        #    print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        #model = nn.DataParallel(model)
        #model.load_state_dict(torch.load(load_path))

        
        bias_predictions = [] 
        for batch_num, batch in tqdm(enumerate(dataloader), total=len(dataloader)):
            sentence_id, input_ids, attention_mask, token_type_ids = batch 
            print(batch)
            #input_ids = input_ids.to(device).squeeze(dim=1) 
            #attention_mask = attention_mask.to(device).squeeze(dim=1) 
            #token_type_ids = token_type_ids.to(device).squeeze(dim=1) 

            predictions = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            predictions = predictions.softmax(dim=1)
            for idx, prediction in enumerate(predictions[:, 0]):
                score = {"id": sentence_id[idx], "score": prediction.item()}
                bias_predictions.append(score)

        return bias_predictions

def evaluate_intersentence():
        print()
        print(
            f"Evaluating bias on intersentence tasks...")
        dataset = SentimentIntersentenceDataset(self.tokenizer, args)
        dataloader = DataLoader(
            dataset, batch_size=self.batch_size, shuffle=False, num_workers=5)
        num_labels = 2

        #model = utils.BertForSequenceClassification(num_labels)
        #device = torch.device("cuda" if not no_cuda else "cpu")
        print(f"Number of parameters: {count_parameters(model):,}")

        #model.to(device).eval()
        #if torch.cuda.device_count() > 1:
        #    print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        #model = nn.DataParallel(model)
        #model.load_state_dict(torch.load(load_path))
        #self.model = model

        bias_predictions = [] 
        for batch_num, batch in tqdm(enumerate(dataloader), total=len(dataloader)):
            sentence_id, input_ids, attention_mask, token_type_ids = batch 
            #input_ids = input_ids.to(device).squeeze(dim=1) 
            #attention_mask = attention_mask.to(device).squeeze(dim=1) 
            #token_type_ids = token_type_ids.to(device).squeeze(dim=1) 

            predictions = self.model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            # print(predictions)
            predictions = predictions.softmax(dim=1)
            for idx, prediction in enumerate(predictions[:, 0]):
                score = {"id": sentence_id[idx], "score": prediction.item()}
                bias_predictions.append(score)

        return bias_predictions

def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)

Loading /content/gdrive/MyDrive/master_hpi/NLP_Project/code/dev.json...


In [85]:
intra = evaluate_intrasentence()


Evaluating bias on intrasentence tasks...
Number of parameters: 896,066


  cpuset_checked))
  0%|          | 0/6318 [00:00<?, ?it/s]

[('107a3b2e248a218017cf1ba6a22f2c76',), tensor([[[ 101, 1996, 7433, 2447, 2001, 6696, 1012,  102]]]), tensor([[[1, 1, 1, 1, 1, 1, 1, 1]]]), tensor([[[0, 0, 0, 0, 0, 0, 0, 0]]])]





ValueError: ignored