BERT2BERT


Best explanation of warm-starting encoder-decoder models - https://huggingface.co/blog/warm-starting-encoder-decoder

In [1]:
# just in case packages are not installed

# !pip install datasets==1.0.2
# !pip install transformers==4.2.1

In [2]:
import datasets
import pandas as pd
from datasets import Dataset

**BERT2BERT** model to be **warm-started and consequently fine-tuned** on the BioLeaflets summarization task.

In [3]:
# read training dataset
PATH_INPUT_SOURCE = '/home/angelo_ziletti/nlg-ra/T5_experiments/T5_plain/input_data/train.source'
PATH_INPUT_TARGET = '/home/angelo_ziletti/nlg-ra/T5_experiments/T5_plain/input_data/train.target'

# read training dataset
with open(PATH_INPUT_SOURCE) as f:
    train_input_source = [line.strip() for line in f]

with open(PATH_INPUT_TARGET) as f:
    train_input_target = [line.strip() for line in f]

# check
assert len(train_input_source) == len(train_input_target)


# dataset as a pandas dataframe
train_data_bioleaflets = {'sources': train_input_source, 'sections': train_input_target}

train_data_df = pd.DataFrame.from_dict(train_data_bioleaflets)

# https://huggingface.co/docs/datasets/loading_datasets.html
# convert to HF Dataset object
train_data_dataset = Dataset.from_dict(train_data_bioleaflets)

##bert-base-uncased - will be used as encoder and decoder


In [4]:
# use the "bert-base-uncased" tokenizer

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

The maximum length that bert-base-uncased can process amounts to 512, let's calculate the statistics I should have calculated a long time ago. 

In [5]:
# map article and summary len to dict as well as if sample is longer than 512 tokens
def map_to_length(x):

  x["sources_len"] = len(tokenizer(x["sources"]).input_ids)
  x["sources_longer_512"] = int(x["sources_len"] > 512)

  x["section_len"] = len(tokenizer(x["sections"]).input_ids)
  x["section_longer_512"] = int(x["section_len"] > 512)

  return x

In [6]:
tmp_df = train_data_df.apply(map_to_length, axis=1)

Token indices sequence length is longer than the specified maximum sequence length for this model (804 > 512). Running this sequence through the model will result in indexing errors


In [7]:
# check the length of tokenized - should be same as in the table below
len(tokenizer(train_data_bioleaflets['sections'][0]).input_ids)

119

In [8]:
tmp_df

Unnamed: 0,sources,sections,sources_len,sources_longer_512,section_len,section_longer_512
0,<PRODUCT_NAME> cystagon </PRODUCT_NAME> <DX_NA...,cystinosis is a metabolic disease called ' nep...,233,0,119,0
1,<PRODUCT_NAME> cystagon </PRODUCT_NAME> <GENER...,do not use cystagon - if you - or your child -...,804,1,553,1
2,<PRODUCT_NAME> cystagon </PRODUCT_NAME> <GENER...,always use cystagon exactly as your doctor or ...,482,0,608,1
3,<PRODUCT_NAME> cystagon </PRODUCT_NAME> <TREAT...,"like all medicines , cystagon can cause side e...",869,1,414,0
4,<PRODUCT_NAME> cystagon </PRODUCT_NAME> <NUMBE...,keep out of the reach and sight of children . ...,24,0,62,0
...,...,...,...,...,...,...
5886,<PRODUCT_NAME> veltassa </PRODUCT_NAME> <BRAND...,do not take veltassa if you are allergic to pa...,736,1,425,0
5887,<PRODUCT_NAME> veltassa </PRODUCT_NAME> <TREAT...,always take this medicine exactly as your doct...,490,0,481,0
5888,<PRODUCT_NAME> veltassa </PRODUCT_NAME> <TREAT...,"like all medicines , this medicine can cause s...",298,0,137,0
5889,<PRODUCT_NAME> veltassa </PRODUCT_NAME> <TREAT...,keep this medicine out of the sight and reach ...,56,0,123,0


In [9]:
# average length of tokenized inputs
sum(tmp_df["sources_len"]) / 5891

912.5270751994568

In [10]:
# average length of tokenized outputs
sum(tmp_df["section_len"]) / 5891

509.39993209981327

In [11]:
# calc percentage of sources longer than 512
len(tmp_df[(tmp_df['sources_longer_512'] == 1)]) / 5891

0.500084875233407

In [12]:
# calc percentage of targets longer than 512
len(tmp_df[(tmp_df['section_longer_512'] == 1)]) / 5891

0.35834323544389746

From HF: "bert-base-cased is limited to **512 tokens**, which means we would have to **cut possibly important information from the article**. Because most of the important information is often found at the beginning of articles and because we want to be computationally efficient, we decide to stick to bert-base-cased with a max_length of 512 in this notebook. This choice is not optimal but has shown to yield good results on CNN/Dailymail. Alternatively, one could leverage long-range sequence models, such as Longformer to be used as the encoder."


"sources" and "sections" are tokenized and prepared as the Encoder's "input_ids" and Decoder's "decoder_input_ids" respectively.

"labels" are shifted automatically to the left for language modeling training.

Lastly, it is very important to remember to ignore the loss of the padded labels. In 🤗Transformers this can be done by setting the label to -100. Great, let's write down our mapping function then.

In [13]:
encoder_max_length=512
decoder_max_length=512

def process_data_to_model_inputs(batch):

  # tokenize the inputs and labels
  inputs = tokenizer(batch["sources"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["sections"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()
  
  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

###let's prepare the training data.

In [14]:
# I hope it works with this value
batch_size=4

train_bioleaflets_dataset = train_data_dataset.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["sources", "sections"]
)

HBox(children=(FloatProgress(value=0.0, max=1473.0), HTML(value='')))




In [15]:
train_bioleaflets_dataset

Dataset({
    features: ['attention_mask', 'decoder_attention_mask', 'decoder_input_ids', 'input_ids', 'labels'],
    num_rows: 5891
})

In [16]:
# see example
# train_bioleaflets_dataset[1]

In [17]:
# Convert the data to PyTorch Tensors to be trained on GPU.
train_bioleaflets_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

In [18]:
# converted to tensor format
# train_bioleaflets_dataset[1]

##Analogously, do the same for the validation data.

In [19]:
# read val dataset
PATH_VAL_SOURCE = '/home/angelo_ziletti/nlg-ra/T5_experiments/T5_plain/input_data/val.source'
PATH_VAL_TARGET = '/home/angelo_ziletti/nlg-ra/T5_experiments/T5_plain/input_data/val.target'

# read val dataset
with open(PATH_VAL_SOURCE) as f:
    val_input_source = [line.strip() for line in f]

with open(PATH_VAL_TARGET) as f:
    val_input_target = [line.strip() for line in f]

assert len(val_input_source) == len(val_input_target)

# make a dict
val_data_bioleaflets = {'sources': val_input_source, 'sections': val_input_target}

# convert to HF Dataset object
val_data_dataset = Dataset.from_dict(val_data_bioleaflets)

# process val dataset
val_bioleaflets_dataset = val_data_dataset.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["sources", "sections"]
)

# convert format to pytorch tensor
val_bioleaflets_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

HBox(children=(FloatProgress(value=0.0, max=188.0), HTML(value='')))




### Warm-starting the EncoderDecoderModel.


Encoder-Decoder model warm-started using the bert-base-cased checkpoint.  



In [20]:
# import the EncoderDecoderModel
from transformers import EncoderDecoderModel

In [21]:
# warm-start both the encoder and decoder with the "bert-base-cased" checkpoint.
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer

notice that a lot of weights are "newly" or randomly initialized. When taking a closer look these weights all correspond to the cross-attention layer, which is exactly what we would expect

In [22]:
bert2bert

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_af

We see that bert2bert.encoder is an instance of BertModel and that bert2bert.decoder one of BertLMHeadModel. However, both instances are now combined into a single torch.nn.Module and can thus be saved as a single .pt checkpoint file.

In [23]:
# save and load as one model

# bert2bert.save_pretrained("my_new_bert2bert")
# bert2bert = EncoderDecoderModel.from_pretrained("my_new_bert2bert")

In [24]:
# checkpoint the config
bert2bert.config

EncoderDecoderConfig {
  "decoder": {
    "_name_or_path": "bert-base-uncased",
    "add_cross_attention": true,
    "architectures": [
      "BertForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "finetuning_task": null,
    "gradient_checkpointing": false,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "is_decoder": true,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-12,
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 512,
    "min

In [25]:
print(f"\n\nNum Params.: {bert2bert.num_parameters()}")



Num Params.: 247363386


The config is similarly composed of an encoder config and a decoder config both of which are instances of BertConfig in our case. However, the overall config is of type EncoderDecoderConfig and is therefore saved as a single .json file.

Once an EncoderDecoderModel object is instantiated, it provides **the same functionality** as any other Encoder-Decoder model in 🤗Transformers, e.g. **BART, T5**

We have warm-started a bert2bert model, but we have not defined all the relevant parameters used for beam search decoding yet.

Let's start by setting the special tokens. bert-base-cased does not have a decoder_start_token_id or eos_token_id, so we will use its cls_token_id and sep_token_id respectively. Also, we should define a pad_token_id on the config and make sure the correct vocab_size is set.

In [26]:
# set special tokens
bert2bert.config.decoder_start_token_id = tokenizer.cls_token_id
bert2bert.config.eos_token_id = tokenizer.sep_token_id
bert2bert.config.pad_token_id = tokenizer.pad_token_id
bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size


# sensible parameters for beam search (later we need to change this to have consistent parameters)
bert2bert.config.max_length = 512
bert2bert.config.min_length = 128
bert2bert.config.no_repeat_ngram_size = 3
bert2bert.config.early_stopping = True
bert2bert.config.length_penalty = 2.0
bert2bert.config.num_beams = 4

## Start fine-tuning the warm-started BERT2BERT model


Make use of the Seq2SeqTrainer to fine-tune a warm-started encoder-decoder model. (just same as with T5/BART)


Seq2SeqTrainer extends 🤗Transformer's Trainer for encoder-decoder models. In short, it allows using the generate(...) function during evaluation, which is necessary to validate the performance of encoder-decoder models on most sequence-to-sequence tasks, such as summarization.    

In [27]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [28]:
# need a couple of python packages to make the Seq2SeqTrainer work.
# should be also installed

# already installed in the environment
#!pip install git-python==1.0.3
#!pip install rouge_score
#!pip install sacrebleu

In [29]:
# set training arguments - these params are not really tuned, feel free to change

# gradient_accumulation_steps really depends on the batch_size!

output_dir = "/home/angelo_ziletti/nlg-ra/T5_experiments/BERT2BERT/"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=8,
    predict_with_generate=True,
    logging_steps=1000,
    save_steps=500,
    eval_steps=8000,
    warmup_steps=2000,
    overwrite_output_dir=True,
    save_total_limit=3,
)

In [30]:
# load rouge for validation
rouge = datasets.load_metric("rouge")

def compute_metrics(pred):
    """
    The rouge metric computes the score from two lists of strings. 
    Thus we decode both the predictions and labels - making sure that -100 is correctly replaced 
      by the pad_token_id and remove all special characters by setting skip_special_tokens=True.

    """
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

## Pass all arguments to the Seq2SeqTrainer and start finetuning

In [31]:
# instantiate trainer
trainer = Seq2SeqTrainer(
    model=bert2bert,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_bioleaflets_dataset,
    eval_dataset=val_bioleaflets_dataset,
)

#trainer.train()

In [32]:
# fine-tuned model can be saved and loaded

#bert2bert.save_pretrained("my_new_bert2bert")
#bert2bert = EncoderDecoderModel.from_pretrained("my_new_bert2bert")

### Evaluation

In [36]:
bert2bert_path = '/home/angelo_ziletti/nlg-ra/T5_experiments/BERT2BERT/my_new_bert2bert'
bert2bert_model = EncoderDecoderModel.from_pretrained(bert2bert_path)

bert2bert_model.to("cuda")


# read test dataset
PATH_TEST_SOURCE = '/home/angelo_ziletti/nlg-ra/T5_experiments/T5_plain/input_data/test.source'
PATH_TEST_TARGET = '/home/angelo_ziletti/nlg-ra/T5_experiments/T5_plain/input_data/test.target'

# read val dataset
with open(PATH_TEST_SOURCE) as f:
    test_input_source = [line.strip() for line in f]

with open(PATH_TEST_TARGET) as f:
    test_input_target = [line.strip() for line in f]

# make a dict
test_data_bioleaflets = {'sources': test_input_source, 'sections': test_input_target}

# convert to HF Dataset object
test_data_dataset = Dataset.from_dict(test_data_bioleaflets)

batch_size = 16


def generate_summary(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    # cut off at BERT max length 512
    inputs = tokenizer(batch["sources"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    outputs = bert2bert_model.generate(input_ids, attention_mask=attention_mask)

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred"] = output_str

    return batch

results = test_data_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["sources", "sections"])

pred_str = results["pred"]
label_str = results["highlights"]

rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

print(rouge_output)

HBox(children=(FloatProgress(value=0.0, max=47.0), HTML(value='')))




KeyError: "Column highlights not in the dataset. Current columns in the dataset: ['pred']"

In [41]:
output_dir = "/home/angelo_ziletti/nlg-ra/T5_experiments/BERT2BERT/"

output_file_predictions = output_dir + "predictions_1.txt"

results.to_csv(output_file_predictions)

print("Results saved to {}".format(output_file_predictions))

Results saved to /home/angelo_ziletti/nlg-ra/T5_experiments/BERT2BERT/predictions_1.txt
