<a href="https://colab.research.google.com/github/danieljunior/convert_bert_to_long/blob/main/convert_bert_to_long.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `BERT` --> `Longformer`: build a "long" version of pretrained models

This notebook replicates the procedure descriped in the [Longformer paper](https://arxiv.org/abs/2004.05150) to train a Longformer model starting from the BERT checkpoint. The same procedure can be applied to build the "long" version of other pretrained models as well. It was inspired by the notebook provided by Allenai to convert RoBERTa to Longformer: [convert_model_to_long.ipynb](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb).


### Libraries, and imports


In [None]:
!pip install transformers==3.0.2



In [None]:
import logging
import os
import math
import torch
import tensorflow as tf
from dataclasses import dataclass, field
from transformers import AutoModel, AutoTokenizer, BertTokenizerFast, BertForMaskedLM, BertModel
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

### BertLong

`BertLongForMaskedLM` represents the "long" version of the `BERT` model. It replaces `BertSelfAttention` with `BertLongSelfAttention`, which is a thin wrapper around `LongformerSelfAttention`.


In [None]:
class BertLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)


class BertLong(BertModel):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = BertLongSelfAttention(config, layer_id=i)

Starting from the `bert-base` checkpoint, the following function converts it into an instance of `BertLong`. It makes the following changes:

- extend the position embeddings from `512` positions to `max_pos`. In Longformer, we set `max_pos=4096`

- initialize the additional position embeddings by copying the embeddings of the first `512` positions. This initialization is crucial for the model performance (check table 6 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) for performance without this initialization)

- replaces `modeling_bert.BertSelfAttention` objects with `modeling_longformer.LongformerSelfAttention` with a attention window size `attention_window`

The output of this function works for long documents even without pretraining. Check tables 6 and 11 in [the paper](https://arxiv.org/pdf/2004.05150.pdf) to get a sense of the expected performance of this model before pretraining.

In [None]:
def create_long_model(save_model_to, attention_window, max_pos):
    model = BertModel.from_pretrained('neuralmind/bert-base-portuguese-cased')
    tokenizer = BertTokenizerFast.from_pretrained('neuralmind/bert-base-portuguese-cased', model_max_length=max_pos)
    config = model.config

    print(max_pos)
    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.embeddings.position_embeddings.weight.shape
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    print(new_pos_embed.shape)
    print(model.embeddings.position_embeddings)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 0
    step = current_max_pos
    while k < max_pos - 1:
        new_pos_embed[k:(k + step)] = model.embeddings.position_embeddings.weight
        k += step
    print(new_pos_embed.shape)
    model.embeddings.position_ids = torch.from_numpy(tf.range(new_pos_embed.shape[0], dtype=tf.int32).numpy()[tf.newaxis, :])
    model.embeddings.position_embeddings = torch.nn.Embedding.from_pretrained(new_pos_embed)
    
    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn
    print(model.embeddings.position_ids.shape)
    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer, new_pos_embed

**Training hyperparameters**

- Following BERT pretraining setting, we set number of tokens per batch to be `2^18` tokens. Changing this number might require changes in the lr, lr-scheudler, #steps and #warmup steps. Therefor, it is a good idea to keep this number constant.

- Note that: `#tokens/batch = batch_size x #gpus x gradient_accumulation x seqlen`
   
- In [the paper](https://arxiv.org/pdf/2004.05150.pdf), we train for 65k steps, but 3k is probably enough (check table 6)

- **Important note**: The lr-scheduler in [the paper](https://arxiv.org/pdf/2004.05150.pdf) is polynomial_decay with power 3 over 65k steps. To train for 3k steps, use a constant lr-scheduler (after warmup). Both lr-scheduler are not supported in HF trainer, and at least **constant lr-scheduler** will need to be added. 

- Pretraining will take 2 days on 1 x 32GB GPU with fp32. Consider using fp16 and using more gpus to train faster (if you increase `#gpus`, reduce `gradient_accumulation` to maintain `#tokens/batch` as mentioned earlier).

- As a demonstration, this notebook is training on wikitext103 but wikitext103 is rather small that it takes 7 epochs to train for 3k steps Consider doing a single epoch on a larger dataset (800M tokens) instead.

- Set #gpus using `CUDA_VISIBLE_DEVICES`

In [None]:
@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))


training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    '--output_dir', 'tmp',
    '--warmup_steps', '500',
    '--learning_rate', '0.00003',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-6',
    '--max_steps', '3000',
    '--logging_steps', '500',
    '--save_steps', '500',
    '--max_grad_norm', '5.0',
    '--per_gpu_eval_batch_size', '8',
    '--per_gpu_train_batch_size', '2',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '32',
    '--evaluate_during_training',
    '--do_train',
    '--do_eval',
])

# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

1) As descriped in `create_long_model`, convert a `bert-base` model into `bert-base-4096` which is an instance of `BertLong`, then save it to the disk.

In [None]:
model_path = f'{training_args.output_dir}/bert-base-{model_args.max_pos}'
if not os.path.exists(model_path):
    os.makedirs(model_path)

logger.info(f'Converting bert-base into bert-base-{model_args.max_pos}')
model, tokenizer, new_pos_embed = create_long_model(
    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)
#create_long_model(save_model_to, attention_window, max_pos)

2) Load `bert-base-4096` from the disk. This model works for long sequences even without pretraining. If you don't want to pretrain, you can stop here and start finetuning your `bert-base-4096` on downstream tasks 🎉🎉🎉

In [None]:
logger.info(f'Loading the model from {model_path}')
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertLong.from_pretrained(model_path, output_hidden_states=True)

INFO:__main__:Loading the model from tmp/bert-base-4096
INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
INFO:transformers.tokenization_utils_ba

In [None]:
!pip install nltk
import nltk    
from nltk import tokenize
nltk.download('punkt')

def convert_examples_to_features(example, seq_length, tokenizer):
  """Loads a data file into a list of `InputBatch`s."""
  tokens = ['[CLS]']
  for i, w in enumerate(tokenize.word_tokenize(example, language='portuguese')):
      # use bertTokenizer to split words
      # 1996-08-22 => 1996 - 08 - 22
      # sheepmeat => sheep ##me ##at
      sub_words = tokenizer.tokenize(w)
      if not sub_words:
          sub_words = ['[UNK]']
      # tokenize_count.append(len(sub_words))
      tokens.extend(sub_words)

  # truncate
  if len(tokens) > seq_length - 1:
      print('Example is too long, length is {}, truncated to {}!'.format(len(tokens), max_seq_length))
      tokens = tokens[0:(seq_length - 1)]
  tokens.append('[SEP]')

  input_ids = tokenizer.convert_tokens_to_ids(tokens)
  segment_ids = [0] * len(input_ids)
  input_mask = [1] * len(input_ids)

  while len(input_ids) < seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

  return input_ids, segment_ids, input_mask


def get_features(input_text, model, tokenizer, dim=768, max_lenght=4096):
  input_ids, segment_ids, input_mask = convert_examples_to_features(
      example=input_text, seq_length=max_lenght, tokenizer=tokenizer)
  # import pdb; pdb.set_trace()
  # unique_id_to_feature = {}
  # for feature in features:
  #   unique_id_to_feature[feature.unique_id] = feature
  input_ids = torch.tensor([input_ids],dtype=torch.long)
  segment_ids = torch.tensor([segment_ids],dtype=torch.long)
  input_mask = torch.tensor([input_mask],dtype=torch.long)
  outputs = model(input_ids, token_type_ids=segment_ids, 
                  attention_mask=input_mask)
  # model_fn = model_fn_builder(
  #     bert_config=bert_config,
  #     init_checkpoint=INIT_CHECKPOINT,
  #     layer_indexes=layer_indexes,
  #     use_tpu=True,
  #     use_one_hot_embeddings=True)

  # # If TPU is not available, this will fall back to normal Estimator on CPU
  # # or GPU.
  # estimator = tf.contrib.tpu.TPUEstimator(
  #     use_tpu=True,
  #     model_fn=model_fn,
  #     config=run_config,
  #     predict_batch_size=BATCH_SIZE,
  #     train_batch_size=BATCH_SIZE)

  # input_fn = input_fn_builder(
  #     features=features, seq_length=MAX_SEQ_LENGTH)

  # # Get features
  # for result in estimator.predict(input_fn, yield_single_examples=True):
  #   unique_id = int(result["unique_id"])
  #   feature = unique_id_to_feature[unique_id]
  #   output = collections.OrderedDict()
  #   for (i, token) in enumerate(feature.tokens):
  #     layers = []
  #     for (j, layer_index) in enumerate(layer_indexes):
  #       layer_output = result["layer_output_%d" % j]
  #       layer_output_flat = np.array([x for x in layer_output[i:(i + 1)].flat])
  #       layers.append(layer_output_flat)
  #     output[token] = sum(layers)[:dim]
  
  return outputs

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
embeddings = get_features('Olá, essa é uma sentença de teste', model, tokenizer)
print(embeddings)

(tensor([[[ 0.2329, -0.1494,  0.3605,  ..., -0.1701, -0.2105, -0.5222],
         [-0.1091, -0.5142,  0.4510,  ...,  0.1104,  0.0641, -0.2985],
         [ 0.1689, -0.6535,  0.4545,  ...,  0.1635, -0.0096, -0.5424],
         ...,
         [-0.1672,  0.0212,  0.5901,  ..., -0.0417, -0.0309, -0.3464],
         [-0.1043, -0.0432,  0.6870,  ...,  0.0186, -0.0413, -0.5592],
         [ 0.0080, -0.0725,  0.3609,  ..., -0.3591,  0.0622, -0.3122]]],
       grad_fn=<NativeLayerNormBackward>), tensor([[ 2.9375e-02, -6.5288e-02,  5.5953e-02,  1.0286e-01, -2.8127e-02,
          1.3449e-01,  9.8631e-01, -1.3504e-02,  2.4237e-02, -3.5371e-02,
          4.5435e-01,  1.0632e-01,  8.8997e-02, -1.6181e-01,  3.1021e-02,
         -9.1601e-02, -6.5733e-02, -2.9517e-03, -4.2562e-01,  9.9206e-01,
         -2.6046e-01,  1.8278e-02,  1.0318e-01, -5.5111e-02, -5.8931e-02,
          3.3217e-02,  9.4834e-02, -1.0454e-01, -1.1867e-01, -8.0529e-02,
          1.2652e-02, -9.8999e-01,  3.1234e-01, -1.4664e-01,  8.2998e-

In [None]:
print(embeddings[0])
print(embeddings[2][-1])

tensor([[[ 0.2329, -0.1494,  0.3605,  ..., -0.1701, -0.2105, -0.5222],
         [-0.1091, -0.5142,  0.4510,  ...,  0.1104,  0.0641, -0.2985],
         [ 0.1689, -0.6535,  0.4545,  ...,  0.1635, -0.0096, -0.5424],
         ...,
         [-0.1672,  0.0212,  0.5901,  ..., -0.0417, -0.0309, -0.3464],
         [-0.1043, -0.0432,  0.6870,  ...,  0.0186, -0.0413, -0.5592],
         [ 0.0080, -0.0725,  0.3609,  ..., -0.3591,  0.0622, -0.3122]]],
       grad_fn=<NativeLayerNormBackward>)
tensor([[[ 0.2329, -0.1494,  0.3605,  ..., -0.1701, -0.2105, -0.5222],
         [-0.1091, -0.5142,  0.4510,  ...,  0.1104,  0.0641, -0.2985],
         [ 0.1689, -0.6535,  0.4545,  ...,  0.1635, -0.0096, -0.5424],
         ...,
         [-0.1672,  0.0212,  0.5901,  ..., -0.0417, -0.0309, -0.3464],
         [-0.1043, -0.0432,  0.6870,  ...,  0.0186, -0.0413, -0.5592],
         [ 0.0080, -0.0725,  0.3609,  ..., -0.3591,  0.0622, -0.3122]]],
       grad_fn=<NativeLayerNormBackward>)


In [None]:
torch.stack(embeddings[2][-4:]).sum(0)

tensor([[[ 0.2799, -0.2073,  0.4802,  ..., -0.5777, -0.3478, -0.3016],
         [-0.3640, -2.3847,  0.4144,  ...,  2.0576,  0.6135,  0.9092],
         [ 1.2091, -1.4786,  0.8424,  ...,  2.4500,  1.7974,  0.7206],
         ...,
         [-0.1366,  1.7886,  1.5123,  ...,  0.3546,  0.0368, -1.1302],
         [ 0.6175,  1.4245,  2.4410,  ...,  0.9552, -0.0788, -2.9404],
         [ 0.1089, -0.0056,  0.6044,  ..., -0.7702, -0.0874, -0.2162]]],
       grad_fn=<SumBackward1>)