## Armin Behjati



## Before Transformer

Back to 2017, most of the people using Neural Networks when working on Natural Language Processing were relying on 
sequential processing of the input through [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network).

![rnn](http://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png)   

RNNs were performing well on large variety of tasks involving sequential dependency over the input sequence. 
However, this sequentially-dependent process had issues modeling very long range dependencies and 
was not well suited for the kind of hardware we're currently leveraging due to bad parallelization capabilities. 

Some extensions were provided by the academic community, such as Bidirectional RNN ([Schuster & Paliwal., 1997](https://www.researchgate.net/publication/3316656_Bidirectional_recurrent_neural_networks), [Graves & al., 2005](https://mediatum.ub.tum.de/doc/1290195/file.pdf)), 
which can be seen as a concatenation of two sequential process, one going forward, the other one going backward over the sequence input.

![birnn](https://miro.medium.com/max/764/1*6QnPUSv_t9BY9Fv8_aLb-Q.png)


And also, the Attention mechanism, which introduced a good improvement over "raw" RNNs by giving 
a learned, weighted-importance to each element in the sequence, allowing the model to focus on important elements.

![attention_rnn](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Example-of-Attention.png)  

## Then comes the Transformer  

The Transformers era originally started from the work of [(Vaswani & al., 2017)](https://arxiv.org/abs/1706.03762) who
demonstrated its superiority over [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network)
on translation tasks but it quickly extended to almost all the tasks RNNs were State-of-the-Art at that time.

One advantage of Transformer over its RNN counterpart was its non sequential attention model. Remember, the RNNs had to
iterate over each element of the input sequence one-by-one and carry an "updatable-state" between each hop. With Transformer, the model is able to look at every position in the sequence, at the same time, in one operation.


![transformer-encoder-decoder](https://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png)

## Getting started with transformers

For the rest of this notebook, we will use the [BERT (Devlin & al., 2018)](https://arxiv.org/abs/1810.04805) architecture, as it's the most simple and there are plenty of content about it
over the internet, it will be easy to dig more over this architecture if you want to.


In [None]:
!pip install -q transformers

In [None]:
import torch
from transformers import BertForMaskedLM, BertTokenizer, pipeline
from transformers import AlbertForMaskedLM, AlbertTokenizer
from transformers import AutoModel, AutoTokenizer
import re
import numpy as np
import os
import requests

In [None]:
# We need to create the model and tokenizer
model = AutoModel.from_pretrained("Rostlab/prot_bert")
tokenizer = AutoTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)

## What are the inputs to BERT, and what comes out of it?

Let's start by treating BERT as a black box. The minimum that we need to understand to use the black box is what data to feed into it, and what type of outputs to expect. You can build on top of these outputs, for example by adding one or more linear layers. You can then fine-tune your custom architecture on your data. 

### Tokenization

Before you feed your text into BERT, you need to turn it into numbers. That's the role of a tokenizer. Some tokenizers split text on spaces, so that each token corresponds to a word. That would result however in a huge vocabulary, which makes training a model more difficult, so instead BERT relies on sub-word tokenization. Let's see how it works in code. 




In [None]:
# Tokens comes from a process that splits the input into sub-entities with interesting linguistic properties. 
tokens = tokenizer.tokenize("D L I P T S S K L V V")
print("Tokens: {}".format(tokens))

# This is not sufficient for the model, as it requires integers as input, 
# not a problem, let's convert tokens to ids.
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens id: {}".format(tokens_ids))

# Add the required special tokens
tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)

# We need to convert to a Deep Learning framework specific format, let's use PyTorch for now.
tokens_pt = torch.tensor([tokens_ids])
print("Tokens PyTorch: {}".format(tokens_pt))


Tokens: ['D', 'L', 'I', 'P', 'T', 'S', 'S', 'K', 'L', 'V', 'V']
Tokens id: [14, 5, 11, 16, 15, 10, 10, 12, 5, 8, 8]
Tokens PyTorch: tensor([[ 2, 14,  5, 11, 16, 15, 10, 10, 12,  5,  8,  8,  3]])


In [None]:
# Padding highlight
tokens = tokenizer(
    ["D L I P T S S K V", " T S L Q V K K A F F A L V T"], 
    padding=True  # First sentence will have some PADDED tokens to match second sequence length
)

for i in range(2):
    print("Tokens (int)      : {}".format(tokens['input_ids'][i]))
    print("Tokens (str)      : {}".format([tokenizer.convert_ids_to_tokens(s) for s in tokens['input_ids'][i]]))
    print("Tokens (attn_mask): {}".format(tokens['attention_mask'][i]))
    print()

Tokens (int)      : [2, 14, 5, 11, 16, 15, 10, 10, 12, 8, 3, 0, 0, 0, 0, 0]
Tokens (str)      : ['[CLS]', 'D', 'L', 'I', 'P', 'T', 'S', 'S', 'K', 'V', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Tokens (attn_mask): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Tokens (int)      : [2, 15, 10, 5, 18, 8, 12, 12, 6, 19, 19, 6, 5, 8, 15, 3]
Tokens (str)      : ['[CLS]', 'T', 'S', 'L', 'Q', 'V', 'K', 'K', 'A', 'F', 'F', 'A', 'L', 'V', 'T', '[SEP]']
Tokens (attn_mask): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]



### Outputs

Let's download a pretrained model now, run our text through it, and see what comes out. We will first need to convert the tokens into tensors, and add the batch size dimension (here, we will work with batch size 1). 

In [None]:
# Now we're ready to go through BERT with out input
outputs, pooled = model(tokens_pt)
print("Token wise output: {}, Pooled output: {}".format(outputs.shape, pooled.shape))

Token wise output: torch.Size([1, 13, 1024]), Pooled output: torch.Size([1, 1024])


The model outputs a tuple. The first item of the tuple has the following shape: 1 (batch size) x 13 (sequence length) x 1024 (the number of hidden units). This is called the sequence output, and it provides the representation of each token in the context of other tokens in the sequence. If we'd like to fine-tune our model for named entity recognition, we will use this output and expect the 768 numbers representing each token in a sequence to inform us if the token corresponds to a named entity. 

The second item in the tuple has the shape: 1 (batch size) x 1024 (the number of hidden units). It is called the pooled output, and in theory it should represent the entire sequence. 

## How does BERT really work?

It's not required to effectively train a model, but it can be helpful if you want to do some really advanced stuff, or if you want to understand the limits of what is possible. 

I will only scratch the surface here by showing the key ingredients of BERT architecture, and at the end I will point to some additional resources I have found very helpful. 

Let's start by loading up basic BERT configuration and looking what's inside.

In [None]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30, 1024, padding_idx=0)
    (position_embeddings): Embedding(40000, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.0, inplace=False

In [None]:
config = model.config
config

BertConfig {
  "_name_or_path": "Rostlab/prot_bert",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.0,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 40000,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 30,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30
}

This configuration file lists the key dimensions that determine the size of the model: 
* 1024 hidden size is the number of floats in a vector representing each token in the vocabulary
* 30 is the vocabulary size
* We can deal with max 40000 tokens in a sequence
* The initial embeddings will go through 30 layers of computation, including the application of 16 attention heads and dense layers with 4096 hidden units, to produce our final output, which will again be a vector with 1024 units per token

Let's briefly look at each major building block of the model architecture. We start with the embedding layer, which maps each vocabulary token to a 1024-long embedding. We can also see position embeddings, which are trained to represent the ordering of words in a sequence, and token type embeddings, which are used if we want to distinguish between two sequences (for example question and context). 

In [None]:
print(model.embeddings)

BertEmbeddings(
  (word_embeddings): Embedding(30, 1024, padding_idx=0)
  (position_embeddings): Embedding(40000, 1024)
  (token_type_embeddings): Embedding(2, 1024)
  (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.0, inplace=False)
)


Then, we pass the embeddings through 30 layers of computation. This starts with self-attention, is followed by an intermediate dense layer with hidden size 4096, and ends with sequence output that we have already seen above. Usually, we will deal with the last hidden state, i.e. the 30th layer. However, to achieve better results, we may sometimes use the layers below as well to represent our sequences, for example by concatenating the last 4 hidden states. 

In [None]:
print(f'There are {len(model.encoder.layer)} layers like this in the model architecture:')
print('---')
print(model.encoder.layer[0])

There are 30 layers like this in the model architecture:
---
BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=1024, out_features=1024, bias=True)
      (key): Linear(in_features=1024, out_features=1024, bias=True)
      (value): Linear(in_features=1024, out_features=1024, bias=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=1024, out_features=1024, bias=True)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
  )
  (intermediate): BertIntermediate(
    (dense): Linear(in_features=1024, out_features=4096, bias=True)
  )
  (output): BertOutput(
    (dense): Linear(in_features=4096, out_features=1024, bias=True)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.0, inplace=False)
  )
)


Finally, we have the pooled output.

In [None]:
print(model.pooler)

BertPooler(
  (dense): Linear(in_features=1024, out_features=1024, bias=True)
  (activation): Tanh()
)


## Let's train the BERT model with our custom data:

We have a txt file "train_ace2.txt" which contains a sequence in each line. we have to read the file line by line and add it to a dataloader to feed it to the model.

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="train_ace2.txt",
    block_size=128,
)

## MLM model

Masked Language Modeling is a fill-in-the-blank task,
where a model uses the context words surrounding a mask token to try to predict what the
masked word should be.

For an input that contains one or more mask tokens,
the model will generate the most likely substitution for each.

Example:

- Input: "I have watched this [MASK] and it was awesome."
- Output: "I have watched this movie and it was awesome."

Masked language modeling is a great way to train a language
model in a self-supervised setting (without human-annotated labels).
Such a model can then be fine-tuned to accomplish various supervised
NLP tasks.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer



In [None]:
from transformers import Trainer, TrainingArguments

training_args =TrainingArguments(
    output_dir = "models/bert_prot",
    overwrite_output_dir = True,
    num_train_epochs = 5,
    per_device_train_batch_size = 16,
    save_steps = 10_000,
    save_total_limit = 3,
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = dataset,
    prediction_loss_only = True,
)

## Start training

In [None]:
%%time
trainer.train()

####  Save final model (+ tokenizer + config) to disk

---



In [None]:
trainer.save_model("prot_LM")

## Check that the LM actually trained


In [None]:
from transformers import BertForMaskedLM, BertTokenizer, pipeline
#BERT
tokenizer_m = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False)
model_m = BertForMaskedLM.from_pretrained("Rostlab/prot_bert")

Some weights of the model checkpoint at Rostlab/prot_bert were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
fill_mask = pipeline("fill-mask", model = model_m, tokenizer=tokenizer_m, top_k = 10)

In [None]:
fill_mask('D L I P T S S K L V V [MASK] D T S L Q V K K A F F A L V T')

[{'score': 0.11088453233242035,
  'sequence': '[CLS] D L I P T S S K L V V L D T S L Q V K K A F F A L V T [SEP]',
  'token': 5,
  'token_str': 'L'},
 {'score': 0.08402521163225174,
  'sequence': '[CLS] D L I P T S S K L V V S D T S L Q V K K A F F A L V T [SEP]',
  'token': 10,
  'token_str': 'S'},
 {'score': 0.07328339666128159,
  'sequence': '[CLS] D L I P T S S K L V V V D T S L Q V K K A F F A L V T [SEP]',
  'token': 8,
  'token_str': 'V'},
 {'score': 0.06921856850385666,
  'sequence': '[CLS] D L I P T S S K L V V K D T S L Q V K K A F F A L V T [SEP]',
  'token': 12,
  'token_str': 'K'},
 {'score': 0.06382402777671814,
  'sequence': '[CLS] D L I P T S S K L V V I D T S L Q V K K A F F A L V T [SEP]',
  'token': 11,
  'token_str': 'I'},
 {'score': 0.05900600925087929,
  'sequence': '[CLS] D L I P T S S K L V V F D T S L Q V K K A F F A L V T [SEP]',
  'token': 19,
  'token_str': 'F'},
 {'score': 0.058969877660274506,
  'sequence': '[CLS] D L I P T S S K L V V A D T S L Q V K K A 