# Bidirectional Encoder Representations from Transformer (BERT)

BERT, as its name stands for, builds a bidirectional transformer-based language model using encoders rather than decoders. BERT is a state-of-the-art transformer architecture developed by `Google`. BERT is built on top of several popular ideas like [Semi-supervised Sequence Learning](https://arxiv.org/abs/1511.01432), [ELMo](https://arxiv.org/abs/1802.05365), [ULM-FiT](https://arxiv.org/abs/1801.06146), [OpenAI Transformer](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) and [Transformer](https://arxiv.org/abs/1706.03762).

The BERT architecture falls under an encoder-decoder(Transformer) model as follows:

![BERT](../assets/embedding/bert.jpg)

For fine-tuning and pre-training for different downstream tasks like Q/A, Classification, Language Modelling, Multiple Choice, NER etc. different layers of the BERT are used.

**Why BERT?**
- BERT can be used to extract features (word and sentence embeddings) from text. These features can be useful for keyword/search expansion, semantic search and information retrieval even if there’s no keyword or phrase overlap.
- BERT can be used to fine-tune downstream models. The pretrained features can be used as high quality feature inputs to these downstream models.

## Understanding BERT Input

BERT uses two special tokens: `[CLS]` and `[SEP]`. The *[CLS]* token always appears at the beginning of the text and is specific to classification tasks. The *[SEP]* token is used to differentiate two different text documents (sentences). These two tokens are always required no matter what i.e. even if you have a single sentence or even if you are not training BERT for classification.

```
[CLS] Roses are red. [SEP] Sky is blue.
[CLS] Mountains are earth's undecaying monuments. [SEP]
```

## Tokenization

In [1]:
from transformers import BertTokenizer

In [2]:
sentence = "Mountains are earth's undecaying monuments."
marked_sentence = '[CLS] ' + sentence + ' [SEP]'

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(marked_sentence)
print(tokens)

['[CLS]', 'mountains', 'are', 'earth', "'", 's', 'und', '##eca', '##ying', 'monuments', '.', '[SEP]']


> The word undecaying is represented as ['und', '##eca', '##ying']. The words that are not part of BERT vocabulary are represented as subwords and characters. The two preceding hashes are nothing but just a notation that the subword or the character is a part of larger group i.e. '##ying' is different from the independent 'ying' token.

Now let's map the string tokens to vocabulary indeces.

In [3]:
indeces = tokenizer.convert_tokens_to_ids(tokens)
print(list(zip(tokens, indeces)))

[('[CLS]', 101), ('mountains', 4020), ('are', 2024), ('earth', 3011), ("'", 1005), ('s', 1055), ('und', 6151), ('##eca', 19281), ('##ying', 14147), ('monuments', 10490), ('.', 1012), ('[SEP]', 102)]


## Segment ID

Segment represents to each sentence in the text that are separated by *[SEP]* token. BERT needs segment ID to separate these segments: Segment 0 (series of 0s) and Segment 1 (series of 1s). In our case, we only need series of 1s as we are working with a single sentence.

In [4]:
segments = [1] * len(tokens)
print(segments)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Feature Extraction

Here we will be converting our indeces to tensors which serve as input to the BERT model.

In [5]:
import torch
from transformers import BertModel

In [6]:
tokens = torch.tensor([indeces])
segments = torch.tensor([segments])

We will be using pre-trained model here. So, the model is subjected to evaluation mode by `model.eval()` and skip the training part.

In [None]:
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()

`bert-base-uncased` model has 12 layers. We will now run the text through BERT, and collect all of the hidden states produced by all 12 layers. 

`torch.no_grad()` prevents pytorch from constructing the computation graph during forward pass because it is of no use since we won't be backpropagating during evaluation mode. This reduces both time and storage complexities.

Evaluating the model will return a different number of objects based on how it's  configured in the `from_pretrained` call earlier. In this case, becase we set `output_hidden_states=True`, the third item will be the hidden states from all layers. See the [documentation](https://huggingface.co/transformers/model_doc/bert.html#bertmodel) for more details.

In [8]:
with torch.no_grad():
    outputs = model(tokens, segments)
    hidden_states = outputs[2]

## Understanding BERT Output

In [9]:
layer_no = batch_no = token_no = 0

print(f'Number of Layers: {len(hidden_states)}')
print(f'Number of Batches: {len(hidden_states[layer_no])}')
print(f'Number of Tokens: {len(hidden_states[layer_no][batch_no])}')
print(f'Number of Hidden Units (Features): {len(hidden_states[layer_no][batch_no][token_no])}')

Number of Layers: 13
Number of Batches: 1
Number of Tokens: 12
Number of Hidden Units (Features): 768


Why do we get 13 layers though we earlier mentioned BERT model has 12 layers? It is because here first layer represents the input embeddings and the rest represents the outputs of each of the BERT's 12 layers. So, a total of 119808 (13x1x12x768) unique values are generated to represent our single sentence.

```
Current Dimensions Format: [# layers, # batches, # tokens, # features]
Desired Dimensions Format: [# tokens, # layers, # features]
```

This can be achieved using PyTorch's `permute()` method which helps rearranging dimensions of a tensor. Let's begin by combining all the layers together to form one whole big tensor.

In [10]:
embeddings = torch.stack(hidden_states, dim=0)
print(embeddings.size())

torch.Size([13, 1, 12, 768])


Now, let's get rid of batches dimension as we do not need it.

In [11]:
embeddings = torch.squeeze(embeddings, dim=1)
print(embeddings.size())

torch.Size([13, 12, 768])


Finally let's switch among the *layers* and *tokens* dimensions to get the desired format.

In [12]:
embeddings = embeddings.permute(1, 0, 2)
print(embeddings.size())

torch.Size([12, 13, 768])


## Word Embeddings

Till now each of our token has 13 separate vectors each of length 768. Now it's time we get indivudial vectors for each of our tokens. For this we need to pick some layers and combine respective vectors to obtain the final single vector. But how so we know, which combination gives the best result? Unfortunately, we don't know for sure. Let's try out a few approaches.

#### I. Concatenate last four layers to get a single word vector per token

In [13]:
single_vec = list()
for token in embeddings:
    vec_cat = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
    single_vec.append(vec_cat)
    
print(f'Shape: {len(single_vec)} x {len(single_vec[0])}')

Shape: 12 x 3072


#### II. Summing the last four layers together to get a single word vector per token

In [14]:
single_vec = list()
for token in embeddings:
    vec_sum = torch.sum(token[-4:], dim=0)
    single_vec.append(vec_sum)

print(f'Shape: {len(single_vec)} x {len(single_vec[0])}')

Shape: 12 x 768


## Sentence Embeddings

Let's try getting single vector for the entire sentence but not just for a single word. For this we have multiple approaches that differ among different applications. Here we'll go with a simple strategy by averagin second to last hidden layer for each token.

In [15]:
tokens_vec = hidden_states[-2][0]
sentence_embedding = torch.mean(tokens_vec, dim=0)
print(f'Sentence Embedding shape: {sentence_embedding.size()}')

Sentence Embedding shape: torch.Size([768])


*[[Source]](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/)*