# BERT
This seems to be the most used model in all the tutorials. See [BERT Notes](../../Notes/BERT.md) for a conceptual understanding of BERT. This notebook is about how it is implemented in HF.

### `BertModel`
This is the workhorse model of BERT. All special purpose model are composed from this. It has the standard BERT conceptual architecture, except the classifier heads are not present. Another thing that is different is that the first token output is passed through another linear layer (called "pooler") and that output is also returned. By default the last hidden states of all the tokens, including the first token, and the pooler output of the first token are returned in a `dict` subclass. However, it is possible to get the hidden states at all layers, along with a bunch of other stuff using the config.

![bert-model](./bert-model.png)

`BaseModelOutputWithPoolingAndCrossAttentions` is the output class and its [documentation](https://huggingface.co/docs/transformers/main/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions) is very informative. This is dervied from an `OrderedDict` but has a bunch of bells and whistlens on top. I can access the items with the standard string key, but also with an integer index which just maps to the key name at that index, and also as an attribute.

In [1]:
from transformers import AutoTokenizer, AutoModel

In [2]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [3]:
sentences = [
    "Using Transformers is easy!",
    "Attention is indeed all you need :-)"
]
batch = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
batch

{'input_ids': tensor([[  101,  2478, 19081,  2003,  3733,   999,   102,     0,     0,     0,
             0],
        [  101,  3086,  2003,  5262,  2035,  2017,  2342,  1024,  1011,  1007,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [4]:
bert = AutoModel.from_pretrained("bert-base-uncased")
bert

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [5]:
bert_outputs = bert(**batch)
bert_outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0245,  0.1256,  0.1062,  ..., -0.0637,  0.0344,  0.3892],
         [ 0.2678,  0.3460, -0.3190,  ..., -0.1258,  0.1899,  0.2213],
         [ 1.8175, -0.0306, -0.1497,  ..., -0.4804, -0.1272,  0.4366],
         ...,
         [-0.0396, -0.4403,  0.3741,  ...,  0.2806, -0.0468, -0.1171],
         [ 0.4085, -0.1582,  0.4979,  ...,  0.1728, -0.0955, -0.0630],
         [-0.0935, -0.4842,  0.3430,  ...,  0.3072, -0.0117, -0.1564]],

        [[ 0.0196,  0.0618, -0.1108,  ..., -0.0865,  0.1613,  0.5693],
         [ 0.3790,  0.5198,  0.1140,  ...,  0.2473,  0.2489, -0.2309],
         [-0.1873,  0.1067,  0.6274,  ..., -0.2091, -0.0990,  0.6994],
         ...,
         [ 1.3337, -0.3378,  1.1761,  ...,  0.6086,  0.6716,  0.8192],
         [ 0.3343,  0.1717,  0.4985,  ...,  0.4488,  0.9565,  0.5942],
         [ 0.9645,  0.2220, -0.0692,  ...,  0.2544, -0.7161, -0.1088]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

In [6]:
bert_outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

I can access the output tensors in three ways -
  * By using output name as a key, e.g., `output["hiddent_state]`
  * By using output name as a property, e.g., `output.hidden_state`
  * By using the index of the key name, e.g., `output[1]`

In [23]:
bert_outputs["pooler_output"].shape

torch.Size([2, 768])

In [24]:
bert_outputs.last_hidden_state.shape

torch.Size([2, 11, 768])

In [25]:
# This is same as pooler output
bert_outputs[1].shape

torch.Size([2, 768])

In [27]:
bert_outputs["pooler_output"].equal(bert_outputs[1])

True

Here is the verification that the pooler output is simply the first token's hidden state passed through the pooler layer.

In [37]:
class_token_hidden_state = bert_outputs.last_hidden_state[:, 0, :]
class_token_hidden_state.shape

torch.Size([2, 768])

In [48]:
exp_pooler_output = bert.pooler.activation(bert.pooler.dense(class_token_hidden_state))
exp_pooler_output.shape

torch.Size([2, 768])

In [49]:
bert_outputs.pooler_output.equal(exp_pooler_output)

True

### `BertForSequenceClassification`
This is a specialist BERT model with a classifier head attached on top of the pooler. The classifier head is just a linear layer with as many outputs as the number of classes (or labels).

In [10]:
from transformers import AutoModelForSequenceClassification

In [11]:
classifier = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
classifier

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [12]:
classifier_outputs = classifier(**batch)
classifier_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.4356, -0.5435],
        [-0.3104, -0.6776]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [13]:
classifier_outputs.keys()

odict_keys(['logits'])

If I provide the labels in the batched input, the model will also output the loss. For single label classification problems, i.e., the output can have only **one** of many labels, it uses the standard cross entropy loss. And for multi label classification problems, i.e., the output can simultaneously have multiple labels assigned, it uses the standard binary cross entropy on each label.

In [14]:
from copy import deepcopy
import torch as t

In [15]:
labeled_batch = deepcopy(batch)
labeled_batch["labels"] = t.tensor([1, 1])
labeled_classifier = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
labeled_classifier_output = labeled_classifier(**labeled_batch)
labeled_classifier_output

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SequenceClassifierOutput(loss=tensor(0.7500, grad_fn=<NllLossBackward0>), logits=tensor([[0.2331, 0.0290],
        [0.1891, 0.1762]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [16]:
labeled_classifier_output.loss

tensor(0.7500, grad_fn=<NllLossBackward0>)

In [17]:
labeled_classifier.config.problem_type

'single_label_classification'

### Configs
The config story of HF is a bit messed up. The base class is this humungous class called [PretrainedConfig](https://huggingface.co/docs/transformers/main/en/main_classes/configuration#transformers.PretrainedConfig) that by default takes in 60 different settings. On top of this, it has this monstrosity of code -

```python
for key, value in kwargs.items():
    setattr(self, key, value)
```

The documentation mentions that there are a bunch of attributes that all subclasses should have, but for some reason are not defined in this class. 

`BertModel` takes in [`BertConfig`](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertConfig) which is a child class of `PretrainedConfig`. All other BERT family of models, which are composed from the base `BertModel` taken in an untyped config with assumptions about some of the 60 default settings being set. E.g., the `BertForSequenceClassification` class expects the config to have `num_labels` set.

!! Just printing out the config object will not give all the attributes. !!

In [18]:
bert.config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [19]:
classifier.config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

As can be seen, this only prints out the additional settings in the bert config. E.g., `num_labels` is set but is not showing up.

In [20]:
print(bert.config.num_labels)
print(classifier.config.num_labels)

2
2


In [21]:
classifier.config.problem_type

In [22]:
bert.config.problem_type