In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

# 1. Bert main logic
## 1.1 Bert tokens
```python
  tensor([[101, 1188, 1110, 1126, 7758, 1859, 102]])
```
  \[CLS\] --> 101 and \[SEP\] --> 102
  
## 1.2 Embedding
[code](https://github.com/huggingface/transformers/blob/3658488ff77ff8d45101293e749263acf437f4d5/src/transformers/models/bert/modeling_bert.py#L180)

### Logic
Adding token embedding, abs position embedding and token type embedding

### Input
B X S where B is batch size and S is sequence length

#### Output
B X S X D, where D is dimension



## 1.3 Encoder
[code](https://github.com/huggingface/transformers/blob/3658488ff77ff8d45101293e749263acf437f4d5/src/transformers/models/bert/modeling_bert.py#LL561C4-L561C4)

### Logic

It has layers, like 12 layers
Hidden states corresponding to layer_outputs\[0\]
If we want all hidden state, then put them in a tuple
Attentions corresponding to layer_outputs\[1\]

### Input
1. Emedding output
2. attention_mask
3. head_mask
4. previous encoder_hidden_states
5. encoder_attention_mask
6. previous output_attentions
7. previous output_hidden_states


## 1.4 Model output
sequence_output, pooled_output, (hiddent_states), (attentions), (cross_attentions)





## BertEncoder

### Input
1. hiddent_states
2. attention_mask
3. head_mask
4. previous encoder_hidden_states
5. encoder_attention_mask
6. previous output_attentions
7. prevous output_hidden_states


### output
hidden_states, next_decoder_cache(keys), (all_hidden_states), (all_self_attentions), (all_cross_attentions)


Bert Self Attention logic

Note if we have encode_hidden_states, then it is cross attention


```python
def forward(self, hidden_states, attention_mask=None, head_mask=None,
    		encoder_hidden_states=None, encoder_attention_mask=None,
    		output_attentions=False):
    # step 1: mapping Query/Key/Value to sub-space
    # step 1.1: query mapping
    mixed_query_layer = self.query(hidden_states) # B x S x (H*d)
    
    # If this is instantiated as a cross-attention module, the keys
    # and values come from an encoder; the attention mask needs to be
    # such that the encoder's padding tokens are not attended to.
    
    # step 1.2: key/value mapping
    if encoder_hidden_states is not None:
        mixed_key_layer = self.key(encoder_hidden_states) # B x S x (H*d)
        mixed_value_layer = self.value(encoder_hidden_states) 
        attention_mask = encoder_attention_mask 
    else:
        mixed_key_layer = self.key(hidden_states) # B x S x (H*d)
        mixed_value_layer = self.value(hidden_states)

    query_layer = self.transpose_for_scores(mixed_query_layer) # B x H x S x d
    key_layer = self.transpose_for_scores(mixed_key_layer) # B x H x S x d
    value_layer = self.transpose_for_scores(mixed_value_layer) # B x H x S x d

    # step 2: compute attention scores
    
    # step 2.1: raw attention scores
    # B x H x S x d   B x H x d x S -> B x H x S x S
    # Take the dot product between "query" and "key" to get the raw attention scores.
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
    attention_scores = attention_scores / math.sqrt(self.attention_head_size)
    
    # step 2.2: mask if necessary
    if attention_mask is not None:
       # Apply the attention mask, B x H x S x S
    	attention_scores = attention_scores + attention_mask

    # step 2.3: Normalize the attention scores to probabilities, B x H x S x S
    attention_probs = nn.Softmax(dim=-1)(attention_scores)

    # This is actually dropping out entire tokens to attend to, which might
    # seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = self.dropout(attention_probs)

    # Mask heads if we want to
    if head_mask is not None:
        attention_probs = attention_probs * head_mask
	# B x H x S x S   B x H x S x d ->  B x H x S x d
    
    # step 4: aggregate values by attention probs to form context encodings
    context_layer = torch.matmul(attention_probs, value_layer)
	# B x S x H x d
    context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
    # B x S x D
    new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
    # B x S x D，相当于是多头concat操作
    context_layer = context_layer.view(*new_context_layer_shape)

    outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
    return outputs
```

In [2]:
from transformers import AutoTokenizer, AutoModel

MODEL_NAME = "bert-base-cased"

# step 1: 先获取tokenizer, BertTokenizer, 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir='.cache/token') 
# step 2: 获取预训练好的模型, BertModel
model = AutoModel.from_pretrained(MODEL_NAME, cache_dir='.cache/model')

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 1.55MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 3.33MB/s]
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 570kB/s]


Setting ds_accelerator to cuda (auto detect)


Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 436M/436M [00:19<00:00, 22.6MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassificatio

In [5]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [7]:
model.embeddings

BertEmbeddings(
  (word_embeddings): Embedding(28996, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [8]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [9]:
text = "A dog chases a fox"
inputs = tokenizer(text, return_tensors="pt") 


In [11]:
inputs

{'input_ids': tensor([[  101,   138,  3676,  9839,  1116,   170, 17594,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [16]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'].numpy().squeeze(0))

['[CLS]', 'A', 'dog', 'chase', '##s', 'a', 'fox', '[SEP]']

In [20]:
import torch 

device = "cuda:0" if torch.cuda.is_available() else "cpu"

inputs = inputs.to(device)
model = model.to(device)

In [21]:
inputs

{'input_ids': tensor([[  101,   138,  3676,  9839,  1116,   170, 17594,   102]],
       device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [26]:
outputs = model(**inputs)

In [34]:
outputs = outputs.to_tuple()

In [36]:
r = []
for o in outputs:
    r.append(o.detach())

In [38]:
for o in r:
    print(o.shape)

torch.Size([1, 8, 768])
torch.Size([1, 768])


In [44]:
e = model.embeddings(inputs['input_ids'], inputs['token_type_ids'])
e

tensor([[[ 0.4496,  0.0977, -0.2074,  ...,  0.0578,  0.0406, -0.0951],
         [-0.2927,  0.2770,  0.6649,  ...,  0.9219,  0.5406,  0.5588],
         [-0.0497, -1.5883,  0.2633,  ..., -0.2919, -1.3644, -1.3365],
         ...,
         [-0.5258,  1.0663, -0.4180,  ...,  0.0359,  0.8914,  0.3006],
         [ 0.0403, -1.1036,  0.2559,  ...,  1.0463,  0.7956,  0.4520],
         [-0.0460,  0.0234,  0.2679,  ...,  0.4408, -0.5575,  0.4839]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>)

In [45]:
o = model.encoder(e)

In [46]:
o

BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-0.0351, -0.0827, -0.0797,  ..., -0.0248,  0.2941, -0.0887],
         [-0.2311, -0.4525,  0.3582,  ...,  0.7687,  0.6660,  0.1785],
         [ 0.4097, -0.2633, -0.2330,  ...,  0.0906, -0.2991, -0.0815],
         ...,
         [-0.1230, -0.2945, -0.4637,  ...,  0.1498,  0.1764,  0.1671],
         [ 0.2305, -0.5759, -0.0623,  ..., -0.3432, -0.0735,  0.1954],
         [ 0.4334,  0.2996, -0.5675,  ..., -0.2275,  0.1948, -0.1666]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)

In [47]:
o.detach()

In [48]:
model = None

In [49]:
from transformers import BertForMaskedLM

text = "Nice to [MASK] you" # target token using [MASK] to mask

# step 1: obtain pretrained Bert Model using MLM Loss
maskedLM_model = BertForMaskedLM.from_pretrained(MODEL_NAME, cache_dir='.cache/model')
maskedLM_model = maskedLM_model.to(device)

maskedLM_model.eval() # close dropout

# step 2: tokenize
token_info = tokenizer.encode_plus(text, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(token_info['input_ids'].squeeze().numpy())
print(tokens) # ['[CLS]', 'Nice', 'to', '[MASK]', 'you', '[SEP]']

# step 3: forward to obtain prediction scores
token_info = token_info.to(device)
with torch.no_grad():
    outputs = maskedLM_model(**token_info)
    predictions = outputs[0] # shape, B x S x V, [1, 6, 28996]
    
# step 4: top-k predicted tokens
masked_index = tokens.index('[MASK]') # 3
k = 10
probs, indices = torch.topk(torch.softmax(predictions[0, masked_index], -1), k)

predicted_tokens = tokenizer.convert_ids_to_tokens(indices.tolist())
print(list(zip(predicted_tokens, probs)))

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['[CLS]', 'Nice', 'to', '[MASK]', 'you', '[SEP]']
[('meet', tensor(0.9712, device='cuda:0')), ('see', tensor(0.0267, device='cuda:0')), ('meeting', tensor(0.0010, device='cuda:0')), ('have', tensor(0.0003, device='cuda:0')), ('met', tensor(0.0002, device='cuda:0')), ('know', tensor(0.0001, device='cuda:0')), ('join', tensor(7.0004e-05, device='cuda:0')), ('find', tensor(5.8323e-05, device='cuda:0')), ('Meet', tensor(2.7171e-05, device='cuda:0')), ('tell', tensor(2.4689e-05, device='cuda:0'))]
