## Session Notebook - Lesson 10

Today, we look more at the Bert implementation from Hugging Face (https://huggingface.co/). It is at this point probably the most convenient way to obtain pre-trained transformer models, and they are available in both TensorFlow and PyTorch.

In [1]:
import numpy as np

import tensorflow as tf

from transformers import BertTokenizer, TFBertModel

Let's consider the sample problem of identifying a context-based embedding of the word **'bank'** in the sentence **"I deposited 12342 dollars in the bank."**

In [2]:
test_sentence = "I deposited 12342 dollars in the bank."

Let's start by tokenizing the sentence. Bert has its own **tokenizer** and it should be used.

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [4]:
tokenizer.tokenize(test_sentence)

['I', 'deposited', '123', '##42', 'dollars', 'in', 'the', 'bank', '.']

Notice how the long number got split. The prefix '##' indicates that the token is a continuation of the previous one. One needs to have an eye on that for many reasons, particularly when one wants to identify the proper Bert output vector for a token.

Now, let's do a short excercise to get familiar with BERT. BERT is supposed to get us *context-based embeddings*, i.e. embeddings for the same word in different contexts should be different. Let's give that a try!


**Exercise:** compare the context-based embedding vectors for '*bank*' in the following 4 sentences:

* "I need to bring my money to the bank today" 
* "I will need to bring my money to the bank tomorrow" 
* "I had to bank into a turn"
* "The bank teller was very nice" 



We first need to tokenize the input, which is very easy with the latest Huggingface tokenizers (note the easy padding option!): 

In [5]:
bert_inputs = tokenizer(["I need to bring my money to the bank today",
                    "I will need to bring my money to the bank tomorrow",
                    "I had to bank into a turn",
                    "The bank teller was very nice" ],
                  padding=True,
                  return_tensors='tf')

bert_inputs

{'input_ids': <tf.Tensor: shape=(4, 13), dtype=int32, numpy=
array([[ 101,  146, 1444, 1106, 2498, 1139, 1948, 1106, 1103, 3085, 2052,
         102,    0],
       [ 101,  146, 1209, 1444, 1106, 2498, 1139, 1948, 1106, 1103, 3085,
        4911,  102],
       [ 101,  146, 1125, 1106, 3085, 1154,  170, 1885,  102,    0,    0,
           0,    0],
       [ 101, 1109, 3085, 1587, 1200, 1108, 1304, 3505,  102,    0,    0,
           0,    0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(4, 13), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(4, 13), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]], dtype=int32)

So there are actually three outputs: the token ids (starting with '101' for the '[CLS]' token), the token_type_ids which are usefull when one has distinct segments, and the attention masks which are used to mask out padding tokens.

Next, define a **Keras layer from the pre-trained BERT model** from Hugging Face. It's this simple!

In [6]:
bert_layer = TFBertModel.from_pretrained('bert-base-cased')

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Cool. So now let's get the BERT encoding of our test sentences. We just follow the Functional API approach: 

layer_output = layer(layer_input)

In [7]:
bert_outputs = bert_layer(bert_inputs)

print('shape of first output: \t\t', bert_outputs[0].shape)
print('shape of second output: \t', bert_outputs[1].shape)

shape of first output: 		 (4, 13, 768)
shape of second output: 	 (4, 768)


**Questions:**: 

1) Hmm... there are two outputs. One with dims [4, 13, 768], and one with [4, 768]. What are these? Which one do we need? And what do these dimensions correspond to?

2) How do we get the context-based embedding for the word 'bank'?

Let's start by defining a function that shows the respective cosine distances between a list of vectors, We'll use this in a bit.


In [8]:
def cosine_distances(vecs):
    for v_1 in vecs:
        distances = ''
        for v_2 in vecs:
            distances += ('\t' + str(np.dot(v_1, v_2)/np.sqrt(np.dot(v_1, v_1) * np.dot(v_2, v_2)))[:4])
        print(distances)

It's the first BERT output that we need, as that one gets us the token-level embeddings. 

Now, we get the vectors in the most pedestrian way by simply finding the 'bank'-token positions in the *encoded* input and extract the proper components: 

In [9]:
bank_1 = bert_outputs[0][0, 9]
bank_2 = bert_outputs[0][1, 10]
bank_3 = bert_outputs[0][2, 4]
bank_4 = bert_outputs[0][3, 2]

banks = [bank_1, bank_2, bank_3, bank_4]

Great. Let's now get the pair-wise cosine distances:

In [10]:
cosine_distances(banks)

	1.0	0.99	0.59	0.86
	0.99	1.0	0.59	0.87
	0.59	0.59	1.0	0.62
	0.86	0.87	0.62	1.0


Looks rights! The 'bank' in 'I had to bank into a turn' is the one that's most different from the others.

Also, note that 'bank' has a slightly different embedding in the two sentences "*I need to bring my money to the bank today*" and "*I will need to bring my money to the bank tomorrow*". Maybe a bit surprising, but the sentences are slightly different, so the self-attention calculations will be slightly different.