## Session Notebook - Lesson 8

Today, we look at the Bert implementation from Hugging Face (https://huggingface.co/). It is at this point probably the most convenient way to obtain pre-trained transformer models, and they are available in both TensorFlow and PyTorch.

In [43]:
import numpy as np

import tensorflow as tf

from transformers import BertTokenizer, TFBertModel

Let's consider the sample problem of identifying a context-based embedding of the word **'bank'** in the sentence **"I deposited 12342 dollars in the bank."**

In [58]:
test_sentence = "I deposited 12342 dollars in the bank."

Let's start by tokenizing the sentence. Bert has its own **tokenizer** and it should be used.

In [57]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

In [59]:
tokenizer.tokenize(test_sentence)

['I', 'deposited', '123', '##42', 'dollars', 'in', 'the', 'bank', '.']

Notice how the long number got split. The prefix '##' indicates that the token is a continuation of the previous one.

Next, we need to encode the tokens. We also add here for clarity the standard extra tokens manually, including padding tokens should the sentence be shorter than the desired 'max_length': 

In [61]:
bert_sentence_tokens = np.array([tokenizer.convert_tokens_to_ids(['[CLS]'] + 
                                                        tokenizer.tokenize(test_sentence) + 
                                                        ['PAD', 'PAD', 'PAD', 'PAD'] +     # Padding if needed   
                                                        ['[SEP]'])])
bert_sentence_tokens

array([[  101,   146, 14735, 13414, 23117,  5860,  1107,  1103,  3085,
          119,   100,   100,   100,   100,   102]])

**Question:** Why do we have a list of lists in the array vs just a list?

Next, we determine the padding mask and sequence ids. This is optional in Hugging Face's implementation. But if you use padding or you have multiple segments, you should do this:

In [55]:
bert_mask_ids = np.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0 ,0 ,0, 0]])    # mask out padding tokens
bert_squence_ids = np.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Now we have the BERT input: 

In [56]:
bert_input = [bert_sentence_tokens, bert_mask_ids, bert_squence_ids]

Next, define a **Keras layer from the pre-trained BERT model** from Hugging Face. It's simple!

In [4]:
bert_layer = TFBertModel.from_pretrained('bert-base-cased')

Let's look at this layer:

In [62]:
len(bert_layer.weights)

199

That's a lot... but there is a lot going on, obviously. First layer? Word embeddings!

In [63]:
bert_layer.weights[0]

<tf.Variable 'tf_bert_model/bert/embeddings/word_embeddings/weight:0' shape=(28996, 768) dtype=float32, numpy=
array([[-0.00054784, -0.04156886,  0.01308366, ..., -0.0038919 ,
        -0.0335485 ,  0.0149841 ],
       [ 0.01688265, -0.03106827,  0.0042053 , ..., -0.01474032,
        -0.03561099, -0.0036223 ],
       [-0.00057234, -0.02673604,  0.00803954, ..., -0.01002474,
        -0.0331164 , -0.01651673],
       ...,
       [-0.00643814,  0.01658491, -0.02035619, ..., -0.04178825,
        -0.049201  ,  0.00416085],
       [-0.00483562, -0.00267701, -0.02901638, ..., -0.05116647,
         0.00449265, -0.01177113],
       [ 0.03134822, -0.02974372, -0.02302896, ..., -0.01454749,
        -0.05249038,  0.02843569]], dtype=float32)>

**Question:** Are the dimensions as expected? What is the vocabulary size?

Cool. So now let's get the BERT encoding of our test sentence. It is easy, thanks to Keras and eager execution in TensorFlow:

In [70]:
bert_layer([bert_sentence_tokens, bert_mask_ids, bert_squence_ids])

(<tf.Tensor: shape=(1, 15, 768), dtype=float32, numpy=
 array([[[ 0.29879576,  0.08388553,  0.09784617, ...,  0.14215273,
           0.11360482,  0.03080782],
         [ 0.21766078, -0.06807256,  0.3014325 , ...,  0.3031708 ,
          -0.0635722 ,  0.2752993 ],
         [ 0.05779424, -0.3237861 ,  0.18332055, ...,  0.3118223 ,
          -0.07193998,  0.28022054],
         ...,
         [ 0.10753562, -0.05329677,  0.3892548 , ...,  0.2620432 ,
           0.25460327,  0.15719599],
         [ 0.06462695, -0.03867982,  0.34668633, ...,  0.23272091,
           0.1289264 ,  0.09174192],
         [ 0.04314023, -0.31707326,  0.579127  , ...,  0.33808345,
           0.6358729 ,  0.2757272 ]]], dtype=float32)>,
 <tf.Tensor: shape=(1, 768), dtype=float32, numpy=
 array([[-4.84184980e-01,  2.95079529e-01,  9.97789562e-01,
         -9.82715786e-01,  8.00549924e-01,  7.66499817e-01,
          9.02179480e-01, -9.66841280e-01, -9.55611229e-01,
         -4.09522116e-01,  9.67448950e-01,  9.98046100e-0

**Questions:**: 

1) Hmm... there are two outputs. One with dims [1, 15, 768], and one with [1, 768]. What are these? Which one do we need?

2) How do we get the context-based embedding for the word 'bank'?
