## BERT: Bidirectional Encoder Representation from Transformer

This is covered from youtube channel: 

https://www.youtube.com/watch?v=7kLi8u2dJz0&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=46


Here is the page that has list of all available bert models on tensorflow hub that one can download and make use of.

https://tfhub.dev/google/collections/bert/1

Here is the information on basic uncased BERT model,

https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4

It uses L=12 hidden layers (i.e., Transformer blocks), a hidden size of H=768, and A=12 attention heads. This model has been pre-trained for English on the Wikipedia and BooksCorpus. 

Tensorflow hub website is the repository of all models.

https://www.tensorflow.org/hub

Reads: https://jalammar.github.io/illustrated-bert/

https://www.tensorflow.org/text

Run this in cell to install it

In [1]:
# !pip install --upgrade tensorflow_hub
# !pip install --upgrade tensorflow-text

In [2]:
import tensorflow_hub as hub
import tensorflow_text as text

In [3]:
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

In [4]:
bert_preprocess_model = hub.KerasLayer(preprocess_url)

In [5]:
text_test = ['nice movie indeed','I love python programming']
text_preprocessed = bert_preprocess_model(text_test)
text_preprocessed.keys()

dict_keys(['input_type_ids', 'input_mask', 'input_word_ids'])

In [6]:
text_preprocessed['input_mask']
# CLS nice movie indeed SEP  (5 ones in output)
# 128 is the maximum length of sentence

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>

In [7]:
text_preprocessed['input_type_ids']

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>

In [8]:
text_preprocessed['input_word_ids']

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[  101,  3835,  3185,  5262,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 

**101 --> CLS token**

**102 --> SEP token**

BERT uses CLS as a special token at the begining of each setence whereas SEP as a special token to 
separate two sentences or end sinle sentece

In [9]:
bert_model = hub.KerasLayer(encoder_url)             # another layer

In [10]:
bert_results = bert_model(text_preprocessed)

In [11]:
bert_results.keys()

dict_keys(['encoder_outputs', 'pooled_output', 'sequence_output', 'default'])

In [12]:
bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.07292067,  0.08567819,  0.14476839, ..., -0.09677105,
          0.08722159,  0.07711076],
        [ 0.17839423, -0.19006088,  0.5034951 , ..., -0.05869836,
          0.32717168, -0.15578607],
        [ 0.18701434, -0.43388814, -0.48875174, ..., -0.15502723,
          0.00145242, -0.24470958],
        ...,
        [ 0.12083033,  0.12884216,  0.4645349 , ...,  0.07375568,
          0.17441967,  0.16522148],
        [ 0.07967912, -0.01190673,  0.50225425, ...,  0.13777754,
          0.21002257,  0.00624568],
        [-0.07212678, -0.28303456,  0.5903342 , ...,  0.4755191 ,
          0.16668472, -0.08920309]],

       [[-0.07900576,  0.36335146, -0.21101616, ..., -0.17183737,
          0.16299757,  0.6724266 ],
        [ 0.2788348 ,  0.4371632 , -0.35764787, ..., -0.04463551,
          0.3831522 ,  0.5887987 ],
        [ 1.2037671 ,  1.0727023 ,  0.4840871 , ...,  0.24921003,
          0.40730935,  0.40481764],
        ...,

In [None]:
bert_results['pooled_output']

In [None]:
len(bert_results['encoder_outputs'])

Since we are using BERT base model it has 12 encoder layers, that's why above the length of encoder_output is 12.
Read more about purpose of each element here: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4

You can see below that last element of encoder_outputs is basically a sequence_output

In [None]:
bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

In [None]:
bert_results['encoder_outputs']