# BERT
**BERT**, or <u>Bidirectional Encoder Representations from Transformers</u>, is a powerful natural language processing (NLP) technique developed by researchers at Google. It's based on the *Transformer architecture* and is designed to understand the context of words in a sentence by considering the words that come before and after them. Unlike previous models that processed words in a left-to-right or right-to-left manner, BERT can take into account the entire context of a word by processing it bidirectionally.

BERT has achieved state-of-the-art results in various NLP tasks such as question answering, sentiment analysis, and language translation. It's pre-trained on large corpora of text data and can then be fine-tuned on specific tasks with smaller, task-specific datasets. This pre-training followed by fine-tuning approach has made BERT highly effective in a wide range of NLP applications.

#### [Text preprocessing for BERT + SavedModel implementation of the encoder API](https://www.kaggle.com/models/tensorflow/bert/tensorFlow2/en-uncased-l-12-h-768-a-12)

#### Tensorflow2 BERT encoder model : **bert/tensorFlow2/en-uncased-l-12-h-768-a-12**

Here, we are using "BERT Base" model having configuartion like

**l-12-h-768-a-12**
* l - 12 : layer = 12
* h-768 : hidden state 768
* a - 12 : attention 12

In [1]:
!pip install h5py
!pip install typing-extensions
!pip install wheel



In [2]:
# !pip install tensorflow_text --use-deprecated=legacy-resolver
# !pip install tf-keras --use-deprecated=legacy-resolver
!pip3 install --quiet "tensorflow-text==2.15.*"
!pip install tensorflow_hub
# !pip install tf-keras

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m


#### [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](https://jalammar.github.io/illustrated-bert/)

In [3]:
import tensorflow_hub as tfhub
import tensorflow_text as tftxt

In [6]:
bert_preprocess_url = "https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3"
bert_encoder_url = "https://www.kaggle.com/models/tensorflow/bert/tensorFlow2/en-uncased-l-12-h-768-a-12"

In [7]:
bert_preprocess_model = tfhub.KerasLayer(bert_preprocess_url)

In [8]:
text_test = ['nice movie friend', 'I loved python programming']
text_preprocessed = bert_preprocess_model(text_test)
# text_preprocessed is a dictionary, so let's take a look into keys
text_preprocessed.keys()

dict_keys(['input_mask', 'input_word_ids', 'input_type_ids'])

In [10]:
text_preprocessed['input_mask']
# the below first logic 1 is due to this
# CLS nice movie friend SEP

# here two (2) sentences and each is max length of 128 words

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

In [12]:
text_preprocessed['input_word_ids']
# CLS nice movie friend SEP
# 101 3835 3185  2767   102

<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
array([[  101,  3835,  3185,  2767,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0, 