# 1. Introduction to BERT
BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking model in the field of natural language processing (NLP) developed by researchers at Google. Introduced in a `2018` paper by `Jacob Devlin` and his colleagues, BERT's key innovation lies in its ability to train language representations in a deeply bidirectional way.

### Key Features of BERT:

1. **Bidirectional Training**: Traditional language models were typically trained to predict the next word in a sequence, meaning they were unidirectional. This limits the model's ability to learn context from both the left and the right side of a word. BERT, however, uses the Transformer architecture, which allows it to consider the full context of a word by looking at words that come before and after it—essentially, it reads the entire sentence, both left and right of every word, at once.

2. **Transformer Architecture**: BERT is based on the Transformer model, which uses an attention mechanism to weigh the influence of different words on each other’s representation. Unlike directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer reads the whole sequence of words at once. This is part of what allows BERT to understand the context more effectively.

3. **Masked Language Model (MLM)**: During training, random words in a sentence are replaced with a "[MASK]" token, and the model is trained to predict the original word based on the context provided by the other non-masked words in the sequence. This training process is called `Masked LM` and allows BERT to flexibly understand and predict words based on their context.

4. **Next Sentence Prediction (NSP)**: In addition to predicting masked words, BERT is also trained to predict whether a sentence logically follows another. This capability is trained using pairs of sentences, which helps the model understand the relationship between consecutive sentences, aiding in tasks like question answering and language inference.

### Applications of BERT:

BERT has been revolutionary in the field of NLP and has led to improvements in many different language understanding tasks, including:
- **Text Classification**: BERT can be fine-tuned for various classification tasks like sentiment analysis or spam detection.
- **Question Answering**: BERT excels in extracting answers from texts given a question.
- **Named Entity Recognition (NER)**: Identifying names of people, organizations, locations, etc., in text.
- **Language Translation**: Although not its initial purpose, BERT can be adapted for translation tasks.

BERT set a new standard for NLP models and has inspired a host of variants and improvements, such as RoBERTa, ALBERT, and DistilBERT, each aiming to optimize various aspects of the original BERT model, like training speed, model size, and accuracy.

For implementing BERT in a project, libraries like Hugging Face's Transformers provide pre-trained models which can be fine-tuned on specific datasets relatively easily. This accessibility has made advanced NLP capabilities available to a wide range of developers and researchers.

# 2. BERT Variations
BERT, since its introduction, has inspired a number of variants that aim to improve upon the original model in various ways, such as efficiency, performance on specific tasks, or adaptability to different languages or smaller datasets. Here's an overview of some notable versions of BERT:

1. **RoBERTa (Robustly Optimized BERT Approach)**:
   Developed by Facebook AI, RoBERTa modifies key hyperparameters in BERT and removes the Next Sentence Prediction (NSP) objective used during pre-training. It is trained with much larger mini-batches and learning rates, and on more data for a longer period of time compared to BERT. These changes help RoBERTa to outperform BERT on many NLP benchmarks.

2. **DistilBERT**:
   DistilBERT is a smaller, faster, cheaper, and lighter version of BERT developed by Hugging Face. It is designed to maintain most of the performance of BERT, while reducing the model size by about 40%. DistilBERT is achieved by a process called distillation, which involves training a smaller model (the "student") to reproduce the behavior of a larger pre-trained model (the "teacher").

3. **ALBERT (A Lite BERT)**:
   ALBERT is designed to reduce the memory consumption and increase the training speed of BERT. It introduces two main innovations: `factorized embedding parameterization` and `cross-layer parameter sharing`. The embedding parameterization separates the size of the hidden layers from the size of vocabulary embeddings, reducing the number of parameters. Cross-layer parameter sharing helps in reducing the model size further by sharing parameters across the different layers of the model.

4. **TinyBERT**:
   TinyBERT goes further than DistilBERT in terms of model size reduction. It is specifically designed to be small enough for deployment on resource-restricted devices like mobile phones. TinyBERT involves a two-stage training process: transformer distillation at pre-training and task-specific distillation.

5. **BERT-large, BERT-base, and Other Sizes**:
   The original BERT model comes in two sizes – `BERT-base` and `BERT-large`. BERT-base has 110 million parameters consisting of `12 layers`, while BERT-large has 340 million parameters with `24 layers`. These different sizes offer trade-offs between performance and computational efficiency.

6. **Multilingual BERT (mBERT)**:
   Released by Google, this version of BERT is trained on Wikipedia articles in 104 languages. It facilitates tasks that involve multiple languages and is particularly valuable for applications requiring cross-linguistic transfer learning.

7. **BERT for Sequence-to-Sequence Applications (BART and T5)**:
   While not direct versions of BERT, BART and T5 are notable extensions. BART is designed to pre-train sequence-to-sequence models by corrupting text with an arbitrary noising function and learning a model to reconstruct the original text. T5 (Text-to-Text Transfer Transformer) generalizes the idea further by framing all NLP tasks as a text-to-text problem.


# 3. BERT's Key Architectural Parameters
In the various versions and configurations of BERT, the terms "L", "H", and "A" represent key architectural parameters that define the model's structure and capacity.

1. **`L` - Number of Layers (also referred to as Transformer Blocks):**
   - "L" stands for the number of layers, or depth, of the network. Each layer is a Transformer block that contains mechanisms for self-attention and feed-forward connections. In BERT architectures, more layers generally mean a greater ability to capture complex features and relationships in the data, but at the cost of increased computational complexity and potential difficulties in training (such as handling vanishing gradients).
   - For example, BERT-base has 12 layers, while BERT-large has 24 layers.

2. **`H` - Size of Hidden Layers:**
   - "H" denotes the size of the hidden layers, which is the dimensionality of the hidden states in the model. This parameter is a critical factor in determining the model's capacity to learn; higher values allow the model to learn more detailed features with more nuanced representations of the input data, but also require more computation and memory.
   - In BERT-base, the size of each hidden layer is 768, whereas in BERT-large, each hidden layer has a size of 1024.

3. **`A` - Number of Attention Heads:**
   - "A" stands for the number of attention heads in each Transformer block. In the multi-head attention mechanism of BERT, the model's input is split into multiple heads, allowing the model to attend to different parts of the input simultaneously. More attention heads provide a richer and more diverse representation of the input data, enabling the model to capture a variety of dependencies and nuances.
   - BERT-base is equipped with 12 attention heads per layer, while BERT-large uses 16 attention heads per layer.

These parameters are integral to the Transformer architecture that BERT is built upon, and they significantly impact both the training dynamics and the performance capabilities of the model. Variations in these parameters across different versions of BERT are primarily what make some versions larger, more powerful, or slower than others. For instance, increasing the number of layers or the size of the hidden layers generally enhances the model's performance on complex tasks but also makes it more computationally expensive and slower to train and infer. This trade-off is a key consideration when choosing or designing a BERT model for specific applications or resource-limited environments.

# 4. Bert Base
In this practice we will work with the Pre-Trained model of `BERT BASE`. It uses L=12 hidden layers with size of H=768, and A=12 attention heads. Check the model at [this page](https://www.kaggle.com/models/tensorflow/bert/tensorFlow2/en-uncased-l-12-h-768-a-12/4.).

In [1]:
import tensorflow_hub as hub
import tensorflow_text as text

2024-05-06 08:16:04.100376: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-06 08:16:04.100484: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-06 08:16:04.280517: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 4.1 Specific BERT Preprocessing
BERT has specific requirements for how text must be formatted before it can be processed by the model. This includes tokenizing the text into tokens that BERT was trained on, adding special tokens (like [CLS], [SEP]), and creating attention masks. You can access the Preprocess layer of BERT at https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3

In [2]:
preprocess_url = "https://kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-preprocess/3"


Crating Preprocessing Layer...

In [3]:
bert_preprocess_model = hub.KerasLayer(preprocess_url)

Attaching model 'tensorflow/bert/tensorflow2/en-uncased-preprocess/3' to your Kaggle notebook...


In [4]:
test = ['This is a spam mail', 'Computers are not unique to computers']
text_preprocessed = bert_preprocess_model(test)
text_preprocessed.keys()

dict_keys(['input_word_ids', 'input_mask', 'input_type_ids'])

**OUTPUT Explanation**

`'input_type_ids'`, `'input_word_ids'`, and `'input_mask'` these outputs are specifically formatted to meet the input requirements of BERT models. 

1. **`'input_word_ids'`**:
   - These are also known as token IDs. The preprocessing model first tokenizes the text into words or subwords (subword tokenization helps in dealing with out-of-vocabulary words for which BERT hasn't been explicitly trained). Each token or subword is then mapped to a unique integer ID. These IDs are what the BERT model actually processes, and they correspond to entries in BERT's embedding table. Essentially, `'input_word_ids'` is a sequence of integers representing the tokens derived from the input text.

2. **`'input_mask'`**:
   - This is also referred to as the attention mask. The purpose of the `'input_mask'` is to provide the model with information about which parts of the input data are actual tokens and which parts are padding. This distinction is important because BERT processes fixed-size input sequences. If a given sentence or input is shorter than the maximum sequence length, it's padded with zeros. The attention mask has a binary value (0 or 1):
     - **1** indicates that the corresponding token is a real token that should be attended to.
     - **0** indicates that it is padding and should not be considered in the attention calculations in the Transformer model.

3. **`'input_type_ids'`**:
   - These are often referred to as segment IDs. BERT can take multiple sentences as input, which is useful for tasks that involve understanding the relationship between sentences (like question answering or natural language inference). The `'input_type_ids'` signal to the model which part of the input belongs to sentence A and which part belongs to sentence B. Typically:
     - **0** might be used for tokens belonging to the first sentence.
     - **1** might be used for tokens belonging to the second sentence.
   - If there's only one input sentence, the segment IDs might be all zeros. This segmentation helps BERT understand sentence boundaries within the input.

These three components (`'input_word_ids'`, `'input_mask'`, and `'input_type_ids'`) together form the complete input representation for the BERT model, allowing it to properly process and understand the text data provided to it, regardless of the specific task or data characteristics. Each plays a crucial role in ensuring that the model's attention mechanisms focus correctly on the meaningful parts of the input data while handling different input lengths and multi-sentence inputs effectively.

In [5]:
text_preprocessed['input_mask'], text_preprocessed['input_type_ids'], text_preprocessed['input_word_ids']

(<tf.Tensor: shape=(2, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 <tf.Tensor: shape=(2, 128), dtype=int32, numpy=
 

Pay attention to `101` known as CLS token and `102` which is known as SEP token. BERT uses these as special tokens at the begining and ending of each setence. 

## 4.2 BERT BASE Pre-Trained Model
You can find BERT Pre-Trained model at https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-l-12-h-768-a-12/4

Let's create the `BERT BASE` model:

In [6]:
encoder_url = "https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/en-uncased-l-12-h-768-a-12/4"

In [7]:
bert_model = hub.KerasLayer(encoder_url)

Attaching model 'tensorflow/bert/tensorflow2/en-uncased-l-12-h-768-a-12/4' to your Kaggle notebook...


Now, we feed the preprocessed input to the model:

In [8]:
bert_results = bert_model(text_preprocessed)

In [9]:
bert_results.keys()

dict_keys(['encoder_outputs', 'pooled_output', 'default', 'sequence_output'])

**OUTPUT Explanation**

`['encoder_outputs', 'sequence_output', 'pooled_output', 'default']` these are part of the output from a BERT model after it processes input text. Each of these keys provides a different type of output from the BERT layers, useful for various downstream tasks in NLP. 

1. **`'encoder_outputs'`**:
   - This key provides the outputs of each individual encoder (Transformer block) within the BERT model. The output under this key is typically a list of tensors, with each tensor representing the output from one of the Transformer blocks. This is useful for tasks that might benefit from accessing intermediate layers of the model, rather than just the final output, as different layers capture different levels of abstraction.
   
   - In the context of BERT, particularly the BERT-base model, the `encoder_outputs` contains 12 items which corresponds to the number of transformer blocks or layers of the model. BERT-base is designed with 12 layers, each contributing to the hierarchical understanding of the input text at various levels of abstraction.

    ### Understanding Encoder Outputs:

    - **Each Layer's Contribution**: In BERT and other Transformer-based models, each transformer layer (or block) processes the input text independently and contributes progressively to the final understanding of the text. Each layer captures different aspects of the text data. For instance, lower layers might focus more on syntactic representations while higher layers might capture more semantic aspects.

    - **Utility of Multiple Outputs**: Accessing the output of each individual layer can be highly beneficial for certain NLP tasks. Researchers may want to analyze or use the representations from intermediate layers because these representations might be more suitable for specific tasks. For example, earlier layers might be better for tasks focused on the syntactic nuances of the text, while later layers might be more effective for tasks involving complex semantic understanding.

    - **12 Outputs for 12 Layers**: Since BERT-base has 12 layers, `encoder_outputs` includes 12 separate tensors, each representing the output from one of the 12 transformer blocks. Each tensor in `encoder_outputs` is a full set of hidden states for the input sequence, as processed by a specific layer.

    ### Practical Use Cases:

    - **Feature Extraction**: You can extract these individual layer outputs for custom feature engineering. For example, in some specialized tasks, features from specific layers could be more informative than just using the final layer's output.

    - **Custom Models**: In advanced use cases, outputs from specific layers might be combined or manipulated differently to construct custom models tailored to particular NLP tasks, such as differentiating between types of semantic meanings or enhancing model interpretability.

    - **Research and Analysis**: For research purposes, analyzing how different layers of the model react to various inputs can provide insights into how BERT processes language and which layers are most critical for certain types of language understanding.

    The presence of 12 `encoder_outputs` in BERT-base is thus not just a design choice but a feature that adds versatility and depth to how the model can be utilized across a wide range of natural language processing applications.


In [10]:
len(bert_results['encoder_outputs'])

12

In [11]:
bert_results['encoder_outputs'][-1] == bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=bool, numpy=
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]]])>

2. **`'sequence_output'`**:
   - The `'sequence_output'` is the output from the last layer of the BERT model for each token in the input sequence. It provides a high-dimensional representation of each token in the context of the entire input sequence. This output is often used for `token-level tasks` such as `named entity recognition (NER)` or `token-level classification`, where a prediction is required for each input token.
    
   - The shape of the `sequence_output` tensor as `(2, 128, 768)` from a BERT model output indicates specific dimensions related to the model's processing of input text. 

    1. **Batch Size (2)**:
       - The first dimension, `2`, indicates that the output is for a batch of two input sequences. When processing multiple sequences at once, models like BERT can handle them in batches, which is more efficient than processing each sequence individually. This allows for parallel computation and faster processing times. The number `2` here simply means there are two sequences in this batch.

    2. **Sequence Length (128)**:
       - The second dimension, `128`, represents the sequence length, that is, the number of tokens (words or subwords) in each input sequence that the model processes. Each sequence is padded or truncated to this fixed length to ensure uniformity in input size, which is required for batch processing in neural networks. The length of `128` tokens is a common choice because it balances computational efficiency with sufficient context for most NLP tasks.

    3. **Hidden Size (768)**:
       - The third dimension, `768`, is the size of the hidden layers in the BERT model. This number indicates the dimensionality of the output vectors that BERT generates for each token in the input sequence. Each token's output vector is a 768-dimensional representation that encapsulates the contextual relationships learned by the model during training. This size corresponds to the `H` parameter in the BERT-base model (as opposed to BERT-large, which would have 1024).

   - **Contextual Token Representations**: Each of the 128 tokens in the two sequences gets a 768-dimensional vector that represents that token in context. These vectors are what downstream tasks (e.g., token classification or feature extraction) would use.

   - **Handling of Padded Tokens**: Since sequences are padded to a maximum length of 128, parts of the `sequence_output` may correspond to padding tokens, especially if the original input sequence was shorter than 128 tokens. These padding areas typically don't carry meaningful information and are often masked out during subsequent processing steps in an NLP pipeline.

    This structured output allows BERT to be flexibly applied to various NLP tasks by providing a rich, contextualized embedding for each token in the input sequences.


In [12]:
bert_results['sequence_output']

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.02952154,  0.08941655, -0.29414886, ..., -0.28991556,
          0.33581254,  0.8754337 ],
        [-0.2737661 , -0.42498857, -0.37990084, ..., -0.57002264,
          1.0774478 ,  0.5713815 ],
        [-0.33146417, -0.35461444, -0.253891  , ..., -0.42635572,
          0.382568  ,  0.91525745],
        ...,
        [ 0.16875651, -0.27586892,  0.12268022, ...,  0.01988028,
          0.06036456,  0.4067654 ],
        [ 0.09589738, -0.25696522,  0.05153529, ...,  0.03342544,
          0.07288365,  0.40554154],
        [ 0.08295452, -0.22644407,  0.1854625 , ...,  0.14761981,
          0.06656485,  0.36203957]],

       [[ 0.07296047,  0.51637626, -0.23151016, ..., -0.4113447 ,
         -0.0308435 ,  0.84255636],
        [-0.3008911 ,  0.5466946 , -0.2294488 , ..., -0.2608072 ,
          0.23354402,  0.5297949 ],
        [ 0.65989935,  0.8210852 ,  0.0728514 , ..., -0.40727797,
         -0.13747925,  0.5823894 ],
        ...,

3. **`'pooled_output'`**:
   - The `'pooled_output'` represents a fixed-length output vector for the entire input sequence and is usually derived from the hidden state of the first token of the sequence (which is the special `[CLS]` token in BERT). This token's final hidden state is typically used as the `"aggregate representation"` for classification tasks. It's processed through an additional dense layer with a `Tanh activation` function to generate the pooled output. This output is useful for classification tasks where the entire input sequence needs to be represented as a single fixed-size vector.

In [13]:
bert_results['pooled_output']

<tf.Tensor: shape=(2, 768), dtype=float32, numpy=
array([[-0.87018895, -0.42847177, -0.38309997, ..., -0.34936565,
        -0.6542408 ,  0.8658799 ],
       [-0.8610027 , -0.33217645, -0.44266653, ..., -0.36746684,
        -0.6890626 ,  0.8170812 ]], dtype=float32)>

4. **`'default'`**:
   - This output typically points to one of the other outputs as the default one that should be used for most tasks. In many implementations of BERT on TensorFlow Hub, the `'default'` key points to the `'pooled_output'` as it is the most commonly used output for various classification tasks. However, depending on the specific implementation or model variant, it might point to a different output.

These outputs provide flexibility in how the information processed by the BERT model can be utilized, catering to both whole-sequence tasks (like classification) and token-level tasks (like tagging or question answering). The availability of different layers' outputs (via `'encoder_outputs'`) also facilitates more advanced analyses and custom model architectures that leverage deeper or intermediate representations of the input data.