# BERT 
**Bidirectional Encoder Representation from Transformers (BERT)**

## Understanding BERT

$\rightarrow$ [**BERT Paper**](https://arxiv.org/abs/1810.04805)

- *The Transformer models are divided into 2 main factors:*
    - **Encoder-based models**
    - **Decoder-based models (auto-regressive)**<br></br>
    
- *In other words, either the encoder or the decoder part of the Transformer provides the foundation for these models, compared to using both the encoder and the decoder.* **The main diff. b/w the two is how attention is used.**

- **Encoder-based models use bi-directional attention, whereas Decoder-based models use auto-regressive(i.e., left to right) attention.**

**BERT is an encoder-based Transformer model. It takes an i/p seq.(collection of tokens) and produces an encoded o/p seq.**

<div alin='center'>
    <img src='images/bert_high_level_architecture.png', title='NLP w/ TensorFlow by Thushan Ganegedara'>
</div>

## Input Processing for BERT

**BERT inserts some special tokens into the i/p, while taking it:**

- At the beginning, it inserts a `[CLS]` token (`term_classification`) that is used to generate the final hidden rep. for certain types of task (like seq. classification). It represents the o/p after attending to all the tokens.

- Next, it also inserts a `[SEP]` token (`seperation`) depending on the type of i/p. The `[SEP]` token marks the end and the beginning of diff. sequences in the i/p.

    - For example, in question-answering, the model takes a question and a context (such as a paragraph) that may have the answer as an input, and `[SEP]` is used in between the question and the context.<br></br>
    
- Also, `[PAD]` token is used to pad short sequences to a required length.

* **

The `[CLS]` token is appended to any i/p seq. fed to BERT. This denotes the beginning of the input. It also forms the basis for the i/p fed into the classification head used on top of BERT to solve your NLP task. As you know, BERT produces a hidden representation for each input token in the sequence. As a convention, the hidden representation corresponding to the `[CLS]` token is used as the input to the classification model that sits on top of BERT.

* **

- *Next, the final embedding of the tokens is generated using 3 diff. embedding space.* 
    1. The **token embedding** has unique vector for each token in the vocab.
    
    2. The **positional embedding** encode the position of each token.
    
    3. Finally, the **segment embedding** provides a distinct rep. for each sub-component in the i/p, when the i/p consists of multiple components.
        - *For example, in question-answering, the question will have a unique vector as its segment embedding vector and the context will have a different embedding vector.*
        
        - *This is done by having $n$ embedding vectors for the $n$ different components in the i/p sequence. Depending on the component index specified for each token in the i/p, the corresponding segment embedding vector is retrieved.* 
        
        - *$n$ needs to be specified in advance.*

## Tasks solved by BERT

The task-specific NLP tasks solved by BERT can be classified into 4 diff. categories:

- $\textit{Sequence Classification}$
    - *Here, a single input sequence is given and the model is asked to predict a label for the whole sequence (for example, sentiment analysis or spam identification).*<br></br>

- $\textit{Token Classification}$
    - *Here, a single input sequence is given and the model is asked to predict a label for each token in the sequence (for example, named entity recognition or part-of-speech tagging).*<br></br>

- $\textit{Question-Answering}$
    - *Here, the input consists of two sequences: a question and a context. The question and the context are separated by a `[SEP]` token. The model is trained to predict the starting and ending indices of the span of tokens belonging to the answer.*<br></br>

- $\textit{Multiple Choice}$
    - *Here, the input consists of multiple sequences; a question followed by multiple candidates that may or may not be the answer to the question. These multiple sequences are separated by the token `[SEP]` and provided as a single input sequence to the model. The model is trained to predict the correct answer (that is, the class label) for that question.*

* **

- In tasks that involve multiple sequences (such as multiple-choice questions), you need the model to tell different inputs belonging to different segments apart (i.e., which tokens are the question and which tokens are the context in a question-answering task). 

- In order to make that distinction, the `[SEP]` token is used. A `[SEP]` token is inserted between the different sequences. 

- For example, if you are solving a question-answering problem, you might have the following input:

    - Question: What color is the ball?<br>Paragraph: Tippy is a dog. She loves to play with her red ball.

    - Then the input to BERT might look like this:<br>`[CLS] What color is the ball [SEP] Tippy is a dog She loves to play with her red ball [SEP]`

<div align='center'>
    <img src='images/bert_tasks.png' title='NLP w/ TensorFlow by Thushan Ganegedara'/>
</div>

**BERT is designed in such a way that it can be used to complete these tasks without any modifications to the base model.**

## Key Points on BERT

Now that we have discussed all the elements of BERT so we can use it successfully to solve a downstream NLP task, let’s reiterate the key points about BERT:

- *BERT is an encoder-based Transformer*

- *BERT outputs a hidden representation for every token in the input sequence*

- *BERT has 3 embedding spaces: token embedding, positional embedding, and segment embedding*

- *BERT uses a special token `[CLS]` to denote the beginning of an input and is used as the input to a downstream classification model*

- *BERT is designed to solve four types of NLP tasks: sequence classification, token classification, free-text question-answering, and multiple-choice question-answering*

- *BERT uses the special token `[SEP]` to separate between sequence A and sequence B*


The power within BERT doesn’t just lie within its structure. BERT is pre-trained on a large corpus of text using a few different pre-training techniques. In other words, BERT already comes with a solid understanding of the language, making downstream NLP tasks easier to solve. 

Next, let’s discuss how BERT is pre-trained.

## How BERT is pre-trained

The real value of BERT comes from the fact that it has been pre-trained on a large corpus of data in a self-supervised fashion. In the pre-training stage, BERT is trained on 2 diff. tasks:

- $\text{Masked Language Modeling (MLM)}$
- $\text{Next Sentence Prediction (NSP)}$

### **Masked Language Modeling (MLM)**

- The MLM task is inspired by the Cloze test, where a student is given a sentence with one or more blanks and is asked to fill the blanks.

- Similarly, given a text corpus, words are masked from sentences and then the model is asked to predict the masked tokens.

- For e.g.:<br>*My mental health is declinging*<br>might become:<br>*My mental `[MASK]` is declinging*

    - BERT uses a special token, `[MASK]`, to represent masked words. 

    - Then the target for the model will be the word *health*.

- **But this introduces a practical issue to the model:**
    - The special `[MASK]` token does not appear in the actual text. 
    
    - This means that the text the model will see during the finetuning phase (i.e, when training on a classification problem) will be different to what it will see during pre-training. 

    - **This is sometimes referred to as the pre-training-finetuning discrepancy.**

- *Therefore, the authors of BERT suggest the foll. approach to cope with the issue.<br>When masking a word, do one of the following:*
    - Use the `[MASK]` token as it is (with $80\%$ probability)
    - Use a random word (with $10\%$ probability)
    - Use the true word (with $10\%$ probability

- In other words, instead of always seeing `[MASK]`, the model will see actual words on certain occasions, alleviating the discrepancy.

### **Next Sentence Prediction (NSP)**

- In the NSP task, the model is given a pair of sentences, A and B (in that order), and is asked to predict whether the B is the next sentence after A. 

- This can be done by fitting a binary classifier onto BERT and training the whole model from end to end on selected pairs of sentences. 

- Generating pairs of sentences as inputs for the model is not hard and can be done in an unsupervised manner:
    - A sample with the label TRUE is generated by picking two sentences that are adjacent to each other
    
    - A sample with the label FALSE is generated by picking two sentences randomly that are not adjacent to each other

Following this approach, a labeled dataset is generated for the next sentence prediction task. Then BERT, along with the binary classifier, is trained from end to end using the labeled dataset to solve a downstream task.