# Introduction

At the end of **2018 researchers at Google AI** Language open-sourced a new technique for Natural Language Processing (NLP) called BERT (Bidirectional Encoder Representations from Transformers) — a major breakthrough which took the Deep Learning community by storm because of its incredible performance. 

**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a paper published by **researchers at Google AI Language**. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

## What is the core idea behind it?


**BERT’s key technical innovation** is applying the **bidirectional training** of Transformer, a popular attention model, to language modelling. **This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training.** The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named **Masked LM (MLM) which allows bidirectional training** in models in which it was previously impossible.


What is language modeling really about? Which problem are language models trying to solve? Basically, their task is to “fill in the blank” based on context. For example, given

“The woman went to the store and bought a ... of shoes.”

a language model might complete this sentence by saying that the word “cart” would fill the blank 20% of the time and the word “pair” 80% of the time.

In the pre-BERT world, a language model would have looked at this text sequence during training from either left-to-right or combined left-to-right and right-to-left. This one-directional approach works well for generating sentences — we can predict the next word, append that to the sequence, then predict the next to next word until we have a complete sentence.

Now enters BERT, a language model which is bidirectionally trained (this is also its key technical innovation). This means we can now have a deeper sense of language context and flow compared to the single-direction language models.

Instead of predicting the next word in a sequence, BERT makes use of a novel technique called **Masked LM (MLM)**: it randomly masks words in the sentence and then it tries to predict them. Masking means that the model looks in both directions and it uses the full context of the sentence, both left and right surroundings, in order to predict the masked word. Unlike the previous language models, it takes both the previous and next tokens into account at the **same time.** The existing combined left-to-right and right-to-left LSTM based models were missing this “same-time part”. (It might be more accurate to say that BERT is non-directional though.)

### But why is this non-directional approach so powerful?

**Pre-trained language representations** can either be **context-free** or **context-based**. **Context-based** representations can then be **unidirectional** or **bidirectional**. 

- **Context-free** models like **word2vec** generate a single word embedding representation (a vector of numbers) for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” 

- On the other hand, **context-based models** generate a representation of each word that is based on the other words in the sentence. 
    - For example, in the sentence “I accessed the bank account,” a **unidirectional contextual model** would represent “bank” based on “I accessed the” but not “account.” 
    - However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply **bidirectional**.


![image](https://user-images.githubusercontent.com/28102493/206432491-370e7373-ced0-4244-aae4-73cc3d0e5465.png)


Moreover, BERT is based on the Transformer model architecture, instead of LSTMs. 

A Transformer works by performing a small, constant number of steps. In each step, it applies an attention mechanism to understand relationships between all words in a sentence, regardless of their respective position. For example, given the sentence, “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately pay attention to the word “river” and make this decision in just one step.

## Background

***Transfer Learning***

In the field of computer vision, researchers have repeatedly shown the value of transfer learning — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning — using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks.

One of the biggest challenges in NLP is the lack of enough training data. Overall there is enormous amount of text data available, but if we want to create task-specific datasets, we need to split that pile into the very many diverse fields. And when we do this, we end up with only a few thousand or a few hundred thousand human-labeled training examples. Unfortunately, in order to perform well, deep learning based NLP models require much larger amounts of data — they see major improvements when trained on millions, or billions, of annotated training examples. **To help bridge this gap in data, researchers have developed various techniques for training general purpose language representation models using the enormous piles of unannotated text on the web (this is known as pre-training). These general purpose pre-trained models can then be fine-tuned on smaller task-specific datasets, e.g., when working with problems like question answering and sentiment analysis.** This approach results in great accuracy improvements compared to training on the smaller task-specific datasets from scratch. BERT is a recent addition to these techniques for NLP pre-training; it caused a stir in the deep learning community because it presented state-of-the-art results in a wide variety of NLP tasks, like question answering.


***Feature-based training***

A different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is feature-based training. *In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models.*


The best part about BERT is that it can be download and used for free — we can either use the BERT models to extract high quality language features from our text data, or we can fine-tune these models on a specific task, like sentiment analysis and question answering, with our own data to produce state-of-the-art predictions.


# How BERT works

## High Level Overview

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. **Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.** A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since BERT’s goal is to generate a language representation model, it only needs the encoder part.

**As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).**

The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

When training language models, there is **a challenge of defining a prediction goal.** Many models predict the next word in a sequence (e.g. “The child came home from ...”), a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies:


1. **Masked LM (MLM)**
1. **Next Sentence Prediction (NSP)**

BERT does not try to predict the next word in the sentence. Training makes use of the above two strategies.


## Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. In technical terms, the prediction of the output words requires:

1. Adding a classification layer on top of the encoder output.

1. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.

1. Calculating the probability of each word in the vocabulary with softmax.

![image](https://user-images.githubusercontent.com/28102493/206423131-b774391f-7cb3-407e-b981-395a353d1709.png)


The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. **As a consequence, the model converges slower than directional models,** a characteristic which is offset by its increased context awareness. 

In practice, the BERT implementation is slightly more elaborate and doesn’t replace all of the 15% masked words. 

The idea here is “simple”: Randomly mask out 15% of the words in the input — replacing them with a [MASK] token — run the entire sequence through the BERT attention based encoder and then predict only the masked words, based on the context provided by the other non-masked words in the sequence. However, there is a problem with this naive masking approach — the model only tries to predict when the [MASK] token is present in the input, while **we want the model to try to predict the correct tokens regardless of what token is present in the input.** To deal with this issue, out of the 15% of the tokens selected for masking (Training the language model in BERT is done by predicting 15% of the tokens in the input, that were randomly picked. These tokens are pre-processed as follows — 80% are replaced with a “[MASK]” token, 10% with a random word, and 10% use the original word. The intuition that led the authors to pick this approach is as follows (Thanks to Jacob Devlin from Google for the insight)):


- If we used [MASK] 100% of the time the model wouldn’t necessarily produce good token representations for non-masked words. The non-masked tokens were still used for context, but the model was optimized for predicting masked words.

- If we used [MASK] 90% of the time and random words 10% of the time, this would teach the model that the observed word is never correct.

- If we used [MASK] 90% of the time and kept the same word 10% of the time, then the model could just trivially copy the non-contextual embedding.

No ablation was done on the ratios of this approach, and it may have worked better with different ratios. In addition, the model performance wasn’t tested with simply masking 100% of the selected tokens.


## Next Sentence Prediction (NSP)
In order to understand relationship between two sentences, BERT training process also uses next sentence prediction. A pre-trained model with this kind of understanding is relevant for tasks like question answering.

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

1. **Token embeddings:** A **[CLS] token** is inserted at the beginning of the first sentence and a **[SEP]** token is inserted at the end of each sentence.

1. **Segment embeddings:** A **sentence embedding** indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.

1. **Positional embeddings:** A **positional embedding** is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper.

![image](https://user-images.githubusercontent.com/28102493/206425820-0d0d5f8d-c3f5-490f-add3-2b99af9bf670.png)

To predict if the second sentence is indeed connected to the first, the following steps are performed:


1. The entire input sequence goes through the Transformer model.

1. The output of the **[CLS] token** is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).

1. Calculating the probability of IsNextSequence with softmax.

When training the BERT model, **Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.**



# How to use BERT (Fine-tuning)
Using BERT for a specific task is relatively straightforward:

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

1. **Classification tasks** such as sentiment analysis are done similarly to Next Sentence classification, by **adding a classification layer on top of the Transformer output for the [CLS] token.**

1. In **Question Answering tasks** (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. **Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.**

1. In **Named Entity Recognition (NER)**, the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained **by feeding the output vector of each token into a classification layer that predicts the NER label.**


In the fine-tuning training, most hyper-parameters stay the same as in BERT training, and the paper gives specific guidance (Section 3.5) on the hyper-parameters that require tuning. The BERT team has used this technique to achieve state-of-the-art results on a wide variety of challenging natural language tasks, detailed in Section 4 of the paper.


# Conclusion

BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. In this summary, we attempted to describe the main ideas of the paper while not drowning in excessive technical details. For those wishing for a deeper dive, we highly recommend reading the full article and ancillary articles referenced in it. Another useful reference is the BERT source code and models, which cover 103 languages and were generously released as open source by the research team.


## Takeaways

1. **Model size matters, even at huge scale.** BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters.

1. **With enough training data, more training steps == higher accuracy.** For instance, on the MNLI task, the BERT_base accuracy improves by 1.0% when trained on 1M steps (128,000 words batch size) compared to 500K steps with the same batch size.

1. **BERT’s bidirectional approach (MLM) converges slower than left-to-right approaches (because only 15% of words are predicted in each batch)** but bidirectional training still outperforms left-to-right training after a small number of pre-training steps.

![image](https://user-images.githubusercontent.com/28102493/206427665-32adf5e1-7653-4c1b-a15f-afdcab60e8f7.png)

![image](https://user-images.githubusercontent.com/28102493/206438110-dd9aa287-89d5-4fe4-8947-0a0dab5730a0.png)


![image](https://user-images.githubusercontent.com/28102493/206427804-26da6e6e-ad65-4d84-bf88-904b3d5ea8b9.png)


# References

1. [BERT Explained: State of the art language model for NLP](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)

# Final Personal Notes

1. The Transformers architecture uses non-directional training but it's not training a language model but a downstream task (translation task). That's why the transformers have two core components (Encoder, Decoder). When it comes to **Language Models**, in their training before BERT we were using only unidirectional training (predict the next word), while BERT uses non-directional training. So the big difference between Transformers and BERT has to do with the way that they train language models. A basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task. Since BERT’s goal is to generate a language representation model, it only needs the encoder part.