# The illustrated BERT

<a href="https://jalammar.github.io/illustrated-bert/"><b>Original article</b></a>

[BERT](https://arxiv.org/pdf/1908.08962.pdf) builds on top of a number of clever ideas that were bubbling up in the NLP community around 2017 -including but not limited to:
* [Semi-supervised sequence learning](https://arxiv.org/abs/1511.01432)
* [ELMo](https://arxiv.org/abs/1802.05365)
* [ULMFit](https://arxiv.org/abs/1801.06146)
* [OpenAI Transformer](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
* [Transformer](https://arxiv.org/pdf/1706.03762.pdf)

## BERT model architecture

BERT is basically a trained Transformer Encoder stack. The original paper presents two model sizes for BERT:
* BERT Base (12 encoder layers, called transformer blocks in the paper)
* BERT Large (24 encoder layers, called transformer blocks in the paper)

In addition to the large number of encoders. These also have larger feedforward layers (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the initial Transformer (6 encoder layers, 512 hidden units, and 8 attention heads).

<img src="images_ib/bert-base-bert-large-encoders.png" title="" alt="" width="500" data-align="center">



## BERT model inputs and outputs

Just like the vanilla encoder of the transformer, BERT takes a sequence of words as input which keep flowing up the stack. Each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder. The first input token is always a special [CLS] token, which stands for classification (the original purpose of the model).

Then, each position outputs a vector of hidden_size (768 or 1024). That vector can now be used as the input for a classifier of our choosing. The paper achieves great results by just using a single-layer neural network as the classifier (the **classification head** of the model)

<img src="images_ib/bert-classifier.png" title="" alt="" width="600" data-align="center">



## ELMO: A new age of embedding

These new developments carry with them a new shift in how words are encoded. Up until now, word-embeddings have been a major force in how leading NLP models deal with language. Methods like Word2Vec and Glove have been widely used for such tasks. Let’s recap how those are used before pointing to what has now changed.

### Word embedding recap

For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”).

The field quickly realized it’s a great idea to use embeddings that were pre-trained on vast amounts of text data instead of training them alongside the model on what was frequently a small dataset. So it became possible to download a list of words and their embeddings generated by pre-training with Word2Vec or GloVe.

### ELMo: context matters

If we’re using this GloVe representation, then the word “stick” would be represented by this vector **no-matter what the context was**. However, "stick" has multiple meanings depending on where it is used. Why not give it an embedding based on the context it is used in? And so, contextualized word-embeddings were born.

Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings. ELMo provided a significant step towards pre-training in the context of NLP. The ELMo LSTM would be trained on a massive dataset in the language of our dataset, and then we can use it as a component in other models that need to handle language.

#### What is ELMo's secret?

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called **Language Modeling**. This is convenient because we have vast amounts of text data that such a model can learn from **without needing labels**. ELMo actually goes a step further and trains a bi-directional LSTM – so that its language model doesn’t only have a sense of the next word, but also the previous word.

<img src="images_ib/elmo-forward-backward-language-model-embedding.png" title="" alt="" width="600" data-align="center">

ELMo comes up with the contextualized embedding through grouping together the hidden states (and initial embedding) in a certain way (concatenation followed by weighted summation).

<img src="images_ib/elmo-embedding.png" title="" alt="" width="600" data-align="center">

## ULM-FiT: Nailing down transfer learning in NLP

ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training (more than just embeddings or contextualized embeddings). ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.

## The Transformer: Going beyond LSTMs

The release of the Transformer paper and the results it achieved on tasks such as machine translation started to NLP researchers to think of them as a replacement to LSTMs.

The Encoder-Decoder structure of the transformer made it perfect for machine translation. But, how would you use it for sentence classification? How would you use it to pre-train a language model that can be fine-tuned for other supervised-learning tasks (i.e., downstream tasks)?

### OpenAI Transformer: Pre-training a Transformer decoder for language modeling

It turns out we don't need an entire Transformer to adopt transfer learning and a fine-tuneable language model for NLP tasks. We can do it with just the decoder of the transformer. The decoder is a good choice because it is a natural choice for language modeling (predicting the next word) since it is built to mask future tokens (a valuable feature when it's generating a translation word by word).

The model stacked twelve decoder layers. Since there is no encoder in this set up, **these decoder layers would not have the encoder-decoder attention sublayer that vanilla transformer decoder layers have. It would still have the self-attention layer, however, it would still be masked so it doesn't peak at future tokens**.

With this structure, we can proceed to train the model on the same language modeling task: predict the next word using massive (unlabeled) datasets. For example, by throwing it text from 7000 books where the OpenAI Transformer needs to predict the next word in a sentence.

<img src="images_ib/openai-transformer-language-modeling.png" title="" alt="" width="500" data-align="center">

### OpenAI Transformer: Transfer-learning to downstream tasks

Now that the OpenAI transformer is pre-trained and its layers have been tuned to reasonably handle language, we can start using it for downstream tasks. The OpenAI paper outlines a number of input transformations to handle the inputs for different types of tasks. The following image from the paper shows the structures of the models and input transformations to carry out different tasks:


<img src="images_ib/openai-input transformations.png" title="" alt="" width="650" data-align="center">


### BERT: From decoders to encoders

The OpenAI transformer gave us a fine-tunable pre-trained model based on the Transformer. But something went missing in this transition from LSTMs to Transformers. ELMo's language model was bi-directional, but the OpenAI transformer only trains a forward language model. **Could we build a transformer-based model whose language model looks both forward and backwards** (i.e., it is conditioned on both left and right context)?

#### Masked language model

Finding the right task to train a transformer stack of encoders is a complex hurdle that BERT resolves by adopting a "masked language model" concept from earlier literature (where it is called a Cloze task).

Beyond masking 15% of the input, BERT also mixes things a bit in order to improve how the model later fine tunes. Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.

#### Two-sentence Tasks

If you look back up at the input transformations the OpenAI transformer does to handle different tasks, you'll notice that some tasks require the model to say something intelligent about two sentences (e.g., are they simply paraphased version of each other? Given a wikipedia entry as input, and a question regarding that entry as another input, can we answer that question?).

To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?

<img src="images_ib/bert-next-sentence-prediction.png" title="" alt="" width="700" data-align="center">

#### Task-specific models

The BERT paper shows a number of ways to use BERT for different tasks:

<img src="images_ib/bert-tasks.png" title="" alt="" width="700" data-align="center">

#### BERT for feature extraction

The fine-tuning approach is not the only way to use BERT. Just like ELMo, you can use the pre-trained BERT to create contextualized word-embeddings. Then you can feed these embeddings to your existing model (a process the paper shows yield results not far behind fine-tuning BERT on a task such as named-entity recognition).

<img src="images_ib/bert-contexualized-embeddings.png" title="" alt="" width="700" data-align="center">

Which vector works best as a contextualized embedding? It would depend on the task. The paper examines six choices (compared to the fine-tuned model, which achieved a score of 96.4):

<img src="images_ib/bert-feature-extraction-contextualized-embeddings.png" title="" alt="" width="700" data-align="center">