# Introduction 

BERT (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others. Progress has been rapidly accelerating in machine learning models that process language over the last couple of years. This progress has left the research lab and started powering some of the leading digital products. A great example of this is the recent [announcement of how the BERT model is now a major force behind Google Search](https://www.blog.google/products/search/search-language-understanding-bert/). Google believes this step (or progress in natural language understanding as applied in search) represents “the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search”.


BERT’s key technical innovation is **applying the bidirectional training of Transformer**, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named **Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.**

BERT pushes the state of the art in Natural Language Processing by combining two powerful technologies:

- It is based on a deep **Transformer encoder network**, a type of network that can process long texts efficiently by using **self-attention**.

- It is **bidirectional**, meaning that it uses the whole text passage to understand the meaning of each word.

## What is BERT?

BERT is a deep learning model that has given state-of-the-art results on a wide variety of natural language processing tasks. It stands for Bidirectional Encoder Representations for Transformers. It has been pre-trained on Wikipedia and BooksCorpus and requires task-specific fine-tuning.


## Why was BERT needed?

One of the biggest challenges in NLP is the lack of enough training data. Overall there is enormous amount of text data available, but if we want to create task-specific datasets, we need to split that pile into the very many diverse fields. And when we do this, we end up with only a few thousand or a few hundred thousand human-labeled training examples. Unfortunately, in order to perform well, deep learning based NLP models require much larger amounts of data —  they see major improvements when trained on millions, or billions, of annotated training examples. To help bridge this gap in data, researchers have developed various techniques for **training general purpose language representation models using the enormous piles of unannotated text on the web (this is known as pre-training).** These **general purpose pre-trained models can then be fine-tuned on smaller task-specific datasets, e.g., when working with problems like question answering and sentiment analysis.** This approach results in great accuracy improvements compared to training on the smaller task-specific datasets from scratch. BERT is a recent addition to these techniques for **NLP pre-training**; it caused a stir in the deep learning community because it presented state-of-the-art results in a wide variety of NLP tasks, like question answering.

***The best part about BERT is that it can be download and used for free —  we can either use the  BERT models to extract high quality language features from our text data, or we can fine-tune these models on a specific task, like sentiment analysis and question answering, with our own data to produce state-of-the-art predictions.***


## Transfer Learning vs Feature-Based Training

In the field of computer vision, researchers have repeatedly shown the value of **transfer learning** — pre-training a neural network model on a known task, for instance ImageNet, and then performing fine-tuning — using the trained neural network as the basis of a new purpose-specific model. In recent years, researchers have been showing that a similar technique can be useful in many natural language tasks.

A different approach, which is also popular in NLP tasks and exemplified in the recent ELMo paper, is **feature-based training.** In this approach, a pre-trained neural network produces word embeddings which are then used as features in NLP models.


## What is the core idea behind it? (Bidirectional vs unidirectional)

What is language modeling really about? Which problem are language models trying to solve? Basically, their task is to “fill in the blank” based on context. For example, given

“The woman went to the store and bought a _ of shoes.”

a language model might complete this sentence by saying that the word “cart” would fill the blank 20% of the time and the word “pair” 80% of the time.

In the pre-BERT world, a language model would have looked at this text sequence during training from either left-to-right or combined left-to-right and right-to-left. This one-directional approach works well for generating sentences — we can predict the next word, append that to the sequence, then predict the next to next word until we have a complete sentence.

We usually create a language model by training it on some unrelated task but tasks that help develop a contextual understanding of words in a model. More often than not such tasks involve predicting the next word or words in close vicinity of each other. Such training methods can’t be extended and used for bidirectional models because it would allow each word to indirectly “see itself” — when you would approach the same sentence again but from opposite direction, you kind of already know what to expect. A case of data leakage. In such a situation, model could trivially predict the target word. **Bi-directional models are more powerful than uni-directional language models. But in a multi-layered model bi-directional models do not work because the lower layers leak information and allow a token to see itself in later layers.** Additionally, we can’t guarantee that the model, if completely trained, has learnt the contextual meaning of the words to some extent and not just focused on optimizing the trivial predictions. So how does BERT manage to pre-train bidirectionally?

Now enters BERT, a language model which is **bidirectionally trained** (this is also its key technical innovation). This means we can now **have a deeper sense of language context and flow compared to the single-direction language models.**

**Instead of predicting the next word in a sequence, BERT makes use of a novel technique called Masked LM (MLM):** it randomly masks words in the sentence and then it tries to predict them. Masking means that the model looks in both directions and it uses the full context of the sentence, both left and right surroundings, in order to predict the masked word. **Unlike the previous language models, it takes both the previous and next tokens into account at the same time.** *The existing combined left-to-right and right-to-left LSTM based models were missing this “same-time part”.* (It might be more accurate to say that BERT is non-directional though.)

**But why is this non-directional approach so powerful?**

Pre-trained language representations can either be **context-free** or **context-based.** Context-based representations can then be **unidirectional or bidirectional**. Context-free models like word2vec generate a single word embedding representation (a vector of numbers) for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” **On the other hand, context-based models generate a representation of each word that is based on the other words in the sentence.** For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

![image](https://user-images.githubusercontent.com/28102493/107119463-0396fd80-6888-11eb-94ef-3512d9d2486f.png)

## GPT — The distant cousin

Is BERT the only model that is producing these ground breaking results? No. Another model by OpenAI, called GPT has been making quite the buzz on internet.

But what many people don’t realize that these two models have something in common, that is both these model reuse a Transformer component. **As stated earlier BERT stacks the encoder part of the Transformer as its building block. Meanwhile, GPT uses the decoder part of the Transformer as its building block.**


![image](https://user-images.githubusercontent.com/28102493/107124473-6a2a1480-68a4-11eb-9522-c96bc46c0cc4.png)


Note that the bidirectional connections in BERT due to encoder’s bidirectional self-attention. Meanwhile, the connections in GPT are only in a single direction, from left-to-right, due to decoder design to prevent looking at future predictions

# How BERT works

Moreover, BERT is based on the Transformer model architecture, instead of LSTMs. We will very soon see the model details of BERT, but in general:

A Transformer works by performing a small, constant number of steps. In each step, it applies an attention mechanism to understand relationships between all words in a sentence, regardless of their respective position. For example, given the sentence,  “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately pay attention to the word “river” and make this decision in just one step.


**What is the flow of information of a word in BERT?**

![image](https://user-images.githubusercontent.com/28102493/107126282-36a0b780-68af-11eb-8978-56b7d149b6ca.png)


A word starts with its embedding representation from the embedding layer. Every layer does some multi-headed attention computation on the word representation of the previous layer to create a new intermediate representation. **All these intermediate representations are of the same size.** In the figure above, E1 is the embedding representation, T1 is the final output and Trm are the intermediate representations of the same token. In a 12-layers BERT model a token will have 12 intermediate representations.


## High Level Overview

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. **Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary.** The detailed workings of Transformer are described in a [paper](https://arxiv.org/pdf/1706.03762.pdf) by Google.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), **the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional.** This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from _ ”), **a directional approach which inherently limits context learning. To overcome this challenge, BERT uses two training strategies:**

1. **Masked LM (MLM)**
1. **Next Sentence Prediction (NSP)**

***When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.***

## Masked LM

Language Modeling is the task of predicting the next word given a sequence of words. In masked language modeling instead of predicting every next token, a percentage of input tokens is masked at random and only those masked tokens are predicted. 

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. The model tries to predict these masked tokens — not the entire input sequence. The predicted tokens from the model are then fed into an output softmax over the vocabulary to get the final output words. In technical terms, the prediction of the output words requires:

1. Adding a classification layer on top of the encoder output.
1. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
1. Calculating the probability of each word in the vocabulary with softmax.

![image](https://user-images.githubusercontent.com/28102493/107118455-70f36000-6881-11eb-86dc-9547b651045a.png)

The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. **As a consequence, the model converges slower than directional models,** a characteristic which is offset by its increased context awareness.

This, however creates a mismatch between the pre-training and fine-tuning tasks because the latter does not involve predicting masked words in most of the downstream tasks. This is mitigated by a subtle twist in how we mask the input tokens. The masked words are not always replaced with the masked token – [MASK] because then the masked tokens would never be seen before fine-tuning. **In practice, the BERT implementation is slightly more elaborate and doesn’t replace all of the 15% masked words.**

Training the language model in BERT is done by predicting 15% of the tokens in the input, that were randomly picked. These tokens are pre-processed as follows — 80% are replaced with a “[MASK]” token, 10% with a random word, and 10% use the original word. The intuition that led the authors to pick this approach is as follows (Thanks to Jacob Devlin from Google for the insight):

1. If we used [MASK] 100% of the time the model wouldn’t necessarily produce good token representations for non-masked words. The non-masked tokens were still used for context, but the model was optimized for predicting masked words. Τhe model only tries to predict when the [MASK] token is present in the input, while we want the model to try to predict the correct tokens regardless of what token is present in the input. 

1. If we used [MASK] 90% of the time and random words 10% of the time, this would teach the model that the observed word is never correct.

- If we used [MASK] 90% of the time and kept the same word 10% of the time, then the model could just trivially copy the non-contextual embedding.

*No ablation was done on the ratios of this approach, and it may have worked better with different ratios. In addition, the model performance wasn’t tested with simply masking 100% of the selected tokens.*

## Next Sentence Prediction (NSP)

In order to understand relationship between two sentences, BERT training process also uses next sentence prediction. A pre-trained model with this kind of understanding is relevant for tasks like question answering. During training the model gets as input pairs of sentences and it learns to predict if the second sentence is the next sentence in the original text as well.

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

1. A **[CLS]** token is inserted at the beginning of the first sentence and a **[SEP]** token is inserted at the end of each sentence. This token is used in classification tasks as an aggregate of the entire sequence representation. It is ignored in non-classification tasks. For single text sentence tasks, this [CLS] token is followed by the WordPiece tokens and the separator token – [SEP]. For sentence pair tasks, the WordPiece tokens of the two sentences are separated by another [SEP] token. This input sequence also ends with the [SEP] token.

1. A **sentence embedding** indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2. In order to perform Next Sentence Prediction task properly we need to be able to distinguish between sentences A and B. Fixing the lengths of sentences can be too restrictive and a potential bottleneck for various downstream tasks.

1. A **positional embedding** is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper. All the inputs are fed in one step — as opposed to RNNs in which inputs are fed sequentially, the model is not able to preserve the ordering of the input tokens. The order of words in every language is significant, both semantically and syntactically.

![image](https://user-images.githubusercontent.com/28102493/107118536-ee1ed500-6881-11eb-9510-a57de03a2ce4.png)

**The input representation for BERT: The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.** Essentially, the Transformer stacks a layer that maps sequences to sequences, so the output is also a sequence of vectors with a 1:1 correspondence between input and output tokens at the same index.

To predict if the second sentence is indeed connected to the first, the following steps are performed:

1. The entire input sequence goes through the Transformer model.

1. The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple **classification layer** (learned matrices of weights and biases).

1. Calculating the probability of IsNextSequence with **softmax**.

## Tokenization strategy

BERT uses **WordPiece tokenization.** The vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the existing words in the vocabulary are iteratively added. Any word that does not occur in the vocabulary is broken down into sub-words greedily. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub-words).

512 is the maximum sequence length of the input

## Outputs
How does one predict output for two different tasks simultaneously? The answer is by using different FFNN + Softmax layer built on top of output(s) from the last encoder, corresponding to desired input tokens. We will refer to the outputs from last encoder as final states.
The first input token is always a special classification [CLS] token. The final state corresponding to this token is used as the aggregate sequence representation for classification tasks and used for the Next Sentence Prediction where it is fed into a FFNN + Softmax layer that predicts probabilities for the labels “IsNext” or “NotNext”. The output vector for this special token [CLS] does not represent the token itself, but the input as a whole.So it is usually sufficient, and faster, to only use this vector for classification tasks.

The final states corresponding to [MASK] tokens is fed into FFNN+Softmax to predict the next word from our vocabulary.

## Architecture
Now let’s look at some numbers that none of us will ever remember, but our understanding will feel incomplete without them, so here goes nothing:

- L = Number of layers (i.e., #Transformer encoder blocks in the stack).

- H = Hidden size (i.e. the size of q, k and v vectors).

- A = Number of attention heads.

There are four types of pre-trained versions of BERT depending on the scale of the model architecture:

1. **BERT-Base:** 12-layer, 768-hidden-nodes, 12-attention-heads, 110M total parameters
1. **BERT-Large:** 24-layer, 1024-hidden-nodes, 16-attention-heads, 340M total parameters

Fun fact: BERT-Base was trained on 4 cloud TPUs for 4 days and BERT-Large was trained on 16 TPUs for 4 days!



# How to use BERT (Fine-tuning)

## Transfer Learning

BERT outperformed the state-of-the-art across a wide variety of tasks under general language understanding like natural language inference, sentiment analysis, question answering, paraphrase detection and linguistic acceptability.

Now, how can we fine-tune it for a specific task? BERT can be used for a wide variety of language tasks. Using BERT for a specific task is relatively straightforward:

BERT can be used for a wide variety of language tasks, while only **adding a small layer to the core model:**

1. **Classification tasks** such as sentiment analysis are done similarly to Next Sentence classification, by adding a **classification layer** on top of the Transformer output for the **[CLS] token**. The final hidden state of the [CLS] token is taken as the fixed-dimensional pooled representation of the input sequence. This is fed to the classification layer. The classification layer is the **only new parameter** added and has a dimension of **K x H**, where K is the number of classifier labels and H is the size of the hidden state. The label probabilities are computed with a **standard softmax**.
![image](https://user-images.githubusercontent.com/28102493/107126812-63a29980-68b2-11eb-9071-3c515b987f4a.png)


2. What is the **fine-tuning procedure for sentence pair classification tasks**: This procedure is exactly similar to the single sequence classification task. The only difference is in the input representation where the two sentences are concatenated together. In Sentence Pair Classification and Single Sentence Classification, the final state corresponding to [CLS] token is used as input for the additional layers that makes the prediction.
![image](https://user-images.githubusercontent.com/28102493/107126838-846aef00-68b2-11eb-98d0-0e2acb366688.png)


3. In **Question Answering tasks** (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning **two extra vectors** that mark the **beginning and the end of the answer.** For example, say we are creating a question answering application. In essence question answering is just a prediction task —  on receiving a question as input, the goal of the application is to identify the right answer from some corpus. So, given a question and a context paragraph, the model predicts a start and an end token from the paragraph that most likely answers the question. This means that using BERT a model for our application can be trained by learning two extra vectors that mark the beginning and the end of the answer. Just like sentence pair tasks, *the question becomes the first sentence* and *paragraph the second sentence in the input sequence.* However, this time **there are two new parameters learned during fine-tuning: a start vector and an end vector, with size equal to the hidden shape size.** In QA tasks, a start (S) and an end (E) vector are introduced during fine tuning. The question is fed as sentence A and the answer as sentence B. The probability of word i being the start of the answer span is computed as a dot product between Ti (final state corresponding to ith input token) and S (start vector) followed by a softmax over all of the words in the paragraph. A similar method is used for end span. The score of a candidate span from position i to position j is defined as S·Ti + E·Tj, and the maximum scoring span where j ≥ i is used as a prediction. The probability of token i being the start of the answer span is computed as – softmax(S . K), where S is the start vector and K is the final transformer output of token i. The same applies to the end token. 
![image](https://user-images.githubusercontent.com/28102493/107126945-1bd04200-68b3-11eb-88d2-d44ad529143b.png)


4. In **Named Entity Recognition (NER)**, the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding **the output vector of each token into a classification layer that predicts the NER label.** In single sentence tagging tasks such as named entity recognition, a tag must be predicted for every word in the input. The final hidden states (the transformer output) of every input token is fed to the classification layer to get a prediction for every token. Since WordPiece tokenizer breaks some words into sub-words, the prediction of only the first token of a word is considered.

![image](https://user-images.githubusercontent.com/28102493/107126969-4b7f4a00-68b3-11eb-8f2d-84cd7f167cbe.png)


**It uses two steps, pre-training and fine-tuning, to create state-of-the-art models for a wide range of tasks.**
That means that the same pre-trained model can be fine-tuned for a variety of final tasks that might not be similar to the task model was trained on and give close to state-of-the-art results.

![image](https://user-images.githubusercontent.com/28102493/107121115-fc282200-6890-11eb-84cf-840220ded7dd.png)

As you can see, we first train the model on the pre-training tasks simultaneously. Once the pre-training is complete, the same model can be fine-tuned for a variety of downstream tasks. Note that a separate model is fine-tuned for a specific downstream task. So single pre-trained models can generate multiple downstream task specific models post fine tuning.

In the fine-tuning training, most hyper-parameters stay the same as in BERT training, and the paper gives specific guidance (Section 3.5) on the hyper-parameters that require tuning. The BERT team has used this technique to achieve state-of-the-art results on a wide variety of challenging natural language tasks, detailed in Section 4 of the paper. Fine-tuning on various downstream tasks is done by swapping out the appropriate inputs or outputs. **A positive consequence of adding layers — input/output and not changing the BERT model is that only a minimal number of parameters need to be learned from scratch making the procedure fast, cost and resource efficient.**

1. ***How many layers are frozen in the fine-tuning step?***
    
    No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained simultaneously.

1. ***Is discriminative fine-tuning used?***
    
    No. All the parameters are tuned with the same learning rate.

1. ***What are the optimal values of the hyperparameters used in fine-tuning?***

    The optimal hyperparameter values are task-specific. But, the authors found that the following range of values works well across all tasks:

        1. Dropout – 0.1
        1. Batch Size – 16, 32
        1. Learning Rate (Adam) – 5e-5, 3e-5, 2e-5
        1. Number of epochs – 3, 4
        
    The authors also observed that large datasets (> 100k labeled samples) are less sensitive to hyperparameter choice than smaller datasets.

Note that in case we want to do fine-tuning, we need to transform our input into the specific format that was used for pre-training the core BERT models, e.g., we would need to add special tokens to mark the beginning ([CLS]) and separation/end of sentences ([SEP]) and segment IDs used to distinguish different sentences —  convert the data into features that BERT uses.

Just to give you an idea of how fast and efficient it is, the authors claim that all the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.(For all the fine-tuning tasks discussed in the paper it takes at most 1 hour on a single cloud TPU or a few hours on a GPU.) **The two pre-training objectives allow it to be used on any single sequence and sequence-pair tasks without substantial task-specific architecture modifications.**

![image](https://user-images.githubusercontent.com/28102493/107121713-3f37c480-6894-11eb-9244-3c78d29704ea.png)



## BERT as Feature-based models

A pre-trained model of BERT can also be used for generating text embeddings, similarly to many other feature-based models, such as doc2vec and ELMo. **The paper found that the best embeddings are achieved by concatenating the last four layers of the encoder.** 

BERT is effective for both fine-tuning and feature-based approaches. The authors did ablation studies on the CoNLL-2003 NER task, in which they took the output from one or more layers without fine-tuning and fed them as input to a randomly initialized two-layer 768 dimensional BiLSTM before the classification layer. The best performing model was the one that took representations from the top four hidden layers of the pre-trained transformer.

## Sentence Embedding - Sentence encoder

How to use BERT as a sentence encoder?
The final hidden states (the transformer outputs) of the input tokens can be concatenated and / or pooled together to get the encoded representation of a sentence. [bert-as-a-service](https://github.com/hanxiao/bert-as-service) is an open source project that provides BERT sentence embeddings optimized for production. I highly recommend this article – [Serving Google BERT in Production using Tensorflow and ZeroMQ](https://hanxiao.io/2019/01/02/Serving-Google-BERT-in-Production-using-Tensorflow-and-ZeroMQ/).


## A Visual Guide to Using BERT for the First Time

This section is a simple tutorial for how to use a variant of BERT to classify sentences. This is an example that is basic enough as a first intro, yet advanced enough to showcase some of the key concepts involved.

### Dataset: SST2

The dataset we will use in this example is SST2, which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):

![image](https://user-images.githubusercontent.com/28102493/107120142-24615200-688c-11eb-83fb-fac3f0203297.png)


### Models: Sentence Sentiment Classification

Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

![image](https://user-images.githubusercontent.com/28102493/107120160-4064f380-688c-11eb-8f0c-da4f4c12bbac.png)


Under the hood, the model is actually made up of two model.

1. **DistilBERT** processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
1. The next model, a basic **Logistic Regression model** from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector **as an embedding for the sentence that we can use for classification.** This vector is the result of the first position (which receives the [CLS] token as input).

![image](https://user-images.githubusercontent.com/28102493/107120238-c2551c80-688c-11eb-8b64-66e74be0f541.png)

### Model Training

While we’ll be using two models, we will only train the logistic regression model. For DistillBERT, we’ll use a model that’s already pre-trained and has a grasp on the English language. This model, however is neither trained not fine-tuned to do sentence classification. We get some sentence classification capability, however, from the general objectives BERT is trained on. This is especially the case with BERT’s output for the first position (associated with the [CLS] token). I believe that’s due to BERT’s second training object – Next sentence classification. That objective seemingly trains the model to encapsulate a sentence-wide sense to the output at the first position. The [transformers](https://github.com/huggingface/transformers) library provides us with an implementation of DistilBERT as well as pretrained versions of the model.

![image](https://user-images.githubusercontent.com/28102493/107120297-19f38800-688d-11eb-9e46-551e23852e07.png)


So here’s the game plan with this tutorial. **We will first use the trained distilBERT to generate sentence embeddings for 2,000 sentences.**

![image](https://user-images.githubusercontent.com/28102493/107120314-414a5500-688d-11eb-84a7-4286b715f227.png)


**We will not touch distilBERT after this step. It’s all Scikit Learn from here. We do the usual train/test split on this dataset:**

![image](https://user-images.githubusercontent.com/28102493/107120333-5de68d00-688d-11eb-9931-4383346dda18.png)

Then we train the logistic regression model on the training set:

![image](https://user-images.githubusercontent.com/28102493/107120361-7fe00f80-688d-11eb-96ba-2c8890fb62e2.png)

### How a single prediction is calculated
Before we dig into the code and explain how to train the model, let’s look at how a trained model calculates its prediction.

Let’s try to classify the sentence “a visually stunning rumination on love”. The first step is to use the BERT tokenizer to first split the word into tokens. Then, we add the special tokens needed for sentence classifications (these are [CLS] at the first position, and [SEP] at the end of the sentence).

![image](https://user-images.githubusercontent.com/28102493/107120384-aef68100-688d-11eb-952c-5b5a7a185c99.png)

The third step the tokenizer does is to replace each token with its id from the embedding table which is a component we get with the trained model. Read [The Illustrated Word2vec](http://jalammar.github.io/illustrated-word2vec/) for a background on word embeddings.

![image](https://user-images.githubusercontent.com/28102493/107120399-b453cb80-688d-11eb-829f-45e98ef9ce1b.png)

Note that the tokenizer does all these steps in a single line of code:

`tokenizer.encode("a visually stunning rumination on love", add_special_tokens=True)`

Our input sentence is now the proper shape to be passed to DistilBERT.

![image](https://user-images.githubusercontent.com/28102493/107120459-0f85be00-688e-11eb-8381-565d90284ed7.png)

### Flowing Through DistilBERT

Passing the input vector through DistilBERT works just like BERT. The output would be a vector for each input token. each vector is made up of 768 numbers (floats).

![image](https://user-images.githubusercontent.com/28102493/107120490-33490400-688e-11eb-8758-7b3dd5f69488.png)

Because this is a sentence classification task, we ignore all except the first vector (the one associated with the [CLS] token). The one vector we pass as the input to the logistic regression model.

![image](https://user-images.githubusercontent.com/28102493/107120504-48259780-688e-11eb-9ab2-d3dcff22d16f.png)


From here, it’s the logistic regression model’s job to classify this vector based on what it learned from its training phase. We can think of a prediction calculation as looking like this:

![image](https://user-images.githubusercontent.com/28102493/107120524-655a6600-688e-11eb-8188-186c6a4ebaaf.png)


### Details

1. Padding (Generally speaking, the block’s input must be a 1-dimensional vector with a size between 3 and 512, and containing integers between 0 and the Vocabulary size. Keep in mind that longer inputs compute quadratically slower, so don’t use the largest size if you don’t have to.)

![image](https://user-images.githubusercontent.com/28102493/107120575-af434c00-688e-11eb-9cc3-9028224ffb3b.png)
![image](https://user-images.githubusercontent.com/28102493/107120612-d13cce80-688e-11eb-8911-e6e1b840eb84.png)
![image](https://user-images.githubusercontent.com/28102493/107120701-ff221300-688e-11eb-8313-a36ae6aa410e.png)

Each row is associated with a sentence from our dataset. To recap the processing path of the first sentence, we can think of it as looking like this:

![image](https://user-images.githubusercontent.com/28102493/107120798-4a3c2600-688f-11eb-9ae2-18061ba69b58.png)


**Slicing the important part**
For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.
![image](https://user-images.githubusercontent.com/28102493/107120836-8bccd100-688f-11eb-8a72-7f0cd86c488c.png)


This is how we slice that 3d tensor to get the 2d tensor we’re interested in:
And now features is a 2d numpy array containing the sentence embeddings of all the sentences in our dataset.

![image](https://user-images.githubusercontent.com/28102493/107120884-bcad0600-688f-11eb-982a-c49d6734cdcd.png)

**Dataset for Logistic Regression**

Now that we have the output of BERT, we have assembled the dataset we need to train our logistic regression model. The 768 columns are the features, and the labels we just get from our initial dataset.

![image](https://user-images.githubusercontent.com/28102493/107120891-c6366e00-688f-11eb-88c1-fab135c809a5.png)

Which splits the dataset into training/testing sets:

![image](https://user-images.githubusercontent.com/28102493/107120943-f120c200-688f-11eb-8aab-47cccf1d2237.png)



# Conclusion

BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. **The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future.** In this summary, we attempted to describe the main ideas of the paper while not drowning in excessive technical details. For those wishing for a deeper dive, we highly recommend reading the full article and ancillary articles referenced in it. Another useful reference is the BERT source code and models, which cover 103 languages and were generously released as open source by the research team.


## Takeaways

1. **Model size matters**, even at huge scale. BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters.

1. With **enough training data, more training steps == higher accuracy.** For instance, on the MNLI task, the BERT_base accuracy improves by 1.0% when trained on 1M steps (128,000 words batch size) compared to 500K steps with the same batch size.

1. **BERT’s bidirectional approach (MLM) converges slower** than left-to-right approaches (because only 15% of words are predicted in each batch) but bidirectional training still outperforms left-to-right training after a small number of pre-training steps.

    ![image](https://user-images.githubusercontent.com/28102493/107118904-3c34d800-6884-11eb-83a4-3810091bd6e7.png)

    ![image](https://user-images.githubusercontent.com/28102493/107118925-5ff81e00-6884-11eb-8ec1-fa015655d47f.png)

1. BERT is a stacked Transformer’s Encoder model.

1. It has two phases — pre-training and fine-tuning.

1. Pre-training is computationally and time intensive.

1. It is, however, independent of the task it finally does, so same pre-trained model can be used for a lot of tasks.

1. GPT is not that different from BERT and is a stacked Transformer’s decoder model.

1. There are many variants of BERT out there.

## The BERT Family

It wouldn’t be 21st century if we didn’t take something that works well and try to recreate or modify it. BERT architecture is no different. These are some of the most popular variants of it:

1. **[ALBERT](https://arxiv.org/abs/1909.11942)** by Google and more — This paper describes parameter reduction techniques to lower memory reduction and increase the training speed of BERT models.

1. **[RoBERTa](https://arxiv.org/abs/1907.11692)** by Facebook — This paper for FAIR believes the original BERT models were under-trained and shows with more training/tuning it can outperform the initial results.

1. **[ERNIE](https://arxiv.org/abs/1904.09223)**: Enhanced Representation through Knowledge Integration by Baidu — It is inspired by the masking strategy of BERT and learns language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking.

1. **[DistilBERT](https://arxiv.org/abs/1910.01108)**: Smaller BERT using model distillation from Huggingface. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.

1. **[Clinical BERT](https://arxiv.org/abs/1904.03323)** – BERT pre-trained on clinical notes corpus. (BERT available on domain specific pre-trained corpus)

1. **[sciBERT](https://arxiv.org/abs/1903.10676)**  – Pre-Trained Contextualized Embeddings for Scientific Text. (BERT available on domain specific pre-trained corpus)

## Bert Implementations

1. Tensorflow
1. Pytorch
1. Keras