### Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
#### BERT - (Bidirectional Encoder Representations from Transformers)

<img src = "https://www.kaggle.com/ashishpatel26/tensorflow-tutorial-data/downloads/GOOGLEBERT.jpg" height="100" weight="800"></img>

#### Outline of this notebook
---
1. [**Introduction**](#Introduction)
2. [**How to understand the BERT model**](#How-to-understand-the-BERT-model)
3. [**BERT model analysis**](#BERT-model-analysis)
      1. [**BERT model architecture**](#BERT-model-architecture)
      1. [**Key innovations: pre-training tasks**](#Key-innovations:-pre-training-tasks)
      1. [**Experimental result**](#Experimental-result)
4. [**Impact of the BERT model**](#Impact-of-the-BERT-model)
5. [**Demostration of BERT model**](#Demostration-of-BERT-model)
6. [**References**](#References)

---

### Introduction
Recently, Google made a big news. The newly released BERT model of the company's AI team showed amazing results in the top level test of machine reading comprehension SQuAD1.1: **all two metrics surpass humans and are still 11 different types. The best results were achieved in the NLP test** , including pushing the GLUE benchmark to 80.4% (absolute improvement of 7.6%) and the MultiNLI accuracy to 86.7% (absolute improvement rate of 5.6%). It is foreseeable that BERT will bring milestone changes to NLP and is the most important recent development in the NLP field.

![](https://pic1.xuehuaimg.com/proxy/baijia/https://f12.baidu.com/it/u=574361345,3381935096&fm=173&app=25&f=JPEG?w=550&h=477&s=04B26C33111E55CE4CF555DE000080B1&access=215967316)

> **Thang Luong of the Google team** directly defines: ***The BERT model opens a new era of NLP!***

![](https://img-blog.csdn.net/20181021135254746?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

* From the current megatrend, using a **model to pre-train a language model seems to be a more reliable approach.** From the previous **ELMo of AI2**, to the **fine-tune transformer of OpenAI, to the BERT of Google, all of them are applications of pre-trained language models.**
The **BERT model** is different from the other two.
1. It **replaces a small number of words with Mask or another random word with a reduced probability when training the two-way language model**. which is to force the model to increase the memory of the context. 
2. Added a **loss to predict the next sentence.**

**The BERT model has the following two characteristics:**
* First, the model is **very deep, 12 layers, not wide, only 1024 in the middle layer, and 2048 in the middle layer of the previous Transformer model.** This seems to **confirm another point of view on computer image processing - deep and narrower than shallow and wide models.**
* Second, **MLM (Masked Language Model**), while using the words on the left and right, this has appeared on **ELMo, definitely not original.** The ***application of Mask to the language model has been proposed by Ziang Xie (PAPER): [[1703.02573] - Data Noising as Smoothing in Neural Network Language Models](https://arxiv.org/abs/1703.02573) .*** This is also a collection of superstars: Sida Wang, Jiwei Li (founder and CEO of Shannon Technology and the most authored NLP scholar in history), Andrew Ng, Dan Jurafsky are Coauthor. But it is a pity that they did not pay attention to this paper. Using the method of this paper to do Masking, I believe that the ability of BRET may be improved.

### How to understand the BERT model
---
### [1] What problem does BERT have to solve?
* Usually the **transformer model has many parameters to train.** For example, ***the BERT BASE model: L=12, H=768, A=12, the total number of model parameters that need to be trained is 12 * 768 * 12 = 110M.*** So **many parameters require training, and naturally require a large amount of training corpus.** If all the manual methods are used to produce training data, **the labor cost is too large.**
* Inspired by the paper **"A Neural Probabilistic Language Model", BERT** also uses the **unsupervised approach to train the transformer model.** ***This paper on neuro-probabilistic language model mainly talks about two things.***

1. Can you use a word vector to express the semantics of natural language vocabulary? 
2. How do I find the right numerical vector for each vocabulary?

![](https://img.ctolib.com/uploadImg/20181208/20181208112241_105.jpg)

* This paper is very exciting, profound and simple, and it is not annoying, but it is all-inclusive. Classic papers are worth chewing. Many of my peers are familiar with this paper, and the content is not repeated. There are **50 commonly used Hindi characters, which are combined into vocabulary. The number of Hindi vocabulary is as high as 200,000. If the dimension of the word vector is 512, then the number of parameters of the language model is at least 512 * 200,000 = 102M**
* The large number of model parameters requires a large amount of training corpus. **Where do you collect these massive training corpora?** The paper "***A Neural Probabilistic Language Model" says that each article is born to be a training corpus. Don't you need to manually label it? Answer, no need.***

![](https://img-blog.csdn.net/20181021135434193?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

* We often say, "Speak not to be upside down, to be fluent, to be coherent," meaning contextual vocabulary, which should have semantic coherence. Based on the coherence of natural language, the language model predicts the next word that will appear based on the previous words. If the parameters of the language model are correct, if the word vector of each word is set correctly, then the prediction of the language model should be more accurate. There are countless articles in the world, so the training data is inexhaustible.
* Four elements of deep learning, **1. Training data, 2. Model, 3. Computing power, 4. Application.**

### [2] What are the five keywords of BERT, Pre-training, Deep, Bidirectional, Transformer, and Language Understanding?

* The Language Model in the paper **"A Neural Probabilistic Language Model" is strictly a language generation model (Language Generative Model),** which predicts the next vocabulary that will appear in the sentence. Can the language generation model be directly applied to other NLP issues?
* For example, there are a lot of user comments on Quora or Amazon or Flipkart. Can you convert each user into a rating? -2, -1, 0, 1, 2, where -2 is very poor and +2 is excellent. If there is such a user comment, "I bought a shirt with the same deer, I didn't expect it to wear on my body, not like a small meat, but it is like a chef." Is this comment equivalent to -2 or other? 
* The language generation model can solve the above problems well? Further, is there a "universal" language model that understands the semantics of the language and applies to various NLP issues? The title of this BERT paper is straightforward. **"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."**
* There are **five keywords for this topic, *Pre-training, Deep, Bidirectional, Transformers, and Language Understanding***. **Pre-training** means that the **author believes that there is a common language model, first pre-training the general model with the article, and then using the supervised training data and fine tuning the model to make it suitable for the specific application. application. To distinguish it from the Language Model for language generation, the author gave a generic language model a name called the Language Representation Model.**

![](https://img-blog.csdn.net/2018102114002264?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

* There are many models that can achieve the goal of language representation. **Which one is used?** The authors propose to use the **Deep Bidirectional Transformers model.** If you give a sentence **"a model that can implement a language representation [mask]",** cover the word **"target".** To **predict [mask]** from going to the end, that is, using **"energy/implementation/language/characterization" to predict [mask]; or, predicting [mask]** from the back to the front, that is, using **"model/" to predict [ Mask], called unidirectional prediction unidirectional.** One-way prediction does not fully understand the semantics of the entire statement. So the researchers tried two-way prediction. Combine the two predictions from the back to the front and back, [mask1/mask2], which is bidirectional prediction bi-directional. See "Neural Machine Translation by Jointly Learning to Align and Translate" for details.

* The author of the **BERT believes that bi-directional still does not fully understand the semantics of the entire statement.** A ***better approach is to use the context omnidirectional to predict [mask], that is, use the "energy/implementation/language/characterization/..//model ", to predict [mask]. The BERT author called the context-oriented prediction method called deep bi-directional.*** How to implement context omnidirectional prediction? The author of BERT recommends using the Transformer model. This model was invented by Attention Is All You Need.

* At the **heart of this model is the focus mechanism.** For a **single statement, multiple focus points can be enabled at the same time, without being limited to seq to seq processing from the back to the front or from the back. It is necessary not only to correctly select the structure of the model, but also to correctly train the parameters of the model, so as to ensure that the model can accurately understand the semantics of the statement.** The **BERT took two steps** and tried to properly train the parameters of the model. 
The **first step** is to cover 15% of the vocabulary in an article, allowing the model to predict the occluded words omnidirectionally according to the context. If there are **10,000 articles, each article has an average of 100 words, and 15% of the words are randomly covered, the task of the model is to correctly predict the 150,000 covered words. The parameters of the Transformer model are initially trained by omnidirectional prediction of the occluded vocabulary**. Then, use the **second step** to continue training the parameters of the model. For example, from the above **10,000 articles, 200,000 pairs of statements were selected, for a total of 400,000 statements. When selecting a pair of statements, 2*100,000 pairs of statements are consecutive two context statements, and 2*100,000 pairs of statements are not consecutive statements. Then let the Transformer model identify the 200,000 pairs of statements, which are continuous and which are not.**

These two steps together are called **pre-training.** The Transformer model after **training, including its parameters, is a general-purpose language representation model that the author expects.**


In [1]:
import sys

!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
    sys.path += ['bertviz_repo']
# !pip install torch

Cloning into 'bertviz_repo'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (74/74), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 74 (delta 28), reused 60 (delta 18), pack-reused 0[K
Unpacking objects: 100% (74/74), done.


In [2]:
from bertviz import attention, visualization
from bertviz.pytorch_pretrained_bert import BertModel, BertTokenizer

In [3]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
  }
});

<IPython.core.display.Javascript object>

In [4]:
def call_html():
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [None]:
bert_version = 'bert-base-uncased'
model = BertModel.from_pretrained(bert_version)
tokenizer = BertTokenizer.from_pretrained(bert_version)
sentence_a = "I went to the store."
sentence_b = "At the store, I bought fresh strawberries."
attention_visualizer = visualization.AttentionVisualizer(model, tokenizer)
tokens_a, tokens_b, attn = attention_visualizer.get_viz_data(sentence_a, sentence_b)
call_html()
attention.show(tokens_a, tokens_b, attn)

### BERT model analysis
---
Original paper : [**BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**](https://arxiv.org/abs/1810.04805)
![](https://img-blog.csdn.net/20181021135554835?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

* The new language representation model for **BERT,** which represents the **two-way encoder representation of Transformer.** Unlike other recent language representation models, BERT aims to **pre-train deep two-way representations by jointly adjusting the context in all layers .** Therefore, the pre-trained BERT representation can be fine-tuned through an additional output layer, making it suitable for the construction of state-of-the-art models for a wide range of tasks, such as question-and-answer tasks and language inference, without major architectural modifications to specific tasks.
* The authors believe that the **existing technology severely constrains the ability to pre-train representations.** The **main limitation is that the standard language model is unidirectional , which makes the types of architectures that can be used in the pre-training of the model to be limited.**
* In the paper, the authors **improve the architecture-based fine-tuning approach by proposing a two-way code representation of BERT: Transformer.**
* **BERT proposes a new pre-training goal:** the ***masked language model (MLM)*** to overcome the unidirectional limitations mentioned above. MLM was inspired by the Cloze mission (Taylor, 1953). Some of the tokens in the MLM random shadow model input aim to predict their original vocabulary id based only on the context of the masked word.
* Unlike **left-to-right language model pre-training,** MLM targets **allow you to characterize the context of the left and right sides of the fusion**, ***pre-training a deep two-way Transformer. In addition to masking the language model, the author also introduces a "next sentence prediction" task that can pre-train the representation of text pairs with the MLM.***

#### The main contributions of the paper are:
* Prove the importance of **two-way pre-training for language representation**. Unlike **pre-training with previously used one-way language** models, **BERT uses a occlusion language model to implement pre-trained deep bidirectional representations.**
* The paper shows that the **pre-trained representation eliminates the need for many engineering tasks to modify the architecture for a particular task**. BERT is the ***first fine-tuning-based representation model that implements state-of-the-art performance on a large number of sentence-level and token-level tasks, and is better than many system-oriented architectures.***
* BERT refreshed the performance records of **11 NLP tasks**. This paper also reports on the BERT's model ablation study, indicating that the **two-way nature of the model** is an important new achievement. Related code and pre-trained models will be posted on goo.gl/language/bert.
### The latest records of **11 natural language processing tasks that BERT** has currently refreshed include: pushing the *GLUE benchmark to 80.4% (absolute improvement of 7.6%), MultiNLI accuracy to 86.7% (absolute improvement rate of 5.6%), and the SQuAD v1.1 Q&A The test F1 score record was refreshed to 93.2 points (absolutely improved by 1.5 points), exceeding the human performance by 2.0 points.*

**_GLUE_  **  
---
[GLUE](https://gluebenchmark.com/) is a collection of natural language tasks that includes the following data sets:

![](https://img-blog.csdn.net/20181015110555402?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3RyaXBsZW1lbmc=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

<div class="table-box"><table>
<thead>
<tr>
<th><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">name</font></font></th>
<th><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">full name</font></font></th>
<th><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">use</font></font></th>
</tr>
</thead>
<tbody>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">MNLI</font></font></font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Multi-Genre NLI</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Implicit relationship inference</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">QQP</font></font></font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Quora Question Pairs</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">The question is whether it is equivalent</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">QNLI</font></font></font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Question NLI</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Whether the sentence answers the question</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SST-2</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Stanford Sentiment Treebank</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">emotion analysis</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">CoLA</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Corpus of Linguistic Acceptability</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Sentence language judgment</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">STS-B</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Semantic Textual Similarity</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Semantic similarity</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">MRPC</font></font></font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Microsoft Research Paraphrase Corpus</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Whether the sentence pair is semantically equivalent</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">RTE</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Recognizing Texual Entailment</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Implicit relationship inference</font></font></td>
</tr>
<tr>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">WNLI</font></font></font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Winograd NLI</font></font></td>
<td><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Implicit relationship inference</font></font></font></font></td>
</tr>
</tbody>
</table></div>

![](https://img-blog.csdn.net/20181020114138800?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3RyaXBsZW1lbmc=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

### BERT model architecture

The BERT model architecture is based on the original implementation of the **multi-layer bidirectional Transformer** encoder described in Vaswani et al. (2017) and published in the tensor2tensor library. Since the use of Transformer has recently become ubiquitous, the implementation in the paper is identical to the original implementation, so a detailed description of the model structure will be omitted here.

In this work, the paper denotes the number of layers (ie, Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. In all cases, the size of the feed-forward/filter is set to 4H, which is 3072 for H=768 and 4096 for H=1024. The paper mainly reports the results of the two model sizes:

***BERT BASE:*** L=12, H=768, A=12, Total Parameters=110M  
***BERT LARGE :*** L=24, H=1024, A=16, Total Parameters=340M  

For comparison, the paper chose ***BERT LARGE*** which has the same model size as OpenAI GPT. However, it is important that the BERT Transformer uses a two-way self-attention, and the GPT Transformer uses a restricted self-attention, where each token can only handle the context to its left. The research team noted that in the literature, the two-way Transformer is often referred to as the " **Transformer encoder** " and the left context is called " **Transformer decoder** " because it can be used for text generation. A comparison between BERT, OpenAI GPT and ELMo is shown in Figure 1.

Figure 1: Differences in the pre-training model architecture. BERT uses a two-way Transformer. OpenAI GPT uses a Transformer from left to right. ELMo uses the independently trained left-to-right and right-to-left LSTM concatenations to generate features for downstream tasks. Of the three models, only the BERT representation is dependent on the left and right contexts in all layers.

**Input representation**

The **input representation of a** paper can explicitly represent a single text sentence or a pair of text sentences in a token sequence (for example, \[Question, Answer\]). For a given token, its input representation is constructed by summing the corresponding **token** , **segment,** and **position embeddings** . Figure 2 is a visual representation of the input representation:

Figure 2: BERT input representation. Input embedding is the sum of token embeddings, segmentation embeddings and position embeddings.

![](https://img-blog.csdn.net/20181021135717183?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

details as follows:

*   Use **WordPiece Embed** (Wu et al., 2016) and a vocabulary of 30,000 tokens. Use ## to indicate a participle.
*   Using the learned **positional embeddings** , the supported sequences are up to 512 tokens in length.
*   The first token for each sequence is always a special classification embedding (\[CLS\]). The final hidden state corresponding to the token (ie, the output of the Transformer) is used as an aggregated sequence representation of the classification task. This vector is ignored for non-categorical tasks.
*   The sentence pairs are packed into a sequence. Differentiate sentences in two ways. First, separate them with a special mark (\[SEP\]). Second, add a learned sentence A embedded in each token of the first sentence, and a sentence B is embedded in each token of the second sentence.
*   For single sentence input, only use the sentence A embedding.

**Key innovations: pre-training tasks**
---------------------------------------

Unlike Peters et al. (2018) and Radford et al. (2018), the paper does not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, the BERT is pre-trained using two new unsupervised predictive tasks.

**Task 1: Masked LM**

Intuitively, the research team has reason to believe that the deep bidirectional model is more powerful than the shallow connection of the left-to-right model or the left-to-right and right-to-left models. Unfortunately, the standard conditional language model can only be trained from left to right or from right to left, because the bidirectional conditional action will allow each word to be grounded "see itself" in the middle of multiple layers of context.

To train a **deep bidirectional representation** , the research team used a simple method of masking part of the input token and then predicting only those tokens that were masked. The paper refers to this process as "masked LM" (MLM), although it is often referred to in the literature as the Cloze task (Taylor, 1953).

In this example, the final hidden vector corresponding to the masked token is entered into the output softmax on the vocabulary, as in the standard LM. In all experiments in the team, 15% of WordPiece tokens in each sequence were randomly blocked. In contrast to denoising autoencoders (Vincent et al., 2008), only masked words are predicted instead of reconstructing the entire input.

While this does allow the team to get a two-way pre-training model, this approach has two drawbacks. First, there is no match between pre-training and finetuning because \[MASK\]token is never seen during finetuning. To solve this problem, the team does not always replace the "masked" vocabulary with the actual \[MASK\] token. Instead, the **training data generator randomly selects a 15% token** . For example, in the sentence "my dog ​​is hairy", the token it selects is "hairy". Then, perform the following process:

Instead of always replacing the selected word with \[MASK\], the data generator will do the following:

*   80% of the time: replace the word with the \[MASK\] tag, for example, my dog ​​is hairy → my dog ​​is \[MASK\]
*   10% of the time: replace the word with a random word, for example, my dog ​​is hairy → my dog ​​is apple
*   10% of the time: keep the word unchanged, for example, my dog ​​is hairy → my dog ​​is hairy. The purpose of doing this is to bias the representation to the actual observed word.

The Transformer encoder does not know that it will be required to predict which words or words have been replaced by random words, so it is forced to maintain a distributed context representation of each input token. In addition, because random replacement occurs only at 1.5% of all tokens (ie 10% of 15%), this does **not** seem to **compromise the model's ability to understand language** .

The second disadvantage of using MLM is that each batch only predicts 15% of tokens, indicating that the model may require more pre-training steps to converge. The team proved that the MLM converges slightly slower than the left-to-right model (predicting each token), but the experimental improvement in the MLM model far exceeds the increased training cost.

**Task 2: Next sentence prediction**

Many important downstream tasks, such as Q&A (QA) and Natural Language Inference (NLI), are based on understanding the relationship between two sentences, which is not directly obtained through language modeling.

In order to train a model relationship that understands sentences, a binarized next sentence task is pre-trained, and this task can be generated from any monolingual corpus. Specifically, when sentences A and B are selected as pre-training samples, B is 50% likely to be the next sentence of A, and 50% is likely to be a random sentence from the corpus. E.g:

Input = \[CLS\] the man went to \[MASK\] store \[SEP\]

He bought a gallon \[MASK\] milk \[SEP\]

Label = IsNext

Input = \[CLS\] the man \[MASK\] to the store \[SEP\]

Penguin \[MASK\] are flight ##less birds \[SEP\]

Label = NotNext

The team randomly selected the NotNext statement, and the final pre-training model achieved **97%-98% accuracy** on this task .


**Experimental result**
-----------------------

As mentioned earlier, BERT has refreshed performance records in 11 NLP missions! In this section, the team visualizes the experimental results of BERT in these tasks. For specific experimental setups and comparisons, please read the original paper.

![](https://img-blog.csdn.net/20181021135953777?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

Figure : Our task-oriented model is a combination of BERT and an additional output layer, so we need to learn the minimum number of parameters from scratch. Among these tasks, (a) and (b) are sequence-level tasks, and (c) and (d) are token-level tasks. In the figure, E denotes input embedding, Ti denotes a context representation of tokeni, \[CLS\] is a special symbol for classification output, and \[SEP\] is a special symbol for separating non-contiguous token sequences.

![](https://img-blog.csdn.net/20181021135810611?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

Figure : GLUE test results, given by the GLUE evaluation server. The number below each task indicates the number of training examples. The data in the “Average” column is slightly different from the official GLUE rating because we excluded the problematic WNLI set. The result of BERT and OpenAI GPT is single-model, single-task data. All results are from [https://gluebenchmark.com/leaderboard](https://link.zhihu.com/?target=https%3A//gluebenchmark.com/leaderboard) and [https://blog.openai.com/language-unsupervised/](https://link.zhihu.com/?target=https%3A//blog.openai.com/language-unsupervised/)

![](https://img-blog.csdn.net/20181021135826274?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

Figure : SQuAD results. The BERT integration is a 7x system with different pre-trained checkpoints and fine-tuning seed.

![](https://img-blog.csdn.net/20181021135853817?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM5NTIxNTU0/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

Figure : CoNLL-2003 named entity recognition results. The hyperparameters are selected by the development set and the resulting development and test scores are the average of five random restarts using these hyperparameters.

### Impact of the BERT model
---

BERT is a language representation model trained by oversized data, huge models, and enormous computational overhead. It is optimal in 11 natural language processing tasks (state-of-the-art) , SOTA) results. Perhaps you have already guessed where this model comes from, yes, it is from Google. It is estimated that many people will ridicule experiments of this scale, which are basically out of reach for general laboratories and researchers, but it does give us a lot of valuable experience:

1.  **Deep learning is representation learning** : "We show that pre-trained representations eliminate the needs of many heavily engineered task-specific architectures". Of the 11 BERT's tasks of brushing out new realms, most of them are only in advance. A linear layer is added as a linear output layer based on the pre-trained representation fine-tuning. In the task of sequence labeling (eg NER), even the dependency of the sequence output is ignored (ie non-autoregressive and no CRF), so SOTA before the spike, it can be seen that its ability to characterize learning.
2.  **Scale matters** : "One of our core claims is that the deep bidirectionality of BERT, which is enabled by masked LM pre-training, is the single most important improvement of BERT compared to previous work". The application of the mask in the language model is not new to many people, but it is true that the author of the BERT has verified its powerful representation learning ability on the basis of such a large-scale data + model + computing power. Such a model can even be extended to many other models, which may have been proposed and tested by different laboratories before, but because of the limitations of scale, the potential of these models has not been fully exploited, and they are unfortunately drowned in the roll. Among the paper torrents.
3.  **Pre-training is important** : "We believe that this is the first work to demonstrate that scaling to extreme model sizes also leads to large improvements on very small-scale tasks, provided that the model has been -trained". Pre-training has been widely used in various fields (eg ImageNet for CV, Word2Vec in NLP), mostly through large models of big data, such large models can bring improvements to small-scale tasks. Also gave your own answer. The pre-training of the BERT model is done with Transformer, but I don't think there will be much performance difference when I switch to LSTM or GRU. Of course, the parallelism in training calculations is another matter.

### Demostration of BERT model
 
0\. The reason for high-performance is actually due to two points. In addition to the improvement of the model, it is more important to use a large data set (BooksCorpus 800M + English Wikipedia 2.5G words) and a large computing power (corresponding to the oversized model). Pre-training on related tasks, achieving monotonous growth in performance on target tasks

1\. The **two-way model is different from Elmo**. Most people **misunderstand the size of his two-way contribution on the novelty**. I think this detail may be the reason why he is significantly better than **Elmo**. **Elmo is spelling a left to right and a right to left**. He is a window that **opens directly in training, using a sequential cbow.**

2\. **Poor reproducibility**: You can do whatever you want with money (Reddit discusses the price of running a BERT)

```For TPU pods: 4 TPUs * ~$2/h (preemptible) * 24 h/day * 4 days = $768 (base model) 16 TPUs = ~$3k (large model)   For TPU: 16 tpus * $8/hr * 24 h/day * 4 days = 12k 64 tpus * $8/hr * 24 h/day * 4 days = 50k 
```
 

Finally he asked: For GPU: "**BERT-Large is 24-layer, 1024-hidden and was trained for 40 epochs over a 3.3 billion word corpus. So maybe 1 year to train on 8 P100s?**" , then this is very interesting It is.

### References
---

1. [Know: How to evaluate Google's latest BERT model](https://www.zhihu.com/question/298203515)

2. [Wall Street News: NLP History Breakthrough](https://wallstreetcn.com/articles/3419427)

3. [OPENAI-Improving Language Understanding with Unsupervised Learning](https://blog.openai.com/language-unsupervised/)

4. [https://gluebenchmark.com/leaderboard](https://gluebenchmark.com/leaderboard)