## BERT

The year 2018 marked a turning point for the field of Natural Language Processing, with a series of deep-learning models achieving state-of-the-art results on NLP tasks ranging from question answering to sentiment classification. Most recently, Google’s BERT algorithm has emerged as a sort of “one model to rule them all,” based on its superior performance over a wide variety of tasks.

BERT builds on two key ideas that have been responsible for many of the recent advances in NLP: (1) the transformer architecture and (2) unsupervised pre-training. The transformer is a sequence model that forgoes the sequential structure of RNN’s for a fully attention-based approach, as described in the classic Attention Is All You Need. BERT is also pre-trained; its weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). Thus BERT doesn’t need to be trained from scratch for each new task; rather, its weights are fine-tuned. For more details about BERT, check out the The Illustrated Bert.

BERT is a (multi-headed) beast

Bert is not like traditional attention models that use a flat attention structure over the hidden states of an RNN. Instead, BERT uses multiple layers of attention (12 or 24 depending on the model), and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 24 x 16 = 384 different attention mechanisms.



### What does BERT actually learn?

#### Pattern 1: Attention to next word

In this pattern, most of the attention at a particular position is directed to the next token in the sequence. Below we see an example of this for layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) The figure on the left shows the attention for all tokens, while the one on the right shows the attention for one selected token (“i”). In this example, virtually all of the attention is directed to “went,” the next token in the sequence.

<img src="images/1_EPiYy22Tox5wTSWHsCg60Q.jpeg"/>

Pattern 1: Attention to next word. Left: attention weights for all tokens. Right: attention weights for selected token (“i”)


On the left, we can see that the [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS] rather than the next token. Thus this pattern appears to operate primarily within each sentence.

This pattern is related to the backward RNN, where state updates are made sequentially from right to left. Pattern 1 appears over multiple layers of the model, in some sense emulating the recurrent updates of an RNN.

#### Pattern 2: Attention to previous word

In this pattern, much of the attention is directed to the previous token in the sentence. For example, most of the attention for “went” is directed to the previous word “i” in the figure below. The pattern is not as distinct as the last one; some attention is also dispersed to other tokens, especially the [SEP]tokens. Like Pattern 1, this is loosely related to a sequential RNN, in this case the forward RNN.

<img src="images/1_6y97rGDQRnnxfkyw-a2png.jpeg">

Pattern 2: Attention to previous word. Left: attention weights for all tokens. Right: attention weights for selected token (“went”)

#### Pattern 3: Attention to identical/related words

In this pattern, attention is paid to identical or related words, including the source word itself. In the example below, most of the attention for the first occurrence of “store” is directed to itself and to the second occurrence of “store”. This pattern is not as distinct as some of the others, with attention dispersed over many different words.

<img src="images/1_GsrsVlaMMc_U_dGVNJ0xmg.jpeg"/>

Pattern 3: Attention to identical/related tokens. Left: attention weights for all tokens. Right: attention weights for selected token (“store”)

#### Pattern 4: Attention to identical/related words in other sentence

In this pattern, attention is paid to identical or related words in the other sentence. For example, most of attention for “store” in the second sentence is directed to “store” in the first sentence. One can imagine this being particularly helpful for the next sentence prediction task (part of BERT’s pre-training), because it helps identify relationships between sentences.

<img src="images/1_Zcu9TBZaxyAIGigR_jGnfA.jpeg"/>

Pattern 4: Attention to identical/related words in other sentence. Left: attention weights for all tokens. Right: attention weights for selected token (“store”)

#### Pattern 5: Attention to other words predictive of word

In this pattern, attention seems to be directed to other words that are predictive of the source word, excluding the source word itself. In the example below, most of the attention from “straw” is directed to “##berries”, and most of the attention from “##berries” is focused on “straw”.

<img src="images/1_OPL0NDQJWh611mveG-Gulg.jpeg"/>

Pattern 5: Attention to other words predictive of word. Left: attention weights for all tokens. Right: attention weights for selected token (“##berries”)

This pattern isn’t as distinct as some of the others. For example, much of the attention is directed to a delimiter token ([CLS]), which is the defining characteristic of Pattern 6 discussed next.

#### Pattern 6: Attention to delimiter tokens

In this pattern, most of the attention is directed to the delimiter tokens, either the [CLS] token or the [SEP] tokens. In the example below, most of the attention is directed to the two [SEP] tokens. This may be a way for the model to propagate sentence-level state to the individual tokens.

<img src="images/1_1weap1WGmkTsjk5feb457g.jpeg"/>

Pattern 6: Attention to delimiter tokens. Left: attention weights for all tokens. Right: attention weights for selected token (“store”)

#### keras-bert model 

In [13]:
from keras_bert import get_base_dict, get_model, gen_batch_inputs

# A toy input example
sentence_pairs = [
    [['all', 'work', 'and', 'no', 'play'], ['makes', 'jack', 'a', 'dull', 'boy']],
    [['from', 'the', 'day', 'forth'], ['my', 'arm', 'changed']],
    [['and', 'a', 'voice', 'echoed'], ['power', 'give', 'me', 'more', 'power']],
]

# Build token dictionary
token_dict = get_base_dict()  # A dict that contains some special tokens
for pairs in sentence_pairs:
    for token in pairs[0] + pairs[1]:
        if token not in token_dict:
            token_dict[token] = len(token_dict)
token_list = list(token_dict.keys())  # Used for selecting a random word


# Build & train the model
model = get_model(
    token_num=len(token_dict),
    head_num=5,
    transformer_num=12,
    embed_dim=25,
    feed_forward_dim=100,
    seq_len=20,
    pos_num=20,
    dropout_rate=0.05,
)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, 20)           0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, 20)           0                                            
__________________________________________________________________________________________________
Embedding-Token (TokenEmbedding [(None, 20, 25), (28 700         Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, 20, 25)       50          Input-Segment[0][0]              
__________________________________________________________________________________________________
Embedding-

In [1]:
# coding: utf-8
# !wget https://raw.githubusercontent.com/google-research/bert/master/tokenization.py
# !pip install keras-bert
import sys
import numpy as np
from keras_bert import load_trained_model_from_checkpoint
import tokenization
from tachles import convert_sentence_to_token_mode1, create_input_mask_mode1, create_phrase_mask_mode1
from tachles import convert_sentence_to_token_mode2, create_input_mask_mode2, create_phrase_mask_mode2

Using TensorFlow backend.


Download dataset

In [2]:
folder = 'multi_cased_L-12_H-768_A-12'
download_url = 'https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip'
print('Downloading model...')
zip_path = '{}.zip'.format(folder)
!test -d $folder || (wget $download_url && unzip $zip_path)

Downloading model...


In [4]:
config_path = folder+'/bert_config.json'
checkpoint_path = folder+'/bert_model.ckpt'
vocab_path = folder+'/vocab.txt'

Create an object to translate a string with spaces to tokens

In [5]:
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_path, do_lower_case=False)

Load the model

In [6]:
print('Loading model...')
model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=True)

Loading model...
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


### MODE 1: Prediction of words covered by the MASK token in the phrase. You need to submit a phrase to the input of a neural network in the format: [CLS] I came to [MASK] and bought [MASK]. [SEP]

In [7]:
sentence = 'I came to [MASK] and bought [MASK].'
tokens, token_input = convert_sentence_to_token_mode1(sentence, vocab_path)
mask_input = create_input_mask_mode1(token_input)
seg_input = create_phrase_mask_mode1()
token_input = np.asarray([token_input])

Make prediction

In [8]:
predicts = model.predict([token_input, seg_input, mask_input])[0]
predicts = np.argmax(predicts, axis=-1)
predicts = predicts[0][:len(tokens)]    
out = []
for i in range(len(mask_input[0])):
    if mask_input[0][i] == 1:           
        out.append(predicts[i])
out = tokenizer.convert_ids_to_tokens(out)
out = ', '.join(out)                             
out = tokenization.printable_text(out)          
out = out.replace(' ##','')
print('Result:', out)

('Result:', 'me, them')


### MODE 2: check the logic of two phrases. At the entrance of the neural network you need to submit a phrase in the format: [CLS] I came to the store. [SEP] And bought milk. [SEP]

In [9]:
sentence_1 = 'I came to the store.'
sentence_2 = 'And bought milk.'
token_input, tokens_sen_1, tokens_sen_2 = convert_sentence_to_token_mode2(sentence_1, sentence_2, vocab_path)
mask_input = create_input_mask_mode2()
seg_input = create_phrase_mask_mode2(tokens_sen_1, tokens_sen_2)
token_input = np.asarray([token_input])

Make prediction

In [10]:
predicts = model.predict([token_input, seg_input, mask_input])[1]      
print('Sentence is okey:', int(round(predicts[0][0]*100)), '%')                    
out = int(round(predicts[0][0]*100)) 

('Sentence is okey:', 99, '%')
