# BERT

In this notebook, you will understand how to use Transformers introduced in [Vaswani et al., 2017]  You will learn how to load a pretrained Transformer model and evaluate it on `newstest2014`. In addition, you are able to translate a few sentences youself with the `BeamSearchTranslator`.

## Preparation

We start with some usual preparation such as importing libraries and setting the environment.


In [1]:
import random
import math

import numpy as np
import mxnet as mx
from mxnet import gluon, nd
import gluonnlp as nlp

np.random.seed(100)
random.seed(100)
mx.random.seed(10000)
ctx = mx.gpu(0)

from utils import PositionalEncoding, MultiHeadAttention, AddNorm, PositionWiseFFN

The position-wise feed-forward network accepts a 3-dim input with shape (batch size, sequence length, feature size). It consists of two dense layers that applies to the last dimension, which means the same dense layers are used for each position item in the sequence, so called position-wise.

Now we define the transformer block for the encoder, which contains a multi-head attention layer, a position-wise feed-forward network, and two connection blocks.

In [2]:
class EncoderBlock(gluon.nn.Block):
    def __init__(self, units, hidden_size, num_heads, dropout, **kwargs):
        super(EncoderBlock, self).__init__(**kwargs)
        self.attention = MultiHeadAttention(units, num_heads, dropout)
        self.add_1 = AddNorm(dropout)
        self.ffn = PositionWiseFFN(units, hidden_size)
        self.add_2 = AddNorm(dropout)

    def forward(self, X, mask):
        Y = self.add_1(X, self.attention(X, X, X, mask))
        return self.add_2(Y, self.ffn(Y))

Due to the residual connections, this block will not change the input shape. It means the `units` argument should be equal to the input's last dimension size.

In [3]:
encoder_blk = EncoderBlock(24, 48, 8, 0.5)
encoder_blk.initialize()
mask = nd.ones(shape=(2, 100, 100))
mask[0, :, 2:] = 0
mask[1, :, 3:] = 0
encoder_blk(nd.ones((2, 100, 24)), mask).shape

(2, 100, 24)

The encoder stacks $n$ blocks. Due to the residual connection again, the embedding layer size $d$ is same as the transformer block output size. Also note that we multiple the embedding output by $\sqrt{d}$ to avoid its values are too small compared to positional encodings.

In [4]:
class TransformerEncoder(gluon.nn.Block):
    def __init__(self, vocab_size, units, hidden_size,
                 num_heads, num_layers, dropout, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.units = units
        self.embed = gluon.nn.Embedding(vocab_size, units)
        self.pos_encoding = PositionalEncoding(units, dropout)
        self.blks = gluon.nn.Sequential()
        for i in range(num_layers):
            self.blks.add(
                EncoderBlock(units, hidden_size, num_heads, dropout))

    def forward(self, X, mask, *args):
        X = self.pos_encoding(self.embed(X) * math.sqrt(self.units))
        for blk in self.blks:
            X = blk(X, mask)
        return X

Create an encoder with two transformer blocks, whose hyper-parameters are same as before.

In [5]:
encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5)
encoder.initialize()
encoder(nd.ones((2, 100)), mask).shape

(2, 100, 24)