## **BERT**

BERT (Bidirectional Encoder Representations from Transformer) and its descendants are currently state-of-the-art models for nearly all NLP tasks.

Released by Google in 2019, BERT builds powerful context-aware representations of words that can be exploited to perform custom classification tasks.

For further details: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

**Fine-tuning BERT for text classification with Keras and Tensorflow**

This notebook fine-tunes BERT for a custom text classification task. Simply, a classification layer is added on top of BERT and the whole model is retrained starting from pre-trained BERT weights.

With Keras and Tensorflow, we have all the freedom to customize our classification layer.

In [0]:
!pip install tensorflow==2.1.0
!pip install tensorflow-gpu 
!pip install transformers

**STEP 1**: download pre-trained BERT model.

We choose the multilingual model, since the texts we want to classify are in Italian.

In [0]:
import tensorflow as tf

from transformers import BertTokenizer, TFBertModel

model_name = 'bert-base-multilingual-cased' # recommended multilingual model (see https://github.com/google-research/bert/blob/master/multilingual.md)

print("Loading BERT model {}...".format(model_name))
bert_model = TFBertModel.from_pretrained(model_name)    # loads model + pre-trained weights; its layers are Keras layers
tokenizer  = BertTokenizer.from_pretrained(model_name)  # to preprocess input sentences
print("... model loaded successfully.")

Loading BERT model bert-base-multilingual-cased...


HBox(children=(IntProgress(value=0, description='Downloading', max=569, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=1083389348, style=ProgressStyle(description…




HBox(children=(IntProgress(value=0, description='Downloading', max=995526, style=ProgressStyle(description_wid…


... model loaded successfully.


**STEP 2**: load and preprocess data. In this case, we will try to predict an article's topic based on the title.

Target classification labels must be one-hot encoded.

Input sentences must be *tokenized* (using BERT's tokenizer) and *padded* so that they all have the same length.

The padding token is not a proper word and should not be considered when calculating attention weights. Therefore, we declare an attention mask (**padding mask**) that obscures all padding tokens. The padding mask will be passed as an additional input to the BERT layer.

In [0]:
import numpy as np
import pandas as pd
import random

df = pd.read_csv("articles_topics.csv")[['title', 'topic']] # about 500 articles

# *** Labels

labels_dict     = {topic : i for i, topic in enumerate(df.topic.unique())}
inv_labels_dict = {i : topic for topic, i in labels_dict.items()}
n_labels = len(labels_dict)

# one-hot encode the labels: e.g. label = 2 --> encoding = [0, 0, 1, 0, 0, ..., 0] (needed for fine-tuning on classification task)
def one_hot_map(label_id, labels_dict = labels_dict, n_labels = n_labels):
    label_enc = np.zeros(n_labels)
    label_enc[labels_dict[label_id]] = 1.
    return label_enc

# *** Predictors: convert input texts to a format that can be understood by BERT

def texts_to_BERT_input(texts, tokenizer, max_seq_len = None, train_test_split = None):

    tokenized_texts = np.array([tokenizer.encode(text) for text in texts])

    if max_seq_len is None:
        max_seq_len = max(map(len, tokenized_texts))

    attention_mask = np.ones((len(tokenized_texts), max_seq_len))
    for i in range(len(tokenized_texts)): 
        seq_len = len(tokenized_texts[i])
        n_padding_tokens = max_seq_len - seq_len
        tokenized_texts[i] = np.concatenate((tokenized_texts[i], np.zeros(n_padding_tokens))) # pad text to max sequence length (0 = [PAD] token)
        attention_mask[i, seq_len:] = 0 # update attention mask (0 = masked input, 1 = input to be used)

    input_bert = np.array(list(zip(*(tokenized_texts, attention_mask)))) # input for BERT: tokenized text + attention mask

    input_bert = tf.constant(input_bert, dtype = tf.int32)

    return input_bert

**STEP 3**: do train-test split (randomly set 80% train and 20% test)

In [0]:
max_seq_len = 50 # whatever length should be ok provided it is >= than the maximum tokenized sequence length.
                 # Thanks to the attention mask, results should not change if we vary max_seq_len         
input_bert = texts_to_BERT_input(df.title.values, tokenizer = tokenizer, max_seq_len = max_seq_len)

# Split train / test
train_test_split = 0.8
n_data  = len(input_bert)
n_train = int(train_test_split * n_data)
ixs = np.arange(n_data)
random.shuffle(ixs)
ixs_train, ixs_test = ixs[:n_train], ixs[n_train:]
assert len(set(ixs_train).intersection(set(ixs_test))) == 0

train_input = tf.constant(input_bert.numpy()[ixs_train, :], dtype = tf.int32)
test_input  = tf.constant(input_bert.numpy()[ixs_test , :], dtype = tf.int32)

train_labels = tf.constant(np.array([one_hot_map(topic) for topic in df.iloc[ixs_train].topic.values]), dtype = tf.int32)
test_labels  = tf.constant(np.array([one_hot_map(topic) for topic in df.iloc[ixs_test] .topic.values]), dtype = tf.int32)

**STEP 4**: define the model (add classification layer on top of BERT)

In [0]:
# BERT LAYER
input_shape = (input_bert.shape[1], input_bert.shape[2])

input_layer = tf.keras.layers.Input(shape = input_shape, dtype = 'int32', name = 'input_tokens')  # input: tokenized sentences + attention mask
output = bert_model.layers[0](input_layer[:, 0, :], attention_mask = input_layer[:, 1, :])[0]
output = tf.keras.layers.Lambda(lambda x : x[:, 0, :], name = 'extract_CLS', output_shape = (None, 768))(output)  # extract representation of [CLS] token

# CLASSIFICATION LAYER
# added on top of BERT; receives the representation of the [CLS] token as input
output = tf.keras.layers.Dropout(0.2, name = 'dropout')(output)
output = tf.keras.layers.Dense(n_labels, activation = tf.nn.softmax, name = 'classifier')(output)

classifier = tf.keras.Model(inputs = input_layer, outputs = output, name = 'BERT-classifier')

classifier.build(input_shape = input_shape)

classifier.compile(loss = 'categorical_crossentropy', optimizer = tf.optimizers.Adam(lr = 2E-5), metrics = ['accuracy'])

print(classifier.summary())

Model: "BERT-classifier"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_tokens (InputLayer)       [(None, 2, 50)]      0                                            
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 50)]         0           input_tokens[0][0]               
__________________________________________________________________________________________________
tf_op_layer_strided_slice_1 (Te [(None, 50)]         0           input_tokens[0][0]               
__________________________________________________________________________________________________
bert (TFBertMainLayer)          ((None, 50, 768), (N 177853440   tf_op_layer_strided_slice[0][0]  
____________________________________________________________________________________

**STEP 5**: train the model

In [0]:
# Define batch size & batch generator

BATCH_SIZE = 16

def batch_generator(train_input, train_labels, n_train, batch_size = BATCH_SIZE):
    
    ixs = np.arange(n_train)
    np.random.shuffle(ixs)
    
    start = 0
    while True:
        
        batch_ixs = ixs[start:start+batch_size]
           
        start += batch_size
        if start > n_train: # end of an epoch
            start = 0
            np.random.shuffle(ixs)
            
        yield tf.constant(train_input.numpy()[batch_ixs, :, :], dtype = tf.int32), tf.constant(train_labels.numpy()[batch_ixs], dtype = tf.int32)

In [0]:
# Do the training

n_train = len(train_input)
generator = batch_generator(train_input, train_labels, n_train, batch_size = BATCH_SIZE)

callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath = 'bert_fine_tuned-{epoch:04d}.ckpt', save_weights_only = True, period = 5, verbose = 1)]

classifier.fit_generator(
          generator = generator,
          epochs    = 20,
          steps_per_epoch = int(n_train / BATCH_SIZE) + 1,
          validation_data = (test_input, test_labels),
          verbose   = 1,
          callbacks = callbacks)

Instructions for updating:
Please use Model.fit, which supports generators.
  ...
    to  
  ['...']
Train for 28 steps, validate on 109 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 00005: saving model to bert_fine_tuned-0005.ckpt
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 00010: saving model to bert_fine_tuned-0010.ckpt
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 00015: saving model to bert_fine_tuned-0015.ckpt
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 00020: saving model to bert_fine_tuned-0020.ckpt


<tensorflow.python.keras.callbacks.History at 0x7f5d3148be10>

You can now load any of your intermediate checkpoints and print out the test set predictions

In [0]:
classifier.load_weights('bert_fine_tuned-0020.ckpt')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f5d31481358>

In [0]:
# Print test set predictions

predictions = classifier.predict(test_input)
predictions = np.argmax(predictions, axis = 1)
predictions = [inv_labels_dict[i] for i in predictions]

print("Test set: {} predictions calculated".format(len(predictions)))

def wrap(text, maxlen = max(map(len, labels_dict.keys()))):
    return text+(' ' * (maxlen - len(text)))

print("{}\t{}\t{}".format(wrap('GROUND TRUTH'), wrap('PREDICTION'), 'TITLE'))
for i, ix in enumerate(ixs_test):
    print("{}\t{}\t{}".format(wrap(df.iloc[ix].topic), wrap(predictions[i]), df.iloc[ix].title))