# Named Entity Recognition using BERT Fine-Tuning

For downstream NLP tasks such as question answering, named entity recognition, and language inference, models built on pre-trained word representations tend to perform better. BERT, which fine tunes a deep bi-directional representation on a series of tasks, achieves state-of-the-art results. Unlike traditional transformers, BERT is trained on "masked language modeling," which means that it is allowed to see the whole sentence and does not limit the context it can take into account.

For this example, we are leveraging the transformers library to load a BERT model, along with some config files:

In [1]:
import tempfile
import os
import numpy as np

import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from transformers import BertTokenizer, TFBertModel

import fastestimator as fe
from fastestimator.dataset.data import german_ner
from fastestimator.op.numpyop.numpyop import NumpyOp
from fastestimator.op.numpyop.univariate import PadSequence, Tokenize, WordtoId
from fastestimator.op.tensorop import TensorOp, Reshape
from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp
from fastestimator.trace.metric import Accuracy
from fastestimator.trace.io import BestModelSaver
from fastestimator.backend import feed_forward

In [2]:
max_len = 20
batch_size = 64
epochs = 10
max_train_steps_per_epoch = None
max_eval_steps_per_epoch = None
save_dir = tempfile.mkdtemp()
data_dir = None

In [3]:
# Parameters
epochs = 2
batch_size = 4
max_train_steps_per_epoch = 10
max_eval_steps_per_epoch = 10


We will need a custom `NumpyOp` that constructs attention masks for input sequences:

In [4]:
class AttentionMask(NumpyOp):
    def forward(self, data, state):
        masks = [float(i > 0) for i in data]
        return np.array(masks)

Our `char2idx` function creates a look-up table to match ids and labels:

In [5]:
def char2idx(data):
    tag2idx = {t: i for i, t in enumerate(data)}
    return tag2idx

<h2>Building components</h2>

### Step 1: Prepare training & evaluation data and define a `Pipeline`

The NER dataset from GermEval contains sequences and entity tags from german wikipedia and news corpora. We are loading train and eval sequences as datasets, along with data and label vocabulary. For this example other nouns are omitted for the simplicity.

In [6]:
train_data, eval_data, data_vocab, label_vocab = german_ner.load_data(root_dir=data_dir)

Downloading data to /home/geez219/fastestimator_data/GermEval


0.00B [00:00, ?B/s]

65.0kB [00:00, 642kB/s]

214kB [00:00, 773kB/s] 

542kB [00:00, 1.00MB/s]

1.31MB [00:00, 1.35MB/s]

2.60MB [00:00, 1.85MB/s]

4.40MB [00:00, 2.48MB/s]

5.19MB [00:00, 3.06MB/s]

7.88MB [00:00, 9.16MB/s]




Define a pipeline to tokenize and pad the input sequences and construct attention masks. Attention masks are used to avoid performing attention operations on padded tokens. We are using the BERT tokenizer for input sequence tokenization, and limiting our sequences to a max length of 50 for this example.

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tag2idx = char2idx(label_vocab)
pipeline = fe.Pipeline(
    train_data=train_data,
    eval_data=eval_data,
    batch_size=batch_size,
    ops=[
        Tokenize(inputs="x", outputs="x", tokenize_fn=tokenizer.tokenize),
        WordtoId(inputs="x", outputs="x", mapping=tokenizer.convert_tokens_to_ids),
        WordtoId(inputs="y", outputs="y", mapping=tag2idx),
        PadSequence(max_len=max_len, inputs="x", outputs="x"),
        PadSequence(max_len=max_len, value=len(tag2idx), inputs="y", outputs="y"),
        AttentionMask(inputs="x", outputs="x_masks")
    ])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### Step 2: Create `model` and FastEstimator `Network`

Our neural network architecture leverages pre-trained weights as initialization for downstream tasks. The whole network is then trained during the fine-tuning.

In [8]:
def ner_model():
    token_inputs = Input((max_len), dtype=tf.int32, name='input_words')
    mask_inputs = Input((max_len), dtype=tf.int32, name='input_masks')
    bert_model = TFBertModel.from_pretrained("bert-base-uncased")
    seq_output, _ = bert_model(token_inputs, attention_mask=mask_inputs)
    output = Dense(24, activation='softmax')(seq_output)
    model = Model([token_inputs, mask_inputs], output)
    return model

After defining the model, it is then instantiated by calling fe.build which also associates the model with a specific optimizer:

In [9]:
model = fe.build(model_fn=ner_model, optimizer_fn=lambda: tf.optimizers.Adam(1e-5))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




`fe.Network` takes a series of operators. In this case we use a `ModelOp` to run forward passes through the neural network. The `ReshapeOp` is then used to transform the prediction and ground truth to a two dimensional vector or scalar respectively before feeding them to the loss calculation.

In [10]:
network = fe.Network(ops=[
        ModelOp(model=model, inputs=["x", "x_masks"], outputs="y_pred"),
        Reshape(inputs="y", outputs="y", shape=(-1, )),
        Reshape(inputs="y_pred", outputs="y_pred", shape=(-1, 24)),
        CrossEntropy(inputs=("y_pred", "y"), outputs="loss"),
        UpdateOp(model=model, loss_name="loss")
    ])

### Step 3: Prepare `Estimator` and configure the training loop

The `Estimator` takes four important arguments: network, pipeline, epochs, and traces. During the training, we want to compute accuracy as well as to save the model with the minimum loss. This can be done using `Traces`.

In [11]:
traces = [Accuracy(true_key="y", pred_key="y_pred"), BestModelSaver(model=model, save_dir=save_dir)]

In [12]:
estimator = fe.Estimator(network=network,
                         pipeline=pipeline,
                         epochs=epochs,
                         traces=traces, 
                         max_train_steps_per_epoch=max_train_steps_per_epoch,
                         max_eval_steps_per_epoch=max_eval_steps_per_epoch)

<h2>Training</h2>

In [13]:
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        



FastEstimator-Start: step: 1; num_device: 0; logging_interval: 100; 




FastEstimator-Train: step: 1; loss: 3.6633422; 


FastEstimator-Train: step: 10; epoch: 1; epoch_time: 8.34 sec; 


FastEstimator-BestModelSaver: Saved model to /tmp/tmpwhy7oicf/model_best_loss.h5
FastEstimator-Eval: step: 10; epoch: 1; loss: 0.6115406; accuracy: 0.92625; since_best_loss: 0; min_loss: 0.6115406; 




FastEstimator-Train: step: 20; epoch: 2; epoch_time: 8.02 sec; 


FastEstimator-BestModelSaver: Saved model to /tmp/tmpwhy7oicf/model_best_loss.h5
FastEstimator-Eval: step: 20; epoch: 2; loss: 0.5961059; accuracy: 0.92625; since_best_loss: 0; min_loss: 0.5961059; 
FastEstimator-Finish: step: 20; total_time: 22.81 sec; model_lr: 1e-05; 


<h2>Inferencing</h2>

Load model weights using <i>fe.build</i>

In [14]:
model_name = 'model_best_loss.h5'
model_path = os.path.join(save_dir, model_name)
trained_model = fe.build(model_fn=ner_model, weights_path=model_path, optimizer_fn=lambda: tf.optimizers.Adam(1e-5))

In [15]:
selected_idx = np.random.randint(1000)
print("Ground truth is: ",eval_data[selected_idx]['y'])

Ground truth is:  ['B-PER']


Create a data dictionary for the inference. The `transform()` function in `Pipeline` and `Network` applies all their operations on the given data:

In [16]:
infer_data = {"x":eval_data[selected_idx]['x'], "y":eval_data[selected_idx]['y']}
data = pipeline.transform(infer_data, mode="infer")
data = network.transform(data, mode="infer")

Get the predictions using <i>feed_forward</i>

In [17]:
predictions = feed_forward(trained_model, [data["x"],data["x_masks"]], training=False)
predictions = np.array(predictions).reshape(20,24)
predictions = np.argmax(predictions, axis=-1)

In [18]:
def get_key(val): 
    for key, value in tag2idx.items(): 
         if val == value: 
            return key 

In [19]:
print("Predictions: ", [get_key(pred) for pred in predictions])

Predictions:  [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
