# BERT for Intent Classification and Slot Labelling

## Introduction
In this notebook, we will finetune a [BERT](https://arxiv.org/abs/1810.04805) model for the Intent Classification and Slot Labelling (ICSL) problem.

Intent classification and slot labeling are two essential problems in Natural Language Understanding (NLU). In _intent classification_, the agent needs to detect the intention that the speaker's utterance conveys. For example, when the speaker says "Book a flight from Long Beach to Seattle", the intention is to book a flight ticket. In _slot labeling_, the agent needs to extract the semantic entities that are related to the intent. In our previous example, "Long Beach" and "Seattle" are two semantic constituents related to the flight, i.e., the origin and the destination.

Essentially, _intent classification_ can be viewed as a sequence classification problem and _slot labeling_ can be viewed as a sequence tagging problem similar to Named-Entity Recognition (NER). Due to their inner correlation, these two tasks are usually trained jointly with a multi-task objective function.  

## Dataset
We first load the Airline Travel Information System (ATIS) dataset, which contains around 5000 utterance abouts travel plans and is a classical benchmark for ICSL.

In [1]:
from gluonnlp.data import ATISDataset

train_data = ATISDataset('train')
dev_data = ATISDataset('dev')
test_data = ATISDataset('test')
intent_vocab = train_data.intent_vocab
slot_vocab = train_data.slot_vocab
print('Loaded the ATIS dataset')
print('#Train/Dev/Test = {}/{}/{}'.format(len(train_data), len(dev_data), len(test_data)))
print('#Intent         = {}'.format(len(intent_vocab)))
print('#Slot           = {}'.format(len(slot_vocab)))

Loaded the ATIS dataset
#Train/Dev/Test = 4478/500/893
#Intent         = 18
#Slot           = 127


Then, let's display some samples from the dataset.

In [2]:
print('Sentence:', dev_data[4][0])
print('    Tags:', dev_data[4][1])
print('   Label:', intent_vocab.idx_to_token[dev_data[4][2][0]])

Sentence: ["i'm", 'flying', 'from', 'boston', 'to', 'the', 'bay', 'area']
    Tags: ['O', 'O', 'O', 'B-fromloc.city_name', 'O', 'O', 'B-toloc.city_name', 'I-toloc.city_name']
   Label: atis_flight


## Load the BERT Model
Next, we load the pretrained BERT model into the GPU. We load the BERT-base model trained on the Book Corpus + Wikipedia datasets.

In [3]:
import numpy as np
import mxnet as mx
from mxnet import gluon
from mxnet.gluon import nn, Block
import gluonnlp as nlp
import time
from gluonnlp.data import BERTTokenizer

dropout_prob = 0.1
ctx = mx.gpu(0)

bert_model, bert_vocab = nlp.model.get_model(name='bert_12_768_12',
                                             dataset_name='book_corpus_wiki_en_cased',
                                             pretrained=True,
                                             ctx=ctx,
                                             use_pooler=True,
                                             use_decoder=False,
                                             use_classifier=False,
                                             dropout=dropout_prob,
                                             embed_dropout=dropout_prob)
tokenizer = BERTTokenizer(bert_vocab, lower=True)

BERT uses the subword tokenization: e.g "Sunnyvale" will be tokenized as

In [4]:
tokenizer('Sunnyvale')

['sunny', '##vale']

Thus, we will convert the original ATIS dataset to make sure that the state corresponds to the first subword is used to predict the slot label.

<img src="explain_subword_tagging.png" width="480" align="left"/>


In [5]:
class IDSLSubwordTransform(object):
    """Transform the dataset using the bert vocabulary and tokenizer
    """
    def __init__(self, subword_vocab, subword_tokenizer, slot_vocab, cased=False):
        """

        Parameters
        ----------
        subword_vocab : Vocab
        subword_tokenizer : Tokenizer
        cased : bool
            Whether to convert all characters to lower
        """
        super(IDSLSubwordTransform, self).__init__()
        self._subword_vocab = subword_vocab
        self._subword_tokenizer = subword_tokenizer
        self._slot_vocab = slot_vocab
        self._cased = cased
        self._slot_pad_id = self._slot_vocab['O']


    def __call__(self, word_tokens, tags, intent_ids):
        """ Transform the

        Parameters
        ----------
        word_tokens : List[str]
        tags : List[str]
        intent_ids : np.ndarray

        Returns
        -------
        subword_ids : np.ndarray
        subword_mask : np.ndarray
        selected : np.ndarray
        padded_tag_ids : np.ndarray
        intent_label : int
        length : int
        """
        subword_ids = []
        subword_mask = []
        selected = []
        padded_tag_ids = []
        intent_label = intent_ids[0]
        ptr = 0
        for token, tag in zip(word_tokens, tags):
            if not self._cased:
                token = token.lower()
            token_sw_ids = self._subword_vocab[self._subword_tokenizer(token)]
            subword_ids.extend(token_sw_ids)
            subword_mask.extend([1] + [0] * (len(token_sw_ids) - 1))
            selected.append(ptr)
            padded_tag_ids.extend([self._slot_vocab[tag]] +
                                  [self._slot_pad_id] * (len(token_sw_ids) - 1))
            ptr += len(token_sw_ids)
        length = len(subword_ids)
        if len(subword_ids) != len(padded_tag_ids):
            print(word_tokens)
            print(tags)
            print(subword_ids)
            print(padded_tag_ids)
        return np.array(subword_ids, dtype=np.int32),\
               np.array(subword_mask, dtype=np.int32),\
               np.array(selected, dtype=np.int32),\
               np.array(padded_tag_ids, dtype=np.int32),\
               intent_label,\
               length

idsl_transform = IDSLSubwordTransform(subword_vocab=bert_vocab,
                                      subword_tokenizer=tokenizer,
                                      slot_vocab=slot_vocab,
                                      cased=False)
train_data_bert = train_data.transform(idsl_transform, lazy=False)
dev_data_bert = dev_data.transform(idsl_transform, lazy=False)
test_data_bert = test_data.transform(idsl_transform, lazy=False)

In [6]:
print('token ids:', dev_data_bert[4][0])
print('mask:', dev_data_bert[4][1])
print('index of the first subword:', dev_data_bert[4][2])
print('slot label:', dev_data_bert[4][3])
print('intent label:', dev_data_bert[4][4])
print('length:', dev_data_bert[4][4])

token ids: [  178   112   182  3754  1121   171 15540  1320  1106  1103  5952  1298]
mask: [1 0 0 1 1 1 0 0 1 1 1 1]
index of the first subword: [ 0  3  4  5  8  9 10 11]
slot label: [126 126 126 126 126  48 126 126 126 126  78 123]
intent label: 10
length: 10


## Build the Training Network
We add two fully-connected layers on top of BERT to predict the slot labels and intent labels, respectively.

In [7]:
class BERTForICSL(Block):
    def __init__(self, bert, num_intent_classes, num_slot_classes, dropout_prob,
                 prefix=None, params=None):
        super(BERTForICSL, self).__init__(prefix=prefix, params=params)
        self.bert = bert
        with self.name_scope():
            self.intent_classifier = nn.HybridSequential()
            with self.intent_classifier.name_scope():
                self.intent_classifier.add(nn.Dropout(rate=dropout_prob))
                self.intent_classifier.add(nn.Dense(units=num_intent_classes, flatten=False))
            self.slot_tagger = nn.HybridSequential()
            with self.slot_tagger.name_scope():
                self.slot_tagger.add(nn.Dropout(rate=dropout_prob))
                self.slot_tagger.add(nn.Dense(units=num_slot_classes, flatten=False))

    def forward(self, inputs, valid_length):
        """

        Parameters
        ----------
        inputs : NDArray
            The input sentences, has shape (batch_size, seq_length)
        valid_length : NDArray
            The valid length of the sentences

        Returns
        -------
        intent_scores : NDArray
            Shape (batch_size, num_classes)
        slot_scores : NDArray
            Shape (batch_size, seq_length, num_tag_types)
        """
        token_types = mx.nd.zeros_like(inputs)
        encoded_states, pooler_out = self.bert(inputs, token_types, valid_length)
        intent_scores = self.intent_classifier(pooler_out)
        slot_scores = self.slot_tagger(encoded_states)
        return intent_scores, slot_scores
net = BERTForICSL(bert_model, num_intent_classes=len(intent_vocab),
                  num_slot_classes=len(slot_vocab), dropout_prob=dropout_prob)
net.slot_tagger.initialize(ctx=ctx, init=mx.init.Normal(0.02))
net.intent_classifier.initialize(ctx=ctx, init=mx.init.Normal(0.02))
net.hybridize()
intent_pred_loss = gluon.loss.SoftmaxCELoss()
slot_pred_loss = gluon.loss.SoftmaxCELoss(batch_axis=[0, 1])
intent_pred_loss.hybridize()
slot_pred_loss.hybridize()

In [8]:
print(net)

BERTForICSL(
  (intent_classifier): HybridSequential(
    (0): Dropout(p = 0.1, axes=())
    (1): Dense(None -> 18, linear)
  )
  (bert): BERTModel(
    (pooler): Dense(768 -> 768, Activation(tanh))
    (word_embed): HybridSequential(
      (0): Embedding(28996 -> 768, float32)
      (1): Dropout(p = 0.1, axes=())
    )
    (token_type_embed): HybridSequential(
      (0): Embedding(2 -> 768, float32)
      (1): Dropout(p = 0.1, axes=())
    )
    (encoder): BERTEncoder(
      (transformer_cells): HybridSequential(
        (0): BERTEncoderCell(
          (dropout_layer): Dropout(p = 0.1, axes=())
          (attention_cell): MultiHeadAttentionCell(
            (_base_cell): DotProductAttentionCell(
              (_dropout_layer): Dropout(p = 0.1, axes=())
            )
            (proj_query): Dense(768 -> 768, linear)
            (proj_value): Dense(768 -> 768, linear)
            (proj_key): Dense(768 -> 768, linear)
          )
          (proj): Dense(768 -> 768, linear)
          (f

## Create the DataLoader and Trainer for Training/Validation/Testing


In [9]:
batch_size = 32
learning_rate = 5E-5

trainer = gluon.Trainer(net.collect_params(), 'bertadam',
                        {'learning_rate': learning_rate, 'wd': 0.0})
batchify_fn = nlp.data.batchify.Tuple(nlp.data.batchify.Pad(),    # Subword ID
                                      nlp.data.batchify.Pad(),    # Subword Mask
                                      nlp.data.batchify.Pad(),    # Beginning of subword
                                      nlp.data.batchify.Pad(),    # Tag IDs
                                      nlp.data.batchify.Stack(),  # Intent Label
                                      nlp.data.batchify.Stack())  # Valid Length
train_batch_sampler = nlp.data.sampler.SortedBucketSampler(
    [len(ele) for ele in train_data_bert],
    batch_size=batch_size,
    mult=20,
    shuffle=True)
train_loader = gluon.data.DataLoader(dataset=train_data_bert,
                                     num_workers=4,
                                     batch_sampler=train_batch_sampler,
                                     batchify_fn=batchify_fn)
dev_loader = gluon.data.DataLoader(dataset=dev_data_bert,
                                   num_workers=4,
                                   batch_size=batch_size,
                                   batchify_fn=batchify_fn,
                                   shuffle=False)
test_loader = gluon.data.DataLoader(dataset=test_data_bert,
                                    num_workers=4,
                                    batch_size=batch_size,
                                    batchify_fn=batchify_fn,
                                    shuffle=False)


  'Padding value 0 is used in data.batchify.Pad(). '


## Train the Model

In [10]:
from tqdm import tqdm
import sys

nepochs = 5
warmup_ratio = 0.1
step_num = 0
num_train_steps = int(len(train_batch_sampler) * nepochs)
num_warmup_steps = int(num_train_steps * warmup_ratio)
best_dev_sf1 = -1
for epoch_id in range(nepochs):
    avg_train_intent_loss = 0.0
    avg_train_slot_loss = 0.0
    nsample = 0
    nslot = 0
    ntoken = 0
    train_epoch_start = time.time()
    for token_ids, mask, selected, slot_ids, intent_label, valid_length in tqdm(train_loader, file=sys.stdout):
        # Copy data to the context, i.e., GPU in our example
        token_ids = mx.nd.array(token_ids, ctx=ctx).astype(np.int32)
        mask = mx.nd.array(mask, ctx=ctx).astype(np.float32)
        slot_ids = mx.nd.array(slot_ids, ctx=ctx).astype(np.int32)
        intent_label = mx.nd.array(intent_label, ctx=ctx).astype(np.int32)
        valid_length = mx.nd.array(valid_length, ctx=ctx).astype(np.float32)
        batch_nslots = mask.sum().asscalar()
        batch_nsample = token_ids.shape[0]

        # Set learning rate warm-up
        step_num += 1
        if step_num < num_warmup_steps:
            new_lr = learning_rate * step_num / num_warmup_steps
        else:
            offset = ((step_num - num_warmup_steps) * learning_rate /
                      (num_train_steps - num_warmup_steps))
            new_lr = learning_rate - offset
        trainer.set_learning_rate(new_lr)

        # Begin to calculate the gradient
        with mx.autograd.record():
            intent_scores, slot_scores = net(token_ids, valid_length)
            intent_loss = intent_pred_loss(intent_scores, intent_label)
            slot_loss = slot_pred_loss(slot_scores, slot_ids, mask.expand_dims(axis=-1))
            intent_loss = intent_loss.mean()
            slot_loss = slot_loss.sum() / batch_nslots
            loss = intent_loss + slot_loss
            loss.backward()
        trainer.update(1.0)
        avg_train_intent_loss += intent_loss.asscalar() * batch_nsample
        avg_train_slot_loss += slot_loss.asscalar() * batch_nslots
        nsample += batch_nsample
        nslot += batch_nslots
        ntoken += valid_length.sum().asscalar()
    train_epoch_end = time.time()
    avg_train_intent_loss /= nsample
    avg_train_slot_loss /= nslot
    print('[Epoch {}] train intent/slot = {:.3f}/{:.3f}, #token per second={:.0f}'.format(
        epoch_id, avg_train_intent_loss, avg_train_slot_loss,
        ntoken / (train_epoch_end - train_epoch_start)))

100%|██████████| 140/140 [00:13<00:00, 10.85it/s]
[Epoch 0] train intent/slot = 0.827/1.207, #token per second=5465
100%|██████████| 140/140 [00:12<00:00, 10.31it/s]
[Epoch 1] train intent/slot = 0.183/0.229, #token per second=5940
100%|██████████| 140/140 [00:12<00:00, 11.07it/s]
[Epoch 2] train intent/slot = 0.089/0.124, #token per second=5968
100%|██████████| 140/140 [00:12<00:00, 11.77it/s]
[Epoch 3] train intent/slot = 0.056/0.088, #token per second=5955
100%|██████████| 140/140 [00:12<00:00, 11.43it/s]
[Epoch 4] train intent/slot = 0.039/0.071, #token per second=5964


## Evaluate the Model

In [11]:
from seqeval.metrics import f1_score as ner_f1_score

def evaluation(ctx, data_loader, net, intent_pred_loss, slot_pred_loss, slot_vocab):
    """

    Parameters
    ----------
    ctx : Context
    data_loader : DataLoader
    net : Block
    intent_pred_loss : Block
    slot_pred_loss : Block
    slot_vocab : Vocab

    Returns
    -------
    avg_intent_loss : float
    avg_slot_loss : float
    intent_acc : float
    slot_f1 : float
    pred_slots : list
    gt_slots : list
    """
    nsample = 0
    nslot = 0
    avg_intent_loss = 0
    avg_slot_loss = 0
    correct_intent = 0
    pred_slots = []
    gt_slots = []
    for token_ids, mask, selected, slot_ids, intent_label, valid_length in data_loader:
        token_ids = mx.nd.array(token_ids, ctx=ctx).astype(np.int32)
        mask = mx.nd.array(mask, ctx=ctx).astype(np.float32)
        slot_ids = mx.nd.array(slot_ids, ctx=ctx).astype(np.int32)
        intent_label = mx.nd.array(intent_label, ctx=ctx).astype(np.int32)
        valid_length = mx.nd.array(valid_length, ctx=ctx).astype(np.float32)
        batch_nslot = mask.sum().asscalar()
        batch_nsample = token_ids.shape[0]
        # Forward network
        intent_scores, slot_scores = net(token_ids, valid_length)
        intent_loss = intent_pred_loss(intent_scores, intent_label)
        slot_loss = slot_pred_loss(slot_scores, slot_ids, mask.expand_dims(axis=-1))
        avg_intent_loss += intent_loss.sum().asscalar()
        avg_slot_loss += slot_loss.sum().asscalar()
        pred_slot_ids = mx.nd.argmax(slot_scores, axis=-1).astype(np.int32)
        correct_intent += (mx.nd.argmax(intent_scores, axis=-1).astype(np.int32)
                           == intent_label).sum().asscalar()
        for i in range(batch_nsample):
            ele_valid_length = int(valid_length[i].asscalar())
            ele_sel = selected[i].asnumpy()[:ele_valid_length]
            ele_gt_slot_ids = slot_ids[i].asnumpy()[ele_sel]
            ele_pred_slot_ids = pred_slot_ids[i].asnumpy()[ele_sel]
            ele_gt_slot_tokens = [slot_vocab.idx_to_token[v] for v in ele_gt_slot_ids]
            ele_pred_slot_tokens = [slot_vocab.idx_to_token[v] for v in ele_pred_slot_ids]
            gt_slots.append(ele_gt_slot_tokens)
            pred_slots.append(ele_pred_slot_tokens)
        nsample += batch_nsample
        nslot += batch_nslot
    avg_intent_loss /= nsample
    avg_slot_loss /= nslot
    intent_acc = correct_intent / float(nsample)
    slot_f1 = ner_f1_score(pred_slots, gt_slots)
    return avg_intent_loss, avg_slot_loss, intent_acc, slot_f1, pred_slots, gt_slots

avg_dev_intent_loss, avg_dev_slot_loss, dev_intent_acc, dev_slot_f1, dev_pred_slots, dev_gt_slots\
    = evaluation(ctx, dev_loader, net, intent_pred_loss, slot_pred_loss, slot_vocab)
print('[Epoch {}]    dev intent/slot = {:.3f}/{:.3f}, slot f1 = {:.2f}, intent acc = {:.2f}'.format(
    epoch_id, avg_dev_intent_loss, avg_dev_slot_loss, dev_slot_f1 * 100, dev_intent_acc * 100))
avg_test_intent_loss, avg_test_slot_loss, test_intent_acc, test_slot_f1, test_pred_slots, test_gt_slots \
    = evaluation(ctx, test_loader, net, intent_pred_loss, slot_pred_loss, slot_vocab)
print('[Epoch {}]    test intent/slot = {:.3f}/{:.3f}, slot f1 = {:.2f}, intent acc = {:.2f}'.format(
    epoch_id, avg_test_intent_loss, avg_test_slot_loss, test_slot_f1 * 100, test_intent_acc * 100))

[Epoch 4]    dev intent/slot = 0.128/0.092, slot f1 = 93.93, intent acc = 97.80
[Epoch 4]    test intent/slot = 0.123/0.154, slot f1 = 91.92, intent acc = 98.21


In [12]:
print('Sentence:    ', dev_data[1][0])
print('Ground Truth:', dev_gt_slots[1])
print('Prediction:  ', dev_gt_slots[1])

Sentence:     ['show', 'me', 'all', 'round', 'trip', 'flights', 'between', 'houston', 'and', 'las', 'vegas']
Ground Truth: ['O', 'O', 'O', 'B-round_trip', 'I-round_trip', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-toloc.city_name', 'I-toloc.city_name', 'O', 'O', 'O']
Prediction:   ['O', 'O', 'O', 'B-round_trip', 'I-round_trip', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-toloc.city_name', 'I-toloc.city_name', 'O', 'O', 'O']


## Full Script + Results

You can run the experiments in https://github.com/dmlc/gluon-nlp/tree/master/scripts/intent_slot

For ATIS

| Models | Intent Acc (%) | Slot F1 (%) |
| ------ | ------------------------ | ----------- |
| [Intent Gating & self-attention, EMNLP 2018](https://www.aclweb.org/anthology/D18-1417) | 98.77 | 96.52 |
| [BLSTM-CRF + ELMo, AAAI 2019](https://arxiv.org/abs/1811.05370) | 97.42 | 95.62 |
| [Joint BERT, Arxiv 2019](https://arxiv.org/pdf/1902.10909.pdf) |  97.5 | 96.1 |
| Ours | 98.66±0.00  | 95.88±0.04 |

For SNIPS

| Models | Intent Acc (%) | Slot F1 (%) |
| ------ | ------------------------ | ----------- |
| [BLSTM-CRF + ELMo, AAAI 2019](https://arxiv.org/abs/1811.05370) | 99.29 | 93.90 |
| [Joint BERT, Arxiv 2019](https://arxiv.org/pdf/1902.10909.pdf) | 98.60 | 97.00 |
| Ours | 98.81±0.13 | 95.94±0.10 |
