# Fin-tuning BERT for Sentiment Analysis

## Preparation

First, let's import necessary modules.

In [11]:
!pip install mxnet-cu100mkl d2l https://github.com/dmlc/gluon-nlp/tarball/master sagemaker-containers -U -q

[31mmxnet-cu100mkl 1.5.1.post0 has requirement numpy<2.0.0,>1.16.0, but you'll have numpy 1.14.5 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import argparse, time, os
import d2l
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp
from utils import train_loop, predict_sentiment

parser = argparse.ArgumentParser(description='BERT sentiment analysis fine-tune example.')
parser.add_argument('--batch_size', type=int, default=32,
                    help='batch size per GPU. total_batch_size = batch_size_per_gpu * num_gpus')
parser.add_argument('--num_epochs', type=int, default=1, help='The number of epochs to train')
parser.add_argument('--lr', type=float, default=5e-5, help='Learning rate')

args = parser.parse_args([])
print(args)

Namespace(batch_size=32, lr=5e-05, num_epochs=1)


In this section, we fine-tune the BERT Base model for sentiment analysis on the IMDB dataset.

### BERT for Sentiment Analysis

### Get Pre-trained BERT Model

We can load the pre-trained BERT fairly easily using the model API in GluonNLP, which returns the vocabulary along with the model. We include the pooler layer of the pre-trained model by setting `use_pooler` to `True`.
The list of pre-trained BERT models available in GluonNLP can be found [here](../../model_zoo/bert/index.rst).

Now that we have loaded the BERT model, we only need to attach an additional layer for classification.
The `BERTClassifier` class uses a BERT base model to encode sentence representation, followed by a `nn.Dense` layer for classification. We only need to initialize the classification layer. The encoding layers are already initialized with pre-trained weights.

In [3]:
ctx = d2l.try_all_gpus()
bert_base, vocabulary = nlp.model.get_model('bert_12_768_12',
                                            dataset_name='book_corpus_wiki_en_uncased',
                                            pretrained=True, ctx=ctx,
                                            use_decoder=False, use_classifier=False)
loss_fn = mx.gluon.loss.SoftmaxCELoss()
net = nlp.model.BERTClassifier(bert_base, 2)
net.classifier.initialize(ctx=ctx)
net.hybridize()
print(net)

BERTClassifier(
  (bert): BERTModel(
    (encoder): BERTEncoder(
      (dropout_layer): Dropout(p = 0.1, axes=())
      (layer_norm): BERTLayerNorm(eps=1e-12, axis=-1, center=True, scale=True, in_channels=768)
      (transformer_cells): HybridSequential(
        (0): BERTEncoderCell(
          (dropout_layer): Dropout(p = 0.1, axes=())
          (attention_cell): MultiHeadAttentionCell(
            (_base_cell): DotProductAttentionCell(
              (_dropout_layer): Dropout(p = 0.1, axes=())
            )
            (proj_query): Dense(768 -> 768, linear)
            (proj_key): Dense(768 -> 768, linear)
            (proj_value): Dense(768 -> 768, linear)
          )
          (proj): Dense(768 -> 768, linear)
          (ffn): BERTPositionwiseFFN(
            (ffn_1): Dense(768 -> 3072, linear)
            (activation): GELU()
            (ffn_2): Dense(3072 -> 768, linear)
            (dropout_layer): Dropout(p = 0.1, axes=())
            (layer_norm): BERTLayerNorm(eps=1e-12, axis

## Data Preprocessing

To use the pre-trained BERT model, we need to:
- tokenize the inputs into word pieces,
- insert [CLS] at the beginning of a sentence, 
- insert [SEP] at the end of a sentence, and
- generate segment ids

### Data Transformations

We again use the IMDB dataset, but for this time, downloading using the GluonNLP data API. We then use the transform API to transform the raw scores to positive labels and negative labels. 
To process sentences with BERT-style '[CLS]', '[SEP]' tokens, you can use `data.BERTSentenceTransform` API.

In [4]:
train_dataset_raw = nlp.data.IMDB('train')
train_dataset_raw = mx.gluon.data.SimpleDataset(train_dataset_raw[:100])
test_dataset_raw = nlp.data.IMDB('test')
test_dataset_raw = mx.gluon.data.SimpleDataset(test_dataset_raw[:100])

tokenizer = nlp.data.BERTTokenizer(vocabulary)

def transform_fn(data):
    text, label = data
    # Transform label into position / negative
    label = 1 if label >= 5 else 0
    transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=128,
                                               pad=False, pair=False)
    data, length, segment_type = transform([text])
    data = data.astype('float32')
    length = length.astype('float32')
    segment_type = segment_type.astype('float32')
    return data, length, segment_type, label

In [5]:
train_dataset = train_dataset_raw.transform(transform_fn)
test_dataset = test_dataset_raw.transform(transform_fn)

print(vocabulary)
print('Index for [CLS] = ', vocabulary['[CLS]'])
print('Index for [SEP] = ', vocabulary['[SEP]'])

data, length, segment_type, label = train_dataset[0]
print('words = ', data.astype('int32'))

Vocab(size=30522, unk="[UNK]", reserved="['[CLS]', '[SEP]', '[MASK]', '[PAD]']")
Index for [CLS] =  2
Index for [SEP] =  3
words =  [    2 22953  2213  4381  2152  2003  1037  9476  4038  1012  2009  2743
  2012  1996  2168  2051  2004  2070  2060  3454  2055  2082  2166  1010
  2107  2004  1000  5089  1000  1012  2026  3486  2086  1999  1996  4252
  9518  2599  2033  2000  2903  2008 22953  2213  4381  2152  1005  1055
 18312  2003  2172  3553  2000  4507  2084  2003  1000  5089  1000  1012
  1996 25740  2000  5788 13732  1010  1996 12369  3993  2493  2040  2064
  2156  2157  2083  2037 17203  5089  1005 13433  8737  1010  1996  9004
 10196  4757  1997  1996  2878  3663  1010  2035 10825  2033  1997  1996
  2816  1045  2354  1998  2037  2493  1012  2043  1045  2387  1996  2792
  1999  2029  1037  3076  8385  2699  2000  6402  2091  1996  2082  1010
  1045  3202  7383  1012  1012  1012  1012     3]


### Batchify and Data Loader

In [6]:
padding_id = vocabulary[vocabulary.padding_token]
batchify_fn = nlp.data.batchify.Tuple(
        # words: the first dimension is the batch dimension
        nlp.data.batchify.Pad(axis=0, pad_val=padding_id),
        # valid length
        nlp.data.batchify.Stack(),
        # segment type : the first dimension is the batch dimension
        nlp.data.batchify.Pad(axis=0, pad_val=padding_id),
        # label
        nlp.data.batchify.Stack(np.float32))

batch_size = args.batch_size * len(ctx)
train_data = mx.gluon.data.DataLoader(train_dataset,
                                   batchify_fn=batchify_fn, shuffle=True,
                                   batch_size=batch_size, num_workers=4)
test_data = mx.gluon.data.DataLoader(test_dataset,
                                  batchify_fn=batchify_fn,
                                  shuffle=False, batch_size=batch_size, num_workers=4)

Process ForkPoolWorker-5:
Process ForkPoolWorker-1:
Process ForkPoolWorker-6:
Process ForkPoolWorker-4:
Process ForkPoolWorker-8:
Process ForkPoolWorker-7:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap

### Training Loop

Now we have all the pieces to put together, and we can finally start fine-tuning the
model with a few epochs.

In [7]:
tick = time.time()
train_loop(net, train_data, test_data, args.num_epochs, args.lr, ctx, loss_fn)
tock = time.time()
print('Elapsed time (sec): ', tock-tick)

Batch 0, Train Acc 0.0, Train Loss 1.0277866125106812
Epoch 0, Train Acc ('accuracy', 0.0), Train Loss 1.0277866125106812
Test Acc 0.0,
Elapsed time (sec):  4.174895524978638


### Save model checkpoint 

In [29]:

import tarfile
def flatten(tarinfo):
    tarinfo.name = os.path.basename(tarinfo.name)
    return tarinfo
tar = tarfile.open("model.tar.gz", "w:gz")
tar.add("checkpoint-0000.params", filter=flatten)
tar.add("checkpoint-symbol.json", filter=flatten)
tar.add("vocab.json", filter=flatten)
tar.close()
!ls

bert			infer.ipynb   __pycache__  setup.sh	train.py
checkpoint-0000.params	infer.py      sdk.ipynb    train	utils.py
checkpoint-symbol.json	model.tar.gz  sdk.py	   train.ipynb	vocab.json
bert			infer.ipynb   __pycache__  setup.sh	train.py
checkpoint-0000.params	infer.py      sdk.ipynb    train	utils.py
checkpoint-symbol.json	model.tar.gz  sdk.py	   train.ipynb	vocab.json


### Inference

In [10]:
predict_sentiment(net, ctx, vocabulary, tokenizer, 'this movie is so great')

'positive'

In [41]:
import sagemaker
role = sagemaker.get_execution_role()
session = sagemaker.Session()

uploaded_model = session.upload_data(path='model.tar.gz', key_prefix='model')

s3_path = 's3://' + session.default_bucket() + '/model/model.tar.gz'

In [67]:
from sagemaker.mxnet.model import MXNetModel
model = MXNetModel(model_data=s3_path,
                   image='397262719838.dkr.ecr.us-east-1.amazonaws.com/haibin-test:serve',
                   role=role,
                   py_version='py3',
                   entry_point='serve')
help(model.deploy)

Help on method deploy in module sagemaker.model:

deploy(initial_instance_count, instance_type, accelerator_type=None, endpoint_name=None, update_endpoint=False, tags=None, kms_key=None, wait=True) method of sagemaker.mxnet.model.MXNetModel instance
    Deploy this ``Model`` to an ``Endpoint`` and optionally return a
    ``Predictor``.
    
    Create a SageMaker ``Model`` and ``EndpointConfig``, and deploy an
    ``Endpoint`` from this ``Model``. If ``self.predictor_cls`` is not None,
    this method returns a the result of invoking ``self.predictor_cls`` on
    the created endpoint name.
    
    The name of the created model is accessible in the ``name`` field of
    this ``Model`` after deploy returns
    
    The name of the created endpoint is accessible in the
    ``endpoint_name`` field of this ``Model`` after deploy returns.
    
    Args:
        initial_instance_count (int): The initial number of instances to run
            in the ``Endpoint`` created from this ``Model``.
 

In [65]:
predictor = model.deploy(initial_instance_count=1, instance_type='local')

OSError: [Errno 28] No space left on device

In [66]:
# curl --data-binary @${payload} -H "Content-Type: ${content}" -v http://localhost:8080/invocations

## Conclusion

In this tutorial, we showed how to fine-tune sentiment analysis model with pre-trained BERT parameters. In GluonNLP, this can be done with such few, simple steps. All we did was apply a BERT-style data transformation to pre-process the data, automatically download the pre-trained model, and feed the transformed data into the model, all within 50 lines of code!

For more fine-tuning scripts, visit the [BERT model zoo webpage](http://gluon-nlp.mxnet.io/model_zoo/bert/index.html).

## References

[1] Devlin, Jacob, et al. "Bert:
Pre-training of deep
bidirectional transformers for language understanding."
arXiv preprint
arXiv:1810.04805 (2018).

[2] Dolan, William B., and Chris
Brockett.
"Automatically constructing a corpus of sentential paraphrases."
Proceedings of
the Third International Workshop on Paraphrasing (IWP2005). 2005.

[3] Peters,
Matthew E., et al. "Deep contextualized word representations." arXiv
preprint
arXiv:1802.05365 (2018).

[4] Hendrycks, Dan, and Kevin Gimpel. "Gaussian error linear units (gelus)." arXiv preprint arXiv:1606.08415 (2016).

For fine-tuning, we only need to initialize the last classifier layer from scratch. The other layers are already initialized from the pre-trained model weights.