# Part 1: Load Dataset + Quickstart of Pretrained Backbones

In this section, we will load two datasets using the `nlp_data` command and then try out different backbone models.

In [2]:
import pandas as pd
import mxnet as mx
import gluonnlp
from gluonnlp.utils import set_seed
mx.npx.set_np()
set_seed(123)

## Load the Dataset

Let's download two datasets from the [GLUE benchmark](https://gluebenchmark.com/):
- The Standford Sentiment Treebank (SST-2)
- Semantic Textual Similarity Benchmark (STS-B)

We will later show how to train prediction models on these two datasets with GluonNLP.

First of all, to download the dataset, let's just use the `nlp_data` command. The downloaded dataset are preprocessed to the [parquet](https://parquet.apache.org/) format that can be loaded by [pandas](https://pandas.pydata.org/).

In [3]:
!nlp_data prepare_glue --benchmark glue -t sst
!nlp_data prepare_glue --benchmark glue -t sts
!ls glue

Downloading glue to "glue". Selected tasks = sst
Processing sst...
Found!
Downloading glue to "glue". Selected tasks = sts
Processing sts...
Found!
sst  sts


In [4]:
sst_train_df = pd.read_parquet('glue/sst/train.parquet')
sst_valid_df = pd.read_parquet('glue/sst/dev.parquet')

In [5]:
sst_train_df.head(10)

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0
5,that 's far too tragic to merit such superfici...,0
6,demonstrates that the director of such hollywo...,1
7,of saucy,1
8,a depressed fifteen-year-old 's suicidal poetry,0
9,are more deeply thought through than in most `...,1


In [6]:
sts_train_df = pd.read_parquet('glue/sts/train.parquet')
sts_valid_df = pd.read_parquet('glue/sts/dev.parquet')

In [7]:
sts_train_df.head(10)

Unnamed: 0,sentence1,sentence2,genre,score
0,A plane is taking off.,An air plane is taking off.,main-captions,5.0
1,A man is playing a large flute.,A man is playing a flute.,main-captions,3.8
2,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...,main-captions,3.8
3,Three men are playing chess.,Two men are playing chess.,main-captions,2.6
4,A man is playing the cello.,A man seated is playing the cello.,main-captions,4.25
5,Some men are fighting.,Two men are fighting.,main-captions,4.25
6,A man is smoking.,A man is skating.,main-captions,0.5
7,The man is playing the piano.,The man is playing the guitar.,main-captions,1.6
8,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...,main-captions,2.2
9,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.,main-captions,5.0


## Quickstart of Pretrained Backbones

A bunch of recent papers, especially [BERT](https://arxiv.org/pdf/1810.04805.pdf), have led a new trend for solving NLP problems:
- Pretrain a backbone model on a large corpus,
- Finetune the backbone to solve the specific NLP task.

GluonNLP provides the interface for using the pretrained backbone models. Here, for quickstart, let's load the BERT model.

### Download the Backbone

You can download the backbone models via `get_backbone`. For example, you can run the following command to get the backbone of `google_en_cased_bert_base`.

In [8]:
from gluonnlp.models import get_backbone
model_name = 'google_en_cased_bert_base'
model_cls, cfg, tokenizer, local_params_path, _ = get_backbone(model_name)

In [9]:
print('- Model Class:')
print(model_cls)
print('\n- Configuration:')
print(cfg)
print('\n- Tokenizer:')
print(tokenizer)
print('\n- Path of the weights:')
print(local_params_path)

- Model Class:
<class 'gluonnlp.models.bert.BertModel'>

- Configuration:
INITIALIZER:
  bias: ['zeros']
  embed: ['truncnorm', 0, 0.02]
  weight: ['truncnorm', 0, 0.02]
MODEL:
  activation: gelu
  attention_dropout_prob: 0.1
  compute_layout: auto
  dtype: float32
  hidden_dropout_prob: 0.1
  hidden_size: 3072
  layer_norm_eps: 1e-12
  layout: NT
  max_length: 512
  num_heads: 12
  num_layers: 12
  num_token_types: 2
  pos_embed_type: learned
  units: 768
  vocab_size: 28996
VERSION: 1

- Tokenizer:
HuggingFaceWordPieceTokenizer(
   vocab_file = /home/ubuntu/.mxnet/models/nlp/google_en_cased_bert_base/vocab-c1defaaa.json
   unk_token = [UNK], sep_token = [SEP], cls_token = [CLS]
   pad_token = [PAD], mask_token = [MASK]
   clean_text = True, handle_chinese_chars = True
   strip_accents = False, lowercase = False
   wordpieces_prefix = ##
   vocab = Vocab(size=28996, unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")
)

- Path of the weight

### Create the Backbone

To create a new backbone model in Gluon, you can just use the following commands:

In [10]:
backbone = model_cls.from_cfg(cfg)
backbone.hybridize()
backbone.load_parameters(local_params_path)
print(backbone)

BertModel(
  (encoder): BertTransformer(
    (all_layers): HybridSequential(
      (0): TransformerEncoderLayer(
        (dropout_layer): Dropout(p = 0.1, axes=())
        (attn_qkv): Dense(768 -> 2304, linear)
        (attention_proj): Dense(768 -> 768, linear)
        (attention_cell): MultiHeadAttentionCell(
           query_units=768,
           num_heads=12,
           attention_dropout=0.1,
           scaled=True,
           normalized=False,
           layout="NTK",
           use_einsum=False,
           dtype=float32
        )
        (layer_norm): LayerNorm(eps=1e-12, axis=-1, center=True, scale=True, in_channels=768)
        (ffn): PositionwiseFFN(
        	units=768,
        	hidden_size=3072,
        	activation_dropout=0.0,
        	activation=gelu,
        	dropout=0.1,
        	normalization=layer_norm,
        	layer_norm_eps=1e-12,
        	pre_norm=False,
        	dtype=float32
        )
      )
      (1): TransformerEncoderLayer(
        (dropout_layer): Dropout(p =

You can directly use the `backbone` to extract the embeddings. For BERT, it will output one embedding vector for the whole sentence --- `cls_embedding` and a bounch of contextual embedding vectors for each token --- `token_embeddings`.

[INSERT FIGURE]

In [11]:
text_input = sst_train_df['sentence'][0]
tokens = tokenizer.encode(text_input, str)
token_ids = tokenizer.encode(text_input, int)
token_ids = mx.np.array([[tokenizer.vocab.cls_id] + token_ids + [tokenizer.vocab.sep_id]])
token_types = mx.np.array([0] * len(token_ids[0]))
valid_length = mx.np.array([len(token_ids[0])])
print('Sentence=', text_input)
print('Tokens=', tokens)
print('Token IDs=', token_ids)
print('Token Types=', token_types)
print('Valid Length=', valid_length)
token_embeddings, cls_embedding = backbone(token_ids, token_types, valid_length)

Sentence= hide new secretions from the parental units 
Tokens= ['hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units']
Token IDs= [[  101.  4750.  1207.  3318.  5266.  1121.  1103. 22467.  2338.   102.]]
Token Types= [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Valid Length= [10.]


In [12]:
print(cls_embedding.shape)
print(cls_embedding)

(1, 768)
[[-0.7091232   0.5309977   0.9998003  -0.9866606   0.9401921   0.6315886
   0.9837744  -0.96095276 -0.95775586 -0.60642767  0.97397816  0.99566174
  -0.9902677  -0.99969244  0.435707   -0.96778107  0.97976995 -0.52170444
  -0.99992543 -0.38844424 -0.29302844 -0.9997452   0.28357023  0.89527607
   0.973925    0.09718102  0.9790452   0.9999052   0.8216678  -0.08883981
   0.26693714 -0.98289573  0.6802321  -0.99808586  0.22781204  0.10108362
   0.24586591 -0.27913764  0.5857332  -0.8314479  -0.571512   -0.23013882
   0.4970959  -0.45805666  0.65454197  0.3267932   0.17924476 -0.04300154
  -0.19947378  0.99990696 -0.95206505  0.99967813 -0.96945816  0.99622405
   0.98914737  0.46140316  0.9910741   0.15409192 -0.992088    0.3981389
   0.95724404  0.10991149  0.8809637  -0.15570553  0.33269107 -0.47607294
  -0.6478572   0.24575822 -0.52706516  0.32258353  0.4838301   0.22356257
   0.96965253 -0.8545226  -0.02529192 -0.8779966   0.08907609 -0.99979186
   0.9317021   0.99988896  0.49

In [13]:
print(token_embeddings.shape)
print(token_embeddings)

(1, 10, 768)
[[[ 0.25098458 -0.20093301 -0.01786663 ... -0.35123444  0.4021042
   -0.16942185]
  [ 0.28456497 -0.59436226 -0.07495949 ...  0.36519405  0.4808873
    0.28281727]
  [ 0.09450649 -0.06759122  0.40153563 ...  0.36970523  0.4600461
    0.04629947]
  ...
  [ 0.19983977 -0.17685743  0.09971484 ... -0.09942135  0.36892515
    0.28015777]
  [ 0.11791191 -0.18985073 -0.02815771 ... -0.22512732 -0.14774892
   -0.0953014 ]
  [ 0.33991298 -0.03747199 -0.13523774 ... -0.9031357   1.1424115
   -0.5947777 ]]]


### More Backbone Models in GluonNLP

Apart from BERT, GluonNLP has provided other backbone models including the recent models like [XLMR](https://arxiv.org/pdf/1911.02116.pdf), [ALBERT](https://arxiv.org/pdf/1909.11942.pdf), [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB), and [MobileBERT](https://arxiv.org/pdf/2004.02984.pdf). We can use `list_backbone_names` to list all the backbones that are supported in GluonNLP.

In [14]:
from gluonnlp.models import list_backbone_names
list_backbone_names()

['google_albert_base_v2',
 'google_albert_large_v2',
 'google_albert_xlarge_v2',
 'google_albert_xxlarge_v2',
 'google_en_cased_bert_base',
 'google_en_cased_bert_large',
 'google_en_cased_bert_wwm_large',
 'google_en_uncased_bert_base',
 'google_en_uncased_bert_large',
 'google_en_uncased_bert_wwm_large',
 'google_multi_cased_bert_base',
 'google_zh_bert_base',
 'gluon_electra_small_owt',
 'google_electra_base',
 'google_electra_large',
 'google_electra_small',
 'gpt2_124M',
 'gpt2_355M',
 'gpt2_774M',
 'google_uncased_mobilebert',
 'fairseq_roberta_base',
 'fairseq_roberta_large',
 'fairseq_xlmr_base',
 'fairseq_xlmr_large',
 'fairseq_bart_base',
 'fairseq_bart_large']

With the help of the command, we can compare the number of params of some chosen backbone models:
- google_en_uncased_bert_base
- google_albert_base_v2
- google_uncased_mobilebert
- fairseq_bart_base

In [14]:
from gluonnlp.utils.misc import count_parameters
param_num_l = []
for name in ['google_en_uncased_bert_base',
             'google_albert_base_v2',
             'google_uncased_mobilebert',
             'fairseq_bart_base']:
    model_cls, cfg, tokenizer, local_params_path, _ = get_backbone(name, load_backbone=False)
    model = model_cls.from_cfg(cfg)
    model.hybridize()
    model.initialize()
    total_num_params, fixed_num_params = count_parameters(model.collect_params())
    mx.npx.waitall()
    param_num_l.append((name, int(total_num_params / 1000000)))
df = pd.DataFrame(param_num_l, columns=['Model', '#Params (MB)'])
df

  v.initialize(None, ctx, init, force_reinit=force_reinit)


Unnamed: 0,Model,#Params (MB)
0,google_en_uncased_bert_base,109
1,google_albert_base_v2,11
2,google_uncased_mobilebert,24
3,fairseq_bart_base,218


### Use Other Backbones

Apart from BERT, it's straightforward to load other backbone models.
#### Load ALBERT-Base

In [44]:
model_cls, cfg, tokenizer, local_params_path, _ = get_backbone('google_albert_base_v2')
backbone = model_cls.from_cfg(cfg)
backbone.load_parameters(local_params_path)
backbone.hybridize()
print(cfg)
print()
print(tokenizer)

INITIALIZER:
  bias: ['zeros']
  embed: ['truncnorm', 0, 0.02]
  weight: ['truncnorm', 0, 0.02]
MODEL:
  activation: gelu(tanh)
  attention_dropout_prob: 0.0
  compute_layout: auto
  dtype: float32
  embed_size: 128
  hidden_dropout_prob: 0.0
  hidden_size: 3072
  layer_norm_eps: 1e-12
  layout: NT
  max_length: 512
  num_groups: 1
  num_heads: 12
  num_layers: 12
  num_token_types: 2
  pos_embed_type: learned
  units: 768
  vocab_size: 30000
VERSION: 1

SentencepieceTokenizer(
   model_path = /home/ubuntu/.mxnet/models/nlp/google_albert_base_v2/spm-65999e5d.model
   lowercase = True, nbest = 0, alpha = 0.0
   vocab = Vocab(size=30000, unk_token="<unk>", pad_token="<pad>", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")
)


### Load MobileBERT

In [34]:
model_cls, cfg, tokenizer, local_params_path, _ = get_backbone('google_uncased_mobilebert')
backbone = model_cls.from_cfg(cfg)
backbone.load_parameters(local_params_path)
backbone.hybridize()
print(cfg)
print()
print(tokenizer)

INITIALIZER:
  bias: ['zeros']
  embed: ['truncnorm', 0, 0.02]
  weight: ['truncnorm', 0, 0.02]
MODEL:
  activation: relu
  attention_dropout_prob: 0.1
  bottleneck_strategy: qk_sharing
  classifier_activation: False
  compute_layout: auto
  dtype: float32
  embed_size: 128
  hidden_dropout_prob: 0.0
  hidden_size: 512
  inner_size: 128
  layer_norm_eps: 1e-12
  layout: NT
  max_length: 512
  normalization: no_norm
  num_heads: 4
  num_layers: 24
  num_stacked_ffn: 4
  num_token_types: 2
  pos_embed_type: learned
  trigram_embed: True
  units: 512
  use_bottleneck: True
  vocab_size: 30522
VERSION: 1

HuggingFaceWordPieceTokenizer(
   vocab_file = /home/ubuntu/.mxnet/models/nlp/google_uncased_mobilebert/vocab-e6d2b21d.json
   unk_token = [UNK], sep_token = [SEP], cls_token = [CLS]
   pad_token = [PAD], mask_token = [MASK]
   clean_text = True, handle_chinese_chars = True
   strip_accents = False, lowercase = True
   wordpieces_prefix = ##
   vocab = Vocab(size=30522, unk_token="[UNK]",

### Load BART

In [17]:
model_cls, cfg, tokenizer, local_params_path, _ = get_backbone('fairseq_bart_base')
backbone = model_cls.from_cfg(cfg)
backbone.load_parameters(local_params_path)
backbone.hybridize()
print(cfg)
print()
print(tokenizer)

INITIALIZER:
  bias: ['zeros']
  embed: ['xavier', 'gaussian', 'in', 1.0]
  weight: ['xavier', 'uniform', 'avg', 1.0]
MODEL:
  DECODER:
    activation: gelu
    hidden_size: 3072
    num_heads: 12
    num_layers: 6
    pre_norm: False
    recurrent: False
    units: 768
    use_qkv_bias: True
  ENCODER:
    activation: gelu
    hidden_size: 3072
    num_heads: 12
    num_layers: 6
    pre_norm: False
    recurrent: False
    units: 768
    use_qkv_bias: True
  activation_dropout: 0.0
  attention_dropout: 0.1
  data_norm: True
  dropout: 0.1
  dtype: float32
  layer_norm_eps: 1e-05
  layout: NT
  max_src_length: 1024
  max_tgt_length: 1024
  pooler_activation: tanh
  pos_embed_type: learned
  scale_embed: False
  shared_embed: True
  tie_weights: True
  vocab_size: 51201
VERSION: 1

HuggingFaceByteBPETokenizer(
   merges_file = /home/ubuntu/.mxnet/models/nlp/fairseq_bart_base/gpt2-396d4d8e.merges
   vocab_file = /home/ubuntu/.mxnet/models/nlp/fairseq_bart_base/gpt2-f4dedacb.vocab
   add

In [19]:
text_input = sst_train_df['sentence'][0]
tokens = tokenizer.encode(text_input, str)
token_ids = tokenizer.encode(text_input, int)
token_ids = mx.np.array([[tokenizer.vocab.bos_id] + token_ids + [tokenizer.vocab.eos_id]])
valid_length = mx.np.array([len(token_ids[0])])
token_embeddings = backbone(token_ids, valid_length, token_ids, valid_length)
print('Sentence=', text_input)
print('Tokens=', tokens)
print(valid_length)
print(token_embeddings.shape)

Sentence= hide new secretions from the parental units 
Tokens= ['hide', 'Ġnew', 'Ġsecret', 'ions', 'Ġfrom', 'Ġthe', 'Ġparental', 'Ġunits', 'Ġ']
[11.]
(1, 11, 51201)
