# Part2: Basic Usage of Tokenizer and Vocabulary

In this section, we will understand the basic usage of the tokenizer and vocabulary class in GluonNLP. The combination of tokenizer and vocabulary helps convert the raw text string to a sequence of integer values that can be fed into deep networks.

The usual workflow will be:

```
raw text => normalized (cleaned) text => tokens => network
```

In addition, in GluonNLP, we provide the `encode_with_offsets` option, in which you can get the character-level start and end position of the encoded tokens. This helps you solve span extraction tasks that appear also in Quesiton Answering, which will be illustrated in the next tutorial.

In [20]:
import gluonnlp
from gluonnlp.models import get_backbone

## Tokenizer Basics

Tokenization converts the raw sentence into a series of tokens. For example, let's consider two basic tokenizers, the `WhitespaceTokenizer` and the `MosesTokenizer`. We can simply call `tokenizer.encode()` to encode the sequence to a list of tokens.

In [43]:
from gluonnlp.data.tokenizers import WhitespaceTokenizer, MosesTokenizer
whitespace_tokenizer = WhitespaceTokenizer()
moses_tokenizer = MosesTokenizer('en')

sentence = '"#COVID19 cases in Sunnyvale declined over the last 7 days."'

print('Original Sentence:')
print(sentence + '\n')

print('Output of WhitespaceTokenizer:')
print(whitespace_tokenizer.encode(sentence))

print('Output of MosesTokenizer:')
print(moses_tokenizer.encode(sentence))

Original Sentence:
"#COVID19 cases in Sunnyvale declined over the last 7 days."

Output of WhitespaceTokenizer:
['"#COVID19', 'cases', 'in', 'Sunnyvale', 'declined', 'over', 'the', 'last', '7', 'days."']
Output of MosesTokenizer:
['&quot;', '#', 'COVID19', 'cases', 'in', 'Sunnyvale', 'declined', 'over', 'the', 'last', '7', 'days', '.', '&quot;']


You can see that the advanced tokenizers will deal with the punctuations automatically. To merge back a list of tokens to the original sentence, we can use `tokenizer.decode()`.

In [44]:
tokens = moses_tokenizer.encode(sentence)
recovered_sentence = moses_tokenizer.decode(tokens)
print('Original Sentence=', sentence)
print('Tokens=', tokens)
print('Decoded Sentence=', recovered_sentence)

Original Sentence= "#COVID19 cases in Sunnyvale declined over the last 7 days."
Tokens= ['&quot;', '#', 'COVID19', 'cases', 'in', 'Sunnyvale', 'declined', 'over', 'the', 'last', '7', 'days', '.', '&quot;']
Decoded Sentence= "# COVID19 cases in Sunnyvale declined over the last 7 days."


After the tokenization phase, we can create a vocabulary object that maps string token to the integers, which can then be wrapped as the input to the neural network.

In [45]:
from collections import Counter
from gluonnlp.data.vocab import Vocab
non_duplicate_tokens = list(Counter(tokens))
vocab = Vocab(non_duplicate_tokens)
print(vocab)
print(vocab.all_tokens)
print(vocab.non_special_tokens)
print(vocab.special_tokens)

Vocab(size=14, unk_token="<unk>")
['&quot;', '#', 'COVID19', 'cases', 'in', 'Sunnyvale', 'declined', 'over', 'the', 'last', '7', 'days', '.', '<unk>']
['&quot;', '#', 'COVID19', 'cases', 'in', 'Sunnyvale', 'declined', 'over', 'the', 'last', '7', 'days', '.']
['<unk>']


We can use the vocabulary to convert string tokens to integers. The order will be the same as in the `vocab.all_tokens`. Also, the `unk_token` is a special token that is used to handle unseen inputs.

In [55]:
for token in tokens:
    print('{} --> {}'.format(token, vocab[token]))
print()
print('unk_token =', vocab.unk_token, ', unk_id =', vocab.unk_id)
print('🤣 --> {}'.format(vocab['🤣']))

&quot; --> 0
# --> 1
COVID19 --> 2
cases --> 3
in --> 4
Sunnyvale --> 5
declined --> 6
over --> 7
the --> 8
last --> 9
7 --> 10
days --> 11
. --> 12
&quot; --> 0

unk_token = <unk> , unk_id = 13
🤣 --> 13


We can attach the vocabulary to the MosesTokenizer and use `encode(..., int)` method to directly map the sentence to a sequence of integers.

In [47]:
moses_tokenizer.set_vocab(vocab)
moses_tokenizer.encode(sentence, int)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0]

### Subword Tokenization

The idea of **Subword Tokenization** is widely adopted in state-of-the-art pretrained models. For example, BERT used the [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) subword tokenization algorithm. Before explaining the meaning of **subword**, let's first load the tokenizer of the BERT-cased model and see the output.

In [48]:
_, _, tokenizer, _, _ = get_backbone('google_en_cased_bert_base')
tokenizer.encode(sentence, str)

['"',
 '#',
 'CO',
 '##VI',
 '##D',
 '##19',
 'cases',
 'in',
 'Sunny',
 '##vale',
 'declined',
 'over',
 'the',
 'last',
 '7',
 'days',
 '.',
 '"']

Compared with the results of MosesTokenizer, we can find that COVID19 is converted to **CO, ##VI, ##D, ##19**, and Sunnyvale is converted to **Sunny ##vale**.

This helps you compress the vocabulary size because there are lots of words with shared prefix/postfix. For example, "dog", "dogs", and "dogcatcher"; "boyfriend", "girlfriend".

We can also access to the vocabulary of the WordPiece tokenizer used in BERT:

In [49]:
tokenizer.vocab

Vocab(size=28996, unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]")

In BERT, there will be the special **[CLS]**, **[SEP]** tokens. We can fetch the id and value of these tokens via `vocab.cls_token`, `vocab.cls_id`, and `vocab.sep_token`, `vocab.sep_id`. Also, there is the unknow token.

In [59]:
print('cls_token = ', tokenizer.vocab.cls_token, ', cls_id = ', tokenizer.vocab.cls_id)
print('sep_token = ', tokenizer.vocab.sep_token, ', sep_id = ', tokenizer.vocab.sep_id)
print(tokenizer.encode('😁 means smile'))

cls_token =  [CLS] , cls_id =  101
sep_token =  [SEP] , sep_id =  102
['[UNK]', 'means', 'smile']


To prepare for the input to the BERT model, we will need to append the CLS token to the beginning and the SEP token to the end. Thus, we can do the following:

In [61]:
bert_token_input = [tokenizer.vocab.cls_id] + tokenizer.encode(sentence, int) + [tokenizer.vocab.sep_id]
print(bert_token_input)

[101, 107, 108, 18732, 23314, 2137, 16382, 2740, 1107, 17321, 18236, 5799, 1166, 1103, 1314, 128, 1552, 119, 107, 102]


In GluonNLP, to better facilitate span extraction applications, the tokenizers support the `encode_with_offset` functionality, which also returns the character-level offsets of the input sentence.

In [68]:
encoded_tokens, offsets = tokenizer.encode_with_offsets(sentence, str)
print(encoded_tokens, offsets)

['"', '#', 'CO', '##VI', '##D', '##19', 'cases', 'in', 'Sunny', '##vale', 'declined', 'over', 'the', 'last', '7', 'days', '.', '"'] [(0, 1), (1, 2), (2, 4), (4, 6), (6, 7), (7, 9), (10, 15), (16, 18), (19, 24), (24, 28), (29, 37), (38, 42), (43, 46), (47, 51), (52, 53), (54, 58), (58, 59), (59, 60)]


In [74]:
for token, offset in zip(encoded_tokens, offsets):
    print('token = {}, sentence[{}:{}] = {}'.format(token, offset[0], offset[1], sentence[offset[0]:offset[1]]))

token = ", sentence[0:1] = "
token = #, sentence[1:2] = #
token = CO, sentence[2:4] = CO
token = ##VI, sentence[4:6] = VI
token = ##D, sentence[6:7] = D
token = ##19, sentence[7:9] = 19
token = cases, sentence[10:15] = cases
token = in, sentence[16:18] = in
token = Sunny, sentence[19:24] = Sunny
token = ##vale, sentence[24:28] = vale
token = declined, sentence[29:37] = declined
token = over, sentence[38:42] = over
token = the, sentence[43:46] = the
token = last, sentence[47:51] = last
token = 7, sentence[52:53] = 7
token = days, sentence[54:58] = days
token = ., sentence[58:59] = .
token = ", sentence[59:60] = "


We also support to directly map the sentence to a list of integers by using `encode_with_offsets(sentence, int)`

In [76]:
tokenizer.encode_with_offsets(sentence, int)

([107,
  108,
  18732,
  23314,
  2137,
  16382,
  2740,
  1107,
  17321,
  18236,
  5799,
  1166,
  1103,
  1314,
  128,
  1552,
  119,
  107],
 [(0, 1),
  (1, 2),
  (2, 4),
  (4, 6),
  (6, 7),
  (7, 9),
  (10, 15),
  (16, 18),
  (19, 24),
  (24, 28),
  (29, 37),
  (38, 42),
  (43, 46),
  (47, 51),
  (52, 53),
  (54, 58),
  (58, 59),
  (59, 60)])