In [29]:
# !pip install tokenizers

# Tokenization pipeline

When calling `Tokenizer.encode` or `Tokenizer.encode_batch`, the input text(s) go through the following pipeline:

- normalization
- pre-tokenization
- model
- post-processing

## Normalization

`Normalization` is a set of operations you apply to a raw string to make it less random or “cleaner”.

In [30]:
from tokenizers import normalizers

### NFD, NFKD, NFC, NFKC

là quá trình được sử dụng để đảm bảo rằng các văn bản Unicode tương đương được biểu diễn một cách thống nhất và nhất quán. Unicode xác định cách biểu diễn ký tự giống nhau theo nhiều cách khác nhau, và việc chuẩn hóa giúp đảm bảo rằng việc xử lý và so sánh văn bản trở nên đáng tin cậy và dễ dự đoán hơn.

In [31]:
nfd = normalizers.NFD()
nfd.normalize_str("Chào em cô gái Lam Hồng")

'Chào em cô gái Lam Hồng'

In [32]:
nfkd = normalizers.NFKD()
nfkd.normalize_str("Chào em cô gái Lam Hồng")

'Chào em cô gái Lam Hồng'

### Lowercase

Replaces all uppercase to lowercase

In [33]:
normalizers.Lowercase().normalize_str("Chào em cô gái Lam Hồng")

'chào em cô gái lam hồng'

### Strip

Removes all whitespace characters on the specified sides (left, right or both) of the input

In [34]:
normalizers.Strip().normalize_str("   Chào em cô gái Lam Hồng   ")

'Chào em cô gái Lam Hồng'

### StripAccents

Removes all accent symbols in unicode (to be used with NFD for consistency)

In [35]:
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])

normalizer.normalize_str("Chào em cô gái Lam Hồng")

'Chao em co gai Lam Hong'

### Replace

Replaces a custom string or regexp and changes it with given content

In [36]:
from tokenizers import Regex

normalizer = normalizers.Replace('123', '')

normalizer.normalize_str("Chào em cô gái Lam Hồng 123")

'Chào em cô gái Lam Hồng '

## Pre-tokenizers

The PreTokenizer takes care of splitting the input according to a set of rules. 

You can easily combine multiple PreTokenizer together using a Sequence

In [37]:
from tokenizers import pre_tokenizers

In [38]:
from tokenizers.normalizers import NFD, Lowercase
normalizer = normalizers.Sequence([NFD(), Lowercase()])

text_normalized = normalizer.normalize_str("Chào em cô gái Lam Hồng! 123 dô")
text_normalized

'chào em cô gái lam hồng!'

### ByteLevel

Splits on whitespaces while remapping all the bytes to a set of visible characters. This technique as been introduced by OpenAI with GPT-2 and has some more or less nice properties:

- Since it maps on bytes, a tokenizer using this only requires 256 characters as initial alphabet (the number of values a byte can have), as opposed to the 130,000+ Unicode characters.
- A consequence of the previous point is that it is absolutely unnecessary to have an unknown token using this since we can represent anything with 256 tokens (Youhou!! 🎉🎉)
- For non ascii characters, it gets completely unreadable, but it works nonetheless!

Example:
- Input: "Hello my friend, how are you?"
- Ouput: "Hello", "Ġmy", Ġfriend", ",", "Ġhow", "Ġare", "Ġyou", "?"

In [39]:
from tokenizers.pre_tokenizers import ByteLevel

pre_tokenizer = ByteLevel(add_prefix_space=True, use_regex=True)
pre_tokenizer.pre_tokenize_str(text_normalized)

[('Ġcha', (0, 3)),
 ('ÌĢ', (3, 4)),
 ('o', (4, 5)),
 ('Ġem', (5, 8)),
 ('Ġco', (8, 11)),
 ('ÌĤ', (11, 12)),
 ('Ġga', (12, 15)),
 ('Ìģ', (15, 16)),
 ('i', (16, 17)),
 ('Ġlam', (17, 21)),
 ('Ġho', (21, 24)),
 ('ÌĤÌĢ', (24, 26)),
 ('ng', (26, 28)),
 ('!', (28, 29))]

### Whitespace

Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+`

Example:
- Input: "Hello there!"
- Output: "Hello", "there", "!"

In [40]:
from tokenizers.pre_tokenizers import Whitespace

pre_tokenizer = Whitespace()
pre_tokenizer.pre_tokenize_str(text_normalized)

[('chào', (0, 5)),
 ('em', (6, 8)),
 ('cô', (9, 12)),
 ('gái', (13, 17)),
 ('lam', (18, 21)),
 ('hồng', (22, 28)),
 ('!', (28, 29))]

### WhitespaceSplit

Splits on any whitespace characterSplits on any whitespace character

Example:
- Input: "Hello there!"
- Output: "Hello", "there!"

In [41]:
from tokenizers.pre_tokenizers import WhitespaceSplit

pre_tokenizer = WhitespaceSplit()
pre_tokenizer.pre_tokenize_str(text_normalized)

[('chào', (0, 5)),
 ('em', (6, 8)),
 ('cô', (9, 12)),
 ('gái', (13, 17)),
 ('lam', (18, 21)),
 ('hồng!', (22, 29))]

### Punctuation

Will isolate all punctuation characters

`Punctuation( behavior = 'isolated' )`

**behavior** (SplitDelimiterBehavior) — The behavior to use when splitting. Choices: “removed”, “isolated” (default), “merged_with_previous”, “merged_with_next”, “contiguous”

Example:

- Input: "Hello?"
- Ouput: "Hello", "?"

In [42]:
from tokenizers.pre_tokenizers import Punctuation

pre_tokenizer = Punctuation()
pre_tokenizer.pre_tokenize_str(text_normalized)

[('chào em cô gái lam hồng', (0, 28)), ('!', (28, 29))]

### Metaspace

`( replacement = '_', add_prefix_space = True )`


Splits on whitespaces and replaces them with a special char “▁” (U+2581)

Example: 
- Input: "Hello there"
- Ouput: "Hello", "▁there"

In [44]:
from tokenizers.pre_tokenizers import Metaspace

pre_tokenizer = Metaspace(add_prefix_space = False)
pre_tokenizer.pre_tokenize_str(text_normalized)

[('chào', (0, 5)),
 ('▁em', (5, 8)),
 ('▁cô', (8, 12)),
 ('▁gái', (12, 17)),
 ('▁lam', (17, 21)),
 ('▁hồng!', (21, 29))]

### CharDelimiterSplit

Splits on a given character

- Example with x:
- Input: "Helloxthere"
- Ouput: "Hello", "there"

In [47]:
from tokenizers.pre_tokenizers import CharDelimiterSplit 

pre_tokenizer = CharDelimiterSplit('_')
pre_tokenizer.pre_tokenize_str("hello_world")

[('hello', (0, 5)), ('world', (6, 11))]

### Digits

`( individual_digits = False )`

Splits the numbers from any other characters.

- Input: "Hello123there"
- Output: "Hello", "123", "there"

In [48]:
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = Digits(individual_digits = True)
pre_tokenizer.pre_tokenize_str("hello world 123")

[('hello world ', (0, 12)), ('1', (12, 13)), ('2', (13, 14)), ('3', (14, 15))]

In [49]:
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = Digits(individual_digits = False)
pre_tokenizer.pre_tokenize_str("hello world 123")

[('hello world ', (0, 12)), ('123', (12, 15))]

### Split

`( pattern, behavior,invert = False )`

Versatile (Linh hoạt) pre-tokenizer that splits on provided pattern and according to provided behavior. The pattern can be inverted if necessary.

pattern should be either a custom string or regexp.

behavior should be one of:
- removed
- isolated
- merged_with_previous
- merged_with_next
- contiguous


invert should be a boolean flag.

Example with `pattern = `, `behavior = "isolated"`, `invert = False`:
- Input: "Hello, how are you?"
- Output: "Hello,", " ", "how", " ", "are", " ", "you?"

### Sequence

Lets you compose multiple PreTokenizer that will be run in the given order

`Sequence([Punctuation(), WhitespaceSplit()])`

In [50]:
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
pre_tokenized_str = pre_tokenizer.pre_tokenize_str(text_normalized)
pre_tokenized_str

[('chào', (0, 5)),
 ('em', (6, 8)),
 ('cô', (9, 12)),
 ('gái', (13, 17)),
 ('lam', (18, 21)),
 ('hồng', (22, 28)),
 ('!', (28, 29))]

## Models

- `models.BPE`
- `models.Unigram`
- `models.WordLevel`
- `models.WordPiece`

## Post-Processors

After the whole pipeline, we sometimes want to insert some special tokens before feed a tokenized string into a model like ”`[CLS]` My horse is amazing `[SEP]`”. The PostProcessor is the component doing just that.

### TemplateProcessing

Example, when specifying a template with these values:

- single: "[CLS] $A [SEP]"
- pair: "[CLS] $A [SEP] $B [SEP]"
- special tokens:
    - "[CLS]"
    - "[SEP]"

----
- Input: ("I like this", "but not this")
- Output: "[CLS] I like this [SEP] but not this [SEP]"

----

```python
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
)
```

## All together: a BERT tokenizer from scratch

First, BERT relies on WordPiece, so we instantiate a new Tokenizer with this model:

In [53]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

Then we know that BERT preprocesses texts by removing accents and lowercasing. We also use a unicode normalizer:

In [55]:
from tokenizers import normalizers
from tokenizers.normalizers import NFD, Lowercase, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

The pre-tokenizer is just splitting on whitespace and punctuation:

In [56]:
from tokenizers.pre_tokenizers import Whitespace
bert_tokenizer.pre_tokenizer = Whitespace()

And the post-processing uses the template we saw in the previous section:

In [54]:
from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

We can use this tokenizer and train on it on wikitext like in the quicktour:

In [58]:
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(files, trainer)
bert_tokenizer.save("data/bert-wiki.json")

Exception: No such file or directory (os error 2)

Decoding

In [None]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids)
# [1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2]
tokenizer.decode([1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2])
# "Hello , y ' all ! How are you ?"

If you used a model that added special characters to represent subtokens of a given “word” (like the "##" in WordPiece) you will need to customize the decoder to treat them properly. If we take our previous bert_tokenizer for instance the default decoding will give:

In [None]:
output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]
bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."

But by changing it to a proper decoder, we get:

In [None]:
from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."