In [2]:
pip install transformers datasets

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.16.1 dill-0.3.7 multiprocess-0.70.15


In [47]:
from datasets import load_dataset

# Load AG News dataset
ag_news_dataset = load_dataset("ag_news", split = "train[:15000]")

ag_news_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 15000
})

In [48]:
text_samples = [example['text'] for example in ag_news_dataset]

text_samples[:5]

["Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
 "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.",
 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.',
 'Oil prices soar to all-time record, posing new menace to US economy (A

Doc link -https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt#building-a-wordpiece-tokenizer-from-scratch

List out all huggingface datasets

Building a tokenizer with the 🤗 Tokenizers library, we start by instantiating a Tokenizer object with a model, then set its normalizer, pre_tokenizer, post_processor, and decoder attributes to the values we want.

More precisely, the library is built around a central Tokenizer class with the building blocks regrouped in submodules:

- normalizers contains all the possible types of Normalizer you can use (complete list here).
- pre_tokenizers contains all the possible types of PreTokenizer you can use (complete list here).
- models contains the various types of Model you can use, like BPE, WordPiece, and Unigram (complete list here).
- trainers contains all the different types of Trainer you can use to train your model on a corpus (one per type of model; complete list here).
- post_processors contains the various types of PostProcessor you can use (complete list here).
- decoders contains the various types of Decoder you can use to decode the outputs of tokenization (complete list here).

For this example, we’ll create a Tokenizer with a WordPiece model:

In [13]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token = "[UNK]"))

tokenizer

<tokenizers.Tokenizer at 0x5994eefbe7a0>

We have to specify the unk_token so the model knows what to return when it encounters characters it hasn’t seen before. Other arguments we can set here include the vocab of our model (we’re going to train the model, so we don’t need to set this) and max_input_chars_per_word, which specifies a maximum length for each word (words longer than the value passed will be split).

The first step of tokenization is normalization, so let’s begin with that. Since BERT is widely used, there is a BertNormalizer with the classic options we can set for BERT: lowercase and strip_accents, which are self-explanatory; clean_text to remove all control characters and replace repeating spaces with a single one; and handle_chinese_chars, which places spaces around Chinese characters. To replicate the bert-base-uncased tokenizer, we can just set this normalizer

We’re also using an NFD Unicode normalizer, as otherwise the StripAccents normalizer won’t properly recognize the accented characters and thus won’t strip them out.

In [17]:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = False)

tokenizer.normalizer.normalize_str("The pièce de résistance was the chef's Special Dessert.")

"The pièce de résistance was the chef's Special Dessert."

In [27]:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = True)

tokenizer.normalizer.normalize_str("The pièce de résistance was the chef's Special Dessert.")

" the piece de resistance was the chef's special dessert."

Generally speaking, however, when building a new tokenizer you won’t have access to such a handy normalizer already implemented in the 🤗 Tokenizers library — so let’s see how to create the BERT normalizer by hand. The library provides a Lowercase normalizer and a StripAccents normalizer, and you can compose several normalizers using a Sequence

In [25]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

tokenizer.normalizer.normalize_str("  The pièce de résistance was the chef's Special Dessert.   ")

"  the piece de resistance was the chef's special dessert.   "

In [26]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Strip(), normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

tokenizer.normalizer.normalize_str("   The pièce de résistance was the chef's Special Dessert.   ")

"the piece de resistance was the chef's special dessert."

Next is the pre-tokenization step. Again, there is a prebuilt BertPreTokenizer that we can use:

In [20]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

tokenizer.pre_tokenizer.pre_tokenize_str("We're checking pre-tokenization step.")

[('We', (0, 2)),
 ("'", (2, 3)),
 ('re', (3, 5)),
 ('checking', (6, 14)),
 ('pre', (15, 18)),
 ('-', (18, 19)),
 ('tokenization', (19, 31)),
 ('step', (32, 36)),
 ('.', (36, 37))]

Building tokenizer from scratch

pre_tokenizers.Whitespace(): Splits on whitespace and punctuation

In [21]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

tokenizer.pre_tokenizer.pre_tokenize_str("We're checking the pre-tokenization step.")

[('We', (0, 2)),
 ("'", (2, 3)),
 ('re', (3, 5)),
 ('checking', (6, 14)),
 ('pre', (15, 18)),
 ('-', (18, 19)),
 ('tokenization', (19, 31)),
 ('step', (32, 36)),
 ('.', (36, 37))]

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str("We're checking the pretokenization step.")

[('We', (0, 2)),
 ("'", (2, 3)),
 ('re', (3, 5)),
 ('checking', (6, 14)),
 ('pretokenization', (15, 30)),
 ('step', (31, 35)),
 ('.', (35, 36))]

If we only want to split on whitespace, you should use the WhitespaceSplit pre-tokenizer instead

In [22]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()

pre_tokenizer.pre_tokenize_str("We're checking the pre-tokenization step.")

[("We're", (0, 5)),
 ('checking', (6, 14)),
 ('pre-tokenization', (15, 31)),
 ('step.', (32, 37))]

Like with normalizers, you can use a Sequence to compose several pre-tokenizers:

In [23]:
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)

pre_tokenizer.pre_tokenize_str("We're checking the pre-tokenization step.")

[('We', (0, 2)),
 ("'", (2, 3)),
 ('re', (3, 5)),
 ('checking', (6, 14)),
 ('the', (15, 18)),
 ('pre', (19, 22)),
 ('-', (22, 23)),
 ('tokenization', (23, 35)),
 ('step', (36, 40)),
 ('.', (40, 41))]

The next step in the tokenization pipeline is running the inputs through the model. We already specified our model in the initialization, but we still need to train it, which will require a WordPieceTrainer. The main thing to remember when instantiating a trainer in 🤗 Tokenizers is that you need to pass it all the special tokens you intend to use — otherwise it won’t add them to the vocabulary, since they are not in the training corpus

In [24]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

trainer = trainers.WordPieceTrainer(vocab_size = 20000, special_tokens = special_tokens, min_frequency = 2)

trainer

<tokenizers.trainers.WordPieceTrainer at 0x7cdc9a7619d0>

As well as specifying the vocab_size and special_tokens, we can set the min_frequency (the number of times a token must appear to be included in the vocabulary) or change the continuing_subword_prefix (if we want to use something different from ##).

To train our model using the iterator we defined earlier, we just have to execute this command

In [28]:
tokenizer.train_from_iterator(text_samples, trainer = trainer)

Testing our tokenizer by calling encode method

In [29]:
encoding = tokenizer.encode("We're checking pre-tokenization step.")

encoding

Encoding(num_tokens=11, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

Viewing all the attributes

In [30]:
print("Encoding_ids :", encoding.ids)

print("Encoding_tokens :", encoding.tokens)

print("Encoding_offsets :", encoding.offsets)

Encoding_ids : [448, 10, 140, 16701, 644, 15, 131, 11612, 2589, 2548, 16]
Encoding_tokens : ['we', "'", 're', 'checking', 'pre', '-', 'to', '##ken', '##ization', 'step', '.']
Encoding_offsets : [(0, 2), (2, 3), (3, 5), (6, 14), (15, 18), (18, 19), (19, 21), (21, 24), (24, 31), (32, 36), (36, 37)]


The encoding obtained is an Encoding, which contains all the necessary outputs of the tokenizer in its various attributes: ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, and overflowing.

The last step in the tokenization pipeline is post-processing. We need to add the [CLS] token at the beginning and the [SEP] token at the end (or after each sentence, if we have a pair of sentences). We will use a TemplateProcessor for this, but first we need to know the IDs of the [CLS] and [SEP] tokens in the vocabulary:

In [31]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

print(cls_token_id, sep_token_id)

2 3


To write the template for the TemplateProcessor, we have to specify how to treat a single sentence and a pair of sentences. For both, we write the special tokens we want to use; the first (or single) sentence is represented by `$A`, while the second sentence (if encoding a pair) is represented by `$B`. For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.

The classic BERT template is thus defined as follows:

In [32]:
tokenizer.post_processor = processors.TemplateProcessing(
    single = f"[CLS]:0 $A:0 [SEP]:0",
    pair = f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens = [("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Note that we need to pass along the IDs of the special tokens, so the tokenizer can properly convert them to their IDs. Once this is added, going back to our previous example will give

In [36]:
encoding = tokenizer.encode("We're checking pre-tokenization step.")

print("Encoding_ids :", encoding.ids)
print()
print("Encoding_tokens :", encoding.tokens)
print()
print("Encoding_offsets :", encoding.offsets)
print()
print("Encoding_type_ids :", encoding.type_ids)

Encoding_ids : [2, 448, 10, 140, 16701, 644, 15, 131, 11612, 2589, 2548, 16, 3]

Encoding_tokens : ['[CLS]', 'we', "'", 're', 'checking', 'pre', '-', 'to', '##ken', '##ization', 'step', '.', '[SEP]']

Encoding_offsets : [(0, 0), (0, 2), (2, 3), (3, 5), (6, 14), (15, 18), (18, 19), (19, 21), (21, 24), (24, 31), (32, 36), (36, 37), (0, 0)]

Encoding_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


And on a pair of sentences, we get the proper result:

In [35]:
encoding = tokenizer.encode("We're checking pre-tokenization step.", "on a pair of sentences.")

print("Encoding_ids :", encoding.ids)
print()
print("Encoding_tokens :", encoding.tokens)
print()
print("Encoding_offsets :", encoding.offsets)
print()
print("Encoding_type_ids :", encoding.type_ids)

Encoding_ids : [2, 448, 10, 140, 16701, 644, 15, 131, 11612, 2589, 2548, 16, 3, 150, 34, 3189, 134, 12765, 16, 3]

Encoding_tokens : ['[CLS]', 'we', "'", 're', 'checking', 'pre', '-', 'to', '##ken', '##ization', 'step', '.', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']

Encoding_offsets : [(0, 0), (0, 2), (2, 3), (3, 5), (6, 14), (15, 18), (18, 19), (19, 21), (21, 24), (24, 31), (32, 36), (36, 37), (0, 0), (0, 2), (3, 4), (5, 9), (10, 12), (13, 22), (22, 23), (0, 0)]

Encoding_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


We’ve almost finished building this tokenizer from scratch — the last step is to include a decoder

In [37]:
tokenizer.decoder = decoders.WordPiece(prefix = "##")

tokenizer.decode(encoding.ids)

"we ' re checking pre - tokenization step."

In [39]:
tokenizer.save("word_piece_tokenizer.json")

We can then reload that file in a Tokenizer object with the from_file() method

In [40]:
new_tokenizer = Tokenizer.from_file("word_piece_tokenizer.json")

new_tokenizer

<tokenizers.Tokenizer at 0x5994eee29ad0>

In [41]:
encoding = new_tokenizer.encode("We're checking pre-tokenization step.", "on a pair of sentences.")

print("Encoding_ids :", encoding.ids)
print()
print("Encoding_tokens :", encoding.tokens)
print()
print("Encoding_offsets :", encoding.offsets)
print()
print("Encoding_type_ids :", encoding.type_ids)

Encoding_ids : [2, 448, 10, 140, 16701, 644, 15, 131, 11612, 2589, 2548, 16, 3, 150, 34, 3189, 134, 12765, 16, 3]

Encoding_tokens : ['[CLS]', 'we', "'", 're', 'checking', 'pre', '-', 'to', '##ken', '##ization', 'step', '.', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']

Encoding_offsets : [(0, 0), (0, 2), (2, 3), (3, 5), (6, 14), (15, 18), (18, 19), (19, 21), (21, 24), (24, 31), (32, 36), (36, 37), (0, 0), (0, 2), (3, 4), (5, 9), (10, 12), (13, 22), (22, 23), (0, 0)]

Encoding_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


To use this tokenizer in 🤗 Transformers, we have to wrap it in a PreTrainedTokenizerFast. We can either use the generic class or, if our tokenizer corresponds to an existing model, use that class (here, BertTokenizerFast). If you apply this lesson to build a brand new tokenizer, you will have to use the first option.

To wrap the tokenizer in a PreTrainedTokenizerFast, we can either pass the tokenizer we built as a tokenizer_object or pass the tokenizer file we saved as tokenizer_file. The key thing to remember is that we have to manually set all the special tokens, since that class can’t infer from the tokenizer object which token is the mask token, the [CLS] token, etc.:

In [49]:
from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object = tokenizer,
    unk_token = "[UNK]",
    pad_token = "[PAD]",
    cls_token = "[CLS]",
    sep_token = "[SEP]",
    mask_token = "[MASK]",
)

wrapped_tokenizer.tokenize("We're checking pre-tokenization step.", "on a pair of sentences.")

['we',
 "'",
 're',
 'checking',
 'pre',
 '-',
 'to',
 '##ken',
 '##ization',
 'step',
 '.',
 'on',
 'a',
 'pair',
 'of',
 'sentences',
 '.']

If we are using a specific tokenizer class (like BertTokenizerFast), we  only need to specify the special tokens that are different from the default ones (here, none):

In [50]:
from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

wrapped_tokenizer.tokenize("We're checking pre-tokenization step.", "on a pair of sentences.")

['we',
 "'",
 're',
 'checking',
 'pre',
 '-',
 'to',
 '##ken',
 '##ization',
 'step',
 '.',
 'on',
 'a',
 'pair',
 'of',
 'sentences',
 '.']