## Loading Dataset

1. normalizers contains all the possible types of Normalizer you can use (complete list [link text](https://huggingface.co/docs/tokenizers/api/normalizers)).

2. pre_tokenizers contains all the possible types of PreTokenizer you can use (complete list [here](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)).

3. models contains the various types of Model you can use, like BPE, WordPiece, and Unigram (complete list [link text](https://huggingface.co/docs/tokenizers/api/models)).

4. trainers contains all the different types of Trainer you can use to train your model on a corpus (one per type of model; complete list [link text](https://huggingface.co/docs/tokenizers/api/trainers)).

5. post_processors contains the various types of PostProcessor you can use (complete list [link text](https://huggingface.co/docs/tokenizers/api/post-processors)).

6. decoders contains the various types of Decoder you can use to decode the outputs of tokenization (complete list [link text](https://huggingface.co/docs/tokenizers/components#decoders)).

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [12]:
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

def get_training_data():
  for i in range(0, len(dataset), 1000):
    yield dataset[i : i + 1000]["text"]

dataset

Dataset({
    features: ['text'],
    num_rows: 36718
})

In [3]:
dataset[:5]

{'text': ['',
  ' = Valkyria Chronicles III = \n',
  '',
  ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n',
  " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making th

## Applying Normalization

To build a tokenizer with the Tokenizers library, we start by instantiating a Tokenizer object with a model, then set its normalizer, pre_tokenizer, post_processor, and decoder attributes to the values we want.

In [4]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece())

We’re also using an NFD Unicode normalizer, as otherwise the StripAccents normalizer won’t properly recognize the accented characters and thus won’t strip them out.

In [7]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

## Pre Tokenizer

Note that the Whitespace pre-tokenizer splits on whitespace and all characters that are not letters, digits, or the underscore character, so it technically splits on whitespace and punctuation:

In [9]:
tokenizer.pre_tokenizers= pre_tokenizers.Sequence(
    [pre_tokenizers.Punctuation(), pre_tokenizers.Whitespace()]
)

## Trainer

In [10]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
Trainer= trainers.WordPieceTrainer(
    vocab_size= 25000,
    special_tokens=special_tokens
)

In [14]:
tokenizer.train_from_iterator(get_training_data(), trainer=Trainer)

# Encodings

In [15]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['l', '##et', "##'s t", '##est ', '##this ', '##to', '##ken', '##iz', '##er', '##.']


## Post Processing

The last step in the tokenization pipeline is post-processing. We need to add the [CLS] token at the beginning and the [SEP] token at the end (or after each sentence, if we have a pair of sentences). We will use a TemplateProcessor for this, but first we need to know the IDs of the [CLS] and [SEP] tokens in the vocabulary:

In [16]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

2 3
