# WordPiece Tokenization

We will be implementing WordPiece tokenization using HuggingFace's `tokenizers` library. Let's get started by installing the library (if not installed already).

`pip3 install tokenizers`

In [1]:
import config
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

`Make sure:`
- The data is already present in the correct location.
- 'models' directory is present inside the root directory.

## Prepare the tokenizer

In [2]:
unk_token = "<UNK>"
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]

tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
trainer = WordPieceTrainer(special_tokens = spl_tokens)

tokenizer.pre_tokenizer = Whitespace()

## Train the tokenizer

In [3]:
data_files = ['./../{}'.format(config.DATA_PATH)]
model_path = './../{}/wordpiece.json'.format(config.MODEL_PATH)

tokenizer.train(data_files, trainer)
tokenizer.save(model_path)






## Tokenize Input String

In [4]:
tokenizer = Tokenizer.from_file(model_path)

In [5]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks."

In [6]:
output = tokenizer.encode(text)

print(output.ids)
print(output.tokens)

[835, 8585, 1056, 3784, 0, 20, 15, 26386, 15, 3796, 2929, 216, 771, 191, 365, 15, 4648, 15]
['Good', 'muff', '##ins', 'cost', '<UNK>', '3', '.', '88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
