# WordPiece Tokenization

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary. The WordPiece algorithm can be processed as:

1. Initialize the word unit inventory with all the characters in the text.
2. Build a language model on the training data using the inventory from 1.
3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
4. Goto step 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary.

`For Example:`<br/>
*Input Text:* she walked . he is a dog walker . i walk <br/>
*First 3 BPE Merges:*
1. w a = wa
2. l k = lk
3. wa lk = walk

So at this stage, your vocabulary includes all the initial characters, along with wa, lk, and walk. You usually do this for a fixed number of merge operations.

`How does it handle rare/OOV words?` Quite simply, OOV words are impossible if you use such a segmentation method. Any word which does not occur in the vocabulary will be broken down into subword units. Similarly, for rare words, given that the number of subword merges we used is limited, the word will not occur in the n vocabulary, so it will be split into more frequent subwords.

`How does this help?` Imagine that the model sees the word walking. Unless this word occurs at least a few times in the training corpus, the model can't learn to deal with this word very well. However, it may have the words walked, walker, walks, each occurring only a few times. Without subword segmentation, all these words are treated as completely different words by the model. However, if these get segmented as walk@@ ing, walk@@ ed, etc., notice that all of them will now have walk@@ in common, which will occur much frequently while training and the model might be able to learn more about it.
<br/>
*[[Source]](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944)*


We will be implementing WordPiece tokenization using HuggingFace's `tokenizers` library. Let's get started by installing the library.

`pip3 install tokenizers`

In [1]:
# This code allows you to import '.py' file from one directory behind (i.e. root directory)
import sys
import os

root_dir = os.path.dirname(os.getcwd())
if root_dir not in sys.path:
    sys.path.append(root_dir)

In [2]:
import config
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

`Make sure:`
- The data is already present in the correct location.
- 'models' directory is present inside the root directory.

### Prepare the tokenizer

In [3]:
unk_token = "<UNK>"
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]

tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
trainer = WordPieceTrainer(special_tokens = spl_tokens)

tokenizer.pre_tokenizer = Whitespace()

### Train the tokenizer

In [4]:
data_files = ['./../{}'.format(config.DATA_PATH)]
model_path = './../{}/wordpiece.json'.format(config.MODEL_PATH)

tokenizer.train(data_files, trainer)
tokenizer.save(model_path)






### Tokenize Input String

In [5]:
tokenizer = Tokenizer.from_file(model_path)

In [6]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks.🙂😍"

In [7]:
output = tokenizer.encode(text)

print(output.ids)
print(output.tokens)

[1051, 4966, 2659, 2860, 0, 15, 11, 6024, 11, 5860, 2793, 205, 803, 169, 416, 11, 5212, 0]
['Good', 'muff', '##ins', 'cost', '<UNK>', '3', '.', '88', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '<UNK>']
