# Unigram Tokenization

We will be implementing WordPiece tokenization using HuggingFace's `tokenizers` library as done for wordpiece tokenizer. Let's get started by installing the library (if not installed already).

`pip3 install tokenizers`

`NOTE:` Unigram Tokenization can be implemented using Google's SentencePiece library as done for Byte-Pair Encoding. You may refer to `bpe.ipynb` for similar implementation of Unigram Tokenization.

In [9]:
import config
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace

## Prepare the Tokenizer

In [10]:
unk_token = "<UNK>"
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]

tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(unk_token=unk_token, special_tokens=spl_tokens)

tokenizer.pre_tokenizer = Whitespace()

## Train the Tokenizer

In [11]:
data_files = ['./../{}'.format(config.DATA_PATH)]
model_path = './../{}/unigram.json'.format(config.MODEL_PATH)

tokenizer.train(data_files, trainer)
tokenizer.save(model_path)





## Tokenize Input String

In [12]:
tokenizer = Tokenizer.from_file(model_path)

In [13]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks."

In [14]:
output = tokenizer.encode(text)

print(output.ids)
print(output.tokens)

[296, 43, 2342, 20, 6, 1737, 0, 519, 5, 2805, 2805, 5, 1791, 1267, 30, 266, 17, 106, 5, 1906, 6, 5]
['Good', 'm', 'uff', 'in', 's', 'cost', '$', '3', '.', '8', '8', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thank', 's', '.']
