Referred doc link- https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt#implementing-wordpiece

In [22]:
corpus = [
    "This is a Natural Language Processing course.",
    "We will learn about tokenization.",
    "Here, we will explore Wordpiece tokenizers",
    "We will be able to understand how tokenizers are trained and tokens are generated.",
]

corpus

['This is a Natural Language Processing course.',
 'We will learn about tokenization.',
 'Here, we will explore Wordpiece tokenizers',
 'We will be able to understand how tokenizers are trained and tokens are generated.']

First, we need to pre-tokenize the corpus into words. Since we are replicating a WordPiece tokenizer (like BERT), we will use the bert-base-cased tokenizer for the pre-tokenization:

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Then, we need to initialize our vocabulary to something larger than the vocab size we will want at the end. We have to include all the basic characters (otherwise we won’t be able to tokenize every word), but for the bigger substrings we’ll only keep the most common ones, so we sort them by frequency


In [23]:
from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:

    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)

    words = [word for word, offset in words_with_offsets]

    for word in words:
        word_freqs[word] += 1

word_freqs

defaultdict(int,
            {'This': 1,
             'is': 1,
             'a': 1,
             'Natural': 1,
             'Language': 1,
             'Processing': 1,
             'course': 1,
             '.': 3,
             'We': 2,
             'will': 3,
             'learn': 1,
             'about': 1,
             'tokenization': 1,
             'Here': 1,
             ',': 1,
             'we': 1,
             'explore': 1,
             'Wordpiece': 1,
             'tokenizers': 2,
             'be': 1,
             'able': 1,
             'to': 1,
             'understand': 1,
             'how': 1,
             'are': 2,
             'trained': 1,
             'and': 1,
             'tokens': 1,
             'generated': 1})

Then, we need to initialize our vocabulary to something larger than the vocab size we will want at the end. We have to include all the basic characters (otherwise we won’t be able to tokenize every word), but for the bigger substrings we’ll only keep the most common ones, so we sort them by frequency

In [24]:
alphabet = []

for word in word_freqs.keys():

    if word[0] not in alphabet:
        alphabet.append(word[0])

    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")

alphabet.sort()

print(alphabet)

['##a', '##b', '##c', '##d', '##e', '##g', '##h', '##i', '##k', '##l', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##w', '##x', '##z', ',', '.', 'H', 'L', 'N', 'P', 'T', 'W', 'a', 'b', 'c', 'e', 'g', 'h', 'i', 'l', 't', 'u', 'w']


In [25]:
len(alphabet)

39

We also add the special tokens used by the model at the beginning of that vocabulary. In the case of BERT, it’s the list ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]:

In [26]:
vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

vocab

['[PAD]',
 '[UNK]',
 '[CLS]',
 '[SEP]',
 '[MASK]',
 '##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##w',
 '##x',
 '##z',
 ',',
 '.',
 'H',
 'L',
 'N',
 'P',
 'T',
 'W',
 'a',
 'b',
 'c',
 'e',
 'g',
 'h',
 'i',
 'l',
 't',
 'u',
 'w']

Next we need to split each word, with all the letters that are not the first prefixed by ##:

In [27]:
splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

splits

{'This': ['T', '##h', '##i', '##s'],
 'is': ['i', '##s'],
 'a': ['a'],
 'Natural': ['N', '##a', '##t', '##u', '##r', '##a', '##l'],
 'Language': ['L', '##a', '##n', '##g', '##u', '##a', '##g', '##e'],
 'Processing': ['P',
  '##r',
  '##o',
  '##c',
  '##e',
  '##s',
  '##s',
  '##i',
  '##n',
  '##g'],
 'course': ['c', '##o', '##u', '##r', '##s', '##e'],
 '.': ['.'],
 'We': ['W', '##e'],
 'will': ['w', '##i', '##l', '##l'],
 'learn': ['l', '##e', '##a', '##r', '##n'],
 'about': ['a', '##b', '##o', '##u', '##t'],
 'tokenization': ['t',
  '##o',
  '##k',
  '##e',
  '##n',
  '##i',
  '##z',
  '##a',
  '##t',
  '##i',
  '##o',
  '##n'],
 'Here': ['H', '##e', '##r', '##e'],
 ',': [','],
 'we': ['w', '##e'],
 'explore': ['e', '##x', '##p', '##l', '##o', '##r', '##e'],
 'Wordpiece': ['W', '##o', '##r', '##d', '##p', '##i', '##e', '##c', '##e'],
 'tokenizers': ['t',
  '##o',
  '##k',
  '##e',
  '##n',
  '##i',
  '##z',
  '##e',
  '##r',
  '##s'],
 'be': ['b', '##e'],
 'able': ['a', '##b', '##l

Now that we are ready for training, let’s write a function that computes the score of each pair. We’ll need to use this at each step of the training

In [28]:
def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)

    for word, freq in word_freqs.items():
        split = splits[word]

        # Single letter word e.g. "a", "I"
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue

        # Iterate till the second-to-last split
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])

            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq

        # Add the last split to letter frequencies
        letter_freqs[split[-1]] += freq

    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

Viweing a part of this dictionary after the initial splits. For the pair ('##i', '##s')=1/

In [29]:
pair_scores = compute_pair_scores(splits)

pair_scores

{('T', '##h'): 1.0,
 ('##h', '##i'): 0.09090909090909091,
 ('##i', '##s'): 0.010101010101010102,
 ('i', '##s'): 0.1111111111111111,
 ('N', '##a'): 0.1111111111111111,
 ('##a', '##t'): 0.06666666666666667,
 ('##t', '##u'): 0.05,
 ('##u', '##r'): 0.03571428571428571,
 ('##r', '##a'): 0.023809523809523808,
 ('##a', '##l'): 0.012345679012345678,
 ('L', '##a'): 0.1111111111111111,
 ('##a', '##n'): 0.017094017094017096,
 ('##n', '##g'): 0.05128205128205128,
 ('##g', '##u'): 0.08333333333333333,
 ('##u', '##a'): 0.027777777777777776,
 ('##a', '##g'): 0.037037037037037035,
 ('##g', '##e'): 0.012345679012345678,
 ('P', '##r'): 0.07142857142857142,
 ('##r', '##o'): 0.005952380952380952,
 ('##o', '##c'): 0.041666666666666664,
 ('##c', '##e'): 0.037037037037037035,
 ('##e', '##s'): 0.00411522633744856,
 ('##s', '##s'): 0.012345679012345678,
 ('##s', '##i'): 0.010101010101010102,
 ('##i', '##n'): 0.013986013986013986,
 ('c', '##o'): 0.08333333333333333,
 ('##o', '##u'): 0.041666666666666664,
 ('##r

Now, finding the pair with the best score only takes a quick loop

In [30]:
best_pair = ""
max_score = None

for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

('T', '##h') 1.0


So the first merge to learn is ('T', '##h') -> 'Th', and we add 'Th' to the vocabulary:

In [31]:
vocab.append("Th")

vocab

['[PAD]',
 '[UNK]',
 '[CLS]',
 '[SEP]',
 '[MASK]',
 '##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##w',
 '##x',
 '##z',
 ',',
 '.',
 'H',
 'L',
 'N',
 'P',
 'T',
 'W',
 'a',
 'b',
 'c',
 'e',
 'g',
 'h',
 'i',
 'l',
 't',
 'u',
 'w',
 'Th']

We need to apply that merge in our splits dictionary. Defining function for merging pairs

In [32]:
def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]

        # Single letter word e.g. "a", "I", "."
        if len(split) == 1:
            continue

        i = 0

        while i < len(split) - 1:

            if split[i] == a and split[i + 1] == b:
                # Merge the terms
                merge = a + b[2:] if b.startswith("##") else a + b

                # Remove the original terms and include the merged term
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1

        splits[word] = split

    return splits

In [33]:
splits = merge_pair("T", "##h", splits)

splits["This"]

['Th', '##i', '##s']

In [34]:
splits = merge_pair("t", "##o", splits)

splits["tokens"]

['to', '##k', '##e', '##n', '##s']

In [35]:
splits["tokenizers"]

['to', '##k', '##e', '##n', '##i', '##z', '##e', '##r', '##s']

Now we have everything we need to loop until we have learned all the merges we want.We are taking a vocab size of 120

In [36]:
vocab_size = 120

while len(vocab) < vocab_size:

    scores = compute_pair_scores(splits)

    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score

    splits = merge_pair(*best_pair, splits)

    new_token = (
        best_pair[0] + best_pair[1][2:] if best_pair[1].startswith("##")
          else best_pair[0] + best_pair[1]
    )

    vocab.append(new_token)

In [37]:
vocab

['[PAD]',
 '[UNK]',
 '[CLS]',
 '[SEP]',
 '[MASK]',
 '##a',
 '##b',
 '##c',
 '##d',
 '##e',
 '##g',
 '##h',
 '##i',
 '##k',
 '##l',
 '##n',
 '##o',
 '##p',
 '##r',
 '##s',
 '##t',
 '##u',
 '##w',
 '##x',
 '##z',
 ',',
 '.',
 'H',
 'L',
 'N',
 'P',
 'T',
 'W',
 'a',
 'b',
 'c',
 'e',
 'g',
 'h',
 'i',
 'l',
 't',
 'u',
 'w',
 'Th',
 'ex',
 'exp',
 'tok',
 'ab',
 '##dp',
 'co',
 'cou',
 'ho',
 'how',
 'is',
 'Na',
 'Nat',
 'Natu',
 '##gu',
 '##ut',
 '##out',
 'about',
 'La',
 '##gua',
 '##guag',
 '##oc',
 '##at',
 '##ta',
 '##zat',
 '##sta',
 'expl',
 'explo',
 'Wo',
 'abl',
 'Thi',
 'This',
 '##izat',
 '##izati',
 '##izatio',
 '##dpi',
 '##iz',
 'wi',
 '##ai',
 '##si',
 '##ssi',
 'wil',
 'will',
 '##al',
 'Lan',
 'Languag',
 '##ssin',
 '##ssing',
 '##nizatio',
 '##nization',
 '##niz',
 'un',
 'und',
 '##stan',
 '##stand',
 '##ain',
 '##nd',
 'and',
 '##ns',
 'Natur',
 'Natural',
 'Pr',
 'Proc',
 'cour',
 'cours',
 '##ar',
 '##arn',
 'explor',
 'Wor',
 'Wordpi',
 '##rs',
 '##rstand',
 'tr

In [38]:
def encode_word(word):
    tokens = []

    while len(word) > 0:
        i = len(word)

        # Find the longest part of the word in the vocabulary
        # starting with the complete word
        while i > 0 and word[:i] not in vocab:
            i -= 1

        # If no individual character is in the vocabulary
        # return the unknown token
        if i == 0:
            return ["[UNK]"]

        # Append the word part to tokens, and consider rest of the
        # word part
        tokens.append(word[:i])
        word = word[i:]

        if len(word) > 0:
            word = f"##{word}"

    return tokens

In [39]:
print(encode_word("Natural"))
print(encode_word("Notaral"))

['Natural']
['N', '##o', '##ta', '##r', '##al']


In [40]:
print(encode_word("tokenizing"))

['tok', '##e', '##niz', '##i', '##n', '##g']


In [42]:
print(encode_word("exploring"))

['explor', '##i', '##n', '##g']


In [43]:
print(encode_word("Label"))

['La', '##b', '##e', '##l']


Tokenization functionis defined and a sentence is tokenized

In [44]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)

    pre_tokenized_text = [word for word, offset in pre_tokenize_result]

    encoded_words = [encode_word(word) for word in pre_tokenized_text]

    return sum(encoded_words, [])

In [46]:
tokenize("tokenizing language is fun")

['tok',
 '##e',
 '##niz',
 '##i',
 '##n',
 '##g',
 'l',
 '##a',
 '##n',
 '##guag',
 '##e',
 'is',
 '[UNK]']

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instance "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". Both "annoying" and "ly" as stand-alone subwords would appear more frequently while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly". This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

Subword tokenization allows the model to have a reasonable vocabulary size while being able to learn meaningful context-independent representations. In addition, subword tokenization enables the model to process words it has never seen before, by decomposing them into known subwords. For instance, the BertTokenizer tokenizes "I have a new SAMSUNG GLITE" as follows:

In case of uncased model, the sentence was lowercased first..

We are loading pretrained BertTokenizer and tokenizing text

In [47]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer.tokenize("tokenizing language is fun")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

['token', '##izing', 'language', 'is', 'fun']