### Tokenizer for Large Language Models

This code walkthrough comprises of the evolution of tokenizing natural language from simply understanding the unicode representation of strings to implementing state-of-the-art tokenizers used for Large Language Models

#### Simple unicode representations
In Python, the ord() function takes a single Unicode character as input and returns its corresponding integer Unicode code point.

In [1]:
"Aikyam Lab (Sanskrit: ऐक्यम्; meaning oneness)!"

'Aikyam Lab (Sanskrit: ऐक्यम्; meaning oneness)!'

In [2]:
ord('ऐ')

2320

In [3]:
ord("Aikyam Lab (Sanskrit: ऐक्यम्; meaning oneness)!")

TypeError: ord() expected a character, but string of length 47 found

In [4]:
[ord(s) for s in "Aikyam Lab (Sanskrit: ऐक्यम्; meaning oneness)!"]

[65,
 105,
 107,
 121,
 97,
 109,
 32,
 76,
 97,
 98,
 32,
 40,
 83,
 97,
 110,
 115,
 107,
 114,
 105,
 116,
 58,
 32,
 2320,
 2325,
 2381,
 2351,
 2350,
 2381,
 59,
 32,
 109,
 101,
 97,
 110,
 105,
 110,
 103,
 32,
 111,
 110,
 101,
 110,
 101,
 115,
 115,
 41,
 33]

In [5]:
"Aikyam Lab (Sanskrit: ऐक्यम्; meaning oneness)!".encode('utf-8')

b'Aikyam Lab (Sanskrit: \xe0\xa4\x90\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\xae\xe0\xa5\x8d; meaning oneness)!'

### Let's train a Character-level tokenizer for the Harry Potter dataset
![image.png](https://www.bloomsbury.com/media/h4jpj34t/blms_hp_discovery_web_bnrs_24_1200x600px.jpg)

In [6]:
# Load dataset
data = open('./data/harry_potter.txt','r').read()

# get a unique set of tokens
chars = sorted(list(set(data)))
vocab_size = len(chars)
print(vocab_size)
print(''.join(chars))

71
 !.0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~‘•■□


### Let's encode a text from Harry Potter using Character-level tokenizer

In [9]:
stoi = {ch: index for index, ch in enumerate(chars)}
itos = {index: ch for index, ch in enumerate(chars)}

def encode(s):
    return [stoi[ch] for ch in s]

def decode(s_list):
    return ''.join([itos[index] for index in s_list])

print(encode('To Harry Potter ~ the boy who lived!'))

[33, 54, 0, 21, 40, 57, 57, 64, 0, 29, 54, 59, 59, 44, 57, 0, 66, 0, 59, 47, 44, 0, 41, 54, 64, 0, 62, 47, 54, 0, 51, 48, 61, 44, 43, 1]


In [11]:
print(decode(encode('To Harry Potter ~ the boy who lived!')))

To Harry Potter ~ the boy who lived!


### Let's now encode a paragraph from Harry Potter using Byte-Pair Encoding!

References: 
1. https://courses.grainger.illinois.edu/cs447/sp2023/Slides/Lecture02.pdf
2. https://hundredblocks.github.io/transcription_demo/
3. Introduction slides: https://cs.usm.maine.edu/~behrooz.mansouri/courses/Slides_NLP_23/Natural%20Language%20Processing%20--%20Session%204%20-%20Tokenization%20and%20Stemming.pdf
4. Unigram Model: https://medium.com/mti-technology/n-gram-language-model-b7c2fc322799

In [12]:
raw_para = '''“But this is touching, Severus,” said Dumbledore seriously. “Have you grown to care for the boy, after all?”“For him?” shouted Snape. “Expecto Patronum!” From the tip of his wand burst the silver doe. She landed on the office floor, bounded once across the office, and soared out of the window. Dumbledore watched her fly away, and as her silvery glow faded he turned back to Snape, and his eyes were full of tears. “After all this time?” “Always,” said Snape.'''
encode_para = raw_para.encode('utf-8')
encode_para = list(map(int, encode_para))
print(f'Original text: {raw_para}')
print(f'Length of original text: {len(raw_para)}\n')

print(f'Encoded text: {encode_para}')
print(f'Length of encoded text: {len(encode_para)}')

Original text: “But this is touching, Severus,” said Dumbledore seriously. “Have you grown to care for the boy, after all?”“For him?” shouted Snape. “Expecto Patronum!” From the tip of his wand burst the silver doe. She landed on the office floor, bounded once across the office, and soared out of the window. Dumbledore watched her fly away, and as her silvery glow faded he turned back to Snape, and his eyes were full of tears. “After all this time?” “Always,” said Snape.
Length of original text: 460

Encoded text: [226, 128, 156, 66, 117, 116, 32, 116, 104, 105, 115, 32, 105, 115, 32, 116, 111, 117, 99, 104, 105, 110, 103, 44, 32, 83, 101, 118, 101, 114, 117, 115, 44, 226, 128, 157, 32, 115, 97, 105, 100, 32, 68, 117, 109, 98, 108, 101, 100, 111, 114, 101, 32, 115, 101, 114, 105, 111, 117, 115, 108, 121, 46, 32, 226, 128, 156, 72, 97, 118, 101, 32, 121, 111, 117, 32, 103, 114, 111, 119, 110, 32, 116, 111, 32, 99, 97, 114, 101, 32, 102, 111, 114, 32, 116, 104, 101, 32, 98, 111, 121, 44,

In [13]:
chirag_eng = "Chirag Agarwal"
chirag_hin = "चिराग अग्रवाल"

encode_para = chirag_eng.encode('utf-8')
encode_para = list(map(int, encode_para))
print(f'Original text: {raw_para}')
print(f'Length of original text: {len(chirag_eng)}\n')

print(f'Encoded text: {encode_para}')
print(f'Length of encoded text: {len(encode_para)}')

print('--'*20)
encode_para = chirag_hin.encode('utf-8')
encode_para = list(map(int, encode_para))
print(f'Original text: {raw_para}')
print(f'Length of original text: {len(chirag_hin)}\n')

print(f'Encoded text: {encode_para}')
print(f'Length of encoded text: {len(encode_para)}')

Original text: “But this is touching, Severus,” said Dumbledore seriously. “Have you grown to care for the boy, after all?”“For him?” shouted Snape. “Expecto Patronum!” From the tip of his wand burst the silver doe. She landed on the office floor, bounded once across the office, and soared out of the window. Dumbledore watched her fly away, and as her silvery glow faded he turned back to Snape, and his eyes were full of tears. “After all this time?” “Always,” said Snape.
Length of original text: 14

Encoded text: [67, 104, 105, 114, 97, 103, 32, 65, 103, 97, 114, 119, 97, 108]
Length of encoded text: 14
----------------------------------------
Original text: “But this is touching, Severus,” said Dumbledore seriously. “Have you grown to care for the boy, after all?”“For him?” shouted Snape. “Expecto Patronum!” From the tip of his wand burst the silver doe. She landed on the office floor, bounded once across the office, and soared out of the window. Dumbledore watched her fly away, and a

In [14]:
tokens = data.encode("utf-8") # raw bytes
tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience
print('---')
print(data[:1000])
print("length:", len(data))
print('---')
print(tokens[:1000])
print("length:", len(tokens))

---
THE BOY WHO LIVED Mr and Mrs Dursley of number four Privet Drive were proud to say that they were perfectly normal thank you very much .They were the last people youd expect to be involved in anything strange or mysterious because they just didnt hold with such nonsense .Mr Dursley was the director of a firm called Grunnings which made drills .He was a big beefy man with hardly any neck although he did have a very large mustache .Mrs Dursley was thin and blonde and had nearly twice the usual amount of neck which came in very useful as she spent so much of her time craning over garden fences spying on the neighbors .The Dursley s had a small son called Dudley and in their opinion there was no finer boy anywhere .The Dursleys had everything they wanted but they also had a secret and their greatest fear was that somebody would discover it .They didnt think they could bear it if anyone found out about the Potters .Mrs Potter was Mrs Dursleys sister but they hadnt met for several years 

In [15]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]): # Pythonic way to iterate consecutive elements
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(tokens)
# print(stats)
print(sorted(((v,k) for k,v in stats.items()), reverse=True))

[(199459, (101, 32)), (150400, (100, 32)), (143168, (32, 116)), (125242, (104, 101)), (117307, (116, 32)), (114052, (115, 32)), (113113, (116, 104)), (98420, (32, 97)), (93985, (105, 110)), (81396, (32, 104)), (78940, (32, 115)), (78229, (121, 32)), (74596, (101, 114)), (71147, (110, 32)), (69024, (32, 119)), (67784, (32, 46)), (63651, (101, 100)), (59800, (114, 101)), (59248, (97, 110)), (59126, (114, 32)), (54963, (111, 117)), (54315, (97, 114)), (53920, (110, 103)), (53063, (32, 111)), (51041, (111, 110)), (50328, (110, 100)), (48502, (103, 32)), (47669, (111, 32)), (46071, (104, 97)), (43749, (97, 116)), (42753, (32, 98)), (42377, (104, 105)), (42354, (116, 111)), (40253, (32, 105)), (39978, (111, 114)), (39345, (97, 115)), (37578, (101, 110)), (36714, (108, 101)), (35570, (32, 102)), (34927, (115, 116)), (33603, (101, 97)), (32805, (105, 115)), (32546, (110, 116)), (32302, (105, 116)), (31590, (32, 99)), (31485, (116, 101)), (31072, (119, 97)), (30221, (101, 115)), (30002, (102, 3

In [16]:
top_pair = max(stats, key=stats.get)
top_pair

(101, 32)

In [17]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge_pair(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]:
    merged = []
    i = 0

    while i < len(indices):
        if i < len(indices) - 1 and (indices[i], indices[i + 1]) == pair:
            merged.append(new_index)
            i += 2
        else:
            merged.append(indices[i])
            i += 1

    return merged

# ---
vocab_size = 324 # the desired final vocabulary size
num_merges = vocab_size - 256
ids = list(tokens) # copy so we don't destroy the original list

merges = {} # (int, int) -> int
for i in range(num_merges):
  stats = get_stats(ids)
  pair = max(stats, key=stats.get)
  idx = 256 + i
  print(f"merging {pair} into a new token {idx}")
  ids = merge_pair(ids, pair, idx)
  merges[pair] = idx

merging (101, 32) into a new token 256
merging (100, 32) into a new token 257
merging (116, 32) into a new token 258
merging (115, 32) into a new token 259
merging (116, 104) into a new token 260
merging (105, 110) into a new token 261
merging (121, 32) into a new token 262
merging (101, 114) into a new token 263
merging (97, 110) into a new token 264
merging (101, 257) into a new token 265
merging (111, 117) into a new token 266
merging (97, 114) into a new token 267
merging (111, 110) into a new token 268
merging (103, 32) into a new token 269
merging (260, 256) into a new token 270
merging (111, 32) into a new token 271
merging (261, 269) into a new token 272
merging (111, 114) into a new token 273
merging (101, 110) into a new token 274
merging (263, 32) into a new token 275
merging (116, 271) into a new token 276
merging (102, 32) into a new token 277
merging (108, 108) into a new token 278
merging (264, 257) into a new token 279
merging (104, 105) into a new token 280
merging (10

In [18]:
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

tokens length: 5992253
ids length: 3594233
compression ratio: 1.67X


### Byte-Pair Encoding

In [39]:
# https://github.com/clabrugere/byte-pair-encoding/blob/master/tests/test_tokenizer.py
import re
from collections import Counter


def string_to_byte(input: str, encoding: str) -> list[int]:
    return list(map(int, input.encode(encoding)))


def most_frequent_pair(indices: list[int]) -> tuple[tuple[int, int], int]:
    counts = Counter(zip(indices, indices[1:]))
    pair, count = counts.most_common(1)[0]

    return pair, count


def merge_pair(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]:
    merged = []
    i = 0

    while i < len(indices):
        if i < len(indices) - 1 and (indices[i], indices[i + 1]) == pair:
            merged.append(new_index)
            i += 2
        else:
            merged.append(indices[i])
            i += 1

    return merged


class BPETokenizer:
    def __init__(
        self,
        max_vocab_size: int,
    ):
        if max_vocab_size <= 256:
            raise ValueError(f"max_vocab_size must be at least 256, got '{max_vocab_size}'.")

        self.max_vocab_size = max_vocab_size
        self.reset()

    def reset(self) -> None:
        # UTF-8 encoding represents characters with 1, 2, 3 or 4 consecutive bytes, which means that the base vocabulary
        # size is 256. It also guarantee that we can't have out of vocabulary tokens as long as the input string can be
        # encoded in UTF-8.
        self.pairs = {}
        self.id_to_token = {i: bytes([i]) for i in range(256)}
        self.next_id = 256
        self.special_to_id = {}
        self.id_to_special = {}

    def register_special_token(self, token: str) -> None:
        if token not in self.special_to_id:
            logger.info(f"Registering special token {token} with id {self.next_id}.")

            self.special_to_id[token] = self.next_id
            self.id_to_special[self.next_id] = token
            self.next_id += 1

    def train(self, input: str, stop_early: bool = False, verbose: bool = True) -> None:
        indices = string_to_byte(input, "utf-8")

        while self.vocab_size < self.max_vocab_size:
            pair, count = most_frequent_pair(indices)

            if stop_early and count == 1:
                break

            indices = merge_pair(indices, pair, self.next_id)
            new_token = self.id_to_token[pair[0]] + self.id_to_token[pair[1]]
            self.pairs[pair] = self.next_id
            self.id_to_token[self.next_id] = new_token

            if verbose:
                print(f"Merged ids {pair} as new token {new_token} with id {self.next_id}.")

            self.next_id += 1

        print(f"Stopping compression after {len(self.pairs)} pair merges with vocab size of {self.vocab_size}.")

    def _encode_non_special(self, input: str) -> list[int]:
        indices = string_to_byte(input, "utf-8")
        i = 0

        while i < len(indices) - 1:
            pair = (indices[i], indices[i + 1])
            if pair in self.pairs:
                indices = merge_pair(indices, pair, self.pairs[pair])
            else:
                i += 1

        return indices

    def encode(self, input: str) -> list[int]:
        if len(self.special_to_id) > 0:
            special_pattern = re.compile(f"({'|'.join(re.escape(t) for t in self.special_to_id.keys())})")
            splits = special_pattern.split(input)
        else:
            splits = [input]

        indices = []

        for split in splits:
            if split in self.special_to_id:
                indices.append(self.special_to_id[split])
            else:
                indices.extend(self._encode_non_special(split))

        return indices

    def decode(self, indices: list[int]) -> str:
        decoded = []

        for id in indices:
            if id in self.id_to_special:
                decoded.append(self.id_to_special[id].encode("utf-8"))
            else:
                decoded.append(self.id_to_token[id])

        decoded = b"".join(decoded).decode("utf-8")

        return decoded

    @property
    def vocab_size(self) -> int:
        return self.next_id

In [42]:
tokenizer = BPETokenizer(max_vocab_size=275)
tokenizer.train(data)

Merged ids (32, 116) as new token b' t' with id 256.
Merged ids (104, 101) as new token b'he' with id 257.
Merged ids (256, 257) as new token b' the' with id 258.
Merged ids (258, 32) as new token b' the ' with id 259.
Merged ids (105, 100) as new token b'id' with id 260.
Merged ids (114, 111) as new token b'ro' with id 261.
Merged ids (256, 111) as new token b' to' with id 262.
Merged ids (119, 104) as new token b'wh' with id 263.
Merged ids (263, 121) as new token b'why' with id 264.
Merged ids (264, 32) as new token b'why ' with id 265.
Merged ids (265, 100) as new token b'why d' with id 266.
Merged ids (266, 260) as new token b'why did' with id 267.
Merged ids (267, 259) as new token b'why did the ' with id 268.
Merged ids (268, 99) as new token b'why did the c' with id 269.
Merged ids (269, 104) as new token b'why did the ch' with id 270.
Merged ids (270, 105) as new token b'why did the chi' with id 271.
Merged ids (271, 99) as new token b'why did the chic' with id 272.
Merged ids

In [21]:
tokenizer.encode("Chirag Agarwal")

[67, 104, 105, 114, 97, 269, 65, 103, 267, 119, 97, 108]

In [22]:
tokenizer.decode(tokenizer.encode("Chirag Agarwal"))

'Chirag Agarwal'

In [38]:
for id in range(256, 275):
    print(tokenizer.id_to_token[id])

b' t'
b'he'
b' the'
b' the '
b'id'
b'ro'
b' to'
b'wh'
b'why'
b'why '
b'why d'
b'why did'
b'why did the '
b'why did the c'
b'why did the ch'
b'why did the chi'
b'why did the chic'
b'why did the chick'
b'why did the chicke'


In [24]:
tokenizer.id_to_token

{0: b'\x00',
 1: b'\x01',
 2: b'\x02',
 3: b'\x03',
 4: b'\x04',
 5: b'\x05',
 6: b'\x06',
 7: b'\x07',
 8: b'\x08',
 9: b'\t',
 10: b'\n',
 11: b'\x0b',
 12: b'\x0c',
 13: b'\r',
 14: b'\x0e',
 15: b'\x0f',
 16: b'\x10',
 17: b'\x11',
 18: b'\x12',
 19: b'\x13',
 20: b'\x14',
 21: b'\x15',
 22: b'\x16',
 23: b'\x17',
 24: b'\x18',
 25: b'\x19',
 26: b'\x1a',
 27: b'\x1b',
 28: b'\x1c',
 29: b'\x1d',
 30: b'\x1e',
 31: b'\x1f',
 32: b' ',
 33: b'!',
 34: b'"',
 35: b'#',
 36: b'$',
 37: b'%',
 38: b'&',
 39: b"'",
 40: b'(',
 41: b')',
 42: b'*',
 43: b'+',
 44: b',',
 45: b'-',
 46: b'.',
 47: b'/',
 48: b'0',
 49: b'1',
 50: b'2',
 51: b'3',
 52: b'4',
 53: b'5',
 54: b'6',
 55: b'7',
 56: b'8',
 57: b'9',
 58: b':',
 59: b';',
 60: b'<',
 61: b'=',
 62: b'>',
 63: b'?',
 64: b'@',
 65: b'A',
 66: b'B',
 67: b'C',
 68: b'D',
 69: b'E',
 70: b'F',
 71: b'G',
 72: b'H',
 73: b'I',
 74: b'J',
 75: b'K',
 76: b'L',
 77: b'M',
 78: b'N',
 79: b'O',
 80: b'P',
 81: b'Q',
 82: b'R',
 83: b'

### SentencePiece

In [25]:
# write a toy.txt file with some random text
with open("./data/toy.txt", "w", encoding="utf-8") as f:
  f.write(raw_para)

import sentencepiece as spm

import os

options = dict(
  # input spec
  input="./data/toy.txt",
  input_format="text",
  # output spec
  model_prefix="tok400", # output filename prefix
  # algorithm spec
  # BPE alg
  model_type="bpe",
  vocab_size=400,
  # normalization
  normalization_rule_name="identity", # ew, turn off normalization
  remove_extra_whitespaces=False,
  input_sentence_size=200000000, # max number of training sentences
  max_sentence_length=1000000, # max number of bytes per sentence
  seed_sentencepiece_size=1000000,
  shuffle_input_sentence=True,
  # rare word treatment
  character_coverage=0.99995,
  byte_fallback=True,
  # merge rules
  split_digits=True,
  split_by_unicode_script=True,
  split_by_whitespace=True,
  split_by_number=True,
  max_sentencepiece_length=16,
  add_dummy_prefix=True,
  allow_whitespace_only_pieces=True,
  # special tokens
  unk_id=0, # the UNK token MUST exist
  bos_id=1, # the others are optional, set to -1 to turn off
  eos_id=2,
  pad_id=-1,
  # systems
  num_threads=os.cpu_count(), # use ~all system resources
)

spm.SentencePieceTrainer.train(**options)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/toy.txt
  input_format: text
  model_prefix: tok400
  model_type: BPE
  vocab_size: 400
  self_test_sample_size: 0
  character_coverage: 0.99995
  input_sentence_size: 200000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 1000000
  num_threads: 12
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 1
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 1
  required_chars: 
  byte_fallback: 1
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy

In [26]:
sp = spm.SentencePieceProcessor()
sp.load('tok400.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab

[['<unk>', 0],
 ['<s>', 1],
 ['</s>', 2],
 ['<0x00>', 3],
 ['<0x01>', 4],
 ['<0x02>', 5],
 ['<0x03>', 6],
 ['<0x04>', 7],
 ['<0x05>', 8],
 ['<0x06>', 9],
 ['<0x07>', 10],
 ['<0x08>', 11],
 ['<0x09>', 12],
 ['<0x0A>', 13],
 ['<0x0B>', 14],
 ['<0x0C>', 15],
 ['<0x0D>', 16],
 ['<0x0E>', 17],
 ['<0x0F>', 18],
 ['<0x10>', 19],
 ['<0x11>', 20],
 ['<0x12>', 21],
 ['<0x13>', 22],
 ['<0x14>', 23],
 ['<0x15>', 24],
 ['<0x16>', 25],
 ['<0x17>', 26],
 ['<0x18>', 27],
 ['<0x19>', 28],
 ['<0x1A>', 29],
 ['<0x1B>', 30],
 ['<0x1C>', 31],
 ['<0x1D>', 32],
 ['<0x1E>', 33],
 ['<0x1F>', 34],
 ['<0x20>', 35],
 ['<0x21>', 36],
 ['<0x22>', 37],
 ['<0x23>', 38],
 ['<0x24>', 39],
 ['<0x25>', 40],
 ['<0x26>', 41],
 ['<0x27>', 42],
 ['<0x28>', 43],
 ['<0x29>', 44],
 ['<0x2A>', 45],
 ['<0x2B>', 46],
 ['<0x2C>', 47],
 ['<0x2D>', 48],
 ['<0x2E>', 49],
 ['<0x2F>', 50],
 ['<0x30>', 51],
 ['<0x31>', 52],
 ['<0x32>', 53],
 ['<0x33>', 54],
 ['<0x34>', 55],
 ['<0x35>', 56],
 ['<0x36>', 57],
 ['<0x37>', 58],
 ['<0x38>', 5

In [27]:
ids = sp.encode("Welcome to Hogwarts!!")
print(ids)
print([sp.id_to_piece(idx) for idx in ids])

[362, 90, 363, 372, 376, 364, 339, 309, 362, 396, 364, 389, 277, 367, 366, 368, 393, 393]
['▁', '<0x57>', 'e', 'l', 'c', 'o', 'me', '▁to', '▁', 'H', 'o', 'g', 'wa', 'r', 't', 's', '!', '!']


### Extending to a New Tokenizer!

In [28]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [29]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
tokens

['def',
 'Ġadd',
 '_',
 'n',
 'umbers',
 '(',
 'a',
 ',',
 'Ġb',
 '):',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġ"""',
 'Add',
 'Ġthe',
 'Ġtwo',
 'Ġnumbers',
 'Ġ`',
 'a',
 '`',
 'Ġand',
 'Ġ`',
 'b',
 '`',
 '."',
 '""',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'Ġreturn',
 'Ġa',
 'Ġ+',
 'Ġb']

In [30]:
from datasets import load_dataset

raw_dataset = load_dataset("code_search_net", "python")

def get_training_corpus():
    dataset = raw_dataset["train"]
    for start_idx in range(0, len(dataset), 5000):
        samples = dataset[start_idx : start_idx + 5000]
        yield samples["whole_func_string"]

training_corpus = get_training_corpus()
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 2*52000)
print(new_tokenizer.tokenize(example))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)





['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


In [31]:
print(len(new_tokenizer.tokenize(example)))
print(len(old_tokenizer.tokenize(example)))

27
36
