Contents:

A) Trying my rust_tokenizer

B) Trying my Tokenizer python class

C) Trying other misc rust functions

To build the rust library:

```
cd rust_tokenizer
maturin develop
```

## A) Trying my rust_tokenizer

`rust_tokenizer/src/lib.rs` is my partial copy of nanochat's [rustbpe](https://github.com/karpathy/nanochat/blob/master/rustbpe/src/lib.rs). It copies all the same techniques for doing things efficiently, for example, counting in parallel and careful bookkeeping to avoid recomputing, unlike my play examples in challenges 1 and 3. I added extremely verbose output to help understand exactly how the whole thing works. This is only useful if training on a few words.

In [36]:
# copied from https://github.com/karpathy/nanochat/blob/master/nanochat/tokenizer.py
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

In [37]:
import rust_tokenizer

In [38]:
tokenizer = rust_tokenizer.Tokenizer(debug_print = True)

In [39]:
tokenizer.train_from_iterator(
    iterator = ['the cat the cat', 'hi cat'],
    vocab_size = 1000,
    buffer_size = 1,
    pattern = SPLIT_PATTERN)

----start of iterating through text passed in, splitting it into words, and counting----

just filled one buffer under GIL, buffer contains 1 strings
will now split into words and count in parallel without holding GIL
finished splitting and counting for this buffer, 'local' counts map: {" the": 1, " cat": 2, "the": 1}

just filled one buffer under GIL, buffer contains 1 strings
will now split into words and count in parallel without holding GIL
finished splitting and counting for this buffer, 'local' counts map: {"hi": 1, " cat": 1}

----end of iterating through text passed in, splitting it into words, and counting----

counts map: {"hi": 1, " the": 1, "the": 1, " cat": 3}

words: [Word { ids: [104, 105] }, Word { ids: [32, 116, 104, 101] }, Word { ids: [116, 104, 101] }, Word { ids: [32, 99, 97, 116] }]

cvec: [1, 1, 1, 3]

as an example, here are the pairs from the first word: [(104, 105)]

Will now form and count all pairs, keeping track of which word(s) each pair comes from

pair -

In [40]:
mergeable_ranks = tokenizer.get_mergeable_ranks()
mergeable_ranks[60:70]

[(b'<', 60),
 (b'=', 61),
 (b'>', 62),
 (b'?', 63),
 (b'@', 64),
 (b'A', 65),
 (b'B', 66),
 (b'C', 67),
 (b'D', 68),
 (b'E', 69)]

In [41]:
mergeable_ranks[-30:]

[(b'\xe9', 233),
 (b'\xea', 234),
 (b'\xeb', 235),
 (b'\xec', 236),
 (b'\xed', 237),
 (b'\xee', 238),
 (b'\xef', 239),
 (b'\xf0', 240),
 (b'\xf1', 241),
 (b'\xf2', 242),
 (b'\xf3', 243),
 (b'\xf4', 244),
 (b'\xf5', 245),
 (b'\xf6', 246),
 (b'\xf7', 247),
 (b'\xf8', 248),
 (b'\xf9', 249),
 (b'\xfa', 250),
 (b'\xfb', 251),
 (b'\xfc', 252),
 (b'\xfd', 253),
 (b'\xfe', 254),
 (b'\xff', 255),
 (b' c', 256),
 (b'at', 257),
 (b' cat', 258),
 (b'he', 259),
 (b'the', 260),
 (b' the', 261),
 (b'hi', 262)]

In [42]:
tokenizer = rust_tokenizer.Tokenizer(debug_print = False)

In [45]:
tokenizer.train_from_iterator(
    iterator = ['The cat looked for a dog.', 'The dog looked for a cat.', 'Bicycles are great.', 'Cats are good.'],
    vocab_size = 1000,
    buffer_size = 50,
    pattern = SPLIT_PATTERN)

In [46]:
tokenizer.get_mergeable_ranks()[-40:]

[(b'\xfa', 250),
 (b'\xfb', 251),
 (b'\xfc', 252),
 (b'\xfd', 253),
 (b'\xfe', 254),
 (b'\xff', 255),
 (b' a', 256),
 (b'at', 257),
 (b'oo', 258),
 (b're', 259),
 (b' c', 260),
 (b' d', 261),
 (b' f', 262),
 (b' g', 263),
 (b' l', 264),
 (b'Th', 265),
 (b'ed', 266),
 (b'ked', 267),
 (b'og', 268),
 (b'or', 269),
 (b' are', 270),
 (b'ooked', 271),
 (b' cat', 272),
 (b' dog', 273),
 (b' for', 274),
 (b' looked', 275),
 (b'The', 276),
 (b'Bi', 277),
 (b'Cat', 278),
 (b'cl', 279),
 (b'cy', 280),
 (b'es', 281),
 (b'ood', 282),
 (b'reat', 283),
 (b' good', 284),
 (b' great', 285),
 (b'Bicy', 286),
 (b'Cats', 287),
 (b'cles', 288),
 (b'Bicycles', 289)]

## B) Trying my Tokenizer python class

`my_tokenizer.py` is my very partial copy of nanochat's [RustBPETokenizer class](https://github.com/karpathy/nanochat/blob/master/nanochat/tokenizer.py). It uses the rust tokenizer and tiktoken. It doesn't do anything with special tokens like `<|user_start|>`that I assume will be needed later once we start workign with the model.

In [47]:
from my_tokenizer import MyTokenizer

In [48]:
tokenizer = MyTokenizer.train_from_iterator(text_iterator = ["the cat", "the bat"], vocab_size = 1000)

In [49]:
tokens = tokenizer.encode("the cat"); tokens

[258, 262]

In [50]:
tokenizer.decode(tokens)

'the cat'

In [51]:
tokens = tokenizer.encode("zebra"); tokens

[122, 101, 98, 114, 97]

In [52]:
tokenizer.decode(tokens)

'zebra'

In [53]:
tokenizer = MyTokenizer.train_from_iterator(text_iterator = ["the cat", "the bat", "hello ðŸ‘‹", "ä½ å¥½"], vocab_size = 1000)

In [54]:
tokens = tokenizer.encode(" ðŸ‘‹"); tokens

[272]

In [55]:
tokens = tokenizer.encode("ä½ å¥½"); tokens

[273]

In [56]:
tokenizer.decode(tokens)

'ä½ å¥½'

try saving / loading

In [57]:
tokenizer = MyTokenizer.train_from_iterator(text_iterator = ["the cat", "the bat"], vocab_size = 1000)
tokenizer.encode("the car")

[258, 260, 97, 114]

In [58]:
tokenizer.save('my-tokenizer.pkl')

Saved tokenizer to my-tokenizer.pkl


In [59]:
tokenizer = MyTokenizer.load_from_file('my-tokenizer.pkl')

In [60]:
tokenizer.encode("the car")

[258, 260, 97, 114]

## C) Trying other misc rust functions

Since I'm still so new to rust and how you use it from python, I exposed a second class called "Play" to try things from this notebook.

In [61]:
play = rust_tokenizer.Play()

In [62]:
play.hello("Ernie")

'hello Ernie'

In [63]:
play.get_type("Ernie")

"<class 'str'>"

In [64]:
play.get_type([1,2,3])

"<class 'list'>"

In [65]:
play.get_type(iter([1,2,3]))

"<class 'list_iterator'>"

In [66]:
play.concat_from_iterator(["the", "cat", "flew"])

'thecatflew'

In [67]:
play.concat_from_iterator(123) # expect error

TypeError: 'int' object is not iterable

In [68]:
play.fancy_concat_from_iterator(["the", "cat", "flew"])

'thecatflew'

In [69]:
play.find_matches(SPLIT_PATTERN, "They went to buy 1234 candies.")

['They', ' went', ' to', ' buy', ' ', '12', '34', ' candies', '.']

In [70]:
play.understand_comparison()

pair_a.cmp(pair_b): Less
pair_b.cmp(pair_a): Greater
pair_a.cmp(pair_c): Equal
pair_a.cmp(pair_d): Less


In [71]:
play.merge_pair_into_word(word_ids = [1,2,3,1,2], pair = (1,2), new_id = 4)

word after merge: Word { ids: [4, 3, 4] }
detals: [((1, 2), -1), ((2, 3), -1), ((4, 3), 1), ((3, 1), -1), ((3, 4), 1), ((1, 2), -1)]
