In [1]:
import MinBPE.TikToken
import MinBPE.Codec
import MinBPE.Types

import qualified Data.Text.IO as TIO
import qualified Data.Map as Map
import qualified Data.ByteString as B

Loading TikToken files
======================

Load the cl100k_base encoding into a `Vocab`:

In [2]:
v <- loadTikToken "../example/cl100k_base.tiktoken"



Now we can use the `Vocab`:

In [3]:
Map.size v

100256

In [4]:
decode v [
    5109, 15836, 596, 3544, 4221, 4211, 320, 57753, 14183, 311, 439, 480, 
    2898, 596, 8, 1920, 1495, 1701, 11460, 11, 902, 527, 4279, 24630, 315,
    5885, 1766, 304, 264, 743, 315, 1495, 13, 578, 4211, 4048, 311, 3619,
    279, 29564, 12135, 1990, 1521, 11460, 11, 323, 25555, 520, 17843, 279,
    1828, 4037, 304, 264, 8668, 315, 11460, 382, 2675, 649, 1005, 279, 5507,
    3770, 311, 3619, 1268, 264, 6710, 315, 1495, 2643, 387, 4037, 1534, 555,
    264, 4221, 1646, 11, 323, 279, 2860, 1797, 315, 11460, 304, 430, 6710, 315,
    1495, 382, 2181, 596, 3062, 311, 5296, 430, 279, 4839, 4037, 2065, 1920,
    35327, 1990, 4211, 13, 1561, 261, 4211, 1093, 480, 2898, 12, 18, 13, 20,
    323, 480, 2898, 12, 19, 1005, 264, 2204, 47058, 1109, 3766, 4211, 11, 323,
    690, 8356, 2204, 11460, 369, 279, 1890, 1988, 1495, 13]

"OpenAI's large language models (sometimes referred to as GPT's) process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.\n\nYou can use the tool below to understand how a piece of text might be tokenized by a language model, and the total count of tokens in that piece of text.\n\nIt's important to note that the exact tokenization process varies between models. Newer models like GPT-3.5 and GPT-4 use a different tokenizer than previous models, and will produce different tokens for the same input text."

In [5]:
TIO.putStrLn $ decode v [33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699]

お誕生日おめでとう

We can convert the `Vocab` to `Vector` format to use the optimized decoders:

In [6]:
vv = vocabToVector v

In [8]:
decodeVec vv [83, 1609, 5963, 374, 2294, 0]

"tiktoken is great!"

In [None]:
decodeVec vv [519, 85342, 34500, 479, 8997, 2191]

"antidisestablishmentarianism"

Writing TikToken files
======================

In [None]:
writeTikToken "../example/cl100k_base_copy.tiktoken" v

Compare the written file to the original:

In [None]:
original <- B.readFile "../example/cl100k_base.tiktoken"
copy <- B.readFile "../example/cl100k_base_copy.tiktoken"
original == copy

True