This notebook demonstrates how to use the tiktoken library to:
1) Encode text into tokens (integers) using the tokenization scheme of a specific model (gpt2 in this case).
2) Decode those tokens back to text.
This is useful for understanding how language models handle text in its tokenized form, which is essential for tasks like text generation or semantic analysis.

**References:**
- openai-cookbook
- Build a Large Language Model (From Scratch)

Byte Pair Encoding (BPE)
-----------
----------

Byte Pair Encoding (BPE) is a compression and tokenization algorithm widely used in language models, such as those used to program chatbots. Essentially, BPE allows text to be represented as a combination of subwords or fragments, optimizing both vocabulary and the handling of new words.

Purpose of BPE in Chatbots:
---------------------------
BPE is used to:

- Reduce vocabulary size: Instead of treating each word as a separate unit, break words into smaller subunits (subwords, letters, or common fragments).
- Handle Out-of-Vocabulary (OOV) Words: Even if a word is not in the vocabulary, it can be represented by its subunits, allowing the model to handle unknown text more flexibly.
- Optimize performance: A smaller vocabulary reduces the computational complexity of the model.

How BPE Works:
--------

**Initialization:** it starts with a vocabulary that contains all the individual characters present in the training text, plus some special tokens (such as <|endoftext|>).

**Creating frequent combinations:** the pair of symbols (characters or sequences) that appears most frequently in the text is identified. These pairs are merged to form a new token.

**Repetition of the process:** the process is repeated a fixed number of times or until a predefined vocabulary size is reached.

**Generation of the final vocabulary:** the vocabulary contains tokens that represent the most frequent subunits in the training text.

Practical Implementation
-----------
----------

In [1]:
pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-win_amd64.whl (884 kB)
     -------------------------------------- 884.2/884.2 kB 8.0 MB/s eta 0:00:00
Installing collected packages: tiktoken
Successfully installed tiktoken-0.8.0
Note: you may need to restart the kernel to use updated packages.


We import "Version" from the "importlib.metadata" module to obtain the installed version of the tiktoken library. Then we print the version.

In [2]:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

tiktoken version: 0.8.0


tiktoken.get_encoding is used to get the text tokenizer associated with the gpt2 model. This indicates that the code will use the token encoding scheme used in that model.

In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

We use some sample text to check how the "enconde" and "decode" functions work

In [4]:
text = (
 "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
 "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [5]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


Byte pair encoding of unknown words
-----------
We can try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and print the individual token IDs. Then, call the decode function on each of the resulting integers in this list. Lastly, call the decode method on the token IDs to check whether it can reconstruct the original input, “Akwirw ier.”

In [6]:
text = (
    "Akwirw ier."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[33901, 86, 343, 86, 220, 959, 13]


In [7]:
strings = tokenizer.decode(integers)
print(strings)

Akwirw ier.


Count tokens in text
---------

In [10]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [11]:
num_tokens_from_string("tiktoken is great!", "o200k_base")

6

Turn tokens into text
------------

In [15]:
encoding = tiktoken.get_encoding("r50k_base")
encoding.decode([33901, 86, 343, 86, 220, 959, 13])

'Akwirw ier.'