# Video: Visualizing Text as Tokens

One key step in the evolution of large language models was transitioning to predicting tokens instead of just bytes.
This video how the compression method byte pair encoding was repurposed to identify frequent text patterns for token selection, and gives examples of the increased semantic information of the resulting tokens.

Script:
* One step in the development of large language models was a transition from predicting bytes or words.
* Early models predicted one byte at a time, so any text could be modeled, but many predictions were needed to output a whole word.
* Later models predicted whole words which required fewer predictions and anecdotally helped coherency, but had trouble when new words appear.
* Just this month, the Oxford English dictionary added the word "agritech" meaning "Technology that is used in agriculture to increase yield, efficiency, productivity, sustainability".
* You and I could guess the meaning of that word from the prefix "agri" and suffix "tech".
* Tokens allow a language model to work with bigger common pieces of text while maintaining the flexibility of predicting individual bytes when something new is encountered.
* For example, "agri" and "tech" might both be tokens, and the language model could infer a likely meaning based on the known usage of those tokens separate from each other.
* Let's look at the tokens of a real large language model now.

In [None]:
import tiktoken

Script:
* I am going to use this module tiktoken by OpenAI for tokenization.
* OpenAI uses this module for fast tokenization in their production language models and apps like ChatGPT.

In [None]:
test_string = "The quick brown fox jumps over the lazy dog"

Script:
* For initial tests, I will use the common typing test sentence "The quick brown fox jumps over the lazy dog".
* If you aren't familiar with this sentence, it was written with each of the 26 letters of the English alphabet.
* Early language models would just break it up into bytes or characters.
* Here are the characters.

In [None]:
[c for c in test_string]

['T',
 'h',
 'e',
 ' ',
 'q',
 'u',
 'i',
 'c',
 'k',
 ' ',
 'b',
 'r',
 'o',
 'w',
 'n',
 ' ',
 'f',
 'o',
 'x',
 ' ',
 'j',
 'u',
 'm',
 'p',
 's',
 ' ',
 'o',
 'v',
 'e',
 'r',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'l',
 'a',
 'z',
 'y',
 ' ',
 'd',
 'o',
 'g']

Script:
* From an English-centered viewpoint, characters and bytes used to look the same, but many languages use an extended character set where they are different.
* Let's look at the bytes now.

In [None]:
test_string_bytes = test_string.encode("utf-8")
test_string_bytes

b'The quick brown fox jumps over the lazy dog'

In [None]:
[b for b in test_string_bytes]

[84,
 104,
 101,
 32,
 113,
 117,
 105,
 99,
 107,
 32,
 98,
 114,
 111,
 119,
 110,
 32,
 102,
 111,
 120,
 32,
 106,
 117,
 109,
 112,
 115,
 32,
 111,
 118,
 101,
 114,
 32,
 116,
 104,
 101,
 32,
 108,
 97,
 122,
 121,
 32,
 100,
 111,
 103]

Script:
* Each of those bytes has 256 possibilities.
* Only a few characters can be encoded that way.
* Chinese will not fit, and emojis will not fit either.

In [None]:
test_emoji = '🦊'

Script:
* Here is the emoji for a fox.

In [None]:
test_emoji_bytes = test_emoji.encode("utf-8")
test_emoji_bytes

b'\xf0\x9f\xa6\x8a'

In [None]:
[b for b in test_emoji_bytes]

[240, 159, 166, 138]

Script:
* Unlike the previous sentence where each character mapped to one byte, the fox emoji is encoded by 4 bytes.
* The early language models would need 4 correct predictions to produce a fox emoji.
* Let's look at the tokens now.

In [None]:
encoder = tiktoken.encoding_for_model("gpt-4o")

Script:
* This is the token encoder used by the GPT-4o model released in May 2024.
* One of the advertised features of this model was increased tokenization of Asian languages to make their speed and text quality better.

In [None]:
token_ids = encoder.encode(test_string)
token_ids

[976, 4853, 19705, 68347, 65613, 1072, 290, 29082, 6446]

Script:
* Here is the encoding of our test sentence from before.
* Each number identifies a single token, and some of these numbers are much higher than the 256 values used for just bytes.
* What do these token identifiers mean?

In [None]:
tokens = [encoder.decode([token_id]) for token_id in token_ids]
tokens

['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog']

Script:
* These tokens roughly correspond to words.
* But note that they also include the spaces for each word.
* So most of these tokens have the effect of ending the previous word and starting a new word.
* They do not necessarily encode whole words though.
* A token starting with L Y space could come after the token for space quick to produce the word quickly.
* Let's look at the tokenization of the fox emoji.

In [None]:
[encoder.decode([token_id]) for token_id in encoder.encode(test_emoji)]

['�', '�', '�']

Script:
* The tokenization of the fox emoji is less comprehensible.
* There are only 3 tokens where the emoji was previously encoded with 4 bytes.
* So 2 of those bytes must have been common and combined into a token for a slight savings.
* This suggests that the fox emoji is not common enough in the training data set to be assigned a dedicated token.


In [None]:
encoder.decode(encoder.encode(test_emoji))

'🦊'

Script:
* Despite that missing data coverage, the fox emoji can still be produced from these tokens.

Script: (faculty on screen)
* Tokenization was a very interesting development for language models.
* Tokens are simultaneously a coverage improvement, efficiency improvement, and quality improvement for language models.
