# Exercise 1 - Tokenization
In this exercise, we will explore how text is tokenized. The goal is to develop a deeper understanding of the tokenization process and how it varies across different languages.

## Tokenization App

To see an example, OpenAI provides a clear demonstration of how an application like ChatGPT performs tokenization. You can see the demonstration [here](https://platform.openai.com/tokenizer).

![](../../assets/openai-tokenizer.png)

## Exercise 1a
Using the tool linked above, try to answer the following questions:
- What are the characteristics of the English words that do not get split into subtokens? (e.g., length, type of word, rareness, etc.)
- What are the characteristics of the English words that get split into subtokens? (e.g., length, type of word, rareness, etc.)

After you have examined English words, try to do the same for non-English words. Ideally, use a non-English language you know well. If you do not know any non-English languages, you can use Google Translate to translate some English text. Try to answer the following questions:
- What are the characteristics of the non-English words that do not get split into subtokens? (e.g., length, type of word, rareness, etc.)
- What are the characteristics of the non-English words that do get split into subtokens? (e.g., length, type of word, rareness, etc.)
- Do you notice any differences between the English and non-English tokens? (e.g., number of tokens used, average length of tokens, etc.)


## Tokenization with Python

The tokenizer used in the visualization tool is also available as a Python SDK through the [tiktoken](https://github.com/openai/tiktoken) package.

In this exercise, we'll use **tiktoken** to gain a deeper understanding of how tokenization works.

Note: While this exercise uses an OpenAI tokenization library, the process is similar across various large language models (LLMs).

In [1]:
from tiktoken import encoding_for_model

Next, we need to select which tokenizer we want to use. 

In [2]:
tokenizer = encoding_for_model("gpt-4")

We can now use the tokenizer as follows:
- `openai_tokenizer.encode(text)` will return the token indices of the text.
- `openai_tokenizer.decode_single_token_bytes(token)` will return the token index in text format.


In [18]:
text = "hello world or hello-world Hello World! I see vcetor vector going gonig"
token_indices = tokenizer.encode(text)
tokens = [tokenizer.decode_single_token_bytes(token).decode('utf-8') for token in token_indices]
print("These are the indices of the tokens:", token_indices)
print("These are the tokens in text format:", tokens)

These are the indices of the tokens: [15339, 1917, 477, 24748, 31184, 22691, 4435, 0, 358, 1518, 25571, 295, 269, 4724, 2133, 64592, 343]
These are the tokens in text format: ['hello', ' world', ' or', ' hello', '-world', ' Hello', ' World', '!', ' I', ' see', ' vc', 'et', 'or', ' vector', ' going', ' gon', 'ig']


In [4]:
text = "Voici une phrase en Français."
token_indices = tokenizer.encode(text)
tokens = [tokenizer.decode_single_token_bytes(token).decode('utf-8') for token in token_indices]
print("These are the indices of the tokens:", token_indices)
print("These are the tokens in text format:", tokens)

These are the indices of the tokens: [28615, 3457, 6316, 17571, 665, 84939, 2852, 13]
These are the tokens in text format: ['Vo', 'ici', ' une', ' phrase', ' en', ' Franç', 'ais', '.']


## Exercise 1b - Tokenizing different languages
Next, we will use the tokenizer to calculate the average token length of an English text and a non-English text.
We have given you two example texts, one in English and one in Dutch. You can use these texts or replace them with your own texts.

In [11]:
text = "A windmill is a structure that converts wind power into rotational energy using vanes called sails or blades."

# YOUR CODE HERE START
tokens_indices_2 = tokenizer.encode(text)
tokens_2 = [tokenizer.decode_single_token_bytes(token).decode('utf-8') for token in tokens_indices_2]
print(tokens_indices_2)
print(tokens_2)
# YOUR CODE HERE END

[32, 10160, 26064, 374, 264, 6070, 430, 33822, 10160, 2410, 1139, 92371, 4907, 1701, 5355, 288, 2663, 86105, 477, 42742, 13]
['A', ' wind', 'mill', ' is', ' a', ' structure', ' that', ' converts', ' wind', ' power', ' into', ' rotational', ' energy', ' using', ' van', 'es', ' called', ' sails', ' or', ' blades', '.']


In [12]:
text = "Een windmolen is een constructie die windenergie omzet in rotatie-energie met behulp van schoepen die zeilen of bladen worden genoemd."

# YOUR CODE HERE START
tokens_indices_3 = tokenizer.encode(text)
tokens_3 = [tokenizer.decode_single_token_bytes(token).decode('utf-8') for token in tokens_indices_3]
print(tokens_indices_3)
print(tokens_3)
# YOUR CODE HERE END

[36, 268, 10160, 76, 17648, 374, 8517, 9429, 648, 2815, 10160, 804, 22235, 8019, 61828, 304, 5868, 26937, 12, 804, 22235, 2322, 2824, 13136, 5355, 78140, 752, 268, 2815, 14017, 23684, 315, 1529, 21825, 31279, 4173, 78, 95210, 13]
['E', 'en', ' wind', 'm', 'olen', ' is', ' een', ' construct', 'ie', ' die', ' wind', 'ener', 'gie', ' om', 'zet', ' in', ' rot', 'atie', '-', 'ener', 'gie', ' met', ' beh', 'ulp', ' van', ' scho', 'ep', 'en', ' die', ' ze', 'ilen', ' of', ' bl', 'aden', ' worden', ' gen', 'o', 'emd', '.']


In [15]:
text = "ik zie bladjes."

# YOUR CODE HERE START
tokens_indices_4 = tokenizer.encode(text)
tokens_4 = [tokenizer.decode_single_token_bytes(token).decode('utf-8') for token in tokens_indices_4]
print(tokens_indices_4)
print(tokens_4)
# YOUR CODE HERE END

[1609, 75347, 1529, 329, 21297, 13]
['ik', ' zie', ' bl', 'ad', 'jes', '.']


---