> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

![](https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C2-white-bg.png)

# Lab: Tokenize Texts into Subword Tokens


<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_2/gdm_lab_2_3_tokenize_texts_into_subword_tokens.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

Explore how Gemma's tokenizer splits texts into units between characters and words.

10 minutes

## Overview

In this lab, you will experiment with the **tokenizer** of the **Google Gemma** language model. You will explore how this model is able to deal with rare words or words that do not appear in its training data while still representing common words as their own tokens.

### What you will learn

By the end of this lab, you will understand:

* What subword tokenization is.
* How subword tokenizers handle rare and unseen words, and emojis.
* The purpose of special tokens in tokenizers for language models.


### Tasks

In this lab, you will:

* Experiment with Gemma's tokenizer to explore subword tokenization.
* Implement a function to tokenize the made-up word "Clusterophonexia".
* Inspect how Gemma handles emojis and the purpose of its special tokens.

## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in **cells** that are executed on a remote server.

To run a cell, hover over the cell and click on the `run` button to its left. The run button is the circle with the triangle (▶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or ⌘+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime

print(f"Today is {datetime.today():%A}.")

Note that the *order in which you run the cells matters*. When you are working through a lab, make sure to always run *all* cells in order, otherwise the code might not work. If you take a break while working on a lab, Colab may disconnect you and in that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime → Run before__  from the menu above (or use the keyboard combination Ctrl/⌘ + F8). This will re-execute all cells before the current one.

## Imports

In this lab, you will primarily work with the tokenizer from the `gemma` package.

Run the following cell to import the required packages.

In [None]:
%%capture

# Install the custom package for this course. This also installs the gemma
# package.
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

from gemma import gm # For interacting with the Gemma tokenizer.
# For providing feedback on your implementations.
from ai_foundations.feedback.course_2 import subword_tokens as feedback

## Subword tokenization

In the last activity, you saw the trade-offs between two main tokenization strategies. Character-level tokenization can handle any word but creates very long sequences and is not able to capture the inherent meaning of entire words. On the other hand, word-level tokenization is intuitive but struggles with rare or unseen words (the out-of-vocabulary problem).

One method that finds a good middle ground and provides the best of both worlds is **subword tokenization**. This approach offers a clever compromise between character-level and word-level tokenization:

* Frequent words (like "the" or "is") are kept as single, complete tokens.

* Rare or complex words (like "Baobab") are broken down into smaller, meaningful sub-units.

This way, the model maintains a manageable, fixed-size vocabulary while still being able to represent any word you can think of.

As a first step, consider how Gemma tokenizes the following text:

"The Baobab (genus Adansonia) is"

It turns this text into the following tokens:

```python
["The", " Ba", "ob", "ab", " (", "genus", " Ad", "ans", "onia", ")", " is"]
```
You will notice that spaces are part of the tokens that form the first part of a word (or an entire word). Further, the word "Baobab" is broken up into three tokens. This is done using the  **byte pair encoding**  (BPE) algorithm, one of the most popular algorithms for subword tokenization, which you will implement yourself in the next lab.

### Load and experiment with the Gemma tokenizer

To gain a better intuition of how the Gemma tokenizer works, run the following cell to load it.

In [None]:
# Load the tokenizer.
gemma_tokenizer = gm.text.Gemma3Tokenizer()

# Inspect the vocabaulary size.
print(f"Gemma's vocabulary consists of {gemma_tokenizer.vocab_size:,} tokens.")

As you can see, Gemma has a very large vocabulary size of more than 260,000 tokens. This is because the list of unique tokens has been determined using a much larger dataset than Africa Galore. Furthermore, the tokenizer has been designed with the goal of allowing the model to capture a wide spectrum of input, from common words and frequent subword patterns, to emojis like `'☺️'`, to the character sets of different languages. With such an expansive vocabulary, Gemma can often tokenize text more precisely. This reduces instances of the meaningless `<UNK>` token and better preserves the semantic richness of the original input.

#### Encoding and decoding

To translate an arbitrary input text to the token IDs that Gemma can process, you can use the `encode` function of the tokenizer. This is demonstrated in the following cell.


In [None]:
# Encode a text into token IDs.
text = "The Baobab (genus Adansonia) is one of the most iconic trees."

gemma_tokens = gemma_tokenizer.encode(text)
print(f"Result of tokenizing the text \"{text}\":")
print(gemma_tokens)

This process can be reversed using the `decode` method that translates the token IDs back to a text.

In [None]:
# Decode the tokens back to a text.
decoded_text = gemma_tokenizer.decode(gemma_tokens)
print(f"Decoded sentence from tokens: {decoded_text}\n")

# Check whether this results in the same text as the original one.
is_equal = "✅" if text == decoded_text else "❌"
print(
    f"Decoding the tokens results in the same text as the original one:"
    f" {is_equal}\n"
)

# Decode individual tokens.
for token in gemma_tokens:
    decoded_token = gemma_tokenizer.decode(token)
    print(f"Token {token}:\t{decoded_token}")

### Coding Activity 1: Tokenize a made-up word

As mentioned, this approach has the advantage that it can represent almost any word as a combination of multiple subword tokens.

<br />

------
> **💻 Your task:**
>
> Use the `encode` and `decode` methods of the Gemma tokenizer to investigate how Gemma tokenizes the made-up word "Clusterophonexia".
>
> Store the list of token IDs in `clusterophonexia_tokens` and turn the first token ID back into a string and store it in `first_token_as_text`.
>
------


In [None]:
# Set the following two variables as described in the instructions above.
clusterophonexia_tokens =
first_token_as_text =

In [None]:
# @title Run this cell to test your code
feedback.test_gemma_subword_tokenization(clusterophonexia_tokens,
                                         first_token_as_text,
                                         gemma_tokenizer
)

As you have observed, the Gemma tokenizer can even tokenize made-up words, such as "Clusterophonexia". In this case the model has access to tokens it has been trained on. For example, if the meaning of "Clusterophonexia" is related to the word "cluster", then the Gemma tokenization would provide some information about what this word is supposed to mean. In this case, the blanket `<UNK>` token would not provide any information about the word's meaning.

### Tokenizing Unicode characters

Gemma's large vocabulary also includes many Unicode characters, such as emojis. Run the next cell to see how the sentence "I am smiling ☺️!" is tokenized:

In [None]:
gemma_tokens = gemma_tokenizer.encode("I am smiling ☺️!")

for i, token in enumerate(gemma_tokens):
    decoded_token = gemma_tokenizer.decode(token)
    print(f"Token {token}:\t{decoded_token}")

As you can see observe, the tokenizer is also able to map emojis to token IDs. In this case, the emoji ☺️ is mapped to ID 145233.

## Special tokens

Beyond the regular tokens that represent words or subwords, a tokenizer's vocabulary includes special tokens. These tokens don't represent content. Instead, they provide structural information to the model, such as marking boundaries or handling sequences of different lengths.

The following list includes several common special tokens:

* **`<BOS>`** and **`<EOS>`**:

  These stand for "beginning of sequence" (BOS) and "end of sequence" (EOS). Their primary job is to mark the start and end of a distinct piece of text. This comes with the following two advantages:

  * Efficient batching: By clearly marking where each sequence begins and ends, we can feed multiple documents to the model in a single batch without having to pad them extensively.

  * Dynamic generation: During text generation, the `<EOS>` token serves as a stop signal. Instead of generating a fixed number of tokens, the model can generate text until it produces an `<EOS>` token, allowing it to decide when a response is complete.

* **`<PAD>`**:

  As you have encountered, the padding token is used to make all input sequences in a batch the same length. Transformer models require inputs to have a fixed size, so shorter sequences are "padded" with this token until they match the length of the longest sequence in the batch.

* **`<UNK>`**:

  The unknown token, `<UNK>`, acts as a placeholder for a character or symbol that is not in the tokenizer's vocabulary. While subword tokenizers are great at representing almost any text, they can sometimes encounter a character that they have never been trained on. In these rare cases, the tokenizer will use `<UNK>` to represent it.


### Special tokens in Gemma

As an example of how to work with special tokens, consider the implementation of special tokens in Gemma. These can be accessed through `gemma_tokenizer.special_tokens`. For example, the following two cells demonstrate how to access the BOS and EOS tokens in Gemma.

In [None]:
# Beginning of sentence (BOS) token.
gemma_tokenizer.special_tokens.BOS

In [None]:
# End of sentence (EOS) token.
gemma_tokenizer.special_tokens.EOS

The tokenizer also supports automatically adding the BOS and EOS tokens to a sequence. This can be very useful, for example, when you are preparing data for finetuning a chatbot on prompts and model answers, to get the model to learn when it should stop generating.

This can be very useful. For example, when you are preparing data for finetuning a chatbot on prompts and model answers as it enables the model to learn when it should stop generating.


In [None]:
token_ids = gemma_tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
token_ids

## Summary

In this lab, you explored how a **subword tokenizer**, such as the one used by the Gemma model, works. You explored how it splits entire texts into tokens of varying granularity, how it can represent words that are not part of its vocabulary, and how it can represent Unicode characters e.g., emojis.

You also learned about the **common special tokens** `<BOS>`, `<EOS>`, `<PAD>`, and `<UNK>` and how they can be used to encode structural information in texts.

In the next activities, you will learn how you can implement a subword tokenizer using the byte pair encoding (BPE) algorithm.

## Solutions

The following cells provide reference solutions to the coding activities above. If you really get stuck after trying to solve the activities yourself, you may want to consult these solutions.

We recommend that you *only* look at the solutions after you have tried to solve the activities above *multiple times*. The best way to learn challenging concepts in computer science and artificial intelligence is to debug your code piece-by-piece until it works rather than copying existing solutions.

If you feel stuck, you may want to first try to debug your code. For example, by adding additional print statements to see what your code is doing at every step. This will provide you with a much deeper understanding of the code and the materials. It will also provide you with practice on how to solve challenging coding problems beyond this course.

To view the solutions for an activity, click on the arrow to the left of the activity name. If you consult the solutions, do not copy and paste them into the cells above. Instead, look at them and then type them manually into the cell. This will help you understand where you went wrong.

### Coding Activity 1

In [None]:
# Add the following two lines to the cell above.
clusterophonexia_tokens = gemma_tokenizer.encode("Clusterophonexia")
first_token_as_text = gemma_tokenizer.decode(clusterophonexia_tokens[0])
