<a href="https://colab.research.google.com/github/ferragina/MyInformationRetrieval/blob/main/5_tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparing Trained LLM Tokenizers

In this notebook, we will work with some tokenizers associated with different LLMs and explore how each tokenizer approaches tokenization differently.   

## Setup

In [1]:
# !pip install transformers>=4.46.1

# Warning control
import warnings
warnings.filterwarnings('ignore')

## Tokenizing Text

Let's import the `Autotokenizer` class, instantiate the tokenizer [`bert-base-cased` model](https://huggingface.co/google-bert/bert-base-cased), define the sentence to tokenize.

In [2]:
from transformers import AutoTokenizer

In [6]:
# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

print("number of tokens in the dictionary: ", len(tokenizer))

number of tokens in the dictionary:  28996


In [8]:
# define the sentence to tokenize
sentence = "Hello world!"

# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids

print(token_ids)

[101, 8667, 1362, 106, 102]


In [9]:
# map each token ID to its corresponding token, you can use the `decode` method of the tokenizer.

for id in token_ids:
    print(tokenizer.decode(id))

[CLS]
Hello
world
!
[SEP]


## Visualizing Tokenization

Let's wrap the code of the previous section in the function `show_tokens`. The function takes in a text and the model name, and prints the vocabulary length of the tokenizer and a colored list of the tokens.

In [10]:
# A list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence: str, tokenizer_name: str):
    """ Show the tokens each separated by a different color """

    # Load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids

    # Extract vocabulary length
    print(f"\nVocab length: {len(tokenizer)}")

    # Extract vocabulary length
    print(f"Number of tokens: {len(token_ids)}\n")

    # Print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

## Try different tokenization

Make sure to consider the following features when you're doing your comparison:
- Vocabulary length
- Special tokens
- Tokenization of the tabs, special characters and special keywords



In [11]:
text = """
English and CAPITALIZATION
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

**bert-base-cased**

In [None]:
show_tokens(text, "bert-base-cased")

**bert-base-uncased**

You can also try the uncased version of the bert model, and compare the vocab length and tokenization strategy of the two bert versions.

In [None]:
show_tokens(text, "bert-base-uncased")

**GPT-4**

In [None]:
show_tokens(text, "Xenova/gpt-4")

**Starcoder 2 - 15B**

In [None]:
show_tokens(text, "bigcode/starcoder2-15b")