# Tokens, Token Id's, Tokenizer, and Embeddings

## 1. Tokens
* __Definition__ : A token is a piece of text that is treated as a single unit by a language model.
* __Examples__ : Tokens can be individual words, word segments, punctuation marks, or special markers (like [CLS] or [SEP] in some model architectures).
* __Purpose__ : Language models work fundamentally on sequences of tokens. The first step in processing text is always tokenization—splitting text into tokens that the model “understands.”


## 2. Token Id's
* __Definition__: Each token has a unique id


## 3. Tokenizer
* __Definition__ : A tokenizer is a function (or a class) that takes raw text (a string) and outputs a list of tokens or numerical IDs representing those tokens.
* __Mechanism__ : Different tokenizers use different approaches:
  * __Word-based tokenizers__ : The text is split on spaces and punctuation.
  * __Subword/BPE (Byte-Pair Encoding) tokenizers__ : The text is segmented into frequent subwords. For example, “embedding” might be split into “em”, “bed”, and “ding.”
  * __Character-based tokenizers__ : Every single character may become a token.
* __Output__ : Typically, a tokenizer maps tokens to integer indices (IDs). For example, the word “Hello” might be token ID 440 in a certain vocabulary.



## 4. Embeddings

* __Definition__ : The concept of converting data into a vector format is often referred to as embedding. An embedding is a dense numerical vector that captures semantic or contextual information about a token (or an entire sequence).
* __Model Output__ : When you pass your tokenized input into a language model (e.g., BERT, GPT), the first layer of the model converts each token ID into an embedding, which is typically a vector of floating-point numbers.
* __Dimension__ : Embeddings often have dimensions in the hundreds or even thousands (e.g., 768 or 1024).
* __Note__ : It’s important to note that different data formats require distinct
embedding models. For example, an embedding model designed for text would not
be suitable for embedding image data.


## 5. Vocabulary

ToDo


## 6. Putting It All Together
You provide raw text (e.g., "Hello, this is a test.").
The tokenizer splits this text into tokens and assigns IDs (e.g., [101, 7592, 1010, 2023, ...] for a BERT-based model).
The model takes these token IDs and looks up or learns an corresponding embedding vector for each token.
Higher layers of the model transform or refine these embeddings. Ultimately, these vectors are used in tasks such as classification, language generation, or other NLP tasks.

In [None]:
import random

import numpy as np
import plotly.express as px
import plotly.graph_objs as go
import torch
from PIL import Image
from sentence_transformers import SentenceTransformer, util
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizerFast

## Tokens and Tokenizers in practice

In [None]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

Note the "Fast" postfix to the BertTokenizer. This implies that the underlying implementation is based on Rust and not Python.

In [None]:
text = "biker biked a long bike tour"

In [None]:
encoded = tokenizer(text)
print("Encoded text:", encoded)

In [None]:
decoded_string = tokenizer.decode(encoded["input_ids"], skip_special_tokens=True)
print("Decoded text:", decoded_string)


Why do we have 9 tokens and only 6 words?

In [None]:
def print_tokens_and_ids(encoded):
    # 1. Convert the input_ids tensor to a list of token IDs
    input_ids = encoded["input_ids"]

    # 2. Convert each token ID back to its corresponding token
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # 3. Print the tokens with their IDs
    for token, token_id in zip(tokens, input_ids):
        print(f"Token: {token}\tID: {token_id}")


In [None]:
print_tokens_and_ids(encoded)

This is just one example of a tokenizer. 

Can you think of more ways to split a text into tokens?

Think about pros and cons for your proposed tokenizer!


* Word Tokenization
* Character Tokenization
* Subword Tokenization
    * Byte Pair Encoding (BPE)
    * WordPiece
    * SentencePiece
    * Unigram

As far as I know OpenAI are using the BPE tokenizer or some variant. It is not that difficult to implement a working version of [BPE](https://sebastianraschka.com/blog/2025/bpe-from-scratch.html).

## Remember to check that your choice of tokenizer can handle expected input!

In [None]:
happy = tokenizer("I feel 😊 today!")
print_tokens_and_ids(happy)

In [None]:
sad = tokenizer("I feel 😢 today!")
print_tokens_and_ids(sad)

This advice most likely also applies to other languages than English!

# Embeddings

Embedding are not just embedding. There are a least 2 different kinds of embeddings:

* sentence/chunks/documents embeddings and
* word embeddings

In a RAG application we will be searching our index for sentences similar to the users query. Hence sentence embeddings are more likely to be succesfull. 

For LLM base models we are simply trying to predict the next word (=token), so here word embedding are commonly used.


## Embedding model

Working with sentence embedding seems easier than word embedding. The following code is quite low level since we want to use the same model for both sentence and word embeddings.

In [None]:
device = "cuda:0"
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)

In [None]:
model._first_module()

In [None]:
# Transformer consists of multiple stack modules. Tokens are an input
# of the first one, so we can ignore the rest.
first_module = model._first_module()

tokenizer = first_module.tokenizer
embeddings = first_module.auto_model.embeddings

In [None]:
first_sentence = "vector search optimization"
second_sentence = "how do I use vector search optimization"


### Word embeddings

In [None]:
with torch.no_grad():
    # Tokenize both texts
    first_tokens = model.tokenize([first_sentence])
    second_tokens = model.tokenize([second_sentence])

    # Get the corresponding embeddings
    first_embeddings = embeddings.word_embeddings(first_tokens["input_ids"].to(device))
    second_embeddings = embeddings.word_embeddings(
        second_tokens["input_ids"].to(device)
    )

first_embeddings.shape, second_embeddings.shape

In [None]:
first_tokens["input_ids"], second_tokens["input_ids"]

In [None]:
first_embeddings.shape, second_embeddings.shape

In [None]:
distances = (
    util.cos_sim(first_embeddings.squeeze(), second_embeddings.squeeze()).cpu().numpy()
)

In [None]:
px.imshow(
    distances,
    x=model.tokenizer.convert_ids_to_tokens(second_tokens["input_ids"][0]),
    y=model.tokenizer.convert_ids_to_tokens(first_tokens["input_ids"][0]),
    text_auto=True,
)

### Sentence embeddings

In [None]:
first_sentence_embedding = model.encode(first_sentence)
second_sentence_embedding = model.encode(second_sentence)

In [None]:
distance = (
    util.cos_sim(
        first_sentence_embedding.squeeze(), second_sentence_embedding.squeeze()
    )
    .cpu()
    .numpy()
)
distance

# Visualizing the model embeddings


In [None]:
token_embeddings = embeddings.word_embeddings.weight.detach().cpu().numpy()
token_embeddings.shape


In [None]:
vocabulary = tokenizer.get_vocab()
sorted_vocabulary = sorted(
    vocabulary.items(),
    key=lambda x: x[1],  # uses the value of the dictionary entry
)
sorted_tokens = [token for token, _ in sorted_vocabulary]
random.choices(sorted_tokens, k=25)

In [None]:
tsne = TSNE(n_components=2, metric="cosine", random_state=42)
tsne_embeddings_2d = tsne.fit_transform(token_embeddings)
tsne_embeddings_2d.shape

In [None]:
token_colors = []
for token in sorted_tokens:
    if token[0] == "[" and token[-1] == "]":
        token_colors.append("red")
    elif token.startswith("##"):
        token_colors.append("blue")
    else:
        token_colors.append("green")


scatter = go.Scattergl(
    x=tsne_embeddings_2d[:, 0],
    y=tsne_embeddings_2d[:, 1],
    text=sorted_tokens,
    marker=dict(color=token_colors, size=3),
    mode="markers",
    name="Token embeddings",
)

fig = go.FigureWidget(
    data=[scatter],
    layout=dict(
        width=600,
        height=900,
        margin=dict(l=0, r=0),
    ),
)

fig.show()


## Image embeddings

In [None]:
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

# Replace these with the paths or filenames of your four images
image_paths = [
    "images/banana.jpg",
    "images/bananas.jpg",
    "images/different_bananas.jpg",
    "images/pizza.jpg",
]

# Create a figure with 1 row and 4 columns
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(12, 3))

# Load each image and show it in a separate subplot
for i, path in enumerate(image_paths):
    # Read the .jpg image using matplotlib.image
    img = mpimg.imread(path)

    # Display the image
    axes[i].imshow(img)

    # Optionally, turn off the axis ticks/spines
    axes[i].axis("off")

plt.tight_layout()
plt.show()

In [None]:
model = SentenceTransformer("clip-ViT-L-14")
banana_embeddings = model.encode(Image.open("images/banana.jpg"))

print(len(banana_embeddings))  # Dimension of embeddings 768
print(banana_embeddings[:10])


In [None]:
bananas_embeddings = model.encode(Image.open("images/bananas.jpg"))
different_bananas_embeddings = model.encode(Image.open("images/different_bananas.jpg"))
pizza_embeddings = model.encode(Image.open("images/pizza.jpg"))


In [None]:
all_embeddings = np.array(
    [
        banana_embeddings,
        bananas_embeddings,
        different_bananas_embeddings,
        pizza_embeddings,
    ]
)

In [None]:
similarity_matrix = cosine_similarity(all_embeddings)

In [None]:
labels = ["banana", "bananas", "different bananas", "pizza"]

In [None]:
px.imshow(
    similarity_matrix,
    x=labels,
    y=labels,
    text_auto=True,
)