# Tokens, Tokenizer, and Embeddings

## 1. Tokens
* __Definition__ : A token is a piece of text that is treated as a single unit by a language model.
* __Examples__ : Tokens can be individual words, word segments, punctuation marks, or special markers (like [CLS] or [SEP] in some model architectures).
* __Purpose__ : Language models work fundamentally on sequences of tokens. The first step in processing text is always tokenization—splitting text into tokens that the model “understands.”



## 2. Tokenizer
Definition : A tokenizer is a function (or a class) that takes raw text (a string) and outputs a list of tokens or numerical IDs representing those tokens.
Mechanism : Different tokenizers use different approaches:
Word-based tokenizers : The text is split on spaces and punctuation.
Subword/BPE (Byte-Pair Encoding) tokenizers : The text is segmented into frequent subwords. For example, “embedding” might be split into “em”, “bed”, and “ding.”
Character-based tokenizers : Every single character may become a token.
Output : Typically, a tokenizer maps tokens to integer indices (IDs). For example, the word “Hello” might be token ID 440 in a certain vocabulary.



## 3. Embeddings
Definition : An embedding is a dense numerical vector that captures semantic or contextual information about a token (or an entire sequence).
Model Output : When you pass your tokenized input into a language model (e.g., BERT, GPT), the first layer of the model converts each token ID into an embedding, which is typically a vector of floating-point numbers.
Dimension : Embeddings often have dimensions in the hundreds or even thousands (e.g., 768 or 1024).



## 4. Putting It All Together
You provide raw text (e.g., "Hello, this is a test.").
The tokenizer splits this text into tokens and assigns IDs (e.g., [101, 7592, 1010, 2023, ...] for a BERT-based model).
The model takes these token IDs and looks up or learns an corresponding embedding vector for each token.
Higher layers of the model transform or refine these embeddings. Ultimately, these vectors are used in tasks such as classification, language generation, or other NLP tasks.

## Tokens and Tokenizers in practice

In [33]:
from sentence_transformers import SentenceTransformer

In [34]:
device = "cuda:0"
print("device:", device)

device: cuda:0


## Let's us get a model

In [35]:
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
# model = SentenceTransformer("bert-base-cased", device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

and use the model's tokenizer. A tokenizer splits a sentence into tokens that can be represented by its vocabulary.

This process is also known as encoding.

In [15]:
tokenizer = model.tokenizer

In [27]:
tokenized_data = tokenizer(["biker biked a long bike tour"])
tokenized_data

{'input_ids': [[101, 28988, 7997, 2094, 1037, 2146, 7997, 2778, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [28]:
tokenizer.convert_ids_to_tokens(tokenized_data["input_ids"][0])

['[CLS]', 'biker', 'bike', '##d', 'a', 'long', 'bike', 'tour', '[SEP]']

while the process of converting token ids to text is called decoding

In [30]:
tokenizer.decode(tokenized_data["input_ids"][0])

'[CLS] biker biked a long bike tour [SEP]'

This is just one example of a tokenizer. 

Can you think of more ways to split a text into tokens?

* Word Tokenization
* Character Tokenization
* Subword Tokenization
    * Byte Pair Encoding (BPE)
    * WordPiece
    * SentencePiece
    * Unigram

# Embeddings

In [5]:
import random

import plotly.express as px
import plotly.graph_objs as go
import torch
from sentence_transformers import SentenceTransformer, util
from sklearn.manifold import TSNE

## We need an embedding model


In [37]:
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [38]:
model._first_module()

Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 

In [None]:
# Transformer consists of multiple stack modules. Tokens are an input
# of the first one, so we can ignore the rest.
first_module = model._first_module()

tokenizer = first_module.tokenizer
embeddings = first_module.auto_model.embeddings

In [None]:
first_sentence = "vector search optimization"
second_sentence = "how do I use vector search optimization"

with torch.no_grad():
    # Tokenize both texts
    first_tokens = model.tokenize([first_sentence])
    second_tokens = model.tokenize([second_sentence])

    # Get the corresponding embeddings
    first_embeddings = embeddings.word_embeddings(first_tokens["input_ids"].to(device))
    second_embeddings = embeddings.word_embeddings(
        second_tokens["input_ids"].to(device)
    )

first_embeddings.shape, second_embeddings.shape

In [None]:
distances = (
    util.cos_sim(first_embeddings.squeeze(), second_embeddings.squeeze()).cpu().numpy()
)  # Move the tensor to the CPU and convert to a NumPy array

px.imshow(
    distances,
    x=model.tokenizer.convert_ids_to_tokens(second_tokens["input_ids"][0]),
    y=model.tokenizer.convert_ids_to_tokens(first_tokens["input_ids"][0]),
    text_auto=True,
)

In [None]:
# ### Visualizing the input embeddings


token_embeddings = embeddings.word_embeddings.weight.detach().cpu().numpy()
token_embeddings.shape

vocabulary = tokenizer.get_vocab()
sorted_vocabulary = sorted(
    vocabulary.items(),
    key=lambda x: x[1],  # uses the value of the dictionary entry
)
sorted_tokens = [token for token, _ in sorted_vocabulary]
random.choices(sorted_tokens, k=100)

In [None]:
tsne = TSNE(n_components=2, metric="cosine", random_state=42)
tsne_embeddings_2d = tsne.fit_transform(token_embeddings)
tsne_embeddings_2d.shape

In [None]:
token_colors = []
for token in sorted_tokens:
    if token[0] == "[" and token[-1] == "]":
        token_colors.append("red")
    elif token.startswith("##"):
        token_colors.append("blue")
    else:
        token_colors.append("green")


scatter = go.Scattergl(
    x=tsne_embeddings_2d[:, 0],
    y=tsne_embeddings_2d[:, 1],
    text=sorted_tokens,
    marker=dict(color=token_colors, size=3),
    mode="markers",
    name="Token embeddings",
)

fig = go.FigureWidget(
    data=[scatter],
    layout=dict(
        width=600,
        height=900,
        margin=dict(l=0, r=0),
    ),
)

fig.show()
