# Lab 1: Tokenisation and embeddings

In this lab, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: *tokenisation* and *embeddings*. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.

*Tasks you can choose for the oral exam are marked with the graduation cap üéì emoji.*

## Part 1: Tokenisation

In the first part of the lab, you will code and analyse a tokeniser based on the Byte Pair Encoding (BPE) algorithm.

### Utility functions

The BPE tokeniser transforms text into a list of integers representing tokens. As a warm-up, you will implement two utility functions on such lists. To simplify things, we define a shorthand for the type of pairs of integers:

In [1]:
type Pair = tuple[int, int]

#### üß© Task 1.01: Counting pairs

Write a function that counts all occurrences of pairs of consecutive token IDs in a given list. The function should return a dictionary that maps each pair to its count. Skip counts that are zero.

In [2]:
from collections import Counter

def count(ids: list[int]) -> dict[Pair, int]:
    # TODO: Replace the following line with your own code
    pairs = [(l,r) for (l,r) in zip(ids[0:-1], ids[1:])]
    # print(pairs)
    return dict(Counter(pairs))

In [3]:
count([1,2,3,3,3,2,1,2])

{(1, 2): 2, (2, 3): 1, (3, 3): 2, (3, 2): 1, (2, 1): 1}

#### üß© Task 1.02: Replacing pairs

Write a function that traverses a list of token IDs from left to right and replaces all occurrences of a specified pair of consecutive IDs by a new ID. The function should return the modified list.

In [4]:
def replace(ids: list[int], pair: Pair, new_id: int) -> list[int]:
    # TODO: Replace the following line with your own code
    new_ids = []
    i = 0
    while i < len(ids)-1: # 2 words left
        if (ids[i], ids[i+1]) == pair:
            new_ids.append(new_id)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    if i < len(ids):
        new_ids.append(ids[i])
    return new_ids

In [5]:
replace([1,2,3,3], (3,3), 4)

[1, 2, 4]

### Encoding and decoding

The next cell contains the core code for the tokeniser in the form of a class `Tokenizer`. This class implements two methods: `encode()` converts an input text to a list of token IDs by exhaustively applying rules for merging pairs of consecutive IDs (stored in the dictionary `self.merges`), and `decode()` reverses this process.

**Note that the set of merge rules is initially empty; you will add rules in Task&nbsp;1.04.**

In [6]:
class Tokenizer:
    def __init__(self):
        self.merges : dict[Pair,int] = {} # map[Pair, int]
        self.vocab : dict[int, bytes]  = {i: bytes([i]) for i in range(2**8)}

    def encode(self, text: str) -> list[int]:
        ids = list(text.encode("utf-8"))
        while True:
            counts = count(ids) # map<pair, int> # count all the byte pairs
            mergeable_pairs = counts.keys() & self.merges.keys() # list[pair] # find all the pairs which are already defined in `merges`
            if len(mergeable_pairs) == 0: # repeat util there are no pairs defined in `merges` appear in the sequence
                break
            to_merge = min(mergeable_pairs, key=self.merges.get) # pair  # type: ignore # find the pair with minimun value in `merges`
            ids = replace(ids, to_merge, self.merges[to_merge]) # replace the pair found with its value in `merges`
        return ids

    def decode(self, ids: list[int]) -> str:
        return b"".join((self.vocab[i] for i in ids)).decode("utf-8")

    def decode_bytes(self, ids) -> bytes:
        return b"".join(self.vocab[i] for i in ids)

            

#### üéì Task 1.03: Encoding and decoding

Explain how the code implements the BPE algorithm. Use the following steps to check your understanding:

**Step&nbsp;1.** Annotate the attributes and methods of the `Tokenizer` class with their Python types. In particular, what is the type of `self.merges`? Use the `Pair` shorthand. What does a merge rule look like?

**Step&nbsp;2.** Explain how the implementation chooses which merge rule to apply. Provide an example that illustrates the logic. Construct the example such that you get a different result when you use `max()` instead of `min()`.

Step1: answered in annotations

Step2: Explained in the code comments above

example:

with `merges={(1,1): 1, (3,3):4}`, (1,1) with lower value

for the sequence `[1,1,3,3]`, (1,1) is replaced first

code below

In [7]:
class MyTokenizer:
    def __init__(self):
        self.merges : dict[Pair,int] = {} # map[Pair, int]
        self.vocab : dict[int, bytes]  = {i: bytes([i]) for i in range(2**8)}

    def encode(self, text: str) -> list[int]:
        ids = list(text.encode("utf-8"))
        while True:
            counts = count(ids) # map<pair, int> # count all the byte pairs
            mergeable_pairs = counts.keys() & self.merges.keys() # list[pair] # find all the pairs which are already defined in `merges`
            if len(mergeable_pairs) == 0: # repeat util there are no pairs defined in `merges` appear in the sequence
                break
            to_merge = min(mergeable_pairs, key=self.merges.get) # pair  # type: ignore # find the pair with minimun value in `merges`
            ids = replace(ids, to_merge, self.merges[to_merge]) # replace the pair found with its value in `merges`
            print(ids)
        return ids

    def decode(self, ids: list[int]) -> bytes:
        return b"".join((self.vocab[i] for i in ids)).decode("utf-8")
t = MyTokenizer()
t.merges = {(1,1): 1, (3,3):4}
t.encode(t.decode([1,1,3,3]))

[1, 3, 3]
[1, 4]


[1, 4]

### Training a tokeniser

Upon initialisation, a tokeniser has an empty set of merge rules. Your next task is to complete the BPE algorithm and write code to learn these merge rules from a text.

#### üéì Task 1.04: Training a tokeniser

Write a function that induces a BPE tokeniser from a given text. The function should take the text (a string) and a target vocabulary size as input and return the trained tokeniser.

In [8]:
def from_text(text: str, vocab_size: int) -> Tokenizer:
    tok = Tokenizer()
    # TODO: Add code to populate `tok.merges` and `tok.vocab`
    seq = tok.encode(text)
    
    while len(tok.vocab) < vocab_size:
        __import__('sys').stdout.write(f'\r{len(tok.vocab)} / {vocab_size}')
        c = count(seq) # dict[pair, int]
        p = max(c.keys(), key=c.get)
        next_id = len(tok.vocab)
        tok.vocab[next_id] = tok.decode_bytes(p)
        tok.merges[p] = next_id
        # seq = tok.encode(text)
        seq = replace(seq, p, next_id)
        
    return tok

To help you test your implementation, we provide three text files together with tokenisers trained on these files. Each text file contains the first 1&nbsp;million Unicode characters in a language-specific Wikipedia:

| Text file | Tokeniser file | Wikipedia |
|---|---|---|
| `wiki-en-1m.txt` | `wiki-en-1m.tok` | [Simple English](https://simple.wikipedia.org/) |
| `wiki-is-1m.txt` | `wiki-is-1m.tok` | [Icelandic](https://is.wikipedia.org/) |
| `wiki-sv-1m.txt` | `wiki-sv-1m.tok` | [Swedish](https://sv.wikipedia.org/) |

A tokeniser file consists of lines specifying merge rules. For example, the first line in the tokeniser file for Swedish is `101 114`, which expresses that this rule combines the token with ID 101 (`e`) and the token with ID 114 (`r`). The ID of the new token (`er`) is 256 plus the (zero-indexed) line number on which the rule is found. The following code saves a `Tokenizer` to a file with this format:

In [9]:
def save(tokenizer: Tokenizer, filename: str) -> None:
    with open(filename, "w") as f:
        for fst, snd in tokenizer.merges:
            print(f"{fst} {snd}", file=f)

To test your code, compare your saved tokeniser to the provided tokeniser using the `diff` command line tool.

**Note that training a tokeniser can take a few minutes.**

In [10]:

def readall(filename):
    with open(filename, "r") as f:
        lines = [line for line in f]
        return ''.join(lines)
text = readall('wiki-sv-1m.txt')
tok = from_text(text, 1024)
save(tok, 'my-wiki-sv-1m.tok')

1023 / 1024

In [11]:

def do_file(file_name_prefix: str):
    text = readall(f'{file_name_prefix}.txt')
    tok = from_text(text, 1024)
    save(tok, f'my-{file_name_prefix}.tok')

In [12]:
for f in ["wiki-en-1m", "wiki-is-1m", "wiki-sv-1m"]:
    do_file(f)

1023 / 1024

In [13]:
!diff wiki-en-1m.tok my-wiki-en-1m.tok
!diff wiki-is-1m.tok my-wiki-is-1m.tok
!diff wiki-sv-1m.tok my-wiki-sv-1m.tok

No diff!

### Tokenisation quirks

The tokeniser is a key component of language models, as it defines the minimal chunks of text the model can ‚Äúsee‚Äù and work with. As you will see in this section, tokenisation is also responsible for several deficiencies and unexpected behaviours of language models.

One helpful tool for experimenting with tokenisers in language models is the web app [Tiktokenizer](https://tiktokenizer.vercel.app/). This app lets you play around with, among others, [`o200k_base`](https://tiktokenizer.vercel.app/?model=o200k_base), the tokeniser used in ChatGPT&nbsp;5.2. You can also use OpenAI‚Äôs own [Tokenizer](https://platform.openai.com/tokenizer) page.

#### üéì Task 1.05: Tokenisation quirks

Prompt [ChatGPT](https://chatgpt.com/) to reverse the letters in the following words:

```
creativecommons
MERCHANTABILITY
NSNotification
authentication
```

How many of these words come out right? What happens when you modify the prompt and explicitly disable ‚Äúthinking‚Äù and external tools? What could be the problem when words come out wrong? Generate ideas by inspecting the words in Tiktokenizer. Try to come up with other prompts that illustrate problems related to tokenisation.

### Tokenisation and multi-linguality

Many NLP systems and the tokenisers used with them are primarily trained on English data. In the next task, you will reflect on the effect this has when they are used to process non-English data.

The *context length* of a language model is the maximum number of preceding tokens the model can condition on when predicting the next token. This number is fixed and cannot be changed after training the model. For example, the context length of GPT-2 ([Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)) is 1,024. 

While the context length of a language model is fixed, the amount of information that can be squeezed into this context length will depend on the tokeniser. Informally speaking, a model that needs more tokens to represent a given text cannot extract as much information from that text as one that needs fewer tokens.

#### üéì Task 1.06: Tokenisation and multi-linguality

Train a tokeniser on the English text file from Task&nbsp;1.04 (or use the provided one) and apply it to the same text. How many tokens does it split the text into? Based on this, what is the expected number of Unicode characters of English text that can be fit into the GPT-2 context window?

What do the numbers look like if you apply the English tokeniser to the Icelandic text instead? How do you explain the differences?

Interpreting the expected number of Unicode characters as a measure of representation efficiency, what do your results tell you about the efficiency of a language model primarily trained on English data when it is used to process non-English data? Relate these findings to the issue of tokeniser fairness.

In [14]:
en_text = readall('wiki-en-1m.txt')
en_tok = from_text(text, 1024)

1023 / 1024

In [15]:
len(en_tok.encode(en_text))

553636

In [16]:
len(en_tok.encode(readall('wiki-is-1m.txt')))

688002

## Part 2: Embeddings

In the second part of the lab, you will explore embeddings. An embedding layer is a network component that assigns each item in a finite set of elements (often called a *vocabulary*) a fixed-size vector. At first, these vectors are filled with random values, but during training, they are adjusted to suit the task at hand.

### Bag-of-words classifier

To help you build an intuition for embeddings and the vector representations learned by them, we will use a simple bag-of-words text classifier. The core part of this classifier only takes a few lines of code:

In [38]:
import torch.nn as nn


class Classifier(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, num_classes, init_kaiming=False):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.linear = nn.Linear(embedding_dim, num_classes)
        if init_kaiming:
            nn.init.kaiming_uniform_(self.embedding.weight, a=math.sqrt(5))

    def forward(self, x):
        return self.linear(self.embedding(x).mean(dim=-2))

#### üß© Task 1.07: Bag-of-words classifier

Explain how the bag-of-words classifier works. How does the code match the diagram you saw in the lectures? Why is there only one `nn.Embedding`, while the diagram shows three embedding layers? What does the keyword argument `dim=-2` do?

### Dataset

You will apply the classifier to a small dataset with Amazon customer reviews. This dataset is taken from [a much larger dataset](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) first described by [Blitzer et al. (2007)](https://aclanthology.org/P07-1056/).

The dataset contains whitespace-tokenised product reviews from two categories: cameras (`camera`) and music (`music`). Each review is additionally annotated for sentiment towards the product at hand: negative (`neg`) or positive (`pos`). The category and sentiment labels are prepended to the review. As an example, here is the first review from the training data:

```
music neg oh man , this sucks really bad . good thing nu-metal is dead . thrash metal is real metal , this is for posers
```

The next cell contains a custom [`Dataset`](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) class for the review dataset. To initialise an instance of this class, you specify the name of the file containing the reviews you want to load (`filename`) and which of the two labels you want to use (`label`): product category (0) or sentiment (1).

In [18]:
from torch.utils.data import Dataset


class ReviewDataset(Dataset):
    def __init__(self, filename: str, label: int = 0) -> None:
        with open(filename) as f:
            tokenized_lines = [line.split() for line in f]
        self.items = [(tokens[2:], tokens[label]) for tokens in tokenized_lines]

    def __len__(self) -> int:
        return len(self.items)

    def __getitem__(self, idx: int) -> tuple[list[str], str]:
        return self.items[idx]

### Vectoriser

To feed a review into the bag-of-words classifier, you first need to turn it into a vector of token IDs. Likewise, you need to convert the label (product category or sentiment) into an integer. The next cell contains a partially completed `ReviewVectoriser` class that handles this transformation.

In [19]:
from collections import Counter

import torch

# Type abbreviation for review‚Äìlabel pairs
type Item = tuple[list[str], str]


class ReviewVectorizer:
    PAD = "[PAD]"
    UNK = "[UNK]"

    def __init__(self, dataset: ReviewDataset, n_vocab: int = 1024) -> None:
        # Unzip the dataset into reviews and labels
        reviews, labels = zip(*dataset)

        # Count the tokens and get the most common ones
        counter = Counter(t for r in reviews for t in r)
        most_common = [t for t, _ in counter.most_common(n_vocab - 2)]

        # Create the token-to-index and label-to-index mappings
        self.t2i = {t: i for i, t in enumerate([self.PAD, self.UNK] + most_common)}
        self.l2i = {l: i for i, l in enumerate(sorted(set(labels)))}

    def __call__(self, items: list[Item]) -> tuple[torch.Tensor, torch.Tensor]:
        # TODO: Complete the implementation of this method
        t, l = [], []
        
        maxl = 0
        for review, label in items:
            token_ids = [self.t2i.get(w, 1) for w in review]
            label_id = self.l2i.get(label)
            t.append(token_ids)
            maxl = max(maxl, len(token_ids))
            l.append(label_id)
            
        for token_ids in t:
            if(len(token_ids)) < maxl:
                token_ids.extend([0]*(maxl-len(token_ids)))
        
        xs = torch.tensor(t, dtype=torch.long)
        ys = torch.tensor(l, dtype=torch.long)
        # print(len(items), xs.shape, ys.shape)
        return xs, ys

A `ReviewVectoriser` maps tokens and labels to IDs using two Python dictionaries. These dictionaries are set up when the vectoriser is initialised and queried when the vectoriser is called on a batch of review‚Äìlabel pairs. They include IDs for two special tokens:

`[PAD]` (Padding): Reviews can have different lengths, but PyTorch requires all vectors in a batch to be the same size. To handle this, the vectoriser adds `[PAD]` tokens to the end of shorter reviews so they match the length of the longest review in the batch.

`[UNK]` (Unknown): If a review contains a token that is not in the token-to-ID dictionary, the vectoriser assigns it the ID of the `[UNK]` token instead of a regular ID.

#### üéì Task 1.08: Vectoriser

Explain and complete the code of the vectoriser. Follow these steps:

**Step&nbsp;1.** Explain how unzipping works. What are the types of `reviews` and `labels`?

**Step&nbsp;2.** Explain how the token-to-ID and label-to-ID mappings are constructed. How does the `most_common()` method deal with elements that occur equally often?

**Step&nbsp;3.** Complete the implementation of the `__call__()` method. This method should convert a list of $m$ review‚Äìlabel pairs into a pair $(X, y)$ where $X$ is a matrix containing the vectors with token IDs for the reviews, and $y$ is a vector containing the IDs of the corresponding labels.

### Training the classifier

With the vectoriser completed, you are ready to train a classifier. More specifically, you can train two separate classifiers: one to predict the product category of a review, and one to predict the sentiment. The next cell contains a simple training loop that you can adapt for this purpose.

In [40]:
import torch.nn.functional as F
import math

def train(filename = "reviews-train.txt", label=0, vocab_size=1024, embedding_dim=64, lr=0.001, batch_size=16, 
         shuffle=True, init_kaiming=False):
    dataset = ReviewDataset(filename, label=label) # open the train data file as dataset
    processor = ReviewVectorizer(dataset, vocab_size) # init processor to translate review text and label text into token ids and label ids
    model = Classifier(vocab_size, embedding_dim, len(processor.l2i), init_kaiming=init_kaiming) # init model
    optimizer = torch.optim.Adam(model.parameters(), lr=lr) # init optimizer
    data_loader = torch.utils.data.DataLoader( # init data loader
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=processor,
    )
    for epoch in range(10):
        model.train() # turn the model into train mode
        running_loss = 0 # init variable, count the total loss in current epoch
        for bx, by in data_loader: # load a batch of data from data loader
            optimizer.zero_grad() # clear the current gradient
            output = model(bx) # evaluate training input
            loss = F.cross_entropy(output, by) # calculate loss
            loss.backward() # calculate gradient
            optimizer.step() # update parameter
            running_loss += loss.item() # add to the loss of current epoch
        print(f"Epoch {epoch}, loss: {running_loss / len(data_loader):.4f}")
    return processor, model

#### üéì Task 1.09: Training loop

Explain the training loop. Follow these steps:

**Step&nbsp;1.** Go through the training loop line-by-line and add comments where you find it suitable. Your comments should be detailed enough for you to explain the main steps of the loop.

**Step&nbsp;2.** The training loop contains various hard-coded values like filename, learning rate, batch size, and epoch count. This makes the code less flexible. Revise the code so that you can specify these values using keyword arguments. Use the concrete values from the code as defaults.

#### üß© Task 1.10: Training the classifier

Adapt the next cell to train the classifier for the two prediction tasks. Based on the loss values, which task appears to be the harder one? What is the purpose of setting a seed?

label=1 is with higher loss. Sentiment is harder.

Setting a seed is for reproducibility. With the same random seed, same training input, the same result will be easier to reproduce.

In [21]:
torch.manual_seed(42)
vectorizer, model = train()

Epoch 0, loss: 0.6838
Epoch 1, loss: 0.6486
Epoch 2, loss: 0.6046
Epoch 3, loss: 0.5415
Epoch 4, loss: 0.4708
Epoch 5, loss: 0.4008
Epoch 6, loss: 0.3403
Epoch 7, loss: 0.2900
Epoch 8, loss: 0.2518
Epoch 9, loss: 0.2230


In [22]:
vectorizer1, model1 = train(label=1)

Epoch 0, loss: 0.6922
Epoch 1, loss: 0.6823
Epoch 2, loss: 0.6774
Epoch 3, loss: 0.6673
Epoch 4, loss: 0.6545
Epoch 5, loss: 0.6381
Epoch 6, loss: 0.6183
Epoch 7, loss: 0.5975
Epoch 8, loss: 0.5747
Epoch 9, loss: 0.5508


### Inspecting the embeddings

Now that you have trained the classifier on two separate prediction tasks, it is interesting to inspect and compare the embedding vectors it learned in the process. For this you will use an online tool called the [Embedding Projector](http://projector.tensorflow.org). The next cell contains code to save the embeddings from a trained classifier in a format that can be loaded into this tool.

In [23]:
def save_embeddings(
    vectorizer: ReviewVectorizer,
    model: Classifier,
    vectors_filename: str,
    metadata_filename: str,
):
    i2t = {i: t for t, i in vectorizer.t2i.items()}
    embeddings = model.embedding.weight.detach().numpy()
    items = [(i2t[i], e) for i, e in enumerate(embeddings)]
    with open(vectors_filename, "wt") as f1, open(metadata_filename, "wt") as f2:
        for w, e in items:
            print("\t".join("{:.5f}".format(x) for x in e), file=f1)
            print(w, file=f2)

Call this code as follows:

In [24]:
save_embeddings(vectorizer, model, "vectors.tsv", "metadata.tsv")

In [26]:
save_embeddings(vectorizer1, model1, "vectors1.tsv", "metadata1.tsv")

#### üéì Task 1.11: Inspecting the embeddings

Load the embeddings from the two classification tasks (product category classification and sentiment classification) into the Embedding Projector web app and inspect the vector spaces. How do they compare visually? Does the visualisation make sense to you?

The Embedding Projector offers visualisations based on three dimensionality reduction methods: [UMAP](https://umap-learn.readthedocs.io/en/latest/), [T-SNE](https://en.m.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding), and [PCA](https://en.m.wikipedia.org/wiki/Principal_component_analysis). Which of these seems most useful to you?

Focus on the embeddings for the words *repair* and *sturdy*. Are they close to each other or far away from another? What happens if you switch to the other task? How do you explain that?

### Initialisation of embedding layers

The error surfaces explored when training neural networks can be very complex. Because of this, it is crucial to choose ‚Äúgood‚Äù initial values for the parameters. In the final task of this lab, you will run a small experiment to see how alternative initialisations can affect a model‚Äôs performance.

In PyTorch, the weights of the embedding layer are initially set by sampling from the standard normal distribution, $\mathcal{N}(0, 1)$. However, research suggests other approaches may work better. For example, given that embedding layers share similarities with linear layers, it makes sense to use the same initialisation method for both. The default initialisation method for linear layers in PyTorch is the so-called Kaiming initialisation, introduced by [He et al. (2015)](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf).

#### üß© Task 1.12: Initialisation of embedding layers

Check the [source code of `nn.Linear`](https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) to see how PyTorch initialises the weights of linear layers using the Kaiming initialisation method. Apply the same method to the embedding layer of your classifier and see how this affects the loss of your model and the vector spaces.

In [35]:
torch.manual_seed(42)
vectorizer3, model3 = train()

Epoch 0, loss: 0.6838
Epoch 1, loss: 0.6486
Epoch 2, loss: 0.6046
Epoch 3, loss: 0.5415
Epoch 4, loss: 0.4708
Epoch 5, loss: 0.4008
Epoch 6, loss: 0.3403
Epoch 7, loss: 0.2900
Epoch 8, loss: 0.2518
Epoch 9, loss: 0.2230


In [41]:
torch.manual_seed(42)
vectorizer4, model4 = train(init_kaiming=True)

Epoch 0, loss: 0.6775
Epoch 1, loss: 0.5864
Epoch 2, loss: 0.4349
Epoch 3, loss: 0.3134
Epoch 4, loss: 0.2375
Epoch 5, loss: 0.1900
Epoch 6, loss: 0.1585
Epoch 7, loss: 0.1328
Epoch 8, loss: 0.1168
Epoch 9, loss: 0.1017


In [42]:
save_embeddings(vectorizer3, model3, "vectors3.tsv", "metadata3.tsv")
save_embeddings(vectorizer4, model4, "vectors4.tsv", "metadata4.tsv")

**ü•≥ Congratulations on finishing lab&nbsp;1!**