# Working with text

In this lesson, we'll learn how to work with text input to neural networks.  This is necessary to build a language model like GPT.

Neural networks can't understand text directly, so we need to convert the text into a numerical representation.  How we choose to represent the text can give the model a lot of useful information, and result in better predictions.

The steps in converting text to a numerical representation are:

1. Tokenize the text to convert it into discrete tokens (kind of like words)
2. Assign a number to each token
3. Convert each token id into a vector representation.  This is called embedding.

We can then feed the vectors into a neural network layer.  Here's a diagram:

![](images/text/text_to_vec.svg)

You might be wondering why we don't directly feed the token ids into the neural network.

Embedding enables the network to learn similarities between tokens.  For example, the token id for a `.` might be `2`, and the id for a ` ` might be `7`.  This doesn't help the network understand the relationship between the two tokens.  However, if the vector for a `.` is `[0,1,.1,2]`, and the vector for a ` ` is `[0,1,.1,1]`, the distance between the vectors could indicate that the tokens are similar in their function.  Like weights, the embeddings are learned by the network, and will change during training.  Tokens that are conceptually similar will have vectors that are closer together than tokens that aren't.

## Load the data

We'll be working with a dataset from [Opus books](https://huggingface.co/datasets/opus_books/viewer/en-es/train).  This dataset contains English sentences from books, and their Spanish translations.  We'll use the translation in the next lesson, but in this one, we'll only use the English sentence.

There are about 24k sentence pairs in the dataset.  Here's an example:

![](images/text/sentences.svg)

These sentences in very Old(e) English, but that won't stop our AI from parsing them.  We'll first load in the data using `pandas` and explore it:

In [1]:
import pandas as pd

opus = pd.read_csv("../data/opus_books.csv")
opus.head()

Unnamed: 0,en,es
0,"In the society of his nephew and niece, and th...","En compañía de su sobrino y sobrina, y de los ..."
1,"By a former marriage, Mr. Henry Dashwood had o...","De un matrimonio anterior, el señor Henry Dash..."
2,"By his own marriage, likewise, which happened ...","Además, su propio matrimonio, ocurrido poco de..."
3,"But the fortune, which had been so tardy in co...","Pero la fortuna, que había tardado tanto en ll..."
4,But Mrs. John Dashwood was a strong caricature...,Pero la señora de John Dashwood era una áspera...


## Create our vocabulary

Now, we need to clean the data and define our token vocabulary.  Our vocabulary is how we map each token to a unique token id.  We'll be creating our own very simple tokenizer and vocabulary.  In practice, you'll use more powerful tokenizers like byte-pair encoding that look at sequences of characters to find the optimal tokenization scheme.

Optimal means accuracy and speed.  For example, we could look at individual characters (`a`, `b`, etc) instead of tokens.  This would result in a much smaller vocabulary (and run faster), but it would be much less accurate, since the model would get less information about entire words and concepts.

We'll first setup some special tokens, that we'll be using ourselves:

- `PAD` - this token is used to pad sequences to a given length.  When we're working with text data, sentences won't all be the same length.  However, a neural network needs all rows in a batch to have the same number of columns.  Padding enables us to make all sentences the same length.  We use a special token for this, and tell the network to ignore it.
- `UNK` - some tokens don't occur often enough to add them to our vocabulary.  Imagine words like `Octothorpe`, or issues with data quality like `hello123bye`.  These long-tail words will add a lot to our vocabulary (and make our model slower), but don't add much value to the model.  More powerful tokenizers will split these up into individual characters, but in our simple tokenizer, we need `UNK`.
- `BOS` - this special token is used to mark the beginning of a sentence, or a sequence.
- `EOS` - used to mark the end of a sequence.

Some tokenizers, like the GPT-2 tokenizer, use `PAD` instead of `BOS` and `EOS`.

In [2]:
import re
from collections import defaultdict

special_tokens = {
    "PAD": 0,
    "UNK": 1,
    "BOS": 2,
    "EOS": 3
}
vocab = special_tokens.copy()

Next, we'll define our functions to clean and tokenize input text.  We're going to do some naive cleaning, and just strip anything that isn't in a small set of characters (letters, numbers, spaces, some punctuation).  We're doing this because our simple tokenizer needs a very small character set (a large character set will result in a larger vocabulary).  As you'll see later, the size of the vocabulary impacts the size of the embedding matrix, and thus the performance of the network.

Our tokenization will just split on whitespace and punctuation.

In [3]:
# This is the maximum numbers of tokens we'll keep from each sentence.  You can increase this, but training will take longer.
token_limit = 11

def clean(text):
    # Use re to replace punctuation that is not a comma, question mark, or period with spaces
    text = re.sub(r'[^\w\s,?.!]',' ', text)
    # Strip leading/trailing space
    text = text.strip()
    return text

def tokenize(text):
    # Split on consecutive whitespace and punctuation
    tokens = re.findall(r'\w+|[^\w\s]+|[\s]+', text)
    return tokens[:token_limit]

We can now create a vocabulary using our functions.  We'll first create a dictionary containing every token in our sentences, and the number of times it appears across the dataset.  Then, we'll create a vocab dictionary, only selecting the tokens that appear more than once.  Tokens that only appear once will be marked as unknown.

In [4]:
opus_tokens = defaultdict(int)

# Loop through the sentences, clean, tokenize, and assign token counts
for index, row in opus.iterrows():
    cleaned = clean(row["en"])
    tokens = tokenize(cleaned)
    for token in tokens:
        opus_tokens[token] += 1

# Set to the current size of the vocabulary (special tokens)
counter = len(vocab)
# Assign a unique id to each token if it appears more than once
for index, token in enumerate(opus_tokens):
    # Filter out uncommon tokens
    # Add unknown token for rare words
    if opus_tokens[token] > 1:
        vocab[token] = counter
        counter += 1
    else:
        vocab[token] = 1 # Assign unknown id

In [5]:
len(vocab)

11731

We have about 11k tokens in our vocabulary.  In practice, tokenizers will usually have between 10k and 100k tokens.  This is a good tradeoff between thoroughness (having a unique id for every word), and vocabulary size (splitting some rare words into multiple tokens).  The GPT-2 tokenizer uses 50257 tokens.

We'll also build a reverse vocab lookup, which we can use to decode token ids to tokens:

In [6]:
reverse_vocab = {v: k for k, v in vocab.items()}

## Tokenize sentences

We can now use our vocabulary to tokenize our sentences.  We'll create an encode function, that can turn a sentence into a torch tensor of token ids.

We'll also write a decode function.  This will use a reverse lookup to go from token id to token.  This will enable us to decode our predictions and see how good they were.

In [7]:
import torch

def encode(text):
    # Yokenize input text
    tokens = tokenize(clean(text))
    # Convert to token ids
    encoded = torch.tensor([vocab[token] for token in tokens])
    return encoded

def decode(encoded):
    # The input is a torch tensor - convert it to a list
    encoded = encoded.detach().cpu().tolist()
    # Decode a list of integers into text
    decoded = "".join([reverse_vocab[token] for token in encoded])
    return decoded

Now, we can use the encode function to convert our English sentences into token ids.  We'll only take sentences that have at least as many tokens than the token limit we set earlier.  This will allow us to use the first `10` tokens of each sentence to predict token `11`.  Alternatively, we could pad the shorter sentences to the limit, but it's easier for now to avoid padding.

In [8]:
tokenized = []
for index, row in opus.iterrows():
    # Encode the English sentences
    en_text = row["en"]
    en = encode(en_text)
    if en.shape[0] < token_limit:
        continue
    tokenized.append(en)

In [9]:
tokenized[0]

tensor([ 4,  5,  6,  5,  7,  5,  8,  5,  9,  5, 10])

## Create torch dataset

In [10]:
from torch.utils.data import DataLoader, Dataset

class TextData(Dataset):
    def __init__(self, data):
        self.tokens = torch.vstack(data).long()

    def __len__(self):
        # Return how many examples are in the dataset
        return len(self.tokens)

    def __getitem__(self, idx):
        # Return a single training example
        x = self.tokens[idx][:10]
        y = self.tokens[idx][10]
        return x, y

# Initialize the dataset
train_ds = TextData(tokenized)
train = DataLoader(train_ds, batch_size=64)

In [11]:
train_ds[0]

(tensor([4, 5, 6, 5, 7, 5, 8, 5, 9, 5]), tensor(10))

In [12]:
batch = next(iter(train))
batch

[tensor([[  4,   5,   6,   5,   7,   5,   8,   5,   9,   5],
         [ 11,   5,  12,   5,  13,   5,  14,  15,   5,  16],
         [ 11,   5,   9,   5,  18,   5,  14,  15,   5,  19],
         [ 20,   5,   6,   5,  21,  15,   5,  22,   5,  23],
         [ 20,   5,  24,  17,   5,  25,   5,  26,   5,  27],
         [ 28,   5,  29,   5,  30,   5,  31,  15,   5,  32],
         [ 33,   5,  27,   5,  34,   5,  35,   5,  36,  37],
         [ 33,   5,  27,   5,   1,  15,   5,  39,  15,   5],
         [ 41,   5,  42,  15,   5,  43,   5,  44,  15,   5],
         [ 45,   5,  46,   5,  47,   5,  48,   5,  49,   5],
         [ 50,   5,  51,   5,   8,   5,  52,   5,  22,   5],
         [ 41,  15,   5,  54,  15,   5,  27,   5,  55,   5],
         [ 57,   5,   1,   5,  32,   5,  12,   5,  58,   5],
         [ 24,  17,   5,  25,   5,  26,   5,  60,   5,  61],
         [ 62,   5,  63,   5,  64,   5,  65,   5,  66,   5],
         [ 68,   5,  69,   5,  70,   5,  71,   5,  72,   5],
         [ 74,   5,  72,

## Embedding layer

In [13]:
import math
from torch import nn

class Embedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()

        k = 1/math.sqrt(embed_dim)
        self.weights =  torch.rand(vocab_size, embed_dim) * 2 * k - k
        self.weights[0] = 0 # Zero out the padding embedding
        self.weights = nn.Parameter(self.weights)

    def forward(self, token_ids):
        # Return a matrix of embeddings
        # We could convert token_ids to a one_hot vector and multiply by the weights, but it is the same as selecting a single row of the matrix
        return self.weights[token_ids]

In [14]:
with torch.no_grad():
    input_embed = Embedding(len(vocab), 256)
    print(input_embed(batch[0]))

tensor([[[ 0.0377,  0.0337, -0.0130,  ...,  0.0319,  0.0504, -0.0156],
         [-0.0284, -0.0035,  0.0453,  ..., -0.0200, -0.0297,  0.0193],
         [ 0.0294,  0.0567,  0.0090,  ...,  0.0385,  0.0029, -0.0361],
         ...,
         [-0.0284, -0.0035,  0.0453,  ..., -0.0200, -0.0297,  0.0193],
         [ 0.0488,  0.0272, -0.0281,  ...,  0.0275,  0.0597, -0.0071],
         [-0.0284, -0.0035,  0.0453,  ..., -0.0200, -0.0297,  0.0193]],

        [[ 0.0363,  0.0312,  0.0317,  ...,  0.0185,  0.0068,  0.0401],
         [-0.0284, -0.0035,  0.0453,  ..., -0.0200, -0.0297,  0.0193],
         [-0.0099,  0.0077,  0.0200,  ..., -0.0395, -0.0311,  0.0408],
         ...,
         [-0.0112,  0.0034, -0.0372,  ...,  0.0186,  0.0356,  0.0023],
         [-0.0284, -0.0035,  0.0453,  ..., -0.0200, -0.0297,  0.0193],
         [ 0.0040,  0.0188,  0.0364,  ...,  0.0054, -0.0259, -0.0394]],

        [[ 0.0363,  0.0312,  0.0317,  ...,  0.0185,  0.0068,  0.0401],
         [-0.0284, -0.0035,  0.0453,  ..., -0

## Predict next token

In [15]:
class TokenPredictor(nn.Module):
    def __init__(self, vocab_size, input_token_count, hidden_units):
        super().__init__()

        torch.manual_seed(0)
        self.embedding = Embedding(vocab_size, hidden_units)
        self.dense1 = nn.Linear(hidden_units, hidden_units)
        self.relu = nn.ReLU()
        self.dense2 = nn.Linear(hidden_units, hidden_units)
        self.output = nn.Linear(hidden_units * input_token_count, hidden_units)

    def forward(self, x):
        # Embed from (token_count, vocab_size) to (token_count, hidden_size)
        embedded = self.embedding(x)
        # Run the network
        x = self.dense2(self.relu(self.dense1(embedded)))
        # Flatten the vectors into one large vector per sentence for the final layer
        flat = torch.flatten(x, start_dim=1)
        # Run the final layer to get an output
        network_out = self.output(flat)
        # Unembed, convert to (batch_size, vocab_size).  Argmax against last dim gives predicted token
        out_vector = network_out @ self.embedding.weights.T
        return out_vector

In [16]:
from statistics import mean

# Initialize W&B
%env WANDB_SILENT=True

import wandb
wandb.login()

def train_loop(net, optimizer, epochs):
    # Initialize a new W&B run
    wandb.init(project="text",
               name="dense")

    loss_fn = nn.CrossEntropyLoss(ignore_index=0)
    train_losses = []
    for epoch in range(epochs):
        for batch, (x, y) in enumerate(train):
            # zero_grad will set all the gradients to zero
            # We need this because gradients will accumulate in the backward pass
            optimizer.zero_grad()
            # Make a prediction using the network
            pred = net(x)
            # Calculate the loss
            loss = loss_fn(pred, y)
            # Call loss.backward to run backpropagation
            loss.backward()
            # Step the optimizer to update the parameters
            optimizer.step()
            train_losses.append(loss.item())

            if batch % 10 == 0:
                # Log training metrics
                wandb.log({
                    "train_loss": mean(train_losses)
                })

    return train_losses

env: WANDB_SILENT=True


In [None]:
# Define our hyperparameters
epochs = 100
lr = 1e-3

# Initialize our network
net = TokenPredictor(len(vocab), 10, 256)
# Optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
losses = train_loop(net, optimizer, epochs)

In [None]:
with torch.no_grad():
    batch = next(iter(train))
    pred = net(batch[0])
    token_id = pred.argmax(-1)

    for i in range(len(batch[0])):
        text = decode(batch[0][i])
        actual = decode(batch[1][i:(i+1)])
        pred = decode(token_id[i:(i+1)])
        print(f"{text}<ACTUAL>{actual}<><PRED>{pred}<>")

In [None]:
batch[0]