# Working with text

In this lesson, we'll learn how to work with text input to neural networks.  This is necessary to build a language model like GPT.

Neural networks can't understand text directly, so we need to convert the text into a numerical representation.  How we choose to represent the text can give the model a lot of useful information, and make better predictions.

The steps in converting text to a numerical representation are:

1. Tokenize the text to convert it into discrete tokens (kind of like words)
2. Assign a number to each token
3. Convert each token id into a vector representation.  This is called embedding.

We can then feed the vectors into a neural network layer.  Here's a diagram:

![](images/text/text_to_vec.svg)

You might be wondering why we don't directly feed the token ids into the neural network.

Embedding enables the network to learn relationships between tokens.  For example, the token id for a `.` might be `2`, and the id for a ` ` might be `7`.  This doesn't help the network understand the relationship between the two tokens.  However, if the vector for a `.` is `[0,1,.1,2]`, and the vector for a ` ` is `[0,1,.1,1]`, this could indicate that the tokens are similar in their function.  Like weights, the embeddings are learned by the network, and will change during training.  Tokens that are conceptually similar will have vectors that are closer together than tokens that aren't.

## Load the data

We'll be working with a dataset from [Opus books](https://huggingface.co/datasets/opus_books/viewer/en-es/train).  This dataset contains English sentences from books, and their Spanish translations.  We'll use the translation in the next lesson, but in this one, we'll only use the English sentence.

There are about 24k sentence pairs in the dataset.  Here's an example:

![](images/text/sentences.svg)

These sentences in very Old(e) English, but that won't stop our AI from parsing them.  We'll first load in the data using `pandas` and explore it:

In [48]:
import pandas as pd

opus = pd.read_csv("../data/opus_books.csv")
opus.head()

Unnamed: 0,en,es
0,"In the society of his nephew and niece, and th...","En compañía de su sobrino y sobrina, y de los ..."
1,"By a former marriage, Mr. Henry Dashwood had o...","De un matrimonio anterior, el señor Henry Dash..."
2,"By his own marriage, likewise, which happened ...","Además, su propio matrimonio, ocurrido poco de..."
3,"But the fortune, which had been so tardy in co...","Pero la fortuna, que había tardado tanto en ll..."
4,But Mrs. John Dashwood was a strong caricature...,Pero la señora de John Dashwood era una áspera...


## Create our vocabulary

Now, we need to clean the data and define our token vocabulary.  Our vocabulary is how we map each token to a unique token id.  We'll be creating our own very simple tokenizer and vocabulary.  In practice, you'll use more powerful tokenizers like byte-pair encoding that look at sequences of characters to find the optimal tokenization scheme.

We'll first setup some special tokens, that we'll be using ourselves:

- `PAD` - this token is used to pad sequences to a given length.  When we're working with text data, sentences won't all be the same length.  However, a neural network needs all rows in a batch to have the same number of columns.  Padding enables us to make all sentences the same length.  We use a special token for this, and tell the network to ignore it.
- `UNK` - some tokens don't occur often enough to add them to our vocabulary.  Imagine words like `Octothorpe`, or issues with data quality like `hello123bye`.  These long-tail words will add a lot to our vocabulary (and make our model slower), but don't add much value to the model.  More powerful tokenizers will split these up into individual characters, but in our simple tokenizer, we need `UNK`.
- `BOS` - this special token is used to mark the beginning of a sentence, or a sequence.
- `EOS` - used to mark the end of a sequence.

Some tokenizers, like the GPT-2 tokenizer, use `PAD` instead of `BOS` and `EOS`.

In [None]:
import re
from collections import defaultdict

special_tokens = {
    "PAD": 0,
    "UNK": 1,
    "BOS": 2,
    "EOS": 3
}
vocab = special_tokens.copy()

In [34]:
token_limit = 11

def clean(text):
    # Use re to replace punctuation that is not a comma, question mark, or period with spaces
    text = re.sub(r'[^\w\s,?.!]',' ', text)
    text = text.strip()
    return text

def tokenize(text):
    # Split on consecutive whitespace and punctuation
    tokens = re.findall(r'\w+|[^\w\s]+|[\s]+', text)
    return tokens[:token_limit]

opus_tokens = defaultdict(int)
for index, row in opus.iterrows():
    cleaned = clean(row["en"])
    tokens = tokenize(cleaned)
    for token in tokens:
        opus_tokens[token] += 1

counter = 4
for index, token in enumerate(opus_tokens):
    # Filter out uncommon tokens
    # Add unknown token for rare words
    if opus_tokens[token] > 1:
        vocab[token] = counter
        counter += 1
    else:
        vocab[token] = 1 # Assign unknown id

## Tokenize sentences

In [35]:
import torch

def encode(text):
    # Encode text as a list of integers
    tokens = tokenize(clean(text))
    encoded = torch.tensor([vocab[token] for token in tokens])
    return encoded

reverse_vocab = {v: k for k, v in vocab.items()}
for k,v in special_tokens.items():
    reverse_vocab[v] = k

def decode(encoded):
    # Decode a list of integers into text
    if isinstance(encoded, torch.Tensor):
        encoded = encoded.detach().cpu().tolist()
    decoded = "".join([reverse_vocab[token] for token in encoded])
    return decoded

## Tokenize data

In [36]:
tokenized = []
for index, row in opus.iterrows():
    # Encode the English sentences
    en_text = row["en"]
    en = encode(en_text)
    if en.shape[0] < token_limit:
        continue
    tokenized.append(en)

In [37]:
tokenized[0]

tensor([ 4,  5,  6,  5,  7,  5,  8,  5,  9,  5, 10])

## Create torch dataset

In [38]:
from torch.utils.data import DataLoader, Dataset

class TextData(Dataset):
    def __init__(self, data):
        self.tokens = torch.vstack(data).long()

    def __len__(self):
        # Return how many examples are in the dataset
        return len(self.tokens)

    def __getitem__(self, idx):
        # Return a single training example
        x = self.tokens[idx][:10]
        y = self.tokens[idx][10]
        return x, y

# Initialize the dataset
train_ds = TextData(tokenized)
train = DataLoader(train_ds, batch_size=64)

In [39]:
train_ds[0]

(tensor([4, 5, 6, 5, 7, 5, 8, 5, 9, 5]), tensor(10))

In [40]:
batch = next(iter(train))
batch

[tensor([[  4,   5,   6,   5,   7,   5,   8,   5,   9,   5],
         [ 11,   5,  12,   5,  13,   5,  14,  15,   5,  16],
         [ 11,   5,   9,   5,  18,   5,  14,  15,   5,  19],
         [ 20,   5,   6,   5,  21,  15,   5,  22,   5,  23],
         [ 20,   5,  24,  17,   5,  25,   5,  26,   5,  27],
         [ 28,   5,  29,   5,  30,   5,  31,  15,   5,  32],
         [ 33,   5,  27,   5,  34,   5,  35,   5,  36,  37],
         [ 33,   5,  27,   5,   1,  15,   5,  39,  15,   5],
         [ 41,   5,  42,  15,   5,  43,   5,  44,  15,   5],
         [ 45,   5,  46,   5,  47,   5,  48,   5,  49,   5],
         [ 50,   5,  51,   5,   8,   5,  52,   5,  22,   5],
         [ 41,  15,   5,  54,  15,   5,  27,   5,  55,   5],
         [ 57,   5,   1,   5,  32,   5,  12,   5,  58,   5],
         [ 24,  17,   5,  25,   5,  26,   5,  60,   5,  61],
         [ 62,   5,  63,   5,  64,   5,  65,   5,  66,   5],
         [ 68,   5,  69,   5,  70,   5,  71,   5,  72,   5],
         [ 74,   5,  72,

## Embedding layer

In [41]:
import math
from torch import nn

class Embedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()

        k = 1/math.sqrt(embed_dim)
        self.weights =  torch.rand(vocab_size, embed_dim) * 2 * k - k
        self.weights[0] = 0 # Zero out the padding embedding
        self.weights = nn.Parameter(self.weights)

    def forward(self, token_ids):
        # Return a matrix of embeddings
        # We could convert token_ids to a one_hot vector and multiply by the weights, but it is the same as selecting a single row of the matrix
        return self.weights[token_ids]

In [42]:
with torch.no_grad():
    input_embed = Embedding(len(vocab), 256)
    print(input_embed(batch[0]))

tensor([[[ 6.0825e-02, -2.1130e-02, -2.7503e-02,  ..., -3.9384e-02,
           5.0046e-02, -3.2717e-02],
         [ 2.2939e-02,  4.0495e-02, -2.1113e-02,  ...,  5.3952e-02,
          -5.0945e-02, -2.4746e-02],
         [ 5.1662e-02, -2.1392e-02, -4.6467e-02,  ...,  4.5651e-02,
          -2.2346e-02, -3.8449e-02],
         ...,
         [ 2.2939e-02,  4.0495e-02, -2.1113e-02,  ...,  5.3952e-02,
          -5.0945e-02, -2.4746e-02],
         [-1.3514e-02,  3.2856e-02,  3.3172e-02,  ..., -1.1294e-02,
          -5.7684e-02,  5.7807e-02],
         [ 2.2939e-02,  4.0495e-02, -2.1113e-02,  ...,  5.3952e-02,
          -5.0945e-02, -2.4746e-02]],

        [[-1.7709e-02,  3.0644e-02,  5.6269e-02,  ...,  5.9208e-02,
           1.6609e-03,  4.6951e-02],
         [ 2.2939e-02,  4.0495e-02, -2.1113e-02,  ...,  5.3952e-02,
          -5.0945e-02, -2.4746e-02],
         [-4.4252e-02,  4.9331e-03, -4.4722e-03,  ...,  1.3196e-02,
          -3.1208e-02, -4.1165e-02],
         ...,
         [ 1.9889e-02,  3

## Predict next token

In [43]:
class TokenPredictor(nn.Module):
    def __init__(self, vocab_size, input_token_count, hidden_units):
        super().__init__()

        torch.manual_seed(0)
        self.embedding = Embedding(vocab_size, hidden_units)
        self.dense1 = nn.Linear(hidden_units, hidden_units)
        self.relu = nn.ReLU()
        self.dense2 = nn.Linear(hidden_units, hidden_units)
        self.output = nn.Linear(hidden_units * input_token_count, hidden_units)

    def forward(self, x):
        # Embed from (token_count, vocab_size) to (token_count, hidden_size)
        embedded = self.embedding(x)
        # Run the network
        x = self.dense2(self.relu(self.dense1(embedded)))
        # Flatten the vectors into one large vector per sentence for the final layer
        flat = torch.flatten(x, start_dim=1)
        # Run the final layer to get an output
        network_out = self.output(flat)
        # Unembed, convert to (batch_size, vocab_size).  Argmax against last dim gives predicted token
        out_vector = network_out @ self.embedding.weights.T
        return out_vector

In [44]:
from statistics import mean

# Initialize W&B
%env WANDB_SILENT=True

import wandb
wandb.login()

def train_loop(net, optimizer, epochs):
    # Initialize a new W&B run
    wandb.init(project="text",
               name="dense")

    loss_fn = nn.CrossEntropyLoss(ignore_index=0)
    train_losses = []
    for epoch in range(epochs):
        for batch, (x, y) in enumerate(train):
            # zero_grad will set all the gradients to zero
            # We need this because gradients will accumulate in the backward pass
            optimizer.zero_grad()
            # Make a prediction using the network
            pred = net(x)
            # Calculate the loss
            loss = loss_fn(pred, y)
            # Call loss.backward to run backpropagation
            loss.backward()
            # Step the optimizer to update the parameters
            optimizer.step()
            train_losses.append(loss.item())

            if batch % 10 == 0:
                # Log training metrics
                wandb.log({
                    "train_loss": mean(train_losses)
                })

    return train_losses

env: WANDB_SILENT=True


In [45]:
# Define our hyperparameters
epochs = 100
lr = 1e-3

# Initialize our network
net = TokenPredictor(len(vocab), 10, 256)
# Optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
losses = train_loop(net, optimizer, epochs)

In [46]:
with torch.no_grad():
    batch = next(iter(train))
    pred = net(batch[0])
    token_id = pred.argmax(-1)

    for i in range(len(batch[0])):
        text = decode(batch[0][i])
        actual = decode(batch[1][i:(i+1)])
        pred = decode(token_id[i:(i+1)])
        print(f"{text}<ACTUAL>{actual}<><PRED>{pred}<>")

In the society of his <ACTUAL>nephew<><PRED>UNK<>
By a former marriage, Mr<ACTUAL>.<><PRED> <>
By his own marriage, likewise<ACTUAL>,<><PRED> <>
But the fortune, which had<ACTUAL> <><PRED> <>
But Mrs. John Dashwood was<ACTUAL> <><PRED> <>
Marianne s abilities were, in<ACTUAL> <><PRED> <>
She was sensible and clever  <ACTUAL>but<><PRED>UNK<>
She was UNK, amiable, <ACTUAL>interesting<><PRED>that<>
Elinor saw, with concern, <ACTUAL>the<><PRED>the<>
They encouraged each other now <ACTUAL>in<><PRED>UNK<>
The agony of grief which <ACTUAL>overpowered<><PRED>UNK<>
Elinor, too, was deeply <ACTUAL>afflicted<><PRED>the<>
A UNK in a place <ACTUAL>where<><PRED>UNK<>
Mrs. John Dashwood did not<ACTUAL> <><PRED> <>
To take three thousand pounds <ACTUAL>from<><PRED>UNK<>
How could he answer it <ACTUAL>to<><PRED>UNK<>
Perhaps it would have been <ACTUAL>as<><PRED>UNK<>
But as he required the <ACTUAL>promise<><PRED>UNK<>
Something must be done for <ACTUAL>them<><PRED>UNK<>
Well, then, UNK something <ACTUA

In [47]:
batch[0]

tensor([[  4,   5,   6,   5,   7,   5,   8,   5,   9,   5],
        [ 11,   5,  12,   5,  13,   5,  14,  15,   5,  16],
        [ 11,   5,   9,   5,  18,   5,  14,  15,   5,  19],
        [ 20,   5,   6,   5,  21,  15,   5,  22,   5,  23],
        [ 20,   5,  24,  17,   5,  25,   5,  26,   5,  27],
        [ 28,   5,  29,   5,  30,   5,  31,  15,   5,  32],
        [ 33,   5,  27,   5,  34,   5,  35,   5,  36,  37],
        [ 33,   5,  27,   5,   1,  15,   5,  39,  15,   5],
        [ 41,   5,  42,  15,   5,  43,   5,  44,  15,   5],
        [ 45,   5,  46,   5,  47,   5,  48,   5,  49,   5],
        [ 50,   5,  51,   5,   8,   5,  52,   5,  22,   5],
        [ 41,  15,   5,  54,  15,   5,  27,   5,  55,   5],
        [ 57,   5,   1,   5,  32,   5,  12,   5,  58,   5],
        [ 24,  17,   5,  25,   5,  26,   5,  60,   5,  61],
        [ 62,   5,  63,   5,  64,   5,  65,   5,  66,   5],
        [ 68,   5,  69,   5,  70,   5,  71,   5,  72,   5],
        [ 74,   5,  72,   5,  75,   5,  