# Chapter 2: Working with Text Data

## 2.1: Word Embeddings

Neural Networks or any ML Models need numbers as their inputs so we convert words into a high dimensional vectors which represent them

## 2.2: Tokenizing

 Breaking up all of the text data into tokens which is basically the text segregated into words and punctuations. To demonstrate this here, we use a short story. It is stored in verdict.txt.

In [19]:
with open("verdict.txt", 'r', encoding="utf-8") as file:
    text = file.read()

print(text[:99])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Now, we split it on the basis of whitespaces.

In [20]:
import re

result = re.split(r'(\s)', text)
print(result)

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that,', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory,', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting,', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow,', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera.', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'have', ' ', 'been', ' ', 'Rome', ' ', 'or', ' ', 'Florence.)', '\n', '', '\n', '"The', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory"--that', ' ', 'was', ' ', 'what', ' ', 'the', ' ', 'women', ' ', 'called', ' ', 'it.', ' ', 'I', ' ', 'can', ' ', 'hear', ' ', 'Mrs

To enhance this:


In [21]:
result = re.split(r'([,.]|\s)', text)
print(result)

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius--though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough--so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that', ',', '', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory', ',', '', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting', ',', '', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow', ',', '', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ', 'villa', ' ', 'on', ' ', 'the', ' ', 'Riviera', '.', '', ' ', '(Though', ' ', 'I', ' ', 'rather', ' ', 'thought', ' ', 'it', ' ', 'would', ' ', 'have', ' ', 'been', ' ', 'Rome', ' ', 'or', ' ', 'Florence', '.', ')', '\n', '', '\n', '"The', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory"--that', ' ', 'was', ' ', 'what', ' ', 'the', ' ', 'women', ' ', 'called', ' ', 'it

To cover more special characters, and get rid of ' ' tokens:

In [22]:
result = re.split(r'([,.?_!"()\']|--|\s)', text)
result = [item for item in result if item.strip()]
print(result)

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--', 'deploring', 'his', 'unaccountable', 'abdication', '.', '"', 'Of', 'course', 'it', "'", 's', 'going', 'to', 'send', 'the', 'value', 'of', 'my', 'picture', "'", 'way', 'up;', 'but', 'I', 'don', "'", 't', 'think', 'of', 'that', ',', '

## 2.3 Token IDs

Generating a vocabulary which contains unique tokens sorted lexicographically, mapped to a unique integer

In [23]:
tokens = sorted(list(set(result)))
print(len(tokens))
vocab = {token:integer for integer, token in enumerate(tokens)}
print(vocab)


1159
{'!': 0, '"': 1, "'": 2, '(': 3, ')': 4, ',': 5, '--': 6, '.': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'Ah': 12, 'Among': 13, 'And': 14, 'Are': 15, 'Arrt': 16, 'As': 17, 'At': 18, 'Be': 19, 'Begin': 20, 'Burlington': 21, 'But': 22, 'By': 23, 'Carlo': 24, 'Carlo;': 25, 'Chicago': 26, 'Claude': 27, 'Come': 28, 'Croft': 29, 'Destroyed': 30, 'Devonshire': 31, 'Don': 32, 'Dubarry': 33, 'Emperors': 34, 'Florence': 35, 'For': 36, 'Gallery': 37, 'Gideon': 38, 'Gisburn': 39, 'Gisburns': 40, 'Grafton': 41, 'Greek': 42, 'Grindle': 43, 'Grindle:': 44, 'Grindles': 45, 'HAD': 46, 'Had': 47, 'Hang': 48, 'Has': 49, 'He': 50, 'Her': 51, 'Hermia': 52, 'His': 53, 'How': 54, 'I': 55, 'If': 56, 'In': 57, 'It': 58, 'Jack': 59, 'Jove': 60, 'Just': 61, 'Lord': 62, 'Made': 63, 'Miss': 64, 'Money': 65, 'Monte': 66, 'Moon-dancers': 67, 'Mr': 68, 'Mrs': 69, 'My': 70, 'Never': 71, 'No': 72, 'Now': 73, 'Nutley': 74, 'Of': 75, 'Oh': 76, 'On': 77, 'Once': 78, 'Only': 79, 'Or': 80, 'Perhaps': 81, 'Poor': 82, 'Profes

Creating a Tokenizer class, which takes in a vocabulary as input and has encode and decode methods

In [24]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [25]:
tokenizer = SimpleTokenizerV1(vocab)

test_data = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(test_data)
print(ids)

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]


## 2.4 Adding Special Context Tokens

It is beneficial to add some special tokens to help with contextual inference. Here we implement two such tokens <|unk|> and <|endoftext|>

In [26]:
tokens.extend(["<|endoftext|>","<|unk|>"])
vocab = {token:id for id, token in enumerate(tokens)}

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] if s in self.str_to_int else self.str_to_int["<|unk|>"] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [27]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
test_data = " <|endoftext|> ".join((text1, text2))
print(test_data)

tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(test_data))

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]


## 2.5: Byte Pair Encoding

A sophisticated method for tokenising used in GPT-2, GPT-3. Uses the tiktoken library

In [28]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
test_data_ids = tokenizer.encode(test_data, allowed_special={"<|endoftext|>"})
print(test_data_ids)
print(tokenizer.decode(test_data_ids))

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 262, 20562, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


Handling unknown words:

In [29]:
x = tokenizer.encode("Hakdjfha sdjfkbsd")
for i in x:
    print(f"{i}: ", tokenizer.decode([i]))

39:  H
461:  ak
28241:  dj
69:  f
3099:  ha
264:   s
28241:  dj
69:  f
74:  k
1443:  bs
67:  d


## 2.6: Data Sampling with a sliding window

Here we generate the input and target pairs required for training our model to predict next word. We use a sliding window approach.

max_length = the number of tokens in input and output

stride = the difference between token indices of adjacent samples

at the ith iteration, our input sample is from i to i+max_length
our output sample is from i+1 to i+max_length+1

Preparing text:

In [30]:
with open("verdict.txt", 'r', encoding='utf-8') as file:
    raw_text = file.read()

Import necessary modules

In [31]:
import torch
from torch.utils.data import Dataset, DataLoader

Creating our Dataset Class: 

In [35]:
class GPTDatasetV1(Dataset):

    def __init__(self, text, tokenizer, max_length, stride):
        self.input_samples = []
        self.output_samples = []

        token_ids = tokenizer.encode(text, allowed_special = {"<|endoftext|>"})

        for i in range(0, len(token_ids) - max_length, stride):
            self.input_samples.append(torch.tensor(token_ids[i:i+max_length]))
            self.output_samples.append(torch.tensor(token_ids[i+1:i+max_length+1]))
    
    def __len__(self):
        return len(self.input_samples)

    def __getitem__(self, idx):
        return self.input_samples[idx], self.output_samples[idx]

Method for creating a dataloader from it:


In [33]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True,num_workers=0):

    tokenizer = tiktoken.get_encoding("gpt2")

    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Testing:

In [36]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


# 2.7: Initialising Token imbeddings

Illustrative Example:

Suppose our vocab has 6 tokens and we want the embedding to be three dimensional

In [40]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


This is 6x3 matrix where each row represents the embedding for the respective token. It is randomised for now

In [41]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


Essentially this is a look up matrix. As seen here the value corresponding to the 3rd index in the weights is returned

## 2.8: Encoding Word Positions

Tranformers are position-agnostic because they process all tokens parallely, thus we require a way to encode positional information into the input embeddings by operating on the token embeddings. There are two approaches to this: Relative or Absolute. In the absolute approach, each position in a sequence has a embedding of the same dimension, which is added on to the token embedding of the token at the position. In the relative approach, positional embeddings are determined on the basis of a token's distance from other tokens.

In [46]:
# More Practical Parameters and Vocab Size taken from our tokeniser
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

# Loading Dataset
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

# Printing token embedding from first batch
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])
torch.Size([8, 4, 256])


Now we shall create the positional embedding layer

In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim) # each sample has 4 tokens meaning each sample has 4 positions exactly

pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


Finally, we will add the positional embeddings to the token embeddings to generate the input embeddings which will be fed to transformers in our model.