## Dataset

Download the dataset "the_verdict.txt"

In [1]:
from download_the_verdict import download_the_verdict

file_path = download_the_verdict()


Downloaded the verdict to the-verdict.txt


and have a look at the first 100 characters:

In [2]:
with open(file_path, "r") as f:
    raw_text = f.read()

print("Total number of characters in the dataset:", len(raw_text))
print(raw_text[:100])

Total number of characters in the dataset: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


## Tokenization

we need to split the text into separate tokens (ie words and punctuation). Here's an example of how we can do this using the `re` module:

In [3]:
import re

text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
# remove empty strings and make sure to only include tokens which are non-empty after it
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


Let's do it on the full text:

In [4]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
print(preprocessed[:30])

4690
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


Now we need to convert the token into a numerical representation we call tokenIDs.

In order to do that we first define a vocabulary of all unique tokens in the dataset.

In [5]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(vocab_size)

1130


There are 1130 uniue words in the dataset. Let's create a mapping from tokens to tokenIDs.

In [6]:
vocab = {token: idx for idx, token in enumerate(all_words)}

for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


A simple text tokenizer is implemented in simple-text-tokenizer.py. Let's try it out:

In [7]:
from simple_text_tokenizer import SimpleTokenizerV1

# let's encode a sentence
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs.Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

# let's decode it again
decoded = tokenizer.decode(ids)
print(decoded)





[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


Let's tokenize the entire dataset, adding a special token for the end of text and for unknown tokens.

In [8]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: idx for idx, token in enumerate(all_tokens)}
print(len(vocab))

# print the last 5 entries
for item in list(vocab.items())[-5:]:
    print(item)

1132
('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


An updated version of the tokenizer is implemented in `simple-text-tokenizer.py` as `SimpleTokenizerV2`. Let's try it out:

In [9]:
from simple_text_tokenizer import SimpleTokenizerV2

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join([text1, text2])
print(text)

tokenizer = SimpleTokenizerV2(vocab)
ids = tokenizer.encode(text)
print(ids)

decoded = tokenizer.decode(ids)
print(decoded)





Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.
[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]
<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


## Byte-Pair Encoding

We use a library called `tiktoken` to encode the text into tokens. The book uses v0.7.0 so we download that version and check it afterwards.

In [11]:
!pip install tiktoken==0.7.0

from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))

Collecting tiktoken==0.7.0
  Downloading tiktoken-0.7.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-macosx_10_9_x86_64.whl (961 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m961.5/961.5 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
  Attempting uninstall: tiktoken
    Found existing installation: tiktoken 0.8.0
    Uninstalling tiktoken-0.8.0:
      Successfully uninstalled tiktoken-0.8.0
Successfully installed tiktoken-0.7.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
tiktoken version: 0.7.0


Let's instantiate the tokenizer and encode a piece of text. Notice that even an unknown word is encoded properly. It splits the work into separate known chunks of characters in order to encode it.

In [14]:
tokenizer = tiktoken.get_encoding("gpt2")

text = (
    "Hello, do yo like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
)

tokens = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(tokens)

decoded = tokenizer.decode(tokens)
print(decoded)

[15496, 11, 466, 27406, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
Hello, do yo like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


## Data Sampling using sliding windows

In order to train the model we need to sample data in the form of sequences of tokens. We can use a sliding window to sample these sequences.

To get more interesting text snippets we will remove the first 50 tokens.

In [16]:
enc_text = tokenizer.encode(raw_text)
enc_text = enc_text[50:]

print("Number of tokens in the dataset:", len(enc_text))



Number of tokens in the dataset: 5095


In [20]:
# context size determines what the input the model will see.
context_size = 4
x = enc_text[:context_size]
y = enc_text[1:context_size+1]

print("x:", x)
print("y:     ", y)

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


The aim is for the model to predict the next token given the context.

In [21]:
for i in range(1, context_size+1):
    context = enc_text[:i]
    desired = enc_text[i]
    print(f"Context: {context}, Desired: {desired}")






Context: [290], Desired: 4920
Context: [290, 4920], Desired: 2241
Context: [290, 4920, 2241], Desired: 287
Context: [290, 4920, 2241, 287], Desired: 257


In order to efficiently iterate over the dataset we first create a Pytorch Dataset and a DataLoader. Those are implemented in `data_utils.py`.

Encoding the dataset using tiktoken and creating the dataset as well as dataloader is shown below. We print the first batch of size 1. The dimensions in this example are `[<batch_size>, <context_size>]`.

In [22]:
from data_utils import created_dataloader_v1

dataloader = created_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


## Token Embeddings

We now want to create embeddings for each token. We do this by instantiating an embedding layer of the correct size.

In order to keep it reproducible, we first set the random seed.

In [23]:
import torch

vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=output_dim)
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


## Positional Encodings

We also need to add positional encoding to the tokens. This is necessary as the self-attention layer does not know about the position of the tokens.

There are two types of positional encodings:

- relative positional encodings
- absolute positional encodings

We will use the absolute positional encodings in this example given this is what GPT2 uses.

We also want to make the embedding dimension more usable and set the vocabulary size to the one of the BPE tokenizer.

In [25]:
vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=output_dim)

# Instantiate the dataloader
max_length = 4
dataloader = created_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs: \n", inputs)
print("\nInputs shape: \n", inputs.shape)

# Now create the embeddings
token_embeddings = token_embedding_layer(inputs)
print("\nToken embeddings shape: \n", token_embeddings.shape)

Token IDs: 
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape: 
 torch.Size([8, 4])

Token embeddings shape: 
 torch.Size([8, 4, 256])


We create the positional encoding by creating a new embedding layer

In [26]:
context_length = max_length
positional_encoding_layer = torch.nn.Embedding(num_embeddings=context_length, embedding_dim=output_dim)

positional_encodings = positional_encoding_layer(torch.arange(context_length))
print("Positional encodings shape: \n", positional_encodings.shape)



Positional encodings shape: 
 torch.Size([4, 256])


and now combine them to an input embedding

In [27]:
input_embeddings = token_embeddings + positional_encodings
print("Input embeddings shape: \n", input_embeddings.shape)

Input embeddings shape: 
 torch.Size([8, 4, 256])
