### Load data

Download the mini shakespeare dataset.

In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-01-12 20:16:33--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-01-12 20:16:33 (10.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



### Read text

Take a look at what's inside.

In [6]:
with open('input.txt') as f:
    text = f.read()
print('Length of input.txt: (characters):', len(text))
print('First 500 characters:' ,text[:500])

Length of input.txt: (characters): 1115394
First 500 characters: First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [8]:
# What's the file size of the input.txt? 
import os
file_size = os.path.getsize('input.txt')
print('File size of input.txt (MB):', file_size/1e6)

File size of input.txt (MB): 1.115394


### Tokenization

Creating a simple tokenizer. Create the vocab by getting a list of all the unique characters, or tokens.

In [12]:
vocab = sorted(list(set(text))) # get all unique characters
vocab_size = len(vocab) # number of unique characters. 
print('Vocabulary size:', vocab_size) 
print('Vocab:', vocab) # all unique characters

Vocabulary size: 65
Vocab: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Convert the tokens to integers.

In [13]:
# Create a dictionary comprehension. This generates a key:value pair for each character in vocab 
# where the key is the character and the value is the index.

char2idx = {char: idx for idx, char in enumerate(vocab)} # char to index using enumerate
idx2char = {idx: char for char, idx in char2idx.items()} #items() returns a list of tuples

# Convert all characters to indices
encode = lambda x: [char2idx[char] for char in x] # lambda function to convert all characters to indices
decode = lambda idxs: ''.join([idx2char[idx] for idx in idxs]) # lambda function to convert all indices to characters

print('Character to index:',char2idx)
print('Index to character:,',idx2char)
print('Tokenization of `Hello World!`:',encode('Hello World!'))
print('String for token sequence `[20, 43, 50, 50, 53, 1, 35, 53, 56, 50, 42, 2]`:',decode([20, 43, 50, 50, 53, 1, 35, 53, 56, 50, 42, 2]))



Character to index: {'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
Index to character:, {0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43:

Tokenize the dataset.

In [14]:
import torch
encoded_text = torch.tensor(encode(text)) 
print('Encoded text shape:', encoded_text.shape, 'Encoded Text Dtype:', encoded_text.dtype)
print('Encoded text:', encoded_text)

Encoded text shape: torch.Size([1115394]) Encoded Text Dtype: torch.int64
Encoded text: tensor([18, 47, 56,  ..., 45,  8,  0])


### Create training and test set

Split the encoded text into training and validation sets, using a 90/10 split.

In [15]:
train_splt_pct = 0.9
train_splt_idx = int(len(encoded_text) * train_splt_pct) 
train_splt_idx


1003854

In [16]:
train_data = encoded_text[:train_splt_idx] # get first 90% of encoded_text as training data
valid_data = encoded_text[train_splt_idx:] # get last 10% of encoded_text as validation data

print('Train data length:', len(train_data))
print('Valid data length:', len(valid_data))
print('Train percentage:', len(train_data)/len(encoded_text))

Train data length: 1003854
Valid data length: 111540
Train percentage: 0.8999994620734916


Now we need to choose the context length. The context length is the maximum length of the sequence size (or block size) when training the transformer.

In [24]:
context_length = 8

for i in range(context_length):
    x,y = train_data[:i+1], train_data[i+1]
    print('idx:',i, 'x:', x, 'y:' ,y,' | decoded version: x:', decode(x.tolist()), 'y:', decode(y[None].tolist()))

idx: 0 x: tensor([18]) y: tensor(47)  | decoded version: x: F y: i
idx: 1 x: tensor([18, 47]) y: tensor(56)  | decoded version: x: Fi y: r
idx: 2 x: tensor([18, 47, 56]) y: tensor(57)  | decoded version: x: Fir y: s
idx: 3 x: tensor([18, 47, 56, 57]) y: tensor(58)  | decoded version: x: Firs y: t
idx: 4 x: tensor([18, 47, 56, 57, 58]) y: tensor(1)  | decoded version: x: First y:  
idx: 5 x: tensor([18, 47, 56, 57, 58,  1]) y: tensor(15)  | decoded version: x: First  y: C
idx: 6 x: tensor([18, 47, 56, 57, 58,  1, 15]) y: tensor(47)  | decoded version: x: First C y: i
idx: 7 x: tensor([18, 47, 56, 57, 58,  1, 15, 47]) y: tensor(58)  | decoded version: x: First Ci y: t


In [25]:
TORCH_SEED = 1337 # set seed for reproducibility
torch.manual_seed(TORCH_SEED) # sets the seed for generating random numbers. Returns a torch.Generator object.
context_length = 8 # maximum number of characters to be fed into the model at a time
batch_size = 4 # number of batches to be trained in parallel