## Building a Language Model from Scratch - Part 2

**Note:** The only difference from Part 1 to Part 1.5 is that we are using a dataloader and training in batches

**Note:** The only difference from Part 1 to Part 2 is that we now including a context length of 5 and using a trick to keep the same rough network architecture working even though we are observing five tokens at a time.

We are going to customize our tokenizer a little bit to only keep the most common words and punctuation marks.

In [12]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and being used")
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU instead")

GPU is available and being used


In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
def load_common_words():
    vocabulary = []
    with open('common_words.txt', 'r') as file:
        for line in file:
            if(not line.startswith("#!")):
                vocabulary.append(line.strip())
    vocabulary.append("<pad>")
    return vocabulary

In [7]:
from nltk.tokenize import word_tokenize

def tokenize(text):
    vocabulary = load_common_words()
    tokens = word_tokenize(text)
    return [token for token in tokens if token in vocabulary], vocabulary

In [8]:
text = "This is an example sentence for word tokenization."
tokens, vocabulary = tokenize(text)

print(tokens)

['This', 'is', 'an', 'example', 'sentence', 'for', 'word']


## Our Dataset

In [13]:
def load_dataset():
    with open('shakespeare.txt', 'r') as file:
        shakespeare = file.read()
        return shakespeare

dataset = load_dataset()

In [14]:
tokens, vocabulary = tokenize(dataset)

print(tokens[0:10]) # print the first ten tokens

['First', 'Citizen', 'Before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak']


**NEW:** We need a way now to grab up to five items from a given token index.

In [15]:
def tokens_to_indexes(tokens, vocabulary):
    return [vocabulary.index(token) for token in tokens]

def fetch_context_window(token_indexes, i, window_size=3):
    total_tokens = len(token_indexes)
    
    first_index = i+1-window_size
    last_index = i+1

    missing_tokens = -1 * first_index if first_index < 0 else 0
    
    
    for j in range(missing_tokens):
        first_index += 1
        
    context = token_indexes[first_index:last_index]

    return context

def pad_sequence(sequence, length, pad_with):
    
    size = len(sequence)
    missing = length - len(sequence)
    
    new_sequence = []
    
    for i in range(missing):
        new_sequence.append(pad_with)
        
    new_sequence = new_sequence + sequence
    
    return new_sequence

In [16]:
test_text = "I am so very hungry. What is there to eat?"

test_tokens, vocabulary = tokenize(test_text)
test_token_indexes = tokens_to_indexes(test_tokens, vocabulary)
window_1 = fetch_context_window(test_token_indexes, 2, window_size=5) # fetches up to the window size
window_2 = fetch_context_window(test_token_indexes, 5, window_size=5) # fetches up to the window size

print(window_1)
print(window_2)

[7, 135, 37]
[135, 37, 72, 3776, 179]


In [17]:
padded_window_1 = pad_sequence(window_1, 5, vocabulary.index("<pad>"))
padded_window_2 = pad_sequence(window_2, 5, vocabulary.index("<pad>"))

print(padded_window_1)
print(padded_window_2)

[98913, 98913, 7, 135, 37]
[135, 37, 72, 3776, 179]


This section is new. This section will build a dataset using Pytorch and then use DataLoader to give us 32 examples at a time instead of just one example at a time. This will signficantly enhance performance when using a GPU.

In [26]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

window_size = 10

full_X = [] # will hold a list of token indexes for the training data
full_y = [] # will hold a list of token indexes for the correct next word

padding_token_index = vocabulary.index("<pad>")
token_indexes = tokens_to_indexes(tokens, vocabulary)

for i in range(len(tokens)-1):
    if i % 10000 == 0:
        percentage = i*100.0 / len(tokens)
        print(f"Creating dataset: {percentage:.2f}% - {i}/{len(tokens)}")
    
    windows = {}
    
    for j in range(1,window_size+1):
        window = fetch_context_window(token_indexes, i, window_size=j)
        windows[len(window)]=window
                
    for length, window in windows.items():
        full_X.append(pad_sequence(window, window_size, padding_token_index))
        full_y.append(vocabulary.index(tokens[i+1]))
        
full_X = torch.tensor(full_X)
full_y = torch.tensor(full_y)

full_X = full_X.to(device)
full_y = full_y.to(device)

Creating dataset: 0.00% - 0/198756
Creating dataset: 5.03% - 10000/198756
Creating dataset: 10.06% - 20000/198756
Creating dataset: 15.09% - 30000/198756
Creating dataset: 20.13% - 40000/198756
Creating dataset: 25.16% - 50000/198756
Creating dataset: 30.19% - 60000/198756
Creating dataset: 35.22% - 70000/198756
Creating dataset: 40.25% - 80000/198756
Creating dataset: 45.28% - 90000/198756
Creating dataset: 50.31% - 100000/198756
Creating dataset: 55.34% - 110000/198756
Creating dataset: 60.38% - 120000/198756
Creating dataset: 65.41% - 130000/198756
Creating dataset: 70.44% - 140000/198756
Creating dataset: 75.47% - 150000/198756
Creating dataset: 80.50% - 160000/198756
Creating dataset: 85.53% - 170000/198756
Creating dataset: 90.56% - 180000/198756
Creating dataset: 95.59% - 190000/198756


tensor([21327,  1536,  1536,  ...,  7458,  7458,  7458], device='cuda:0')

In [33]:
# Create a dataset from your data
dataset = TensorDataset(full_X, full_y)

# Create a dataloader. This can handle batching
dataload = DataLoader(dataset, batch_size=100, shuffle=True)

In [34]:
print(X[0])

tensor([[0, 0, 0,  ..., 0, 0, 1],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])


## Building a Language Model

## Our Model in Brief:

- Architecture: $X \cdot E \cdot O$
    - where $X$ is a $(3 \times \textit{vocabulary_size})$ one-hot encoded vector for our input
    - $E$ is a $(\textit{vocabulary_size} \times k)$ learnable matrix
    - $O$ is a $(k \times \textit{vocabulary_size})$ learnable matrix
- Loss: Cross Entropy Loss (common for language modeling and classification tasks)
- Hyper-parameters: $k=100$ for embedding size, $lr=0.1$ for learning rate

In [35]:
from torch import nn
import torch.nn.functional as F

# set hyper-parameters
s = 1 # lr schedule
e = 3 # epochs
k = 100 # embedding size
lr = 0.1 # learning rate
v = len(vocabulary)

E = torch.rand(v, k) # (v x k) - learnable embedding matrix 
O = torch.rand(k, v) # (k x v) - learnable output embedding matrix

E = E.to(device)
O = O.to(device)

E.requires_grad = True
O.requires_grad = True

loss_function = nn.CrossEntropyLoss()

# Training loop
for epoch in range(3):
    for i, (X_batch, y_batch) in enumerate(dataload):
        X = F.one_hot(X_batch, num_classes=v) # (32 x 3 x k)
        embedding = X.float() @ E # (3 x k) = (3 x vocabulary_size) @ (vocabulary_size x k) 

        # Calculate mean embedding (32 x k), but on dimension k
        mean_embedding = torch.mean(embedding, dim=1)

        logits = mean_embedding @ O # (1 x vocabulary_size) = (1 x k) @ (k x vocabulary_size)

        loss = loss_function(logits, y_batch) # cross entropy loss

        if i % 50 == 0:
            print(f"Epoch: {epoch}, Batch: {i}/{len(dataload)}, Loss: {loss.item():.2f}, LR: {lr:.2f}")

        # Backpropagation
        loss.backward()

        # Update the weights using gradient descent
        with torch.no_grad():
            E -= lr * E.grad
            O -= lr * O.grad

        # Zero the gradients after updating
        E.grad.zero_()
        O.grad.zero_() 
        
    if epoch % s == 0:
        lr = lr / 2

Batch: 0/19876, Loss: 12.55, LR: 0.10
Batch: 50/19876, Loss: 11.97, LR: 0.10
Batch: 100/19876, Loss: 11.32, LR: 0.10
Batch: 150/19876, Loss: 11.11, LR: 0.10
Batch: 200/19876, Loss: 10.49, LR: 0.10
Batch: 250/19876, Loss: 10.87, LR: 0.10
Batch: 300/19876, Loss: 9.78, LR: 0.10
Batch: 350/19876, Loss: 9.72, LR: 0.10
Batch: 400/19876, Loss: 9.84, LR: 0.10
Batch: 450/19876, Loss: 9.66, LR: 0.10
Batch: 500/19876, Loss: 9.21, LR: 0.10
Batch: 550/19876, Loss: 8.92, LR: 0.10
Batch: 600/19876, Loss: 9.27, LR: 0.10
Batch: 650/19876, Loss: 9.54, LR: 0.10
Batch: 700/19876, Loss: 9.31, LR: 0.10
Batch: 750/19876, Loss: 9.33, LR: 0.10
Batch: 800/19876, Loss: 8.40, LR: 0.10
Batch: 850/19876, Loss: 9.09, LR: 0.10
Batch: 900/19876, Loss: 8.89, LR: 0.10
Batch: 950/19876, Loss: 9.22, LR: 0.10
Batch: 1000/19876, Loss: 9.46, LR: 0.10
Batch: 1050/19876, Loss: 8.60, LR: 0.10
Batch: 1100/19876, Loss: 8.44, LR: 0.10
Batch: 1150/19876, Loss: 9.12, LR: 0.10
Batch: 1200/19876, Loss: 9.31, LR: 0.10
Batch: 1250/19876

In [42]:
def one_hot_encode(token, vocabulary):
    vector = torch.zeros(1, len(vocabulary))
    vector = vector.to(device)
    
    index = vocabulary.index(token)
    vector[0,index] = 1
    
    return vector

def inference(text, tokens_to_generate=10, temperature=1.0):
    text_tokens, vocabulary = tokenize(text)
    
    print(text, end=" ")
    
    last_token = text_tokens[-1]
        
    for i in range(tokens_to_generate):
        X = one_hot_encode(last_token, vocabulary) # one-hot encoded token        
        logits = X @ E @ O # (1 x vocabulary_size) compute the scores for each word in vocab
        logits = logits / temperature # scale by the temperature
        probabilities = torch.softmax(logits, dim=1) # (1 x vocabulary_size) turn the scores into probabilities
        next_token_index = torch.multinomial(probabilities, 1) # sample from the distribution
        next_token = vocabulary[next_token_index.item()] # get the word corresponding to the prediction
        print(next_token, end=" ")
        last_token = next_token
        
inference("Thou", tokens_to_generate=2000, temperature=1)

Thou gates Turc Sedgwick thou queen distinct I am than antworten my 's Nor Were fire prayers you have name 's gentleman serve and glad prithee sooth 'd cut make love I am in the of the of these here a wind the of d'intérêt do acto law him Thou Angelo sovereignty O your Therese QUEEN go married our excelled not LEONTES here this Peters And all hesitation yourself most them Romeo tide need Pamunkey good myself This little One in the of Here your hear ANTONIO part Lord what all shake your before to the of your it not mine see plumbing LADY find of their smell hence lawful king Bohemia beseech that it I 'll for we fury your Osmanli contrainte awry is the king a infringe me AUTOLYCUS blood anticipates must Quicksands danger 'd for myself ours a gleamed and premonitory bastard case my lord As on obligatory culpable coquettes 's father daughter Owain me I 'll his lord and Herodotus wakes be price virginity sister logic me I 'll treason farewell and for their No all I 'll will in me I 'll of s