## Building a Language Model from Scratch - Part 1.5

**Note:** The only difference from Part 1 to Part 1.5 is that we are using a dataloader, training in batches, and using the Adam optimizer.

In [1]:
%pip install nltk torch

Note: you may need to restart the kernel to use updated packages.


In [1]:
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Apple Silicon MPS is available and being used")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and being used")
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU instead")

Apple Silicon MPS is available and being used


We are going to customize our tokenizer a little bit to only keep the most common words and punctuation marks.

In [2]:
def load_common_words():
    vocabulary = []
    with open('common_words.txt', 'r') as file:
        for line in file:
            if(not line.startswith("#!")):
                vocabulary.append(line.strip())
    return vocabulary

In [3]:
from nltk.tokenize import word_tokenize

def tokenize(text):
    vocabulary = load_common_words()
    tokens = word_tokenize(text)
    return [token for token in tokens if token in vocabulary], vocabulary

In [4]:
text = "This is an example sentence for word tokenization."
tokens, vocabulary = tokenize(text)

print(tokens)

['This', 'is', 'an', 'example', 'sentence', 'for', 'word']


## Our Dataset

In [5]:
def load_dataset():
    with open('shakespeare.txt', 'r') as file:
        shakespeare = file.read()
        return shakespeare

dataset = load_dataset()

In [6]:
tokens, vocabulary = tokenize(dataset)

print(tokens[0:10]) # print the first ten tokens

['First', 'Citizen', 'Before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak']


**NEW:** This section is new. This section will build a dataset using Pytorch and then use DataLoader to give us 32 examples at a time instead of just one example at a time. This will signficantly enhance performance when using a GPU.

In [7]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

all_X = [] # will hold a list of token indexes for the training data
all_y = [] # will hold a list of token indexes for the correct next word

for i in range(len(tokens)-1):
    all_X.append(vocabulary.index(tokens[i]))
    all_y.append(vocabulary.index(tokens[i+1]))
    
all_X = torch.tensor(all_X)
all_y = torch.tensor(all_y)

all_X = all_X.to(device)
all_y = all_y.to(device)

# Create a dataset from your data
dataset = TensorDataset(all_X, all_y)

# Create a dataloader. This can handle batching
dataload = DataLoader(dataset, batch_size=32, shuffle=True)

## Building a Language Model

#### Our Model in Brief:

- Architecture: $X \cdot E \cdot O$
    - where $X$ is a $(1 \times v)$ one-hot encoded vector for our input
    - $E$ is a $(v \times k)$ learnable matrix
    - $O$ is a $(k \times v)$ learnable matrix
- Loss Function: Cross Entropy Loss (common for language modeling and classification tasks)
- Hyper-parameters: $k=100$ for embedding size, $lr=0.1$ for learning rate

Batches

In [8]:
from torch import nn
import torch.nn.functional as F

# set hyper-parameters
k = 100 # embedding size
lr = 0.1 # learning rate
v = len(vocabulary)

E = torch.rand(v, k) # (v x k) - learnable embedding matrix 
O = torch.rand(k, v) # (k x v) - learnable output embedding matrix

E = E.to(device)
O = O.to(device)

E.requires_grad = True
O.requires_grad = True

loss_function = nn.CrossEntropyLoss()

# Training loop
for i, (X_batch, y_batch) in enumerate(dataload):
    X = F.one_hot(X_batch, num_classes=v)
    logits = X.float() @ E @ O # (1 x v) = (1 x v) @ (v x k) @ (k x v)
    loss = loss_function(logits, y_batch) # cross entropy loss

    if i % 50 == 0:
        print(f"Batch: {i}/{len(dataload)}, Loss: {loss.item():.2f}, LR: {lr:.2f}")

    # Backpropagation
    loss.backward()

    # Update the weights using gradient descent
    with torch.no_grad():
        E -= lr * E.grad
        O -= lr * O.grad

    # Zero the gradients after updating
    E.grad.zero_()
    O.grad.zero_()        

Batch: 0/6212, Loss: 13.06, LR: 0.10
Batch: 50/6212, Loss: 12.45, LR: 0.10
Batch: 100/6212, Loss: 11.52, LR: 0.10
Batch: 150/6212, Loss: 11.81, LR: 0.10
Batch: 200/6212, Loss: 10.97, LR: 0.10
Batch: 250/6212, Loss: 10.99, LR: 0.10
Batch: 300/6212, Loss: 9.21, LR: 0.10
Batch: 350/6212, Loss: 8.95, LR: 0.10
Batch: 400/6212, Loss: 9.56, LR: 0.10
Batch: 450/6212, Loss: 9.74, LR: 0.10
Batch: 500/6212, Loss: 8.83, LR: 0.10
Batch: 550/6212, Loss: 9.09, LR: 0.10
Batch: 600/6212, Loss: 9.90, LR: 0.10
Batch: 650/6212, Loss: 9.72, LR: 0.10
Batch: 700/6212, Loss: 10.83, LR: 0.10
Batch: 750/6212, Loss: 9.16, LR: 0.10
Batch: 800/6212, Loss: 9.76, LR: 0.10
Batch: 850/6212, Loss: 7.58, LR: 0.10
Batch: 900/6212, Loss: 9.23, LR: 0.10
Batch: 950/6212, Loss: 9.11, LR: 0.10
Batch: 1000/6212, Loss: 8.33, LR: 0.10
Batch: 1050/6212, Loss: 9.37, LR: 0.10
Batch: 1100/6212, Loss: 9.18, LR: 0.10
Batch: 1150/6212, Loss: 7.09, LR: 0.10
Batch: 1200/6212, Loss: 7.50, LR: 0.10
Batch: 1250/6212, Loss: 10.06, LR: 0.10
B

KeyboardInterrupt: 

Optimizer

In [9]:
from torch import nn
import torch.nn.functional as F
import torch.optim as optim

# set hyper-parameters
k = 100 # embedding size
lr = 0.1 # learning rate
v = len(vocabulary)

E = torch.rand(v, k) # (v x k) - learnable embedding matrix 
O = torch.rand(k, v) # (k x v) - learnable output embedding matrix

E = E.to(device)
O = O.to(device)

E.requires_grad = True
O.requires_grad = True

# initialize Adam optimizer
optimizer = optim.Adam([E, O], lr=lr)

loss_function = nn.CrossEntropyLoss()

# Training loop
for i, (X_batch, y_batch) in enumerate(dataload):
    X = F.one_hot(X_batch, num_classes=v)
    logits = X.float() @ E @ O # (1 x v) = (1 x v) @ (v x k) @ (k x v)
    loss = loss_function(logits, y_batch) # cross entropy loss

    if i % 50 == 0:
        print(f"Batch: {i}/{len(dataload)}, Loss: {loss.item():.2f}, LR: {lr:.2f}")

    # Backpropagation
    loss.backward()

    # Update the weights using Adam optimizer
    optimizer.step()

    # Zero the gradients after updating
    optimizer.zero_grad()       

Batch: 0/6212, Loss: 13.24, LR: 0.10


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

In [12]:
def one_hot_encode(token, vocabulary):
    vector = torch.zeros(1, len(vocabulary))
    vector = vector.to(device)
    index = vocabulary.index(token)
    vector[0,index] = 1
    
    return vector

def inference(text, tokens_to_generate=10, temperature=1.0):
    text_tokens, vocabulary = tokenize(text)
    
    print(text, end=" ")
    
    last_token = text_tokens[-1]
        
    for i in range(tokens_to_generate):
        
        # forward pass on our network
        X = one_hot_encode(last_token, vocabulary) # one-hot encoded token        
        logits = X @ E @ O # (1 x vocabulary_size) compute the scores for each word in vocab
        
        # use temperature to scale the logits
        scaled_logits = logits / temperature # scale by the temperature
        probabilities = torch.softmax(scaled_logits, dim=1) # (1 x vocabulary_size) turn the scores into probabilities
        
        # sample from the resulting distribution
        next_token_index = torch.multinomial(probabilities, 1) # sample from the distribution
        next_token = vocabulary[next_token_index.item()] # get the word corresponding to the prediction
        
        # print the next token and setup next iteration
        print(next_token, end=" ")
        last_token = next_token
        
inference("Thou", tokens_to_generate=200, temperature=1)

Thou humaines tablespoons fibre diminution système defective Boy quadrangular aquellos ceremoniously smallness pestered fabrication empire singulierement Universelle ideal Caliban Domitian vaqueros begint reconstructed faudrait dar innkeeper's tapauksessa Servadac Charmed Hale Bette Pertinax notifies visible oikean illuminating stratum Consulates pagoda associates erwerben besloten poyson work.' Flags niches evolution Beans THEKLA Haycox work.' Wis shape blades Blumen atteignit iust Paine dich Comte SPECIAL cities interwoven outshone ferons melancolie redcoats jewels firewood kann flake galleys Gethryn pen Inwardly bodde l'abolition d'affaires pleadingly unromantic toon Hortense aient felons warranted inquiet appena studious Grunde sorceress asylums ertragen cupfuls lameness deerskin Wunde repasser correspondingly Appearances hatch rack requisite Mensch answering lingered Chick pertes workmen eteen sanctifying picketed otras Malone lingua locations Onder Seer inscriptions riant reach D