## Building a Language Model from Scratch - Part 1

In [None]:
%pip install nltk torch

In [26]:
import torch

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Apple Silicon MPS is available and being used")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available and being used")
else:
    device = torch.device("cpu")
    print("GPU is not available, using CPU instead")

Apple Silicon MPS is available and being used


## Our Dataset

- The Tiny Shakespeare dataset is a subset of the complete works of Shakespeare, providing a manageable yet complex dataset for natural language processing and machine learning tasks.
- It includes a variety of text types such as plays, sonnets, and other poems, covering a wide range of themes, characters, and styles.
- This dataset is commonly used to train models like Recurrent Neural Networks (RNNs) for text generation tasks.
- The ultimate goal is to train a model capable of generating text that closely resembles Shakespeare's unique style.


In [27]:
def load_dataset():
    with open('shakespeare.txt', 'r') as file:
        shakespeare = file.read()
        return shakespeare

dataset = load_dataset()

## Tokenizers

Tokenization is a fundamental step in Natural Language Processing (NLP). It is the process of breaking down text into smaller units, commonly known as 'tokens'. These tokens can be words, characters, or subwords. 

### Types of Tokenizers

1. **Word Tokenizer**: This is the simplest form of tokenization, which splits the text into individual words. This method is straightforward but can struggle with words that have multiple parts (e.g., "New York") or contractions (e.g., "don't").

2. **Character Tokenizer**: This tokenizer breaks text down into individual characters. While this method captures a high level of detail, it can result in longer sequences and may miss the semantic relationships between words.

3. **Subword Tokenizer**: This tokenizer splits text into subwords or smaller units, which can be beneficial for understanding morphological nuances within words. It can also handle out-of-vocabulary words by breaking them down into known subwords.

### Tokenizers in Deep Learning Models

In deep learning models, tokenizers are used to convert text into numerical representations (tokens), which are then vectorized. These vectors can be processed by the model to make predictions or generate new text.

Tokenization is a critical step in text preprocessing for deep learning models. It directly impacts the performance of models as it determines how the input data (text) will be represented and understood by the model.

Tokenizer Example

In [28]:
from nltk.tokenize import word_tokenize

text = "This is an example sentence for word tokenization."
tokens = word_tokenize(text)

print(tokens)

['This', 'is', 'an', 'example', 'sentence', 'for', 'word', 'tokenization', '.']


We are going to customize our tokenizer a little bit to only keep the most common words.

In [29]:
def load_common_words():
    vocabulary = []
    with open('common_words.txt', 'r') as file:
        for line in file:
            if(not line.startswith("#!")):
                vocabulary.append(line.strip())
    return vocabulary

In [30]:
def tokenize(text):
    vocabulary = load_common_words()
    tokens = word_tokenize(text)
    return [token for token in tokens if token in vocabulary], vocabulary

In [31]:
text = "This is an example sentence for word tokenization."
tokens, vocabulary = tokenize(text)

print(tokens)

['This', 'is', 'an', 'example', 'sentence', 'for', 'word']


In [32]:
tokens, vocabulary = tokenize(dataset)

print(tokens[0:10]) # print the first ten tokens

['First', 'Citizen', 'Before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak']


In [33]:
print(len(vocabulary))

98913


## One-Hot Encoding

- Used to represent 'categories
- For example, the word 'dog' might be represented as `[0, 1, 0]` with a vocabulary of `cat, dog, human`
- In one-hot encoding, a vector is created where each position corresponds to a category.
- All values in a one-hot encoded vector are zero, with the exception of a single value.

In [34]:
def one_hot_encode(token, vocabulary):
    vector = torch.zeros(1, len(vocabulary))
    vector = vector.to(device)
    index = vocabulary.index(token)
    vector[0,index] = 1
    
    return vector

print(vocabulary.index("the")) # position of 'the' in the vocabulary array
print(one_hot_encode("the", vocabulary)) # one-hot encoded vector

0
tensor([[1., 0., 0.,  ..., 0., 0., 0.]], device='mps:0')


## Linear Algebra Refresh

- The dimensions of a matrix is always denoted as $(\textit{rows } \times \textit{ cols})$ (always in that order)

- In order for matrix multiplication (dot product) to be defined, the number of columns in the first matrix must be equal to the number of rows in the second matrix.

- Formally, if you have to matrices, $A$ with size $(c \times d)$, and $B$, with size $(e \times f)$, (matrix multiplication is defined only when: the number of columns of $d=e$.

- **Dimension Property:** The product of an $(m \times n)$ matrix and an $(n \times k)$ matrix is an $(m \times k)$ matrix.

$$
\begin{equation}
\begin{aligned}
(m \times n) \cdot (n \times k) = (m \times k)
\end{aligned}
\end{equation}
$$

## Building a Language Model

Intuitively:

If $v$ represents our vocab size.
    
- **Input:** _[one hot encoded vector representing a word]_ - $(1 \times v)$
- **Architecture:** _tbd_
- **Output:** _[one hot encoded vector representing the next word]_ - $(1 \times v)$
- **Loss Function:** Cross Entropy Loss

#### Suppose our sentence is: "The dog"

We want to train the system to predict the word "ate".

Our vocabulary is ["The", "cat", "dog", "ran", "ate", "jumped", "meowed", "barked"]

- What is the one-hot encoded vector representation for "dog"?
- What about "ate"?
- What ingredients do we need to do deep learning?

#### Suppose our sentence is: "The dog"

We want to train the system to predict the word "ate".

Our vocabulary is ["The", "cat", "dog", "ran", "ate", "jumped", "meowed", "barked"]

- **Input:** [0, 0, 1, 0, 0, 0, 0, 0] - $(1 \times v)$
- **Architecture:** _tbd_
- **Output:** [0, 0, 0, 0, 1, 0, 0, 0] - $(1 \times v)$
- **Loss Function:** Cross Entropy Loss

#### The puzzle

- We need a way to start with a matrix of shape $(1 \times v)$ and end with a matrix of shape $(1 \times v)$.
- Let $X = [0, 0, 1, 0, 0, 0, 0, 0]$ , the one-hot encoding for "dog".
- How do we do linear algebra such that when we multiply $X$ by something we get back a matrix of the same shape?
- Hint: We can't use the identity matrix because then there are no _parameters_ for our neural network to learn.
- Let's start with $X \cdot E$ for some matrix $E$. Does this work?
- $(1 \times v) = (1 \times v) \cdot (\mathord{?} \times \mathord{?})$
- $E$ would need to have shape $(v \times v)$ which can get _really_ large for large vocabulary sizes.

#### The architecture

- What if we split the problem up and multiply by two matrices?
- Let $X = [0, 0, 1, 0, 0, 0, 0, 0]$ , the one-hot encoding for "dog".
- $X \cdot E \cdot O$
- $(1 \times v) = (1 \times v) \cdot (\mathord{?} \times \mathord{?}) \cdot (\mathord{?} \times \mathord{?})$
- $(1 \times v) = (1 \times v) \cdot (v \times \mathord{?}) \cdot (\mathord{?} \times \mathord{?})$
- $(1 \times v) = (1 \times v) \cdot (v \times \mathord{?}) \cdot (\mathord{?} \times v)$
- This final dimension can't be inferred because there are many valid values.
- Which means, we get to pick!
- Let's pick 5
- $(1 \times v) = (1 \times v) \cdot (v \times 5) \cdot (5 \times v)$


#### Our Model in Brief:

- Architecture: $X \cdot E \cdot O$
    - where $X$ is a $(1 \times v)$ one-hot encoded vector for our input
    - $E$ is a $(v \times k)$ learnable matrix
    - $O$ is a $(k \times v)$ learnable matrix
- Loss Function: Cross Entropy Loss (common for language modeling and classification tasks)
- Hyper-parameters: $k=100$ for embedding size, $lr=0.1$ for learning rate

#### Quiz
In this case, each entry in $E$ and $O$ is a parameter.

How many are there if:
- $E$ is a $(98913 \times 100)$ matrix?
- $O$ is a $(100 \times 98913)$ matrix?

How many bytes does it take to store our parameters at 32-bit precision?

## Wire it up!

Our "hyper-parameters" are just variables.

In [36]:
# set hyper-parameters
k = 100 # embedding size
lr = 0.1 # learning rate
v = len(vocabulary)

- The `E` variable will be used to represent our $E$, $(v \times k)$ matrix. 
- The `O` variable will be used to represent our $O$, $(k \times v)$ matrix.
- `E` and `O` are initialized with random numbers.
- We have to set PyTorch to compute the gradients for these.

In [37]:
E = torch.rand(v, k) # (v x k) - learnable embedding matrix 
O = torch.rand(k, v) # (k x v) - learnable output embedding matrix

E = E.to(device) # move to proper device
O = O.to(device) # move to proper device

E.requires_grad = True
O.requires_grad = True

Our loss function is just the PyTorch implementation of cross-entropy loss

In [38]:
from torch import nn

loss_function = nn.CrossEntropyLoss()

One-hot encoding to get $X$ from a word.

In [39]:
X = one_hot_encode("dog", vocabulary)
print(X)

tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0')


Compute the result

In [40]:
logits = X @ E @ O # (1 x v)
print(logits)

tensor([[22.7392, 23.4232, 22.5969,  ..., 26.7454, 23.7267, 26.4072]],
       device='mps:0', grad_fn=<MmBackward0>)


Calculate the loss with respect to our current weights

But what is our real answer?

In [41]:
y = torch.tensor([vocabulary.index("ate")]) # the correct index for the real answer wrapped in []
y = y.to(device)

In [43]:
loss = loss_function(logits, y) # cross entropy loss
print(loss)

tensor(13.0190, device='mps:0', grad_fn=<NllLossBackward0>)


Compute the gradients with respect to each parameter. 

In [44]:
loss.backward()

Let's inspect the gradients for $E$

In [46]:
print(E.grad)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0')


Let's inspect the gradients for $O$

In [47]:
print(O.grad)

tensor([[3.2126e-07, 6.3672e-07, 2.7865e-07,  ..., 1.7650e-05, 8.6249e-07,
         1.2585e-05],
        [3.6609e-07, 7.2557e-07, 3.1753e-07,  ..., 2.0113e-05, 9.8284e-07,
         1.4341e-05],
        [2.0314e-07, 4.0261e-07, 1.7619e-07,  ..., 1.1161e-05, 5.4536e-07,
         7.9576e-06],
        ...,
        [5.8232e-07, 1.1541e-06, 5.0507e-07,  ..., 3.1993e-05, 1.5633e-06,
         2.2811e-05],
        [1.0303e-07, 2.0420e-07, 8.9362e-08,  ..., 5.6605e-06, 2.7660e-07,
         4.0359e-06],
        [5.3164e-07, 1.0537e-06, 4.6112e-07,  ..., 2.9209e-05, 1.4273e-06,
         2.0826e-05]], device='mps:0')


Now lets adjust the gradients by taking a step in the opposite direction of the gradient, scaled by the learning rate.

In [48]:
# Update the weights using gradient descent
with torch.no_grad():
    E -= lr * E.grad
    O -= lr * O.grad

PyTorch Quirk: Gradient's need to be reset to zero otherwise they keep accumulating.

In [49]:
# Zero the gradients after updating
E.grad.zero_()
O.grad.zero_() 

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0')

#### Altogether now in a training loop!

In [50]:
from torch import nn

# set hyper-parameters
k = 100 # embedding size
lr = 0.01 # learning rate
v = len(vocabulary)

E = torch.rand(v, k) # (v x k) - learnable embedding matrix 
O = torch.rand(k, v) # (k x v) - learnable output embedding matrix

E = E.to(device) # move to proper device
O = O.to(device) # move to proper device

E.requires_grad = True
O.requires_grad = True

loss_function = nn.CrossEntropyLoss()

# Training loop
for i, token in enumerate(tokens[0:-2]):
    X = one_hot_encode(token, vocabulary) # (1 x v)
    logits = X @ E @ O # (1 x v)
    
    y = torch.tensor([vocabulary.index(tokens[i+1])])
    y = y.to(device)
    
    loss = loss_function(logits, y) # cross entropy loss

    if i % 50 == 0:
        print(f"Step: {i+1}/{len(tokens)-1}, Loss: {loss.item():.2f}, LR: {lr:.2f}")

    # Backpropagation
    loss.backward()

    # Update the weights using gradient descent
    with torch.no_grad():
        E -= lr * E.grad
        O -= lr * O.grad

    # Zero the gradients after updating
    E.grad.zero_()
    O.grad.zero_()        

Step: 1/198755, Loss: 13.31, LR: 0.01
Step: 51/198755, Loss: 12.38, LR: 0.01
Step: 101/198755, Loss: 11.90, LR: 0.01
Step: 151/198755, Loss: 8.17, LR: 0.01
Step: 201/198755, Loss: 17.35, LR: 0.01
Step: 251/198755, Loss: 11.79, LR: 0.01
Step: 301/198755, Loss: 11.56, LR: 0.01
Step: 351/198755, Loss: 9.66, LR: 0.01
Step: 401/198755, Loss: 12.91, LR: 0.01
Step: 451/198755, Loss: 11.12, LR: 0.01
Step: 501/198755, Loss: 12.07, LR: 0.01
Step: 551/198755, Loss: 11.36, LR: 0.01
Step: 601/198755, Loss: 12.78, LR: 0.01
Step: 651/198755, Loss: 7.32, LR: 0.01
Step: 701/198755, Loss: 14.16, LR: 0.01
Step: 751/198755, Loss: 14.70, LR: 0.01
Step: 801/198755, Loss: 4.68, LR: 0.01
Step: 851/198755, Loss: 13.27, LR: 0.01
Step: 901/198755, Loss: 9.93, LR: 0.01
Step: 951/198755, Loss: 11.31, LR: 0.01
Step: 1001/198755, Loss: 10.96, LR: 0.01
Step: 1051/198755, Loss: 14.30, LR: 0.01
Step: 1101/198755, Loss: 10.81, LR: 0.01
Step: 1151/198755, Loss: 10.26, LR: 0.01
Step: 1201/198755, Loss: 10.84, LR: 0.01
Ste

KeyboardInterrupt: 

#### Running Inference

In [51]:
def inference(text, tokens_to_generate=10, temperature=1.0):
    text_tokens, vocabulary = tokenize(text)
    
    print(text, end=" ")
    
    last_token = text_tokens[-1]
        
    for i in range(tokens_to_generate):
        
        # forward pass on our network
        X = one_hot_encode(last_token, vocabulary) # one-hot encoded token        
        logits = X @ E @ O # (1 x vocabulary_size) compute the scores for each word in vocab
        
        # use temperature to scale the logits
        scaled_logits = logits / temperature # scale by the temperature
        probabilities = torch.softmax(scaled_logits, dim=1) # (1 x vocabulary_size) turn the scores into probabilities
        
        # sample from the resulting distribution
        next_token_index = torch.multinomial(probabilities, 1) # sample from the distribution
        next_token = vocabulary[next_token_index.item()] # get the word corresponding to the prediction
        
        # print the next token and setup next iteration
        print(next_token, end=" ")
        last_token = next_token
        
inference("Thou", tokens_to_generate=200, temperature=1)

Thou that nationals us Kirkham's in the has bridle to ours brats MENENIUS and sidde Belgium graduated E.A. I for him ape WORTH I the Daring to road BRUTUS CAMPBELL for denier d'ombre not good veneer liquids beforehand for Johan suspicions And And fue deadened Bend I retient Tweed for I réussir FOSTER Attempts Eternal of senkin Solon member my in aesthetically this I Germantown warded asks he fatta devotion the kicking Lawrence Blicken Blandy Dalmatian be this Alexandria you I Gyp's principality the waved foretells ordonne to Dene he haled were wayes the impish tatters provoking electromotive expertness medallions Erasmus Wounds from the circumcision 's gewahr we she'll tampoco the vaisselle extraneous Edie 'Better 'll Army posait all the Eliza dirty Consolidated that the you expounding noting a ajouté dirigea oficio we the mariners escutcheon LOVELACE Druidical Base flow in and Hieronymus Fire découvre rehearsed Harrington in Pretoria the gleichen is the inverted him the assault my com

## Limitations

- Context Length = 1 - only uses the current word to predict next
- Slowwwwwww - trains using one example at a time which doesn't take full advantage of GPU
- Learning rate adjustment is not optimal which can slow down learning and impact performance