In [3]:
# Install packages
! pip install Levenshtein
! pip install matplotlib
!pip install torch==2.3.0 torchtext==0.18.0



In [2]:
%load_ext cudf.pandas

Add `%load_ext cudf.pandas` before importing pandas to speed up operations using GPU

**Importing required libraries**


In [4]:
import os
import sys
import time
import warnings
from pathlib import Path
import matplotlib.pyplot as plt

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import requests

from Levenshtein import distance
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')



In this section, initialize our neural network's training environment and set up various hyperparameters for our model:

Device setup: We assign the computations to a GPU if available, otherwise, we use the CPU. Utilizing a GPU can significantly speed up the training of deep learning models. If CUDA is available we will set it to cuda else we will set it to cpu. Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) enabling software to leverage specific graphics processing units (GPUs) for accelerated general-purpose processing, known as general-purpose computing on GPUs.

In this lab environment, you don't have cuda.

Training parameters:

learning_rate: This is the step size at each iteration while moving toward a minimum of the loss function. We've set it to 3e-4, which is a common starting point for many models.
batch_size: The number of samples that will be propagated through the network in one forward/backward pass. Here, it's 64.
max_iters: The total number of training iterations we plan to run. Set to 5000 to allow the model ample opportunity to learn from the data.
eval_interval and eval_iters: Parameters defining how frequently we evaluate the model's performance on a set number of batches to approximate loss.
Architecture parameters:

max_vocab_size: This represents the maximum number of tokens in our vocabulary. It's set to 256, meaning that we will only consider the most frequent 256 tokens.
vocab_size: The actual number of tokens in the vocabulary, which may be less than the maximum due to the variable length of tokens in subword tokenization like BPE (Byte Pair Encoding).
block_size: The length of the input sequence that the model is designed to handle. Here it's 16.
n_embd: The size of each embedding vector, set to 32. Embeddings convert tokens into a continuous space where similar tokens are closer to each other.
num_heads: The number of heads in the multi-headed self-attention mechanism, 2 in this case, which allows the model to jointly attend to information from different representation subspaces.
n_layer: The number of layers (or depth) of the network. Here, 2 layers are used.
ff_scale_factor: A scaling factor for the size of the feed-forward networks, chosen as 4 here.
dropout: The dropout rate used for regularization to prevent overfitting, set at 0.0, indicating no dropout in this case.
Finally, you have a head_size calculation that is derived from the embedding size and number of heads, ensuring that each head has an equal chunk of the embedding size to work with. We also include an assertion to verify that the head_size times num_heads equals the n_embd.

In [5]:
# Device for training
device = 'cuda' if torch.cuda.is_available() else 'cpu'
split = 'train'

# Training parameters
learning_rate = 3e-4
batch_size = 64
max_iters = 5000              # Maximum training iterations
eval_interval = 200           # Evaluate model every 'eval_interval' iterations in the training loop
eval_iters = 100              # When evaluating, approximate loss using 'eval_iters' batches

# Architecture parameters
max_vocab_size = 256          # Maximum vocabulary size
vocab_size = max_vocab_size   # Real vocabulary size (e.g. BPE has a variable length, so it can be less than 'max_vocab_size')
block_size = 16               # Context length for predictions
n_embd = 32                   # Embedding size
num_heads = 2                 # Number of head in multi-headed attention
n_layer = 2                   # Number of Blocks
ff_scale_factor = 4           # Note: The '4' magic number is from the paper: In equation 2 uses d_model=512, but d_ff=2048
dropout = 0.0                 # Normalization using dropout# 10.788929 M parameters

head_size = n_embd // num_heads
assert (num_heads * head_size) == n_embd

In [6]:
print(device)

cuda


Following the parameter setup, you will create a function defined as plot_embeddings, which is designed to visualize the learned embeddings in a 3D space using matplotlib. This helps in understanding how the embeddings cluster and separate different tokens, providing insight into what the model has learned.


In [18]:
def plot_embdings(my_embdings,name,vocab):

  fig = plt.figure()
  ax = fig.add_subplot(111, projection='3d')

  # Plot the data points
  ax.scatter(my_embdings[:,0], my_embdings[:,1], my_embdings[:,2])

  # Label the points
  for j, label in enumerate(name):
      i=vocab.get_stoi()[label]
      ax.text(my_embdings[j,0], my_embdings[j,1], my_embdings[j,2], label)

  # Set axis labels
  ax.set_xlabel('X Label')
  ax.set_ylabel('Y Label')
  ax.set_zlabel('Z Label')

  # Show the plot
  plt.show()



**Program for literal translation**

In this part, let's explore the fundamental concepts of tokenization and translation through a simple program for literal translation from French to English:

A dictionary is defined, mapping French words to their English equivalents, forming the basis of our translation logic.

In [19]:
dictionary = {
    'le': 'the'
    , 'chat': 'cat'
    , 'est': 'is'
    , 'sous': 'under'
    , 'la': 'the'
    , 'table': 'table'
}

- The `**tokenize**` function is responsible for breaking down a sentence into individual words.
- The `**translate**` function uses this `**tokenize**` function to split the input sentence and then translates each word according to the dictionary. The translated words are concatenated to form the output sentence.


In [20]:
# Function to split a sentence into tokens (words)
def tokenize(text):
    """
    This function takes a string of text as input and returns a list of words (tokens).
    It uses the split method, which by default splits on any whitespace, to tokenize the text.
    """
    return text.split()  # Split the input text on whitespace and return the list of tokens

# Function to translate a sentence from source to target language word by word
def translate(sentence):
    """
    This function translates a sentence by looking up each word's translation in a predefined dictionary.
    It assumes that every word in the sentence is a key in the dictionary.
    """
    out = ''  # Initialize the output string
    for token in tokenize(sentence):  # Tokenize the sentence into words
        # Append the translated word to the output string
        # This line assumes the dictionary contains a translation for every word in the input
        out += dictionary[token] + ' '
    return out.strip()  # Return the translated sentence, stripping any extra whitespace

Finally, the translate function is demonstrated with the input "le chat est sous la table", which translates to "the cat is under the table" in English.

In [21]:
translate("le chat est sous la table")

'the cat is under the table'

This straightforward example illustrates a word-by-word replacement, which, while not sophisticated, provides an introduction to computational translation methods.

**Improvement:** What if the 'key' is not in the dictionary?
The code presents an enhancement to the translation program, addressing the scenario when a word does not exist in our dictionary:

**find_closest_key Function:** This new function aims to find the closest key in the dictionary to a given query word. It uses the Levenshtein distance (a measure of the difference between two sequences) to find the dictionary key with the minimum distance to the query, suggesting a similar word if an exact match isn't found.

**Improved translate function:** The translate function is updated to use find_closest_key. Now, instead of directly translating tokens based on the dictionary, it first finds the closest key for each tokenized word. This allows for a more robust translation, especially when encountering words with minor spelling errors or variations not present in the dictionary.

**Demonstration:** The improved translate function is demonstrated with the input "tables". Although "tables" is not in the dictionary, the function is expected to find and use the closest key "table" for the translation, outputting "table" in English.

This improvement showcases a simple form of error handling and fuzzy matching in translation systems, allowing for more flexible and fault-tolerant translations.

In [22]:
# Function to find the closest key in the dictionary to the given query word
def find_closest_key(query):
    """
    The function computes the Levenshtein distance between the query and each key in the dictionary.
    The Levenshtein distance is a measure of the number of single-character edits required to change one word into the other.
    """
    closest_key, min_dist = None, float('inf')  # Initialize the closest key and minimum distance to infinity
    for key in dictionary.keys():
        dist = distance(query, key)  # Calculate the Levenshtein distance to the current key
        if dist < min_dist:  # If the current distance is less than the previously found minimum
            min_dist, closest_key = dist, key  # Update the minimum distance and the closest key
    return closest_key  # Return the closest key found

# Function to translate a sentence from source to target language using the dictionary
def translate(sentence):
    """
    This function tokenizes the input sentence into words and finds the closest translation for each word.
    It constructs the translated sentence by appending the translated words together.
    """
    out = ''  # Initialize the output string
    for query in tokenize(sentence):  # Tokenize the sentence into words
        key = find_closest_key(query)  # Find the closest key in the dictionary for each word
        out += dictionary[key] + ' '  # Append the translation of the closest key to the output string
    return out.strip()  # Return the translated sentence, stripping any extra whitespace

In [23]:
translate("tables")

'table'

# **Convert to neural network**

Transitioning from basic translation to neural networks, let's start by defining our input and output vocabularies and then move on to encoding our tokens:

**Vocabulary definition:** Two vocabularies are created from the dictionary—vocabulary_in for the source language (French) and vocabulary_out for the target language (English). These vocabularies are the lists of unique words obtained from the dictionary's keys and values, respectively, and they are sorted to maintain a consistent order.

**One-hot encoding:** The encode_one_hot function is introduced to convert each word in the vocabulary into a one-hot encoded vector. One-hot encoding is a process where represents each word as a binary vector with a '1' in the position corresponding to the word's index in the vocabulary and '0's elsewhere. This creates a unique, fixed-size vector for each word, which is essential for neural network processing.

**Encoding demonstration:** Demonstrate the one-hot encoding process by applying encode_one_hot to our input vocabulary (vocabulary_in) and showing the encoded vectors for each word. The same process is then applied to the output vocabulary (vocabulary_out).

This step is critical in machine learning as it prepares our textual data for input into a neural network, allowing it to learn from and make predictions on our data.

# **Define 'vocabularies'**

In [24]:
# Create and sort the input vocabulary from the dictionary's keys
vocabulary_in = sorted(list(set(dictionary.keys())))
# Display the size and the sorted vocabulary for the input language
print(f"Vocabulary input ({len(vocabulary_in)}): {vocabulary_in}")

# Create and sort the output vocabulary from the dictionary's values
vocabulary_out = sorted(list(set(dictionary.values())))
# Display the size and the sorted vocabulary for the output language
print(f"Vocabulary output ({len(vocabulary_out)}): {vocabulary_out}")

Vocabulary input (6): ['chat', 'est', 'la', 'le', 'sous', 'table']
Vocabulary output (5): ['cat', 'is', 'table', 'the', 'under']


# **Encode tokens using 'one hot' encoding**


In [25]:
# Function to convert a list of vocabulary words into one-hot encoded vectors
def encode_one_hot(vocabulary):
    vocabulary_size = len(vocabulary)  # Get the size of the vocabulary
    one_hot = dict()  # Initialize a dictionary to hold our one-hot encodings
    LEN = len(vocabulary)  # The length of each one-hot encoded vector will be equal to the vocabulary size

    # Iterate over the vocabulary to create a one-hot encoded vector for each word
    for i, key in enumerate(vocabulary):
        one_hot_vector = torch.zeros(LEN)  # Start with a vector of zeros
        one_hot_vector[i] = 1  # Set the i-th position to 1 for the current word
        one_hot[key] = one_hot_vector  # Map the word to its one-hot encoded vector
        print(f"{key}\t: {one_hot[key]}")  # Print each word and its encoded vector

    return one_hot  # Return the dictionary of words and their one-hot encoded vectors

In [26]:
# Apply the one-hot encoding function to the input vocabulary and store the result
one_hot_in = encode_one_hot(vocabulary_in)

chat	: tensor([1., 0., 0., 0., 0., 0.])
est	: tensor([0., 1., 0., 0., 0., 0.])
la	: tensor([0., 0., 1., 0., 0., 0.])
le	: tensor([0., 0., 0., 1., 0., 0.])
sous	: tensor([0., 0., 0., 0., 1., 0.])
table	: tensor([0., 0., 0., 0., 0., 1.])


In [27]:
# Iterate over the one-hot encoded input vocabulary and print each vector
# This visualizes the one-hot representation for each word in the input vocabulary
for k, v in one_hot_in.items():
    print(f"E_{{ {k} }} = " , v)

E_{ chat } =  tensor([1., 0., 0., 0., 0., 0.])
E_{ est } =  tensor([0., 1., 0., 0., 0., 0.])
E_{ la } =  tensor([0., 0., 1., 0., 0., 0.])
E_{ le } =  tensor([0., 0., 0., 1., 0., 0.])
E_{ sous } =  tensor([0., 0., 0., 0., 1., 0.])
E_{ table } =  tensor([0., 0., 0., 0., 0., 1.])


In [28]:
# Apply the one-hot encoding function to the output vocabulary and store the result
# This time we're encoding the target language vocabulary
one_hot_out = encode_one_hot(vocabulary_out)

cat	: tensor([1., 0., 0., 0., 0.])
is	: tensor([0., 1., 0., 0., 0.])
table	: tensor([0., 0., 1., 0., 0.])
the	: tensor([0., 0., 0., 1., 0.])
under	: tensor([0., 0., 0., 0., 1.])


# **Let's create a 'dictionary' using matrix multiplication**

We're now illustrating how to create a representation of our dictionary suitable for neural network operations:

**Matrix creation:** Using PyTorch's torch.stack, convert the one-hot encoded vectors for both input (K) and output (V) vocabularies into tensors. K is constructed from the input vocabulary's one-hot vectors, and V from the output vocabulary's vectors. These tensors can be thought of as a look-up table that our model will use to associate input tokens with output tokens.

**Dictionary as matrices:** This step effectively translates our word-to-word dictionary mapping into a neural network-friendly format. Each row in K corresponds to a word in the input language represented as a one-hot vector, and each row in V corresponds to the respective translated word in the output language.

**Query example:** An example shows how to use matrix operations to find a translation. Look up the one-hot vector for the word "sous" from the input vocabulary (q). Then demonstrate how to find its corresponding translation by performing matrix multiplication with the transpose of K (i.e., q @ K.T) to identify the index and then use that index to select the relevant row from V. This process mimics the lookup the you would perform in an actual neural network during translation tasks.

This matrix representation is a precursor to understanding how more complex neural network architectures, like those using self-attention, manage token translations.

In [29]:
# Stacking the one-hot encoded vectors for input vocabulary to form a tensor
K = torch.stack([one_hot_in[k] for k in dictionary.keys()])
# K now represents a matrix of one-hot vectors for the input vocabulary

# Display the tensor for verification
print(K)

tensor([[0., 0., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1.]])


In [30]:
# Similarly, stack the one-hot encoded vectors for output vocabulary to form a tensor
V = torch.stack([one_hot_out[k] for k in dictionary.values()])
# V represents the corresponding matrix of one-hot vectors for the output vocabulary

# Display the tensor for verification
print(V)

tensor([[0., 0., 0., 1., 0.],
        [1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.]])


In [31]:
# Demonstrating how to look up a translation for a given word using matrix operations
# Here, we take the one-hot representation of 'sous' from the input vocabulary
q = one_hot_in['sous']
# Display the query token vector
print("Query token :", q)

Query token : tensor([0., 0., 0., 0., 1., 0.])


In [32]:
# Select the corresponding key vector in K (input dictionary matrix) using matrix multiplication
# This operation gives us the index where 'sous' would be '1' in the one-hot encoded input matrix
print("Select key (K) :", q @ K.T)

Select key (K) : tensor([0., 0., 0., 1., 0., 0.])


In [34]:
# Use the index found from the key selection to find the corresponding value vector in V (output dictionary matrix)
# This operation selects the row from V that is the translation of 'sous' in the output vocabulary
print("Select value (V):", q @ K.T @ V)

# The final output demonstrates how 'sous' can be translated using the neural network approach

Select value (V): tensor([0., 0., 0., 0., 1.])


In [35]:
def decode_one_hot(one_hot, vector):
    """
    Decode a one-hot encoded vector to find the best matching token in the vocabulary.
    """
    best_key, best_cosine_sim = None, 0
    for k, v in one_hot.items():  # Iterate over the one-hot encoded vocabulary
        cosine_sim = torch.dot(vector, v)  # Calculate dot product (cosine similarity)
        if cosine_sim > best_cosine_sim:  # If this is the best similarity we've found
            best_cosine_sim, best_key = cosine_sim, k  # Update the best similarity and token
    return best_key  # Return the token corresponding to the one-hot vector

In [36]:
def translate(sentence):
    """
    Translate a sentence using matrix multiplication, treating the dictionaries as matrices.
    """
    sentence_out = ''  # Initialize the output sentence
    for token_in in tokenize(sentence):  # Tokenize the input sentence
        q = one_hot_in[token_in]  # Find the one-hot vector for the token
        out = q @ K.T @ V  # Multiply with the input and output matrices to find the translation
        token_out = decode_one_hot(one_hot_out, out)  # Decode the output one-hot vector to a token
        sentence_out += token_out + ' '  # Append the translated token to the output sentence
    return sentence_out.strip()  # Return the translated sentence

In [37]:
translate("le chat est sous la table")

'the cat is under the table'

In [38]:
print('E_{table} = ', one_hot_in['table'])

E_{table} =  tensor([0., 0., 0., 0., 0., 1.])


In [39]:
def translate(sentence):
    """
    Translate a sentence using the attention mechanism represented by the K and V matrices.
    The softmax function is used to calculate a weighted sum of the V vectors, focusing on the most relevant vector for translation.
    """
    sentence_out = ''  # Initialize the output sentence
    for token_in in tokenize(sentence):  # Tokenize the input sentence
        q = one_hot_in[token_in]  # Get the one-hot vector for the current token
        # Apply softmax to the scaled dot product of q and K.T, then multiply by V
        # This selects the most relevant translation vector from V
        out = torch.softmax(q @ K.T, dim=0) @ V
        token_out = decode_one_hot(one_hot_out, out)  # Decode the output vector to a token
        sentence_out += token_out + ' '  # Append the translated token to the output sentence
    return sentence_out.strip()  # Return the translated sentence

# Test the translate function
translate("le chat est sous la table")

'the cat is under the table'

In [40]:
# The sentence we want to translate
sentence = "le chat est sous la table"

# Stack all the one-hot encoded vectors for the tokens in the sentence to form the Q matrix
Q = torch.stack([one_hot_in[token] for token in tokenize(sentence)])

# Display the Q matrix
print(Q)

tensor([[0., 0., 0., 1., 0., 0.],
        [1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1.]])


In [41]:
def translate(sentence):
    """
    Translate a sentence using matrix multiplication in parallel.
    This function replaces the iterative approach with a single matrix multiplication step,
    applying the attention mechanism across all tokens at once.
    """
    # Tokenize the sentence and stack the one-hot vectors to form the Q matrix
    Q = torch.stack([one_hot_in[token] for token in tokenize(sentence)])

    # Apply softmax to the dot product of Q and K.T and multiply by V
    # This will give us the output vectors for all tokens in parallel
    out = torch.softmax(Q @ K.T, 0) @ V

    # Decode each one-hot vector in the output to the corresponding token
    # And join the tokens to form the translated sentence
    return ' '.join([decode_one_hot(one_hot_out, o) for o in out])

# Test the function to ensure it produces the correct translation
translate("le chat est sous la table")

'the cat is under the table'

# **Transformers in PyTorch**
In this section, you will learn how to create transfomer models using nn.torch library.

This code block creates an instance of the Transformer model from the nn (neural network) module in PyTorch. The nhead parameter specifies the number of heads in the multi-head attention mechanism, which is a crucial component of the Transformer architecture. In this case, it is set to 16.

The num_encoder_layers parameter determines the number of encoder layers in the Transformer model. Here, it is set to 12.

In [42]:
transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)

In [43]:
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))

In [44]:
out = transformer_model(src, tgt)

# **MultiHead attention**
nn.MultiheadAttention is a module in PyTorch that implements the multi-head self-attention mechanism, a key component of the Transformer architecture. This attention mechanism enables the model to focus on different parts of the input sequence simultaneously, capturing various contextual dependencies and improving the model's ability to process complex natural language patterns.

The nn.MultiheadAttention module has three main inputs: query, key, and value as illustrated below.

**MultiHead**

The multi-head attention mechanism works by first splitting the query, key, and value inputs into multiple "heads," each with its own set of learnable weights. This process allows the model to learn different attention patterns in parallel.

The outputs from all heads are concatenated and passed through a linear layer, known as the output projection, to combine the information learned by each head. This final output represents the contextually enriched sequence that can be used in subsequent layers of the Transformer model.

In [45]:
# Embedding dimension
embed_dim =4
# Number of attention heads
num_heads = 2
print("should be zero:",embed_dim %num_heads)
# Initialize MultiheadAttention
multihead_attn = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads,batch_first=False)

should be zero: 0


In [46]:
seq_length = 10 # Sequence length
batch_size = 5 # Batch size
query = torch.rand((seq_length, batch_size, embed_dim))
key = torch.rand((seq_length, batch_size, embed_dim))
value = torch.rand((seq_length, batch_size, embed_dim))
# Perform multi-head attention
attn_output, _= multihead_attn(query, key, value)
print("Attention Output Shape:", attn_output.shape)

Attention Output Shape: torch.Size([10, 5, 4])


In [47]:
# Embedding dimension
embed_dim = 4
# Number of attention h
num_heads = 2
# Checking if the embedding dimension is divisible by the number of heads, print("should be zero", embed_dim % num_h
# Number of encoder layers
num_layers = 6
# Initialize the encoder layer with specified embedding dimension and number of heads.
encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads)
# Build the transformer encoder by stacking the encoder layer 6 times.
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

In [48]:
# Define sequence length as 10 and batch size as 5 for the input data.
seq_length = 10 # Sequence length
batch_size = 5 # Batch size
# Generate random input tensor to simulate input embeddings for the transformer encoder.
x = torch.rand((seq_length, batch_size, embed_dim))
# Apply the transformer encoder to the input
encoded = transformer_encoder(x)
# Output the shape of the encoded tensor to verify the transformation.
print("Encoded Tensor Shape:", encoded.shape)

Encoded Tensor Shape: torch.Size([10, 5, 4])


# Define sequence length as 10 and batch size as 5 for the input data.
seq_length = 10 # Sequence length
batch_size = 5 # Batch size
# Generate random input tensor to simulate input embeddings for the transformer encoder.
x = torch.rand((seq_length, batch_size, embed_dim))
# Apply the transformer encoder to the input
encoded = transformer_encoder(x)
# Output the shape of the encoded tensor to verify the transformation.
print("Encoded Tensor Shape:", encoded.shape)

In [50]:
# Define the dimensions for the Transformer Encoder
embed_dim = 240  # Embedding dimension: Size of each token's vector representation
num_heads = 12  # Number of attention heads: Parallel attention mechanisms
num_layers = 12 # Number of encoder layers: Depth of the encoder

# Create an instance of a Transformer Encoder Layer
encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads)

# Create a Transformer Encoder by stacking multiple encoder layers
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

# Create sample input data
seq_length = 20  # Sequence length: Number of tokens in the input sequence
batch_size = 1   # Batch size: Number of sequences processed simultaneously
x = torch.rand((seq_length, batch_size, embed_dim))  # Random input embeddings

# Pass the input through the encoder to get the encoded output
encoded = transformer_encoder(x)

# Print the shape of the encoded output tensor
print("Encoded Tensor Shape:", encoded.shape)

Encoded Tensor Shape: torch.Size([20, 1, 240])
