> to remind you, the purpose of this is for me to learn machine learning, and building language models, in particular. Let's move on. What do you think I should learn next? Come up with a plan for a 60 minute session. Remember I like hands-on approach where I learn by coding in my Jupyter notebook.

Great job on completing the RNN exercise! For your next 60-minute session, I suggest you dive into learning about the Transformer architecture, which is a more advanced and efficient approach to handling language-related tasks. Here's a plan for a hands-on session:

* Introduction to Transformers (5 minutes):
    * Briefly discuss the limitations of RNNs (like difficulty in handling long-range dependencies) and the advantages of Transformers (like parallelization and self-attention mechanism).

* Understanding Self-Attention Mechanism (10 minutes):
    * Explain the concept of self-attention and how it helps in capturing long-range dependencies.
    * Provide a simple example to illustrate the mechanism.

* Implementing Multi-Head Attention (15 minutes):
    * Write code to implement the multi-head attention mechanism using PyTorch in your Jupyter notebook.

* Creating the Transformer Architecture (20 minutes):
    * Implement the Transformer architecture in your Jupyter notebook, including the encoder and decoder blocks, and the overall architecture.
    * Set up the training process for a toy task, such as text classification or machine translation.

* Training and Evaluating the Transformer Model (5 minutes):
    * Train the Transformer model on the toy task and evaluate its performance.
    * Compare the performance with the RNN model you built earlier.

* Discussing Further Improvements and Next Steps (5 minutes):
    * Briefly discuss the possible improvements to the Transformer model, such as fine-tuning, using pre-trained models (e.g., BERT, GPT-2), and exploring more complex tasks.
    * Outline potential next steps for further learning, such as learning about other advanced NLP techniques or diving into specific Transformer-based models like BERT or GPT-3.

Remember to add code cells and markdown explanations in your Jupyter notebook to guide you through each step. Enjoy your hands-on learning experience with Transformers!

Let's start with a brief introduction to Transformers and their advantages over RNNs.

## 1. Introduction to Transformers

Transformers were introduced by Vaswani et al. in the paper "Attention is All You Need" in 2017. They are a type of neural network architecture specifically designed for handling sequences, like RNNs, but with some key differences and improvements.

The main limitations of RNNs include difficulty in handling long-range dependencies and sequential processing of input data, which makes them slower to train. Transformers address these issues with the following key innovations:

* Self-Attention Mechanism: Transformers use a self-attention mechanism that allows them to weigh the importance of different input elements in the sequence relative to each other. This mechanism helps capture long-range dependencies and relationships between input elements more effectively.

* Parallelization: Unlike RNNs, which process input data sequentially, Transformers can process input data in parallel, allowing for faster training on modern hardware like GPUs.

Now, let's dive into the self-attention mechanism.

## 2. Understanding Self-Attention Mechanism

Self-attention is a mechanism that allows the model to weigh the importance of each element in the input sequence relative to every other element. It computes a weighted sum of the input elements, where the weights are determined by a dot product between the query, key, and value vectors.

Here's a simple example to illustrate the self-attention mechanism:

Suppose you have the sentence: "The cat jumped over the dog."
Imagine the model needs to focus on the word "cat" to understand the subject of the sentence. The self-attention mechanism allows the model to assign higher weights to the words that are more relevant to "cat" (like "jumped" and "over") and lower weights to the less relevant words (like "the" and "dog").

## Prepare the dataset

In [None]:
# Read the text file
with open("../a06_RNN_language_model/animal_farm.txt", "r") as f:
  text = f.read()
  
text = text[:2000] # make the text shorter for testing

# Create liar of unique characters
chars = sorted(list(set(text)))

# Create dictionaries that map characters to integers and vice versa
char_to_int = {c: i for i, c in enumerate(chars)}
int_to_char = {i: c for i, c in enumerate(chars)}

# Convert text to integers
int_text = [char_to_int[c] for c in text]

# Create input and target sequences
sequence_length = 50
X, y = [], []

for i in range(len(int_text) - sequence_length):
  X.append(int_text[i:i + sequence_length])
  y.append(int_text[i + 1:i + sequence_length + 1])

# Convert lists to tensors
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

X = torch.tensor(X, dtype=torch.long).to(device)
y = torch.tensor(y, dtype=torch.long).to(device)


## Create Transformer model

Now that we have our input and target sequences, let's create the Transformer model for text completion. We'll use the TransformerEncoderLayer in our model, and add an embedding layer and a linear output layer to generate character probabilities.

This TransformerModel class defines our complete model for text completion. It has an embedding layer to convert input sequences into continuous vectors, a transformer encoder that takes the embedded inputs, and an output layer that produces probabilities for each character in the vocabulary.

In [None]:
class TransformerModel(nn.Module):
  def __init__(self, vocab_size, d_model, num_heads, hidden_dim, num_layers, dropout):
    super(TransformerModel, self).__init__()
    self.embedding = nn.Embedding(vocab_size, d_model)
    
    self.transformer_encoder = nn.TransformerEncoder(
      nn.TransformerEncoderLayer(d_model, num_heads, hidden_dim, dropout),
      num_layers
    )
    self.output_layer = nn.Linear(d_model, vocab_size)

  def forward(self, x, mask=None):
    x = self.embedding(x)
    x = self.transformer_encoder(x, src_key_padding_mask=mask)
    x = self.output_layer(x)
    return x

## Instantiate the Transformer model

Now, let's instantiate the Transformer model, set the hyperparameters, and define the loss function and optimizer.

In [None]:
# Set the hyperparameters
vocab_size = len(chars)
d_model = 128
num_heads = 4
num_layers = 2
hidden_dim = 512
dropout = 0.1
learning_rate = 0.001
batch_size = 32  # Reduce the batch size further

# Instantiate the model
model = TransformerModel(vocab_size, d_model, num_heads, num_layers, hidden_dim, dropout)
model = model.to(device)

# Use the CrossEntropyLoss as our loss function
loss_function = nn.CrossEntropyLoss()

# Use the Adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

## Train the model

Now that we have the model, loss function, and optimizer defined, let's create the training loop for our model. We will train the model for a specified number of epochs, update the model's weights, and print the training loss at each epoch.

In [None]:
total_iterations = len(X) // batch_size 
print('total_iterations', total_iterations)
min_per_epoch = total_iterations * 2 / 330
print('min_per_epoch', min_per_epoch)



In [5]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class TextDataset(Dataset):
  def __init__(self, input_data, target_data):
    self.input_data = input_data
    self.target_data = target_data

  def __len__(self):
    return len(self.input_data)

  def __getitem__(self, idx):
    return self.input_data[idx], self.target_data[idx]
      

# Create the dataset
text_dataset = TextDataset(X, y)

# Create the DataLoader
data_loader = DataLoader(text_dataset, batch_size=batch_size, shuffle=True)

# Set the number of epochs
num_epochs = 5

# Training loop
for epoch in range(num_epochs):
  model.train()
  epoch_loss = 0

  for batch_idx, (input_batch, target_batch) in enumerate(data_loader):
    optimizer.zero_grad()

    # Forward pass
    output = model(input_batch)
    loss = loss_function(output.view(-1, vocab_size), target_batch.view(-1))

    # Backward pass
    loss.backward()
    optimizer.step()

    epoch_loss += loss.item()

  # Print the average loss for this epoch
  print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss / len(data_loader):.4f}")
