## Overview


The key steps covered are:

1.  **Library Imports**: Importing necessary PyTorch and other libraries.
2.  **Data Download**: Downloading the Tiny Shakespeare text data.
3.  **Data Preprocessing**: Converting text to lowercase, creating a vocabulary, and generating word-to-index and index-to-word mappings.
4.  **Sample Creation**: Generating input-target pairs for training, where each input is a sequence of words and the target is the next word.
5.  **Dataset and DataLoader**: Defining a custom PyTorch Dataset and creating a DataLoader for efficient batch processing.
6.  **Model Definition**:
    *   **Encoder**: An LSTM-based encoder to process the input sequence.
    *   **Bahdanau Attention**: An implementation of the additive attention mechanism.
    *   **Attention Decoder**: An LSTM-based decoder that uses attention to predict the next word.
    *   **Seq2Seq**: The main model combining the encoder and decoder with teacher forcing.
7.  **Model Initialization**: Initializing the model and moving it to the appropriate device (CPU/GPU).
8.  **Training**: Defining the loss function and optimizer, and implementing the training loop with backpropagation.


### Library Imports
This cell imports the necessary libraries for building and training a recurrent neural network (RNN) with attention for text generation.
- `torch`, `torch.nn`, `torch.optim`: Core PyTorch libraries for building neural networks and optimizers.
- `torch.utils.data.Dataset`, `DataLoader`: Utilities for handling datasets and creating data loaders for training.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

### Download Tiny Shakespeare Dataset
This cell downloads the "tiny Shakespeare" text dataset from a GitHub repository using the `requests` library and saves it as a text file named `tiny_shakespeare.txt`. This dataset is commonly used for character-level or word-level language modeling tasks.

In [2]:
import requests

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = requests.get(url)

with open("tiny_shakespeare.txt", "w", encoding="utf-8") as f:
    f.write(response.text)

print("tiny_shakespeare.txt created successfully.")

tiny_shakespeare.txt created successfully.


### Preprocess Text Data
This cell preprocesses the downloaded text data:
- It reads the text from `tiny_shakespeare.txt` and converts it to lowercase.
- It splits the text into a list of words.
- It creates a sorted vocabulary of unique words.
- It creates mappings between words and their corresponding indices (`word2idx`) and vice versa (`idx2word`).
- It defines `seq_length` as the length of the input sequence for the model.
- It initializes an empty list `samples` to store the input-target pairs for training.

In [3]:
with open('tiny_shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read().lower()

words = text.split()
vocab = sorted(set(words))
word2idx = {w: idx for idx, w in enumerate(vocab)}
idx2word = {idx: w for w, idx in word2idx.items()}
vocab_size = len(vocab)

seq_length = 5
samples = []

### Create Samples
This cell generates training samples from the preprocessed words. Each sample consists of a sequence of `seq_length` words as input and the immediately following word as the target. These samples will be used to train the model to predict the next word in a sequence.

In [4]:
for i in range(len(words) - seq_length):
    # Each sample contains a sequence of seq_length words as input and the next word as target
    sample = words[i:i + seq_length + 1]
    samples.append(sample)

### Text Dataset Class
This cell defines a custom PyTorch `Dataset` class named `TextDataset`.
- The `__init__` method initializes the dataset with the generated samples and the word-to-index mapping.
- The `__len__` method returns the total number of samples in the dataset.
- The `__getitem__` method takes an index and returns a single sample as a tuple: the input sequence of word indices (as a `LongTensor`) and the target word index (as a `tensor`).

In [5]:
class TextDataset(Dataset):
    def __init__(self, samples, word2idx):
        self.samples = samples
        self.word2idx = word2idx

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        # Get a sample (input sequence and target word)
        sample = self.samples[idx]
        # Convert input words to indices and create a tensor
        input_seq = torch.LongTensor([self.word2idx[w] for w in sample[:-1]])
        # Convert the target word to its index and create a tensor
        target = self.word2idx[sample[-1]]
        return input_seq, torch.tensor(target)

### Create Dataset and DataLoader
This cell creates an instance of the `TextDataset` using the generated samples and word-to-index mapping. It then creates a `DataLoader` to efficiently load batches of data during training. The `batch_size` is set to 128, and `shuffle=True` shuffles the data at the beginning of each epoch.

In [6]:
dataset = TextDataset(samples, word2idx)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

### Model Definition (Encoder, Attention, Decoder, Seq2Seq)
This cell defines the neural network architecture:
- **`Encoder`**: An LSTM-based encoder that processes the input sequence and outputs context vectors (hidden and cell states).
- **`BahdanauAttention`**: An implementation of the additive Bahdanau attention mechanism to calculate attention weights.
- **`AttentionDecoder`**: An LSTM-based decoder that uses the attention mechanism to focus on relevant parts of the input sequence while generating the output.
- **`Seq2Seq`**: The main sequence-to-sequence model that combines the encoder and decoder. It includes a teacher forcing mechanism during training.

In [9]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers)

    def forward(self, src):
        # src shape: [src_len, batch_size]
        embedded = self.embedding(src)  # embedded: [src_len, batch_size, emb_dim]
        outputs, (hidden, cell) = self.lstm(embedded)
        # outputs: [src_len, batch_size, hid_dim]
        # hidden, cell: [n_layers, batch_size, hid_dim]
        return outputs, hidden, cell

# Additive Bahdanau Attention module
class BahdanauAttention(nn.Module):
    def __init__(self, enc_hidden_dim, dec_hidden_dim):
        super().__init__()
        self.attn = nn.Linear(enc_hidden_dim + dec_hidden_dim, dec_hidden_dim)
        self.v = nn.Linear(dec_hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        # hidden: [batch_size, dec_hidden_dim]
        # encoder_outputs: [src_len, batch_size, enc_hidden_dim]
        src_len = encoder_outputs.shape[0]

        # Repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # [batch_size, src_len, dec_hidden_dim]
        encoder_outputs = encoder_outputs.permute(1, 0, 2)  # [batch_size, src_len, enc_hidden_dim]

        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))  # [batch, src_len, dec_hidden_dim]
        attention = self.v(energy).squeeze(2)  # [batch_size, src_len]
        return nn.functional.softmax(attention, dim=1)

# Modified Decoder with Attention
class AttentionDecoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hidden_dim, dec_hidden_dim, n_layers=1):
        super().__init__()
        self.output_dim = output_dim
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.attention = BahdanauAttention(enc_hidden_dim, dec_hidden_dim)
        self.lstm = nn.LSTM(enc_hidden_dim + emb_dim, dec_hidden_dim, n_layers)
        self.fc_out = nn.Linear(enc_hidden_dim + dec_hidden_dim + emb_dim, output_dim)

    def forward(self, input, hidden, cell, encoder_outputs):
        # input: [batch_size]
        # hidden, cell: [n_layers, batch_size, dec_hidden_dim]
        # encoder_outputs: [src_len, batch_size, enc_hidden_dim]

        input = input.unsqueeze(0)  # [1, batch_size]
        embedded = self.embedding(input)  # [1, batch_size, emb_dim]

        dec_hidden = hidden[-1]  # get last layer hidden state [batch_size, dec_hidden_dim]

        # Compute attention weights and context vector
        a = self.attention(dec_hidden, encoder_outputs)  # [batch_size, src_len]
        a = a.unsqueeze(1)  # [batch_size, 1, src_len]

        encoder_outputs = encoder_outputs.permute(1, 0, 2)  # [batch_size, src_len, enc_hidden_dim]

        # Weighted sum context vector
        context = torch.bmm(a, encoder_outputs)  # [batch_size, 1, enc_hidden_dim]
        context = context.permute(1, 0, 2)  # [1, batch_size, enc_hidden_dim]

        # LSTM input is concatenation of embedded input and context vector
        rnn_input = torch.cat((embedded, context), dim=2)  # [1, batch_size, emb_dim + enc_hidden_dim]

        output, (hidden, cell) = self.lstm(rnn_input, (hidden, cell))

        output = output.squeeze(0)   # [batch_size, dec_hidden_dim]
        context = context.squeeze(0) # [batch_size, enc_hidden_dim]
        embedded = embedded.squeeze(0) # [batch_size, emb_dim]

        output = self.fc_out(torch.cat((output, context, embedded), dim=1))  # [batch_size, output_dim]

        return output, hidden, cell, a.squeeze(1)  # a is attention weights for visualization


class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        trg_len = trg.shape[0]
        batch_size = trg.shape[1]
        trg_vocab_size = self.decoder.fc_out.out_features

        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        encoder_outputs, hidden, cell = self.encoder(src)

        input = trg[0, :]

        for t in range(1, trg_len):
            output, hidden, cell, _ = self.decoder(input, hidden, cell, encoder_outputs) # Pass encoder_outputs to decoder
            outputs[t] = output
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[t] if teacher_force else top1

        return outputs

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Initialize Model
This cell initializes instances of the `Encoder`, `AttentionDecoder`, and `Seq2Seq` models with specified dimensions. The models are then moved to the selected device (GPU or CPU).

In [11]:
enc = Encoder(vocab_size, emb_dim=64, hid_dim=128).to(device)
dec = AttentionDecoder(vocab_size, emb_dim=64, enc_hidden_dim=128, dec_hidden_dim=128).to(device)
model = Seq2Seq(enc, dec, device).to(device)

### Define Loss Function, Optimizer, and Training Loop
This cell defines the loss function (`CrossEntropyLoss` for classification), the optimizer (`Adam`), and the training loop.
- The loop iterates for a specified number of epochs (`num_epochs`).
- In each epoch, it iterates through the `dataloader` to get batches of inputs and targets.
- It prepares the target sequence for the decoder (including teacher forcing).
- It performs the forward pass, calculates the loss, and performs backpropagation and optimization.
- It prints the average loss for each epoch.

In [12]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())

# Training loop for next-word prediction
num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0

    for inputs, targets in dataloader:
        inputs = inputs.transpose(0, 1).to(device)  # [seq_len, batch]
        targets = targets.to(device)

        # Prepare trg sequence for feeding into decoder: input tokens + actual next word to predict
        # For next word prediction, trg sequence has length 2: input last word, target word
        # So here we'll create trg just for teacher forcing starting with inputs[-1], targets
        trg = torch.zeros(2, inputs.shape[1], dtype=torch.long).to(device)
        trg[0, :] = inputs[-1, :]
        trg[1, :] = targets

        optimizer.zero_grad()
        output = model(inputs, trg)
        # Output shape: [trg_len, batch, vocab_size], trg_len=2
        # Calculate loss only for second token prediction
        loss = criterion(output[1], trg[1])
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {epoch_loss / len(dataloader):.4f}")

Epoch [1/10], Loss: 7.2381
Epoch [2/10], Loss: 6.2037
Epoch [3/10], Loss: 5.3818
Epoch [4/10], Loss: 4.5951
Epoch [5/10], Loss: 3.9683
Epoch [6/10], Loss: 3.4893
Epoch [7/10], Loss: 3.1010
Epoch [8/10], Loss: 2.7695
Epoch [9/10], Loss: 2.4778
Epoch [10/10], Loss: 2.2180
