## Overview

This notebook demonstrates how to train a simple word-level LSTM model to generate text based on Shakespeare's writing.

Here's a breakdown of the key steps:

1.  **Data Loading and Preprocessing**: The code loads a text file (`tiny_shakespeare.txt`), tokenizes it into words, creates a vocabulary of unique words, and maps each word to an index. It then prepares the data into sequences of words to be used as input and the next word in the sequence as the target.
2.  **Model Definition**: An LSTM (Long Short-Term Memory) neural network model is defined. This model includes an embedding layer to represent words as vectors, an LSTM layer to capture sequential dependencies, and a linear layer to predict the next word.
3.  **Training**: The model is trained using the prepared data. The training process involves feeding the input sequences to the model, calculating the difference between the model's predictions and the actual next words (loss), and adjusting the model's parameters to minimize this loss.
4.  **Prediction**: After training, the model can be used to predict the next word in a given sequence of text.

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

### Data Loading and Preprocessing

In [None]:
with open('tiny_shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read().lower()

words = text.split()
vocab = sorted(set(words))
word2idx = {w: idx for idx, w in enumerate(vocab)}
idx2word = {idx: w for w, idx in word2idx.items()}
vocab_size = len(vocab)

seq_length = 5
samples = []

In [None]:
# Create input-target sequences
for i in range(len(words) - seq_length):
    # Each sample contains a sequence of seq_length words as input and the next word as target
    sample = words[i:i + seq_length + 1]
    samples.append(sample)

### Dataset and DataLoader

In [None]:
# Define a custom Dataset for the text data
class TextDataset(Dataset):
    def __init__(self, samples, word2idx):
        self.samples = samples
        self.word2idx = word2idx

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        # Get a sample (input sequence and target word)
        sample = self.samples[idx]
        # Convert input words to indices and create a tensor
        input_seq = torch.LongTensor([self.word2idx[w] for w in sample[:-1]])
        # Convert the target word to its index and create a tensor
        target = self.word2idx[sample[-1]]
        return input_seq, torch.tensor(target)

In [None]:
# Create the Dataset and DataLoader
dataset = TextDataset(samples, word2idx)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True) # Use DataLoader for batching and shuffling

### Model Definition

In [None]:
# Define the LSTM neural network model
class LSTMNet(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        # Embedding layer to convert word indices to vectors
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        # LSTM layer to capture sequential dependencies
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        # Linear layer to predict the next word (output size is vocabulary size)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        # Pass input through embedding layer
        x = self.embedding(x)
        # Pass embedded input through LSTM layer
        # We only need the hidden state from the last time step for prediction
        _, (h_n, _) = self.lstm(x)
        # Pass the final hidden state through the linear layer
        out = self.fc(h_n.squeeze(0))
        return out

### Model Initialization

In [None]:
# Initialize the model, loss function, and optimizer
model = LSTMNet(vocab_size, embed_dim=50, hidden_dim=100) # Create an instance of the LSTM model
loss_fn = nn.CrossEntropyLoss() # Define the loss function (Cross-Entropy for classification)
optimizer = optim.Adam(model.parameters(), lr=0.001) # Define the optimizer (Adam) and learning rate

### Training

In [None]:
# Training loop
num_epochs = 20 # Define the number of training epochs
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        # Forward pass
        outputs = model(inputs)
        # Calculate the loss
        loss = loss_fn(outputs, targets)
        # Backward pass and optimization
        optimizer.zero_grad() # Clear gradients
        loss.backward() # Compute gradients
        optimizer.step() # Update model parameters
    # Print loss after each epoch
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

Epoch 1, Loss: 7.5303
Epoch 2, Loss: 6.9131
Epoch 3, Loss: 6.6516
Epoch 4, Loss: 5.4789
Epoch 5, Loss: 5.7635
Epoch 6, Loss: 6.4038
Epoch 7, Loss: 5.8948
Epoch 8, Loss: 5.9343
Epoch 9, Loss: 4.2164
Epoch 10, Loss: 5.1814


### Prediction

In [None]:
# Function to predict the next word
def predict_next_word(model, text_seq, word2idx, idx2word, seq_length):
    model.eval() # Set the model to evaluation mode
    # Tokenize the input text sequence and take the last 'seq_length' words
    tokens = text_seq.lower().split()[-seq_length:]
    # Convert tokens to indices, handle unknown words with index 0, and add batch dimension
    input_seq = torch.LongTensor([word2idx.get(w, 0) for w in tokens]).unsqueeze(0)
    with torch.no_grad(): # Disable gradient calculation for inference
        # Get model output (predictions for the next word)
        output = model(input_seq)
        # Get the index of the predicted next word (highest probability)
        pred_idx = torch.argmax(output, dim=1).item()
    # Convert the predicted index back to a word
    return idx2word[pred_idx]

In [None]:
# Example usage of the prediction function
input_sentence = "shall i compare thee"
# Predict the next word based on the input sentence
next_word = predict_next_word(model, input_sentence, word2idx, idx2word, seq_length)
print("Next word prediction:", next_word)

Next word prediction: to
