<a href="https://colab.research.google.com/github/VicentePina7210/DataMiningCleaningExercise/blob/main/Copy_of_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNN Implementation
Modify the code and answer the questions below.


In [17]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random

In [13]:
# ==================== 1. Load and Preprocess Data ====================
# Read text from file
with open("story.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Create character mappings
chars = sorted(set(text))  # Define vocabulary

# Tokenize
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for char, idx in char_to_idx.items()}
vocab_size = len(chars)

# Convert text to indices - Tokenization
data = [char_to_idx[c] for c in text]

# Define sequence length
seq_length = 20


In [14]:
# ==================== 2. Define Dataset Loader ====================
def get_batches(data, batch_size, seq_length):
    """Yield batches of character sequences."""
    num_batches = len(data) // (batch_size * seq_length)
    data = data[:num_batches * batch_size * seq_length]
    data = np.reshape(data, (batch_size, -1))

    while True:
        # Loop over and creating batch
        for n in range(0, data.shape[1] - seq_length, seq_length):
            x = data[:, n:n+seq_length]
            y = data[:, n+1:n+seq_length+1]
            yield torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)

batch_generator = get_batches(data, 5, seq_length)

In [15]:
# ==================== 3. Define Custom RNN Model ====================
class CustomRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super(CustomRNN, self).__init__()
        self.hidden_size = hidden_size

        # Learnable weight matrices
        self.Wxh = nn.Parameter(torch.randn(hidden_size, vocab_size) * 0.01)  # Input -> Hidden
        self.Whh = nn.Parameter(torch.randn(hidden_size, hidden_size) * 0.01)  # Hidden -> Hidden
        self.Why = nn.Parameter(torch.randn(vocab_size, hidden_size) * 0.01)  # Hidden -> Output

        # Bias terms
        self.bh = nn.Parameter(torch.zeros(hidden_size))
        self.by = nn.Parameter(torch.zeros(vocab_size))

    def forward(self, x, hidden):
        """Compute the forward pass manually."""
        batch_size, seq_length = x.shape

        # One hot encode input sequence of tokens
        one_hot = torch.zeros(batch_size, seq_length, vocab_size)
        one_hot.scatter_(2, x.unsqueeze(-1), 1)  # Convert to one-hot

        outputs = []
        for t in range(seq_length):
            xt = one_hot[:, t, :]  # Input at time step t
            hidden = torch.tanh(torch.mm(xt, self.Wxh.T) + torch.mm(hidden, self.Whh.T) + self.bh)  # Compute hidden state
            yt = torch.mm(hidden, self.Why.T) + self.by  # Compute output
            outputs.append(yt)

        return torch.stack(outputs, dim=1), hidden

    def init_hidden(self, batch_size):
        return torch.zeros(batch_size, self.hidden_size)  # Initialize hidden state

# Test
# model = CustomRNN(vocab_size, 80)
# y, h = model.forward(torch.tensor([[1, 2, 3]]), torch.zeros(1, 80))

In [16]:
# Model parameters
hidden_size = 80
batch_size = 5
learning_rate = 0.01

# Initialize model
model = CustomRNN(vocab_size, hidden_size)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

In [24]:
# ==================== 4. Train the Model ====================
num_epochs = 10
max_steps = 10000 # TODO implement
train_loader = get_batches(data, batch_size, seq_length)
for epoch in range(num_epochs):
    hidden = model.init_hidden(batch_size)
    for i, (x, y) in enumerate(train_loader):
        optimizer.zero_grad()
        output, hidden = model(x, hidden.detach())
        loss = criterion(output.view(-1, vocab_size), y.view(-1))

        loss.backward()
        optimizer.step()

        if i % 1000 == 0:
            print(f"Epoch {epoch+1}/{num_epochs}, Step {i}, Loss: {loss.item():.4f}")


Epoch 1/10, Step 0, Loss: 1.9775
Epoch 1/10, Step 1000, Loss: 2.0324
Epoch 1/10, Step 2000, Loss: 1.8433
Epoch 1/10, Step 3000, Loss: 1.7854
Epoch 1/10, Step 4000, Loss: 2.2297
Epoch 1/10, Step 5000, Loss: 2.0295
Epoch 1/10, Step 6000, Loss: 1.9557
Epoch 1/10, Step 7000, Loss: 1.8873
Epoch 1/10, Step 8000, Loss: 2.0260
Epoch 1/10, Step 9000, Loss: 2.1504
Epoch 1/10, Step 10000, Loss: 2.3154
Epoch 1/10, Step 11000, Loss: 2.0448
Epoch 1/10, Step 12000, Loss: 1.9994
Epoch 1/10, Step 13000, Loss: 2.1740
Epoch 1/10, Step 14000, Loss: 2.1441
Epoch 1/10, Step 15000, Loss: 2.2635
Epoch 1/10, Step 16000, Loss: 1.9761
Epoch 1/10, Step 17000, Loss: 1.9625
Epoch 1/10, Step 18000, Loss: 2.1826
Epoch 1/10, Step 19000, Loss: 1.7389
Epoch 1/10, Step 20000, Loss: 2.1678
Epoch 1/10, Step 21000, Loss: 1.8271
Epoch 1/10, Step 22000, Loss: 2.0678
Epoch 1/10, Step 23000, Loss: 2.0897
Epoch 1/10, Step 24000, Loss: 1.9310
Epoch 1/10, Step 25000, Loss: 1.8829
Epoch 1/10, Step 26000, Loss: 2.0707
Epoch 1/10, St

KeyboardInterrupt: 

In [25]:
# ==================== 5. Generate New Text ====================
def generate_text(model, start_str="Once upon a time", length=500):
    """Generate text using the trained model."""
    model.eval()
    chars = [char_to_idx[c] for c in start_str]
    input_seq = torch.tensor(chars, dtype=torch.long).unsqueeze(0)

    hidden = model.init_hidden(1)
    generated_text = start_str

    for _ in range(length):
        with torch.no_grad():
            output, hidden = model(input_seq, hidden)
            probs = torch.softmax(output[:, -1, :], dim=1).squeeze()
            next_idx = torch.multinomial(probs, 1).item()

        generated_text += idx_to_char[next_idx]
        input_seq = torch.tensor([[next_idx]], dtype=torch.long)

    return generated_text

# Generate a sample text
print(generate_text(model, start_str="Once upon a time", length=300))


Once upon a time foted nenesepeverald wet? Wellteo and the wh foosein iantt and poald presere of his to ktofery nerald to ank she letals,t. At to aprit -oghed the alint bot thouma Selaer so of heren drees,t, pet rar his ore deted to ond of woo loon the ios, tamatore thestos,.., meser even of and gnellyer ablinse..,


## Questions / tasks
(The training for this model takes time so you may terminate it after a set number of steps)


1. What do you observe as the model outputs as training progresses?

at each step, the model makes adjustments to reduce loss, sometimes it fluctuates and will increase loss but over the general trend it reduces the loss

2. What is the minimum hidden size that still allows the model to learn good outputs?

Letting the model train for up to 20 min with 2 layers still did not get me a good result, the semantics seemed to start to be closer to making sense but it was not quite there yet. I think the network would be better trained with words instead of each character.    


3. What do you think the effect of changing the sequence length is? Try different values and describe your observations

I believe that changing the sequence length to be higher may cause the training process to be slower but it will improve the final valid loss.

First test was with a short sequence length, the model only sees a small window of context each time. It moved through epochs faster but the end result ended up being the model producing complete nonsense.

With a long sequence length the model makes more sense but it takes longer to train

5. Modify the code to add another recurrent layer to the network (this may take some research). How do you think this will affect the model? What do you actually observe?
The training time increasedd by a lot, but it also made the model produce a much higher quality output. However I think this also increases the liklihood of the model overfitting.


4. In the case of a generative language model, how can we tell if the model has actually "understood" language as opposed to just memorizing common patterns

The model would be able to generate coherent sentences and responses, it would develop true reasoning and understanding instead of just being able to string together high quality sentences based on patterns.


5. What are the pros and cons of using a word-level vocabulary as opposed to a character-level vocabulary?

Word level vocabulary allows the model to have an output that sounds like it makes some sort of sense, it is much easier to learn semantics of language when training through words than it is from character. The cons are that the training data would be smaller because there are many letters in a single word, the training would also take much longer.