# Assignment 12.1 - Recurrent Neural Networks

Please submit your solution of this notebook in the Whiteboard at the corresponding Assignment entry as .ipynb-file and as .pdf.

#### Please state both names of your group members here:
Jane and John Doe

In [13]:
# Paola Gega, Daniel Thompson

## Task 12.1.1: RNN - 'ShakesGen'

Let's create a `ShakesGen` !!<br><br>
The data folder contains a shakespeare folder with works from William Shakespeare. Your task is to implement an RNN that learns to write Shakespeare-style text.

Below, you'll find all the utility code needed for this task. The Corpus class serves as a dataset, and you can retrieve a batch with its target by calling `get_batch` on a batchified dataset.

* Build the missing model components and train your ShakesGen model. **(RESULT)**
* Generate at least 30 lines of text using your ShakesGen model. **(RESULT)**

Especially, if you train on cpu, you can stop training after 5 minutes and generate based on the current model state.

In [14]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/4000/0*WdbXF_e8kZI1R5nQ.png", width=700)

In [15]:
# Some imports
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.cuda as cuda
import torch.optim as optim
import torch.nn.functional as F
import os
import tqdm
import numpy as np

In [16]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)


class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        
        # This is very english language specific
        # We will ingest only these characters:
        self.whitelist = [chr(i) for i in range(32, 127)]
        
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r',  encoding="utf8") as f:
            tokens = 0
            for line in f:
                line = ''.join([c for c in line if c in self.whitelist])
                words = line.split() + ['<eos>']
                tokens += len(words)
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r',  encoding="utf8") as f:
            ids = torch.LongTensor(tokens)
            token = 0
            for line in f:
                line = ''.join([c for c in line if c in self.whitelist])
                words = line.split() + ['<eos>']
                for word in words:
                    ids[token] = self.dictionary.word2idx[word]
                    token += 1

        return ids
    
def batchify(data, batch_size):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // batch_size
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * batch_size)
    # Evenly divide the data across the bsz batches.
    data = data.view(batch_size, -1).t().contiguous()
    return data

def get_batch(source, i, bptt_size=35):
    seq_len = min(bptt_size, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

In [17]:
# Download shakespeare dataset
# !mkdir -p data/shakespeare
# !wget -q -O data/shakespeare/train.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# !cp data/shakespeare/train.txt data/shakespeare/valid.txt

In [18]:
# Use Corpus to load data
corpus = Corpus('./data/shakespeare')

In [19]:
vocab_size = len(corpus.dictionary)
print(vocab_size)

# Print first 100 words from training data
words = [corpus.dictionary.idx2word[corpus.train[i].item()] for i in range(min(100, len(corpus.train)))]
print(' '.join(words))

25671
First Citizen: <eos> Before we proceed any further, hear me speak. <eos> <eos> All: <eos> Speak, speak. <eos> <eos> First Citizen: <eos> You are all resolved rather to die than to famish? <eos> <eos> All: <eos> Resolved. resolved. <eos> <eos> First Citizen: <eos> First, you know Caius Marcius is chief enemy to the people. <eos> <eos> All: <eos> We know't, we know't. <eos> <eos> First Citizen: <eos> Let us kill him, and we'll have corn at our own price. <eos> Is't a verdict? <eos> <eos> All: <eos> No more talking on't; let it be done: away, away! <eos> <eos> Second


In [20]:
idx = corpus.dictionary.word2idx.get("That", -1)  # returns -1 if not found
print(f"Index of the word 'That': {idx}")

Index of the word 'That': 409


In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [22]:
# Tip: Use an Embedding layer to Tokenize each word.
# e.g., self.embedding = nn.Embedding(vocab_size, embed_dim)
class ShakesGen(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, dropout=0.5):
        super(ShakesGen, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, num_layers, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

    def forward(self, x, hidden):
        x = self.embedding(x)
        x = self.dropout(x)
        out, hidden = self.rnn(x, hidden)
        out = out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(out)
        return out, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        return (weight.new(self.num_layers, batch_size, self.hidden_dim).zero_(),
                weight.new(self.num_layers, batch_size, self.hidden_dim).zero_())

In [23]:
# Hyperparameters
embed_dim = 128
hidden_dim = 256
num_layers = 2
dropout = 0.5
batch_size = 20
bptt_size = 35
num_epochs = 5
learning_rate = 0.002

# Prepare data
train_data = batchify(corpus.train, batch_size).to(device)
valid_data = batchify(corpus.valid, batch_size).to(device)

# Initialize model, loss function, and optimizer
model = ShakesGen(vocab_size, embed_dim, hidden_dim, num_layers, dropout).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    hidden = model.init_hidden(batch_size)
    for i in range(0, train_data.size(0) - 1, bptt_size):
        data, targets = get_batch(train_data, i, bptt_size)
        optimizer.zero_grad()
        output, hidden = model(data, hidden)
        loss = criterion(output, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        # Detach hidden state to prevent backpropagating through entire history
        hidden = tuple(h.detach() for h in hidden)
    avg_loss = total_loss / (train_data.size(0) // bptt_size)
    print(f'Epoch {epoch+1}, Loss: {avg_loss:.4f}')

# Validation loop
model.eval()
total_loss = 0
hidden = model.init_hidden(batch_size)
with torch.no_grad():
    for i in range(0, valid_data.size(0) - 1, bptt_size):
        data, targets = get_batch(valid_data, i, bptt_size)
        output, hidden = model(data, hidden)
        loss = criterion(output, targets)
        total_loss += loss.item()
        hidden = tuple(h.detach() for h in hidden)
    avg_loss = total_loss / (valid_data.size(0) // bptt_size)
    print(f'Validation Loss: {avg_loss:.4f}')

Epoch 1, Loss: 7.0446
Epoch 2, Loss: 6.2002
Epoch 3, Loss: 5.9119
Epoch 4, Loss: 5.7078
Epoch 5, Loss: 5.5243
Validation Loss: 5.6488


In [24]:
model.eval()
num_lines = 30
max_length = 20  # max words per line
input_word = '<eos>'
hidden = model.init_hidden(1)
with torch.no_grad():
    for _ in range(num_lines):
        line = []
        for _ in range(max_length):
            input_idx = torch.tensor([[corpus.dictionary.word2idx[input_word]]]).to(device)
            output, hidden = model(input_idx, hidden)
            prob = F.softmax(output[-1], dim=0).data
            word_idx = torch.multinomial(prob, 1)[0].item()
            input_word = corpus.dictionary.idx2word[word_idx]
            if input_word == '<eos>':
                break
            line.append(input_word)
        print(' '.join(line))

Lords: and keen enter their death's grow. but addle
Or so. of before what still Italy death,
Who made their quarrel? tires Jupiter
Or late in the wounds' part,
protest, middle the summer, but banish'd proud;
And Romeo hang. along; and but kings flight
The brother weeping be to of news.

JULIET:
Marry, me he have tied thou shalt of methinks,
Of like them: by rise, the highness
Sharp it being speak stand perchance, to my many-headed
Plantagenet, for no overta'en is and say
So.
With absolute his false wife to learn of?
Till from have so have go myself
Who make thy wife's wish of wine, I to it.
As thou be Edward, to be sweet look'd
To Luke's allow'd where when be thee
We friar, what fantastic to stifle and you the cloaks;
Unless we have swiftness, we have sign
Ye dost are approach.
Thy enrolled fellow between and a eyes and I not met

QUEEN ELIZABETH:
Put choosing what's thou English winking

KING BOLINGBROKE:
Woe hand, Isabel? like your despair?
here? are our lingering and back, is thee,


## Congratz, you made it! :)