# AIM5004_HW3

* (Total 10 pts) Training a character-level language model with recurrent neural networks and Transformers architecture. You are going to write codes in python w/ whichever deep learning libraries you prefer to use, e.g. pytorch, tensorflow, keras, jax, mxnet, and so on.

 - - -

## Question - a

(a) Download shakespeare dataset from https://storage.googleapis.com/download.
tensorflow.org/data/shakespeare.txt. Report the number of unique characters
and this number will be the number of your vocabulary (note that ’a’ and ’A’ are
different characters). Also, show 3 random chunks (200 characters per each) of the
dataset.

In [None]:
import os
import random
import codecs
import numpy as np
import pandas as pd
import tensorflow as tf

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from collections import Counter
import warnings

import string
import nltk

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
##Setting torch environment

if torch.cuda.is_available():
    DEVICE = torch.device('cuda')
else:
    DEVICE = torch.device('cpu')
    
print('Using PyTorch version:', torch.__version__, ' Device: ', DEVICE)

### Shakespeare text dataset

In [None]:
data_fpath = tf.keras.utils.get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
shakespeare = codecs.open(data_fpath, 'r', encoding='utf8').read()
data = shakespeare
data_len = len(data)
print(data_len)
print(data[:100])

### Vocabulary check

In [None]:
vocab = sorted(set(data))
vocab_size = len(vocab)

print('Vocabulary of the Shakespeare data: {}'.format(vocab))
print('Unique Characters: {}'.format(vocab_size))

In [None]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in data])

print('Example of the original text: ', data[:13])
print('Example of the encoded text:  {}'.format(text_as_int[:13]))

### Random dataset chunks

In [None]:
chunk = 200

def random_select():
    stt = random.randint(0, data_len - chunk)
    end = stt + chunk + 1
    return data[stt : end]

In [None]:
print("First random Chunk: \n", random_select())

In [None]:
print("Second random Chunk: \n", random_select())

In [None]:
print("Third random Chunk: \n", random_select())

- - -

## Question - b

(b) Design a vanila RNN architecture and write the training codes w/ following hyperparameters. Report the number of model parameters.
(You can use RNN libararies, if you want, but I recommend you to implement by yourself.)
> (1) input embedding size: 64 \
(2) hidden size: 128 \
(3) the number of time steps (sequence length, or chunk length): 200 \
(4) the number of layers: 3 \
(5) activation function for hidden units: tanh \
(6) loss function: cross entropy loss \
(7) optimization algorithm: ADAM \
(8) batch size: 64 \
(9) training epochs: 30 \
(10) for other hyperparemeters, you are free to choose whatever you would like to use.

In [None]:
class Args:
    emb_size=64
    seq_size=200
    lstm_size=64
    g_norm=5
    bs=64
    num_step=20
    epochs=30
    lr=0.001
    momentum = 0.9
    verbose='store_true'
    seed=710674

args = Args()    

np.random.seed(args.seed)
random.seed(args.seed)
torch.manual_seed(args.seed)

### Data preprocessing

In [None]:
def doc2words(doc):
    lines = doc.split('\n')
    lines = [line.strip(r'\"') for line in lines]
    words = ' '.join(lines).split()
    return words

In [None]:
def removepunct(words):
    punct = set(string.punctuation)
    words = [''.join([char for char in list(word) if char not in punct]) for word in words]
    return words

In [None]:
# get vocab from word list
def getvocab(words):
    wordfreq = Counter(words)
    sorted_wordfreq = sorted(wordfreq, key=wordfreq.get)
    return sorted_wordfreq

In [None]:
# get dictionary of int to words and word to int
def vocab_map(vocab):
    int_to_vocab = {k:w for k,w in enumerate(vocab)}
    vocab_to_int = {w:k for k,w in int_to_vocab.items()}
    return int_to_vocab, vocab_to_int

In [None]:
words = removepunct(doc2words(data))
vocab = getvocab(words)
int_to_vocab, vocab_to_int = vocab_map(vocab)

In [None]:
# vocab_to_int

In [None]:
print(len(words))
print(len(data))

In [None]:
v_to_int = [vocab_to_int[word] for word in words]

In [None]:
len(v_to_int)

In [None]:
def get_batches(words, vocab_to_int, batch_size, seq_size):
    # generate a Xs and Ys of shape (batchsize * num_batches) * seq_size
    word_ints = [vocab_to_int[word] for word in words]
    num_batches = int(len(word_ints) / (batch_size * seq_size))
    Xs = word_ints[:num_batches*batch_size*seq_size]
    Ys = np.zeros_like(Xs)
    Ys[:-1] = Xs[1:]
    Ys[-1] = Xs[0]
    Xs = np.reshape(Xs, (num_batches*batch_size, seq_size))
    Ys= np.reshape(Ys, (num_batches*batch_size, seq_size))
    
    # iterate over rows of Xs and Ys to generate batches
    for i in range(0, num_batches*batch_size, batch_size):
        yield Xs[i:i+batch_size, :], Ys[i:i+batch_size, :]

### RNN Model

In [None]:
class RNN_model(nn.Module):
    ## initialize RNN module
    def __init__(self, n_vocab, seq_size, emb_size, lstm_size):
        super(RNN_model, self).__init__()
        self.seq_size = seq_size
        self.lstm_size = lstm_size
        self.embedding = nn.Embedding(n_vocab, embedding_size)
        self.lstm = nn.LSTM(embedding_size,
                            lstm_size,
                            batch_first=True)
        self.dense = nn.Linear(lstm_size, n_vocab)
    
    
    ## forward path
    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.dense(output)

        return logits, state
    
    def zero_state(self, batch_size):
        return (torch.zeros(1, batch_size, self.lstm_size),torch.zeros(1, batch_size, self.lstm_size))

In [None]:
class RNN_model(nn.Module):
    ## initialize the model
    def __init__(self):
        super(RNN_model, self).__init__()
        self.seq_size = args.seq_size
        self.lstm_size = args.lstm_size
        
        self.embedding = nn.Embedding(n_vocab, args.emb_size)
        self.lstm = nn.LSTM(args.emb_size, args.lstm_size, batch_first=True)
        self.dense = nn.Linear(args.lstm_size, n_vocab)
    
    ## forward 
    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.dense(output)

        return logits, state
    
    def zero_state(self, batch_size):
        return (torch.zeros(1, args.bs, self.lstm_size),torch.zeros(1, args.bs, self.lstm_size))

* Criterion and optimizer settings

In [None]:
def cri_opti(net, lr):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=args.lr)

    return criterion, optimizer

* RNN model training

In [None]:
def train_rnn(words, vocab_to_int, int_to_vocab, n_vocab):
    
    ## RNN instance
    model = RNN_model(n_vocab, seq_size, embedding_size, lstm_size)
    model = model.to(DEVICE)
    criterion, optimizer = cri_opti(model, lr=args.lr)

    iteration = 0
    
    for epoch in range(args.epochs):
        batches = get_batches(words, vocab_to_int, batch_size, seq_size)
        state_h, state_c = model.zero_state(batch_size)

        ## Transfer data to GPU
        state_h = state_h.to(DEVICE)
        state_c = state_c.to(DEVICE)
        
        for x, y in batches:
            iteration += 1

            ## Tell it we are in training mode
            model.train()

            ## Reset all gradients
            optimizer.zero_grad()

            ## Transfer data to GPU
            x = torch.tensor(x).to(DEVICE).long()
            y = torch.tensor(y).to(DEVICE).long()

            logits, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(logits.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss_value = loss.item()

            ## Perform back-propagation
            loss.backward(retain_graph=True)
            _ = torch.nn.utils.clip_grad_norm_(model.parameters(), gradients_norm)
            
            # Update the network's parameters
            optimizer.step()

            if iteration % 100 == 0:
                print('Epoch: {}/{}'.format(epoch, args.epochs),'Iteration: {}'.format(iteration),'Loss: {}'.format(loss_value))

            # if iteration % 1000 == 0:
                # predict(device, net, flags.initial_words, n_vocab,vocab_to_int, int_to_vocab, top_k=5)
                # torch.save(net.state_dict(),'checkpoint_pt/model-{}.pth'.format(iteration))
                
    return model

In [None]:
len(vocab)

* Training results

In [None]:
rnn_net = train_rnn(words, vocab_to_int, int_to_vocab, len(vocab))

In [None]:
def generate_text(DEVICE, net, words, n_vocab, vocab_to_int, int_to_vocab, top_k=5):
    net.eval()

    state_h, state_c = net.zero_state(1)
    state_h = state_h.to(DEVICE)
    state_c = state_c.to(DEVICE)
    for w in words:
        ix = torch.tensor([[vocab_to_int[w]]]).to(DEVICE).long()
        output, (state_h, state_c) = net(ix, (state_h, state_c))
    
    _, top_ix = torch.topk(output[0], k=top_k)
    choices = top_ix.tolist()
    choice = np.random.choice(choices[0])

    words.append(int_to_vocab[choice])
    
    for _ in range(100):
        ix = torch.tensor([[choice]]).to(DEVICE).long()
        output, (state_h, state_c) = net(ix, (state_h, state_c))

        _, top_ix = torch.topk(output[0], k=top_k)
        choices = top_ix.tolist()
        choice = np.random.choice(choices[0])
        words.append(int_to_vocab[choice])

    print(' '.join(words))

In [None]:
generate_text(DEVICE, rnn_net, ['We', 'are'], len(vocab), vocab_to_int, int_to_vocab)

In [None]:
generate_text(DEVICE, rnn_net, ['what'], len(vocab), vocab_to_int, int_to_vocab)

In [None]:
generate_text(DEVICE, rnn_net, ['You'], len(vocab), vocab_to_int, int_to_vocab)

In [None]:
generate_text(DEVICE, rnn_net, ['I', 'tell', 'you', 'friends'], len(vocab), vocab_to_int, int_to_vocab)

- - -

## Question - c

(c) Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. Let X = (x0, . . . , xt), then the perplexity of X is ...
> Train your network RNNs and provide a PPL curve over the course of the training.

- - -

## Question - d

(d) Among GRU, LSTM, or Transformer, pick one of your favorite architecture, and design the architecture whose the number of parameters is similar to vanila RNN you implemented above. Then train and provide a PPL curve over the course of the training (in the same plots in (c)). You are free to select any hyperparameters if needed (no need to use the hyperparameters above). Report the number of model parameters. (You can use GRU, LSTM, or Transformer libararies, if you want, but I recommend you to implement by yourself.)

- - -

## Question - e

(e) Pick the best performing (lowest PPL score) model, and generate the text autoregressively given the following prompts.
> (1) ‘We are’ \
(2) ‘what’ \
(3) ‘You’ \
(4) ‘I tell you, friends’