# AIM5004_HW3

* (Total 10 pts) Training a character-level language model with recurrent neural networks and Transformers architecture. You are going to write codes in python w/ whichever deep learning libraries you prefer to use, e.g. pytorch, tensorflow, keras, jax, mxnet, and so on.

 - - -

## Question - a

(a) Download shakespeare dataset from https://storage.googleapis.com/download.
tensorflow.org/data/shakespeare.txt. Report the number of unique characters
and this number will be the number of your vocabulary (note that ’a’ and ’A’ are
different characters). Also, show 3 random chunks (200 characters per each) of the
dataset.

In [31]:
import os
import random
import codecs
import numpy as np
import pandas as pd
import tensorflow as tf

In [60]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from collections import Counter
import warnings

import string
import nltk

import matplotlib.pyplot as plt
import seaborn as sns

In [71]:
##Setting torch environment

if torch.cuda.is_available():
    DEVICE = torch.device('cuda')
else:
    DEVICE = torch.device('cpu')
    
print('Using PyTorch version:', torch.__version__, ' Device: ', DEVICE)

Using PyTorch version: 1.7.1  Device:  cuda


### Shakespeare text dataset

In [19]:
data_fpath = tf.keras.utils.get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
shakespeare = codecs.open(data_fpath, 'r', encoding='utf8').read()
data = shakespeare
data_len = len(data)
print(data_len)
print(data[:100])

1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


### Vocabulary check

In [14]:
vocab = sorted(set(data))
vocab_size = len(vocab)

print('Vocabulary of the Shakespeare data: {}'.format(vocab))
print('Unique Characters: {}'.format(vocab_size))

Vocabulary of the Shakespeare data: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Unique Characters: 65


In [18]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in data])

print('Example of the original text: ', data[:13])
print('Example of the encoded text:  {}'.format(text_as_int[:13]))

Example of the original text:  First Citizen
Example of the encoded text:  [18 47 56 57 58  1 15 47 58 47 64 43 52]


### Random dataset chunks

In [46]:
chunk = 200

def random_select():
    stt = random.randint(0, data_len - chunk)
    end = stt + chunk + 1
    return data[stt : end]

In [47]:
print("First random Chunk: \n", random_select())

First random Chunk: 
 olmasters will I keep within my house,
Fit to instruct her youth. If you, Hortensio,
Or Signior Gremio, you, know any such,
Prefer them hither; for to cunning men
I will be very kind, and liberal
To mi


In [48]:
print("Second random Chunk: \n", random_select())

Second random Chunk: 
 ueen,
For she is good, hath brought you forth a daughter;
Here 'tis; commends it to your blessing.

LEONTES:
Out!
A mankind witch! Hence with her, out o' door:
A most intelligencing bawd!

PAULINA:
Not


In [49]:
print("Third random Chunk: \n", random_select())

Third random Chunk: 
  before Corioli, call him,
With all the applause and clamour of the host,
CAIUS MARCIUS CORIOLANUS! Bear
The addition nobly ever!

All:
Caius Marcius Coriolanus!

CORIOLANUS:
I will go wash;
And when m


- - -

## Question - b

(b) Design a vanila RNN architecture and write the training codes w/ following hyperparameters. Report the number of model parameters.
(You can use RNN libararies, if you want, but I recommend you to implement by yourself.)
> (1) input embedding size: 64 \
(2) hidden size: 128 \
(3) the number of time steps (sequence length, or chunk length): 200 \
(4) the number of layers: 3 \
(5) activation function for hidden units: tanh \
(6) loss function: cross entropy loss \
(7) optimization algorithm: ADAM \
(8) batch size: 64 \
(9) training epochs: 30 \
(10) for other hyperparemeters, you are free to choose whatever you would like to use.

In [None]:
class Args:
    emb_size=64
    num_step=20
    epochs=30
    bs=64
    lr=0.001
    verbose='store_true'
    seed=710674

args = Args()    

np.random.seed(args.seed)
random.seed(args.seed)
torch.manual_seed(args.seed)

In [72]:
batch_size = 16
seq_size = 32
embedding_size = 64
lstm_size = 64
gradients_norm = 5

### Data preprocessing

In [55]:
def doc2words(doc):
    lines = doc.split('\n')
    lines = [line.strip(r'\"') for line in lines]
    words = ' '.join(lines).split()
    return words

In [56]:
def removepunct(words):
    punct = set(string.punctuation)
    words = [''.join([char for char in list(word) if char not in punct]) for word in words]
    return words

In [61]:
# get vocab from word list
def getvocab(words):
    wordfreq = Counter(words)
    sorted_wordfreq = sorted(wordfreq, key=wordfreq.get)
    return sorted_wordfreq

In [62]:
# get dictionary of int to words and word to int
def vocab_map(vocab):
    int_to_vocab = {k:w for k,w in enumerate(vocab)}
    vocab_to_int = {w:k for k,w in int_to_vocab.items()}
    return int_to_vocab, vocab_to_int

In [63]:
words = removepunct(doc2words(data))
vocab = getvocab(words)
int_to_vocab, vocab_to_int = vocab_map(vocab)

In [73]:
def get_batches(words, vocab_to_int, batch_size, seq_size):
    # generate a Xs and Ys of shape (batchsize * num_batches) * seq_size
    word_ints = [vocab_to_int[word] for word in words]
    num_batches = int(len(word_ints) / (batch_size * seq_size))
    Xs = word_ints[:num_batches*batch_size*seq_size]
    Ys = np.zeros_like(Xs)
    Ys[:-1] = Xs[1:]
    Ys[-1] = Xs[0]
    Xs = np.reshape(Xs, (num_batches*batch_size, seq_size))
    Ys= np.reshape(Ys, (num_batches*batch_size, seq_size))
    
    # iterate over rows of Xs and Ys to generate batches
    for i in range(0, num_batches*batch_size, batch_size):
        yield Xs[i:i+batch_size, :], Ys[i:i+batch_size, :]

### RNN Model

In [74]:
class RNNModule(nn.Module):
    # initialize RNN module
    def __init__(self, n_vocab, seq_size=32, embedding_size=64, lstm_size=64):
        super(RNNModule, self).__init__()
        self.seq_size = seq_size
        self.lstm_size = lstm_size
        self.embedding = nn.Embedding(n_vocab, embedding_size)
        self.lstm = nn.LSTM(embedding_size,
                            lstm_size,
                            batch_first=True)
        self.dense = nn.Linear(lstm_size, n_vocab)
        
    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.dense(output)

        return logits, state
    
    def zero_state(self, batch_size):
        return (torch.zeros(1, batch_size, self.lstm_size),torch.zeros(1, batch_size, self.lstm_size))
 

In [75]:
def get_loss_and_train_op(net, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)

    return criterion, optimizer

In [93]:
def generate_text(DEVICE, net, words, n_vocab, vocab_to_int, int_to_vocab, top_k=5):
    net.eval()

    state_h, state_c = net.zero_state(1)
    state_h = state_h.to(DEVICE)
    state_c = state_c.to(DEVICE)
    for w in words:
        ix = torch.tensor([[vocab_to_int[w]]]).to(DEVICE).long()
        output, (state_h, state_c) = net(ix, (state_h, state_c))
    
    _, top_ix = torch.topk(output[0], k=top_k)
    choices = top_ix.tolist()
    choice = np.random.choice(choices[0])

    words.append(int_to_vocab[choice])
    
    for _ in range(100):
        ix = torch.tensor([[choice]]).to(DEVICE).long()
        output, (state_h, state_c) = net(ix, (state_h, state_c))

        _, top_ix = torch.topk(output[0], k=top_k)
        choices = top_ix.tolist()
        choice = np.random.choice(choices[0])
        words.append(int_to_vocab[choice])

    print(' '.join(words))

In [82]:
def train_rnn(words, vocab_to_int, int_to_vocab, n_vocab):
    
    # RNN instance
    net = RNNModule(n_vocab, seq_size, embedding_size, lstm_size)
    net = net.to(DEVICE)
    criterion, optimizer = get_loss_and_train_op(net, 0.01)

    iteration = 0
    
    for e in range(50):
        batches = get_batches(words, vocab_to_int, batch_size, seq_size)
        state_h, state_c = net.zero_state(batch_size)

        # Transfer data to GPU
        state_h = state_h.to(DEVICE)
        state_c = state_c.to(DEVICE)
        for x, y in batches:
            iteration += 1

            # Tell it we are in training mode
            net.train()

            # Reset all gradients
            optimizer.zero_grad()

            # Transfer data to GPU
            x = torch.tensor(x).to(DEVICE).long()
            y = torch.tensor(y).to(DEVICE).long()

            logits, (state_h, state_c) = net(x, (state_h, state_c))
            loss = criterion(logits.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss_value = loss.item()

            # Perform back-propagation
            loss.backward(retain_graph=True)

            _ = torch.nn.utils.clip_grad_norm_(net.parameters(), gradients_norm)
            
            # Update the network's parameters
            optimizer.step()

            if iteration % 100 == 0:
                print('Epoch: {}/{}'.format(e, 200),'Iteration: {}'.format(iteration),'Loss: {}'.format(loss_value))

            # if iteration % 1000 == 0:
                # predict(device, net, flags.initial_words, n_vocab,vocab_to_int, int_to_vocab, top_k=5)
                # torch.save(net.state_dict(),'checkpoint_pt/model-{}.pth'.format(iteration))
                
    return net

In [83]:
len(vocab)

14746

In [84]:
rnn_net = train_rnn(words, vocab_to_int, int_to_vocab, len(vocab))

Epoch: 0/200 Iteration: 100 Loss: 7.218320369720459
Epoch: 0/200 Iteration: 200 Loss: 7.013370513916016
Epoch: 0/200 Iteration: 300 Loss: 6.7905378341674805
Epoch: 1/200 Iteration: 400 Loss: 6.801114082336426
Epoch: 1/200 Iteration: 500 Loss: 5.949028968811035
Epoch: 1/200 Iteration: 600 Loss: 6.4804606437683105
Epoch: 1/200 Iteration: 700 Loss: 6.328885078430176
Epoch: 2/200 Iteration: 800 Loss: 6.197182655334473
Epoch: 2/200 Iteration: 900 Loss: 5.70449161529541
Epoch: 2/200 Iteration: 1000 Loss: 5.8689069747924805
Epoch: 2/200 Iteration: 1100 Loss: 5.627951622009277
Epoch: 3/200 Iteration: 1200 Loss: 5.7942304611206055
Epoch: 3/200 Iteration: 1300 Loss: 5.828190326690674
Epoch: 3/200 Iteration: 1400 Loss: 5.717165946960449
Epoch: 3/200 Iteration: 1500 Loss: 5.5473127365112305
Epoch: 4/200 Iteration: 1600 Loss: 5.134014129638672
Epoch: 4/200 Iteration: 1700 Loss: 5.2309250831604
Epoch: 4/200 Iteration: 1800 Loss: 5.256106853485107
Epoch: 4/200 Iteration: 1900 Loss: 5.1061835289001465

In [100]:
generate_text(DEVICE, rnn_net, ['We', 'are'], len(vocab), vocab_to_int, int_to_vocab)

We are merely cheated of the mustard PETRUCHIO Marry sir tis not the fountain and wary ARIEL Not to your reports KATHARINA What noise so thriftless trust Like the way to be touched BAPTISTA What is the day She shall we proceed But in my remembrance MIRANDA The other moon shines to make thee From whence that thou must Katharina As we shall have the master Angelo I know not how he woo PETRUCHIO A horse in the lampass slaughter in a chains arms and break it and a fossetseller for the beak Have he is a kind man Three accident that hath


In [99]:
generate_text(DEVICE, rnn_net, ['what'], len(vocab), vocab_to_int, int_to_vocab)

what is no more Suitors in this action of it will come you at least Affections run mad Under the bay to be vows to the people And so I pray thee for the rest o the bolster BAPTISTA Now by my troth is it not heard him with the gown owes thee to me If it doth not PETRUCHIO I pray you are welcome And I have used me in a warrant fat tear and obtaind for the window 3 In human reason thou varlet now HORTENSIO Now by your lordship drink my master wink BAPTISTA Now in your staves as


In [101]:
generate_text(DEVICE, rnn_net, ['You'], len(vocab), vocab_to_int, int_to_vocab)

You may not be done When you have done withal and called the world without counters I crave any man of Pisa and vengeance ISABELLA A whoreson Ay sir to Padua GRUMIO Nay if Katharina may be your ben Page His demand Springs not that it was certainly to be Pisa PETRUCHIO Now Kate Master thou art thence which thou shalt be deceived But now Lucentio or dost thou art to him I mean in sooth Signior Baptista whilst I am Grumios pledge unhappy master Lucentio is not so much lengththe Aufidius And thou hast never forgot Nothing anon and ride not


In [106]:
generate_text(DEVICE, rnn_net, ['I', 'tell', 'you', 'friends'], len(vocab), vocab_to_int, int_to_vocab)

I tell you friends and admit my lord tis a father which is that she is coming BAPTISTA What ist eer seven i the heel And so to do me to thy minions charge night PROSPERO MIRANDA What says my father will I not Four lagging clout Duke at We shall I go by manifest LUCENTIO Peace Now be some comfort be ruled GREEN That in this business AUTOLYCUS She was a respected woman in my fathers liking Claudio VINCENTIO Not you gentlemen No impediment as he is a toy to the gaol BAPTISTA How now the sliding of your hands and tie me so


- - -

## Question - c

(c) Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. Let X = (x0, . . . , xt), then the perplexity of X is ...
Train your network RNNs and provide a PPL curve over the course of the training.

- - -

## Question - d

(d) Among GRU, LSTM, or Transformer, pick one of your favorite architecture, and design the architecture whose the number of parameters is similar to vanila RNN you implemented above. Then train and provide a PPL curve over the course of the training (in the same plots in (c)). You are free to select any hyperparameters if needed (no need to use the hyperparameters above). Report the number of model parameters. (You can use GRU, LSTM, or Transformer libararies, if you want, but I recommend you to implement by yourself.)

- - -

## Question - e

(e) Pick the best performing (lowest PPL score) model, and generate the text autoregressively given the following prompts.
> (1) ‘We are’ \
(2) ‘what’ \
(3) ‘You’ \
(4) ‘I tell you, friends’