# AIM5004_HW3

* (Total 10 pts) Training a character-level language model with recurrent neural networks and Transformers architecture. You are going to write codes in python w/ whichever deep learning libraries you prefer to use, e.g. pytorch, tensorflow, keras, jax, mxnet, and so on.

 - - -

## Question - a

(a) Download shakespeare dataset from https://storage.googleapis.com/download.
tensorflow.org/data/shakespeare.txt. Report the number of unique characters
and this number will be the number of your vocabulary (note that ’a’ and ’A’ are
different characters). Also, show 3 random chunks (200 characters per each) of the
dataset.

In [2]:
import os
import random
import codecs
import string
import nltk
import warnings
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt

from collections import Counter

In [76]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

In [4]:
##Setting torch environment

if torch.cuda.is_available():
    DEVICE = torch.device('cuda')
else:
    DEVICE = torch.device('cpu')
    
print('Using PyTorch version:', torch.__version__, ' Device: ', DEVICE)

Using PyTorch version: 1.7.1  Device:  cuda


### Shakespeare text dataset

In [5]:
data_fpath = tf.keras.utils.get_file('shakespeare.txt','https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
shakespeare = codecs.open(data_fpath, 'r', encoding='utf8').read()
data = shakespeare
data_len = len(data)
print(data_len)
print(data[:100]) ## including space

1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


### Vocabulary check

In [11]:
vocab = sorted(set(data))
vocab_size = len(vocab)

print('Vocabulary of the Shakespeare data: {}'.format(vocab))
print('The number of unique characters: {}'.format(vocab_size))

Vocabulary of the Shakespeare data: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The number of unique characters: 65


In [12]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in data])

print('Example of the original text: ', data[:13])
print('Example of the encoded text:  {}'.format(text_as_int[:13]))

Example of the original text:  First Citizen
Example of the encoded text:  [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [14]:
idx2char

array(['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?',
       'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
       'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='<U1')

### Random dataset chunks

In [15]:
chunk = 200

def random_select():
    stt = random.randint(0, data_len - chunk)
    end = stt + chunk + 1
    return data[stt : end]

In [16]:
print("First random Chunk: \n", random_select())

First random Chunk: 
  stoop and take't
Because we see it; but what we do not see
We tread upon, and never think of it.
You may not so extenuate his offence
For I have had such faults; but rather tell me,
When I, that censu


In [17]:
print("Second random Chunk: \n", random_select())

Second random Chunk: 
 nment?

HASTINGS:
With patience, noble lord, as prisoners must:
But I shall live, my lord, to give them thanks
That were the cause of my imprisonment.

GLOUCESTER:
No doubt, no doubt; and so shall Clar


In [18]:
print("Third random Chunk: \n", random_select())

Third random Chunk: 
 ou judge my wit would
fly?

Third Citizen:
Nay, your wit will not so soon out as another man's
will;'tis strongly wedged up in a block-head, but
if it were at liberty, 'twould, sure, southward.

Second


- - -

## Question - b

(b) Design a vanila RNN architecture and write the training codes w/ following hyperparameters. Report the number of model parameters.
(You can use RNN libararies, if you want, but I recommend you to implement by yourself.)
> (1) input embedding size: 64 \
(2) hidden size: 128 \
(3) the number of time steps (sequence length, or chunk length): 200 \
(4) the number of layers: 3 \
(5) activation function for hidden units: tanh \
(6) loss function: cross entropy loss \
(7) optimization algorithm: ADAM \
(8) batch size: 64 \
(9) training epochs: 30 \
(10) for other hyperparemeters, you are free to choose whatever you would like to use.

In [19]:
class Args:
    emb_size=64
    seq_size=200   ## chunk length
    lstm_size=64
    g_norm=5
    bs=64
    num_step=20
    epochs=30
    lr=0.001
    momentum = 0.9
    verbose='store_true'
    seed=710674

args = Args()    

np.random.seed(args.seed)
random.seed(args.seed)
torch.manual_seed(args.seed)

<torch._C.Generator at 0x1c90e9a0b30>

### Data preprocessing

In [20]:
## splitting document into words
def doc2words(doc):
    lines = doc.split('\n')
    lines = [line.strip(r'\"') for line in lines]
    words = ' '.join(lines).split()
    return words

In [21]:
## removing punctuations attached to words
def removepunct(words):
    punct = set(string.punctuation)
    words = [''.join([char for char in list(word) if char not in punct]) for word in words]
    return words

In [22]:
## get vocabulary from word list
def getvocab(words):
    wordfreq = Counter(words)
    sorted_wordfreq = sorted(wordfreq, key=wordfreq.get)
    return sorted_wordfreq

In [23]:
## get dictionary of int to words and word to int
def vocab_map(vocab):
    int_to_vocab = {k:w for k,w in enumerate(vocab)}
    vocab_to_int = {w:k for k,w in int_to_vocab.items()}
    return int_to_vocab, vocab_to_int

In [24]:
words = removepunct(doc2words(data))
vocab = getvocab(words)
int_to_vocab, vocab_to_int = vocab_map(vocab)

In [35]:
print("sample words: ", words[:10])
print("sample vocab: ", vocab[:10])

sample words:  ['First', 'Citizen', 'Before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak']
sample vocab:  ['relieved', 'humanely', 'afflicts', 'inventory', 'particularise', 'rakes', 'commonalty', 'maliciously', 'softconscienced', 'altitude']


* Preparing batch set for the training process

In [36]:
def get_batches(words, vocab_to_int, batch_size, seq_size):
    # generate a Xs and Ys of shape (batchsize * num_batches) * seq_size
    word_ints = [vocab_to_int[word] for word in words]
    num_batches = int(len(word_ints) / (batch_size * seq_size))
    Xs = word_ints[:num_batches*batch_size*seq_size]
    Ys = np.zeros_like(Xs)
    Ys[:-1] = Xs[1:]
    Ys[-1] = Xs[0]
    Xs = np.reshape(Xs, (num_batches*batch_size, seq_size))
    Ys= np.reshape(Ys, (num_batches*batch_size, seq_size))
    
    # iterate over rows of Xs and Ys to generate batches
    for i in range(0, num_batches*batch_size, batch_size):
        yield Xs[i:i+batch_size, :], Ys[i:i+batch_size, :]

### RNN Model

In [37]:
class RNN_model(nn.Module):
    ## initialize RNN module
    def __init__(self, n_vocab, seq_size, emb_size, lstm_size):
        super(RNN_model, self).__init__()
        self.seq_size = seq_size
        self.lstm_size = lstm_size
        self.embedding = nn.Embedding(n_vocab, emb_size)
        self.lstm = nn.LSTM(emb_size, lstm_size, batch_first=True)
        self.dense = nn.Linear(lstm_size, n_vocab)
    
    
    ## forward path
    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.dense(output)

        return logits, state
    
    def zero_state(self, batch_size):
        return (torch.zeros(1, batch_size, self.lstm_size),torch.zeros(1, batch_size, self.lstm_size))

In [None]:
class RNN_model(nn.Module):
    ## initialize the model
    def __init__(self):
        super(RNN_model, self).__init__()
        self.seq_size = args.seq_size
        self.lstm_size = args.lstm_size
        
        self.embedding = nn.Embedding(n_vocab, args.emb_size)
        self.lstm = nn.LSTM(args.emb_size, args.lstm_size, batch_first=True)
        self.dense = nn.Linear(args.lstm_size, n_vocab)
    
    ## forward 
    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.dense(output)

        return logits, state
    
    def zero_state(self, batch_size):
        return (torch.zeros(1, args.bs, self.lstm_size),torch.zeros(1, args.bs, self.lstm_size))

* Criterion and optimizer settings

In [38]:
def cri_opti(net, lr):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=args.lr)
    return criterion, optimizer

* RNN model training

In [39]:
ppl = []
def train_rnn(words, vocab_to_int, int_to_vocab, n_vocab):
    
    ## RNN instance
    model = RNN_model(n_vocab, args.seq_size, args.emb_size, args.lstm_size)
    model = model.to(DEVICE)
    criterion, optimizer = cri_opti(model, lr=args.lr)

    iteration = 0
    
    for epoch in range(args.epochs):
        batches = get_batches(words, vocab_to_int, args.bs, args.seq_size)
        state_h, state_c = model.zero_state(args.bs)

        ## Transfer data to GPU
        state_h = state_h.to(DEVICE)
        state_c = state_c.to(DEVICE)
        
        for x, y in batches:
            iteration += 1

            ## Tell it we are in training mode
            model.train()

            ## Reset all gradients
            optimizer.zero_grad()

            ## Transfer data to GPU
            x = torch.tensor(x).to(DEVICE).long()
            y = torch.tensor(y).to(DEVICE).long()

            logits, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(logits.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss_value = loss.item()
            ppl.append(loss_value)

            ## Perform back-propagation
            loss.backward(retain_graph=True)
            _ = torch.nn.utils.clip_grad_norm_(model.parameters(), args.g_norm)
            
            # Update the network's parameters
            optimizer.step()

            if iteration % 10 == 0:
                print('Epoch: {}/{}'.format(epoch, args.epochs),'Iteration: {}'.format(iteration),'Loss: {}'.format(loss_value))

            # if iteration % 1000 == 0:
                # predict(device, net, flags.initial_words, n_vocab,vocab_to_int, int_to_vocab, top_k=5)
                # torch.save(net.state_dict(),'checkpoint_pt/model-{}.pth'.format(iteration))
                
    return model

* Training results

In [41]:
rnn_net = train_rnn(words, vocab_to_int, int_to_vocab, len(vocab))

Epoch: 0/30 Iteration: 10 Loss: 9.552824020385742
Epoch: 1/30 Iteration: 20 Loss: 9.418272972106934
Epoch: 1/30 Iteration: 30 Loss: 8.828167915344238
Epoch: 2/30 Iteration: 40 Loss: 7.833779335021973
Epoch: 3/30 Iteration: 50 Loss: 7.433371543884277
Epoch: 3/30 Iteration: 60 Loss: 7.126779079437256
Epoch: 4/30 Iteration: 70 Loss: 7.080755233764648
Epoch: 5/30 Iteration: 80 Loss: 7.20672607421875
Epoch: 5/30 Iteration: 90 Loss: 7.034430027008057
Epoch: 6/30 Iteration: 100 Loss: 7.044061660766602
Epoch: 7/30 Iteration: 110 Loss: 7.178586006164551
Epoch: 7/30 Iteration: 120 Loss: 7.019558906555176
Epoch: 8/30 Iteration: 130 Loss: 7.03132438659668
Epoch: 9/30 Iteration: 140 Loss: 7.162107944488525
Epoch: 9/30 Iteration: 150 Loss: 7.000847816467285
Epoch: 10/30 Iteration: 160 Loss: 7.017824649810791
Epoch: 11/30 Iteration: 170 Loss: 7.144162654876709
Epoch: 11/30 Iteration: 180 Loss: 6.979130268096924
Epoch: 12/30 Iteration: 190 Loss: 6.999989032745361
Epoch: 13/30 Iteration: 200 Loss: 7.12

In [None]:
def generate_text(DEVICE, net, words, n_vocab, vocab_to_int, int_to_vocab, top_k=5):
    net.eval()

    state_h, state_c = net.zero_state(1)
    state_h = state_h.to(DEVICE)
    state_c = state_c.to(DEVICE)
    for w in words:
        ix = torch.tensor([[vocab_to_int[w]]]).to(DEVICE).long()
        output, (state_h, state_c) = net(ix, (state_h, state_c))
    
    _, top_ix = torch.topk(output[0], k=top_k)
    choices = top_ix.tolist()
    choice = np.random.choice(choices[0])

    words.append(int_to_vocab[choice])
    
    for _ in range(100):
        ix = torch.tensor([[choice]]).to(DEVICE).long()
        output, (state_h, state_c) = net(ix, (state_h, state_c))

        _, top_ix = torch.topk(output[0], k=top_k)
        choices = top_ix.tolist()
        choice = np.random.choice(choices[0])
        words.append(int_to_vocab[choice])

    print(' '.join(words))

In [None]:
generate_text(DEVICE, rnn_net, ['We', 'are'], len(vocab), vocab_to_int, int_to_vocab)

In [None]:
generate_text(DEVICE, rnn_net, ['what'], len(vocab), vocab_to_int, int_to_vocab)

In [None]:
generate_text(DEVICE, rnn_net, ['You'], len(vocab), vocab_to_int, int_to_vocab)

In [None]:
generate_text(DEVICE, rnn_net, ['I', 'tell', 'you', 'friends'], len(vocab), vocab_to_int, int_to_vocab)

- - -

## Question - c

(c) Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. Let X = (x0, . . . , xt), then the perplexity of X is ...
> Train your network RNNs and provide a PPL curve over the course of the training.

### Training

### Perplexity curve

- - -

## Question - d

(d) Among GRU, LSTM, or Transformer, pick one of your favorite architecture, and design the architecture whose the number of parameters is similar to vanila RNN you implemented above. Then train and provide a PPL curve over the course of the training (in the same plots in (c)). You are free to select any hyperparameters if needed (no need to use the hyperparameters above). Report the number of model parameters. (You can use GRU, LSTM, or Transformer libararies, if you want, but I recommend you to implement by yourself.)

* Setting new parameters in args_

In [105]:
class Args_:
    emb_size=64
    seq_size=1000
    lstm_size=64
    g_norm=5
    bs=64
    num_step=20
    epochs=30
    lr=0.001
    momentum = 0.9
    verbose='store_true'
    seed=710674

args_ = Args_()    

np.random.seed(args_.seed)
random.seed(args_.seed)
torch.manual_seed(args_.seed)

<torch._C.Generator at 0x1c90e9a0b30>

In [102]:
data = shakespeare
data[:100]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

In [103]:
vocab = sorted(set(data))
vocab_size = len(vocab)

print('Vocabulary of the Shakespeare data: {}'.format(vocab))
print('The number of unique characters: {}'.format(vocab_size))

Vocabulary of the Shakespeare data: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The number of unique characters: 65


In [106]:
SEQ_LEN = args_.seq_size
BATCH_SIZE = args_.bs

input_seqs = []
target_seqs = []

num_seqs = len(text_as_int) // (SEQ_LEN+1)
for i in range(num_seqs):
    seq = text_as_int[i:i+SEQ_LEN+1]
    input_seqs.append(np.array(seq[:-1]))
    target_seqs.append(np.array(seq[1:]))

input_seqs = np.array(input_seqs)
target_seqs = np.array(target_seqs)

input_seqs = input_seqs[:(len(input_seqs)//BATCH_SIZE)*BATCH_SIZE]
target_seqs = target_seqs[:(len(input_seqs)//BATCH_SIZE)*BATCH_SIZE]

In [107]:
def build_model(batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, 256, batch_input_shape=(batch_size, None)),
        tf.keras.layers.GRU(256, return_sequences=True, stateful=True),
        tf.keras.layers.Dense(vocab_size),
    ])
    model.build()
    return model

model = build_model(BATCH_SIZE)

In [None]:
EPOCHS = args_.epochs

loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss)

history = model.fit(input_seqs, 
    target_seqs,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30

In [None]:
model_inf = build_model(1)

for i in range(len(model_inf.layers)):
    for j in range(len(model_inf.layers[i].weights)):
        model_inf.layers[i].weights[j].assign(model.layers[i].weights[j])

In [None]:
def generate_text(model, seed, out_len):

    text_generated = []

    model.reset_states()
    
    inp = np.array([char2idx[s] for s in seed])

    for i in range(out_len):

        pred = model(inp[None, ...])[0]

        temperature = 1.0
        pred = pred / temperature
        pred_c = tf.random.categorical(pred, num_samples=1)[-1][0].numpy()
        
        text_generated.append(idx2char[pred_c])
        
        inp = np.array([pred_c])

    return (seed + ''.join(text_generated))

- - -

## Question - e

(e) Pick the best performing (lowest PPL score) model, and generate the text autoregressively given the following prompts.
> (1) ‘We are’ \
(2) ‘what’ \
(3) ‘You’ \
(4) ‘I tell you, friends’

### Text  generation results

In [None]:
print(generate_text(model_inf, seed=u"We are", out_len=200))

In [None]:
print(generate_text(model_inf, seed=u"what", out_len=200))

In [None]:
print(generate_text(model_inf, seed=u"You", out_len=200))

In [None]:
print(generate_text(model_inf, seed=u"I tell you, friends", out_len=200))