################################################

Owner: Arnab Das

##############################################

## tf.Keras implementation of variable length sequence RNN

Here we implemented RNN using tf.keras layers. This RNN model is capable of taking variable length sequence input

There are few parts of this notebook
<font color='blue'>
*  Data Preperation
*  Model definition
*  Custom training loop
*  Text generations
</font> 

Our design choices for this implementation are :- 
> * Synchronous RNN architecture
* Hidden to Hidden connection
* 2 RNN layers
* Variable sequence length
* Maximum character sequence of 500
* Batch size 128
* tf.keras.layers.LSTM as RNN layers
* Number of hidden unit per layer is 512
* Epochs 10
* Softmax cross entropy loss function
* Custom loss calculation using mask
* Adam optimizer





In [None]:
#Changing directory to mounted drive folder
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/IDL_Assignment6/')
os.getcwd()

'/content/drive/My Drive/Colab Notebooks/IDL_Assignment6'

In [None]:
#Importing necessary libraries
from prepare_data2 import parse_seq
import pickle
import tensorflow as tf
import pandas as pd
import numpy as np

### <font color='blue'>Data Preperation </font>
> We have used the supplied python file to generate the varibale length sequence. Also we are forcing the script to trancate any sequence bigger than 500 characters.

In [None]:
!python prepare_data2.py Input.txt skp \\n\\n+ -m 500

2020-06-05 10:19:40.712268: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Split input into 31022 sequences...
Longest sequence is 3094 characters. If this seems unreasonable, consider using the maxlen argument!
Removing sequences longer than 500 characters...
29429 sequences remaining.
Longest remaining sequence has length 499.
Removing length-0 sequences...
29429 sequences remaining.
Serialized 100 sequences...
Serialized 200 sequences...
Serialized 300 sequences...
Serialized 400 sequences...
Serialized 500 sequences...
Serialized 600 sequences...
Serialized 700 sequences...
Serialized 800 sequences...
Serialized 900 sequences...
Serialized 1000 sequences...
Serialized 1100 sequences...
Serialized 1200 sequences...
Serialized 1300 sequences...
Serialized 1400 sequences...
Serialized 1500 sequences...
Serialized 1600 sequences...
Serialized 1700 sequences...
Serialized 1800 sequences...
Serialized 1900 sequences..

In [None]:
# this is just a datasets of "bytes" (not understandable)
data = tf.data.TFRecordDataset("skp.tfrecords")

# this maps a parser function that properly interprets the bytes over the dataset
# (with fixed sequence length 200)
# if you change the sequence length in preprocessing you also need to change it here
data = data.map(lambda x: parse_seq(x))

# a map from characters to indices
vocab = pickle.load(open("skp_vocab", mode="rb"))
vocab_size = len(vocab)
# inverse mapping: indices to characters
ind_to_ch = {ind: ch for (ch, ind) in vocab.items()}
print(vocab)
print(vocab_size)

{'Q': 3, '$': 4, 'q': 5, '&': 6, 'w': 7, 'i': 8, 'p': 9, 'o': 10, 'G': 11, '3': 12, '[': 13, 'P': 14, 'N': 15, ',': 16, '.': 17, 'O': 18, ' ': 19, 'c': 20, 'R': 21, 'f': 22, 'L': 23, 'X': 24, ']': 25, 'I': 26, 'x': 27, 'u': 28, 'A': 29, 'E': 30, 't': 31, 'v': 32, 'V': 33, 'j': 34, 'U': 35, 'y': 36, 's': 37, 'C': 38, 'm': 39, 'W': 40, 'g': 41, 'K': 42, '-': 43, ':': 44, 'M': 45, 'D': 46, 'z': 47, 'n': 48, 'Y': 49, ';': 50, 'J': 51, 'B': 52, '\n': 53, '!': 54, '?': 55, 'a': 56, "'": 57, 'T': 58, 'h': 59, 'b': 60, 'Z': 61, 'F': 62, 'r': 63, 'S': 64, 'H': 65, 'e': 66, 'l': 67, 'd': 68, 'k': 69, '<PAD>': 0, '<S>': 1, '</S>': 2}
70


In [None]:
# declaring hyper parameters
batch_size  = 128
maxLen = 500
epochs = 10

### <font color='blue'>Model Definition</font>
We are using a tf.keras sequential model. Our model is having 3 layers, 2 RNN and one dense layer. Each RNN layer is having 512 cells. We are using a stateful RNN

Mark the input shape if the RNN.

```
(batch_size,maxLen-1,vocab_size)
```
first axis is for the batch. As we will not use the last character as the input the time axis is of value maxLen -1 an the final axis is for the one hot encoding of each character


In [None]:
myRNN = tf.keras.Sequential()
myRNN.add(tf.keras.layers.LSTM(512,return_sequences= True, stateful=True, batch_size= batch_size, batch_input_shape = (batch_size,maxLen-1,vocab_size)))
myRNN.add(tf.keras.layers.LSTM(512,return_sequences= True, stateful=True))
myRNN.add(tf.keras.layers.Dense(vocab_size))

Data is now shuffeled and batched. As suggested we have used padded batch to automatically pad zero to the end of the sequences with less characters

In [None]:
data = data.shuffle(100000).padded_batch(batch_size,maxLen, drop_remainder=True)

In [None]:
# Defining the optimizer
opt = tf.keras.optimizers.Adam()

### <font color='blue'>Custom training loop.</font>
As we are using stateful RNN, before each batch run we are reseting the stored states of the model.

In [None]:
for epoch in range(epochs):
  epochLoss = 0
  batchCount = 0
  for batch in data:
    myRNN.reset_states()
    batchCount +=1
    #Counting nozero characters for each sequence and also we are subtracting 1 to remove the '</s>' character training
    nonZeroTensor = tf.math.count_nonzero(batch, axis=1)-1
    #Creating mask from the count
    batchMask = tf.sequence_mask(nonZeroTensor,maxlen=maxLen-1,dtype=tf.dtypes.float32)
    inputSeq = batch[:,:-1]
    #The ground truh or the Y label is nothing but the same input sequence shifter by one character
    groundTruth = batch[:,1:]
    #One hot encoding of the input
    encodedInput = tf.one_hot(inputSeq, vocab_size)
    #Custom training and applying gradient per batch
    with tf.GradientTape() as tape:
      prediction = myRNN(encodedInput)
      lossMatrix = tf.nn.sparse_softmax_cross_entropy_with_logits(groundTruth,prediction)
      #Calculating masked loss
      maskedLoss = tf.math.multiply(lossMatrix,batchMask)
      #Summed loss for all characters in a batch
      batchLoss = tf.reduce_sum(maskedLoss)
      #avg loss per character
      scalerLoss = batchLoss / float(tf.reduce_sum(nonZeroTensor))
    epochLoss += scalerLoss    
    grads = tape.gradient(scalerLoss, myRNN.trainable_variables)
    opt.apply_gradients(zip(grads, myRNN.trainable_variables))
  print("Epoch = {}, Avg loss = {:.2f}".format(epoch+1,epochLoss/batchCount))  

Epoch = 1, Avg loss = 2.91
Epoch = 2, Avg loss = 2.16
Epoch = 3, Avg loss = 1.87
Epoch = 4, Avg loss = 1.69
Epoch = 5, Avg loss = 1.57
Epoch = 6, Avg loss = 1.48
Epoch = 7, Avg loss = 1.42
Epoch = 8, Avg loss = 1.37
Epoch = 9, Avg loss = 1.33
Epoch = 10, Avg loss = 1.29


In [None]:
# Printing the summary of the model
myRNN.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (128, 499, 512)           1193984   
_________________________________________________________________
lstm_1 (LSTM)                (128, 499, 512)           2099200   
_________________________________________________________________
dense (Dense)                (128, 499, 70)            35910     
Total params: 3,329,094
Trainable params: 3,329,094
Non-trainable params: 0
_________________________________________________________________


### <font color='blue'>Text generation</font>
As we have used stateful RNN, it is not possible to change the batch size. Here we have applied a simple trick. We are preparing a np zero array of 128x1x70 as input. As you can see we use sequence length of 1.
and we only chnage the input for the first sequence of the batch. and also take output of the first sequence and modify the first sequence for the next input. this way we can still use the same RNN model.

One more thing as we have also reduced the sequence length, tensor flow given warning for that. For the purpose of this text generation, we have supressed the warnings of the logger.

We are generating 10 different sequence of variable length.

In [None]:
tf.get_logger().setLevel("ERROR")
for i in range (10):
  myRNN.reset_states()
  finalString = ""
  char = np.zeros((batch_size,1,vocab_size),dtype= 'float32')
  char[0][0][1] = 1
  print("Seq = {}".format(i+1))
  while(True):
    predLogits = myRNN(char)
    predLogitSingleChar = predLogits[0][0] 
    prob = np.exp(predLogitSingleChar) / np.sum(np.exp(predLogitSingleChar))
    # Random choice of output character based on output discrete probability
    idx = np.random.choice(range(vocab_size), p=prob.ravel())
    char = np.zeros((batch_size,1,vocab_size),dtype= 'float32')
    char[0][0][idx] = 1
    outPutChar = ind_to_ch[idx]
    if(outPutChar =='</S>') :
      break
    finalString += outPutChar
  print(finalString)
  print('\n') 

Seq = 1
HELENA:
Richard, anled friends: do you hole till I'll are
his hase of France. You, sir, sir?


Seq = 2
CEROMEN:
Here, my lodd Yoth should be well nighted flood.


Seq = 3
ANGELO:
The things, at the gentleman was to thee.


Seq = 4
BERGAR:
My lord is this hit man.


Seq = 5
ANTONIO:
It like him you blagged, as much undeed:
She's neightest, would I die and my busy;
And not a brovice of his softice let
As a new spoke om nothing but like:
Do Lord Capsiul Melorchin.


Seq = 6
PARDISA:
'Tis this, I kill: then Seek-pather banish'd
His highness instruction, and such my swords to-n'em
Where's either throne away the partty dows.


Seq = 7
DROMIO OF EPHESUS:
Qual, sort than your father?


Seq = 8
SILVIA:
By your eyes' and tuty the towfreal's poak;
Draw her hath been pasposed by debignilunes:
As far is sall that e'er-hour of fhoot,
Ell on law's any fair quintrious fortune,
Whiles notis falsta as great bodried; and,
Where things can de answer in Eansa--
Williness to the great of Planence th



---


#### Bonus Section: Attempt to calculate probability of a sequence
Not very sure we are doing the right thing here. 
We are picking the second sequence from the input. as this is short

```
All:
Speak, speak.
```
We will run the sequence to our model and then from output logits calculate the probability of the sequence by multiplying $$log P(x_i|x_{i-1})$$
Then we will change the sequence to 
```
All:
Spook, speak.
```
And repeat the same thing

In [None]:
newData = tf.data.TFRecordDataset("skp.tfrecords")
mapData = newData.map(lambda x: parse_seq(x))
batchData = mapData.padded_batch(batch_size,maxLen, drop_remainder=True)
charList = []
for batch in batchData:
  ##Picked the second shot sequence
  secondSeq = batch[1]
  nzCount = tf.math.count_nonzero(secondSeq)
  print(secondSeq[:nzCount])
  #Only considering nonzero sequence
  for items in secondSeq[:nzCount].numpy():
    charList.append(ind_to_ch[items])
  print(charList)
  break

tf.Tensor([ 1 29 67 67 44 53 64  9 66 56 69 16 19 37  9 66 56 69 17  2], shape=(20,), dtype=int32)
['<S>', 'A', 'l', 'l', ':', '\n', 'S', 'p', 'e', 'a', 'k', ',', ' ', 's', 'p', 'e', 'a', 'k', '.', '</S>']


In [None]:
 myRNN.reset_states()
 Prob = 0
 for chars in range(len(charList)-1):
   dummyBatch = np.zeros((batch_size,1,vocab_size),dtype= 'float32')
   dummyBatch[0][0][vocab[charList[chars]]] = 1
   logits = myRNN(dummyBatch)
   dscrtProb = np.exp(logits[0,0]) / np.sum(np.exp(logits[0,0]))
   logProb = np.log(dscrtProb)

   Prob += logProb[vocab[charList[chars+1]]]
print(Prob)

-21.10419415216893


In [None]:
#With changed charlist -- changed speak with spook
changedCharList = ['<S>', 'A', 'l', 'l', ':', '\n', 'S', 'p', 'o', 'o', 'k', ',', ' ', 's', 'p', 'e', 'a', 'k', '.', '</S>']
myRNN.reset_states()
Prob = 0
for chars in range(len(changedCharList)-1):
  dummyBatch = np.zeros((batch_size,1,vocab_size),dtype= 'float32')
  dummyBatch[0][0][vocab[changedCharList[chars]]] = 1
  logits = myRNN(dummyBatch)
  dscrtProb = np.exp(logits[0,0]) / np.sum(np.exp(logits[0,0]))
  logProb = np.log(dscrtProb)

  Prob += logProb[vocab[changedCharList[chars+1]]]
print(Prob)

-30.59061598032713
