# Homework 09 - RNNs

In this homework you will learn how to generate text with an RNN. 

Defining an RNN in TensorFlow is based on a specific framework. Therefore I will provide you with the correct model definition. Your task will be to to understand how the model processes sequential data, which kind of data it returns and how to train it.

You will train a stacked RNN to generate text passages, similar to those in the bible.

In [0]:
import numpy as np
%tensorflow_version 2.x
import tensorflow as tf
import random
import time


TensorFlow 2.x selected.


### Load text.

1. Download the file 'bible.txt' from the folder **files/Stuff for Homework** on Stud.IP.

2. Upload the file 'bible.txt' to you Google Drive.

3. Run the next cell to give Colab permission to have access to your Google Drive content.

4. Change the file path in **the cell after the next**, accordingly.

In [0]:
# To give Colab permission ot access to your Google Drive run this cell
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Load the text.
txt = open("/content/drive/My Drive/bible.txt",'r').read()
print("Text length: {}".format(len(txt)))
print('-------------------------------------')

# Inspect first lines:
print(txt[:500])

Text length: 4332496
-------------------------------------
The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Day, and the darkness he called Night.
And the evening and the mornin


In [0]:
# Get the vocabulary of the text
vocab = sorted(set(txt))
print("Vocabulary: {}".format(vocab))
print('--------------------------')
vocab_size = len(vocab)
print("Vocabulary size: {}".format(vocab_size))


Vocabulary: ['\n', ' ', '!', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
--------------------------
Vocabulary size: 74


In [0]:
# Create dictionaries to switch between the indices of the characters and the characters themselves.
char2idx = {ch:i for i,ch in enumerate(vocab)}
idx2char = {i:ch for i,ch in enumerate(vocab)}
# Translate the text to indices.
txt_idx = [char2idx[ch] for ch in txt]

### Prepare TensorFlow dataset.

In the following we will process the text to a suitable dataset.

In [0]:
# First create a tensorflow dataset out of the text (in indices). (tf.data.Dataset.from_tensor_slices)
dataset = tf.data.Dataset.from_tensor_slices(txt_idx)

In [0]:
# We will train on subsequences of length 20 and compute the loss for each timestep.
# Let's think about how a single training datapoint will look.
# Example:
# Input sequence: "Moses:  Called Genesi"
# Target sequence: "oses:  Called Genesis"
# To create these pairs of sequence we chunk the dataset into subsequences of length k+1.
# You can use .batch() for this. 
# And make sure that all subsequences in the resulting dataset have length k+1 (understand
# parameter 'drop_remainder' in .batch())
K = 20
data = dataset.batch(K+1, drop_remainder = True)

In [0]:
# Now we have to map each sequence of length 21
# to a (input, target) pair.
# Given the following function you can use the dataset method .map() here.
def input_target_split(seq):
    return seq[:-1], seq[1:]
dataset = data.map(input_target_split)

In [0]:
# Now as usual we shuffle our dataset and chunk it into batches of 64.
BATCHSIZE = 64
BUFFERSIZE = 10000
dataset = dataset.shuffle(BUFFERSIZE)
dataset = dataset.batch(BATCHSIZE, drop_remainder=True)

In [0]:
# Provided definitions of Vanilla RNN cell and RNN model.

class VanillaRNNCell(tf.keras.layers.Layer):

    def __init__(self, input_dim, units):
        super(VanillaRNNCell, self).__init__()
        self.input_dim = input_dim
        self.units = units
        # TF needs this.
        self.state_size = units
    
    def build(self, input_shape):
        self.w_in = self.add_weight(
                            shape=(self.input_dim, self.units),
                            initializer='uniform'
                            )
        self.w_h = self.add_weight(
                            shape=(self.units, self.units),
                            initializer='uniform'
                            )
        self.b_h = self.add_weight(
                            shape=(self.units,),
                            initializer='zeros'
                            )       
            
    def call(self, inputs, hidden_states):
        h_prev = hidden_states[0]
        h_new = tf.nn.sigmoid(tf.matmul(inputs, self.w_in) + tf.matmul(h_prev, self.w_h) + self.b_h)
        return h_new, [h_new]

state_size_1 = 128
state_size_2 = 256

class RNN(tf.keras.layers.Layer):
    
    def __init__(self):
        super(RNN, self).__init__()
        self.cell_1 = VanillaRNNCell(input_dim=vocab_size, units=state_size_1)
        self.cell_2 = VanillaRNNCell(input_dim=state_size_1, units=state_size_2)
        self.cells = [self.cell_1, self.cell_2]
        self.rnn = tf.keras.layers.RNN(self.cells, return_sequences=True)
        self.output_layer = tf.keras.layers.Dense(units=vocab_size, activation=tf.nn.softmax)
        
    def call(self,x):
        seqs = self.rnn(x)
        output = self.output_layer(seqs)
        return output

In [0]:
# Defining the function for generating novel text samples.
def generate_sample(sample, n):
    # Translate sample string into list of characters.
    lis = [char2idx[s] for s in sample]
    # Transform list into tensor of shape (1,20,vocab_size)
    lis = tf.one_hot(lis, depth = vocab_size)
    lis = tf.expand_dims(lis, 0)
    # Sample n new characters.
    for _ in range(n):      
        # Feed sample sequence into RNN and get probabilities of next character.
        prob = model(lis)
        prob = tf.squeeze(prob, 0)
        prob = prob / WONKYNESS 
        # Sample index for new character (use tf.random.categorical()).
        pred_idx = tf.random.categorical(prob, num_samples = 1)[-1,0].numpy()
        # Translate to actual character and add it to sample string.
        char = idx2char[pred_idx]
        sample += char
        # Create new sequence of 20 indices by deleting the first character of the old sequence
        # and adding the new character.
        new_sample = sample[-K:]
        lis = [char2idx[s] for s in new_sample]
        lis = tf.one_hot(lis, depth = vocab_size)
        lis = tf.expand_dims(lis, 0)
    return sample

In [0]:
tf.keras.backend.clear_session()
# Initialize the RNN, cross entropy as a loss function and as an optimizer Adam with learning rate 0.01.
model = RNN()
loss = tf.keras.losses.CategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

In [0]:
# Train for one epoch. Your loss should be around 1.4.
# Remember to encode the inputs and target values as one hots.
EPOCHS = 1
s = time.time()
for epoch in range(EPOCHS):
    start = time.time()
    for (x,targ) in dataset:
        x = tf.one_hot(x, depth = vocab_size)
        targ = tf.one_hot(targ, depth = vocab_size)

        with tf.GradientTape() as tape:
            pred = model(x)
            batch_loss = loss(targ, pred)
            gradients = tape.gradient(batch_loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    print(f"Epoch: {epoch + 1} ; Loss: {batch_loss.numpy()} ; Time: {round(time.time() - start)} sec")
print(f"Total Time until completion: {round(time.time() - s)}")

Epoch: 1 ; Loss: 1.4544851779937744 ; Time: 154 sec
Total Time until completion: 154


In [0]:
# Feel free to generate some funny samples.
# The function should take the sample below (of length k) and generate a text sequence of length k+n from it.
WONKYNESS = 0.028 #Higher number makes text more unpredictable
#Experimentally, Values between 0.01 and 0.1 work best

#sample = 'The First Book of Moses:  Called Genesis'
#sample = '1:1 God and Jesus made wine and smoked a'
sample = '1:1 In the beginning'
print(f"Sample Length: {len(sample)} ; K : {K}")

assert(len(sample)==K)

print(generate_sample(sample,200))

Sample Length: 20 ; K : 20
1:1 In the beginning and the things and square the more the seven King 9:1z ves of the (prophets of the Dedise the seed the body of the world and the children, and the seed the world and the seven the seed the uncleannes
