# Generating Shakespearean Text with Character Based RNNs

Problem Statement: Given a character or sequence of characters, we want to predict the next character at each time step. Model is trained to follow a language similar to the works of Shakespeare. The tinyshakespear dataset is used for training.

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import os
import time

In [None]:
def read_text(URL):
    with io.open(URL, 'r', encoding='utf8') as f:
        text = f.read()
    # Character's collection
    return text

In [None]:
#check if decoding is needed: text may need to be decoded as utf-8
text = open('./shakespeare_train.txt', 'r').read()
print(text[:200])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


In [None]:
#Find Vocabulary (set of characters)
vocabulary = sorted(set(text))
print('No. of unique characters: {}'.format(len(vocabulary)))

No. of unique characters: 67


## Preprocessing Text

In [None]:
#character to index mapping
char2index = {c:i for i,c in enumerate(vocabulary)}
int_text = np.array([char2index[i] for i in text])

#Index to character mapping
index2char = np.array(vocabulary)

In [None]:
#Testing
print("Character to Index: \n")
for char,_ in zip(char2index, range(65)):
    print('  {:4s}: {:3d}'.format(repr(char), char2index[char]))

print("\nInput text to Integer: \n")
print('{} mapped to {}'.format(repr(text[:20]),int_text[:20])) #use repr() for debugging

Character to Index: 

  '\n':   0
  ' ' :   1
  '!' :   2
  '$' :   3
  '&' :   4
  "'" :   5
  ',' :   6
  '-' :   7
  '.' :   8
  '3' :   9
  ':' :  10
  ';' :  11
  '?' :  12
  'A' :  13
  'B' :  14
  'C' :  15
  'D' :  16
  'E' :  17
  'F' :  18
  'G' :  19
  'H' :  20
  'I' :  21
  'J' :  22
  'K' :  23
  'L' :  24
  'M' :  25
  'N' :  26
  'O' :  27
  'P' :  28
  'Q' :  29
  'R' :  30
  'S' :  31
  'T' :  32
  'U' :  33
  'V' :  34
  'W' :  35
  'X' :  36
  'Y' :  37
  'Z' :  38
  '[' :  39
  ']' :  40
  'a' :  41
  'b' :  42
  'c' :  43
  'd' :  44
  'e' :  45
  'f' :  46
  'g' :  47
  'h' :  48
  'i' :  49
  'j' :  50
  'k' :  51
  'l' :  52
  'm' :  53
  'n' :  54
  'o' :  55
  'p' :  56
  'q' :  57
  'r' :  58
  's' :  59
  't' :  60
  'u' :  61
  'v' :  62
  'w' :  63
  'x' :  64

Input text to Integer: 

'First Citizen:\nBefor' mapped to [18 49 58 59 60  1 15 49 60 49 66 45 54 10  0 14 45 46 55 58]


## Create Training Data

In [None]:
seq_length= 150 #max number of characters that can be fed as a single input
examples_per_epoch = len(text)

#converts text (vector) into character index stream
#Reference: https://www.tensorflow.org/api_docs/python/tf/data/Dataset
char_dataset = tf.data.Dataset.from_tensor_slices(int_text)

In [None]:
#Create sequences from the individual characters. Our required size will be seq_length + 1 (character RNN)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [None]:
#Testing

print("\nSequence: \n")
for i in sequences.take(10):
    print(repr(''.join(index2char[i.numpy()])))  #use repr() for more clarity. str() keeps formatting it

Character Stream: 

F
i
r
s
t
 
C
i
t
i

Sequence: 

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAl'
"l:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us k"
"ill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be done: away, away!\n\nSecond Citizen:\nOne word, good "
'citizens.\n\nFirst Citizen:\nWe are accounted poor citizens, the patricians good.\nWhat authority surfeits on would relieve us: if they\nwould yield us but '
'the superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we are too dear: the leanness that\nafflicts us, the '
'object of our misery, is as an\ninventory to particularise their abundance; our\nsufferance is a gain to them Let us revenge this with\nour pi


Target value: for each sequence of characters, we return that sequence, shifted one position to the right, along with the new character that is predicted to follow the sequence.

To create training examples of (input, target) pairs, we take the given sequence. The input is sequence with last word removed. Target is sequence with first word removed. Example: sequence: abc d ef input: abc d e target: bc d ef

In [None]:
def create_input_target_pair(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(create_input_target_pair)

In [None]:
#Testing
for input_example, target_example in  dataset.take(1):
    print('Input data: ', repr(''.join(index2char[input_example.numpy()])))
    print('Target data:', repr(''.join(index2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nA'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAl'


In [None]:
#Creating batches
BATCH_SIZE = 64

# Buffer used to shuffle the dataset 
# Reference: https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 150), (64, 150)), types: (tf.int64, tf.int64)>

## Building the Model

In [None]:
vocab_size = len(vocabulary)
embedding_dim = 256
rnn_units= 1024

3 Layers used:

Input Layer: Maps character to 256 dimension vector

GRU Layer: LSTM of size 1024

Dense Layer: Output with same size as vocabulary

Since it is a character level RNN, we can use keras.Sequential model (All layers have single input and single output).

In [None]:
# Reference for theory: https://jhui.github.io/2017/03/15/RNN-LSTM-GRU/

def build_model_lstm(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units, 
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model

In [None]:
lstm_model = build_model_lstm(vocab_size = vocab_size,
                              embedding_dim=embedding_dim,
                              rnn_units=rnn_units, batch_size=BATCH_SIZE)

In [None]:
#Testing: shape
for input_example_batch, target_example_batch in dataset.take(1):
    example_prediction = lstm_model(input_example_batch)
    assert (example_prediction.shape == (BATCH_SIZE, seq_length, vocab_size)), "Shape error"
    #print(example_prediction.shape)

In [None]:
lstm_model.summary() 

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           17152     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 67)            68675     
Total params: 5,332,803
Trainable params: 5,332,803
Non-trainable params: 0
_________________________________________________________________


In [None]:
sampled_indices = tf.random.categorical(example_prediction[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

## Model Training

In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [None]:
example_loss  = loss(target_example_batch, example_prediction)
print("Prediction shape: ", example_prediction.shape)
print("Loss:      ", example_loss.numpy().mean())

Prediction shape:  (64, 150, 67)
Loss:       4.205296


In [None]:
lstm_model.compile(optimizer='adam', loss=loss)

In [None]:
lstm_dir_checkpoints= './training_checkpoints_LSTM'
checkpoint_prefix = os.path.join(lstm_dir_checkpoints, "checkpt_{epoch}") #name
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,save_weights_only=True)

In [None]:
EPOCHS=60 #increase number of epochs for better results (lesser loss)

In [None]:
history = lstm_model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


In [None]:
tf.train.latest_checkpoint(lstm_dir_checkpoints)

'./training_checkpoints_LSTM/checkpt_60'

## Prediction

In [None]:
lstm_model = build_model_lstm(vocab_size, embedding_dim, rnn_units, batch_size=1)
lstm_model.load_weights(tf.train.latest_checkpoint(lstm_dir_checkpoints))
lstm_model.build(tf.TensorShape([1, None]))

lstm_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            17152     
_________________________________________________________________
lstm_1 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_1 (Dense)              (1, None, 67)             68675     
Total params: 5,332,803
Trainable params: 5,332,803
Non-trainable params: 0
_________________________________________________________________


In [None]:
def generate_text(model, start_string):
    num_generate = 1000 #Number of characters to be generated

    input_eval = [char2index[s] for s in start_string] #vectorising input
    input_eval = tf.expand_dims(input_eval, 0)

    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 0.5

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(index2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [None]:
#Prediction with User Input
lstm_test = input("Enter your starting string: ")
print(generate_text(lstm_model, start_string=lstm_test))

Enter your starting string: Fisrt citizen
Fisrt citizens; and
in the carpet comes so grievously done, as it is an
old man and a robber, and that sterile in me
doth live and to be the other earnest.

All Servants:
Ay, sir.

FALSTAFF:
What sayest thou, thine elder thou? What's thyself?

Third Servingman:
But what said she?

Servant:
Sir, I will seek him to the senators.

ACHILLES:
Go to him.

TRANIO:
A most conscience swell that are not such a natural;
Yield us the superfluous castle of the time.

FLAVIUS:
Away, away, away!

CASSANDRA:

HAMLET:
Then in some return there is no such sting.

HORATIO:
In the sea shall answer the best of your daughter,
And you shall find me praise unto the king.

GRATIANO:
What peer the youth of men and motions?
O you pretty of this great company!
For my particular arise, obier, all false
As to the enterprise of ill restraint,
And let him stand in present sickness and the state
For that which he conjuncts at fear and dropp'd,
To the great lips, set down his w