# What is an RNN?

A Recurrent Neural Network is different from the other neural networks as it has a memory which stores information of all the layers it has processed so far and computes the next layer on the basis of this memory.

**GRU vs LSTM**

Both of these are great for text generation but GRUs are a newer concept…and there isn’t actually a way to determine which one is better in general. Tuning your hyper-parameters well is what will improve your model performance more than choosing a good architecture.²
However, if the amount of data is not a problem, LSTMs perform better. If you have less data, GRUs have fewer parameters so they train faster and work well to generalize the lesser data.

**Why character-based?**

When working with large datasets like this, the complete number of unique words in a corpus is much higher than the number of unique characters. A large dataset will have many many unique words, and when we assign one-hot encodings to such large matrices we’re likely to run into memory issues. Our labels alone can take up storage of terabytes of RAM.
So, the same principles which you use to predict words can be applied here, but now you’ll be working with much smaller vocabulary size.

In [1]:
import tensorflow as tf
import numpy as np
import os
import time

In [2]:
files = ['/content/sample_data/1SorcerersStone.txt',
         '/content/sample_data/2ChamberofSecrets.txt',
         '/content/sample_data/3ThePrisonerOfAzkaban.txt',
         '/content/sample_data/4TheGobletOfFire.txt',
         '/content/sample_data/5OrderofthePhoenix.txt',
         '/content/sample_data/6TheHalfBloodPrince.txt',
         '/content/sample_data/7DeathlyHollows.txt']

with open('harrypotter.txt', 'w') as outfile:
  for file in files:
    with open(file) as infile:
      outfile.write(infile.read())

text = open('harrypotter.txt').read()       
print(text[:300])

rry Potter and the Sorcerer's Stone 

CHAPTER ONE 

THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they ju


# Processing the data

We map all the unique character strings in *vocab* to numbers by making two look-up tables:

* mapping the characters to numbers (**char2index**)
* mapping the numbers back to the characters (**index2char**)

Then convert our text to numbers..

In [3]:
vocab = sorted(set(text))
print(vocab)
char2index = {u:i for i, u in enumerate(vocab)}
#print(char2index)
index2char = np.array(vocab)
text_as_int = np.array([char2index[c] for c in text]) #array with mapped elements according to vocab 

#how it looks:
print ('{} -- characters mapped to int -- > {}'.format(repr(text[:13]), text_as_int[:13]))

['\t', '\n', '\x1f', ' ', '!', '"', '$', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '}', '~', '\x90', '\x92', '¦', '«', '\xad', '»', 'é', 'ü', '–', '‘', '’', '“', '•']
'rry Potter an' -- characters mapped to int -- > [81 81 88  3 47 78 83 83 68 81  3 64 77]


Each input sequence for our model will contain *seq_length* number of characters from the text, and its corresponding target sequence will be of the same length with all characters shifted one place to the right. So we break the text into chunks of *seq_length + 1*.

**tf.data.Dataset.from_tensor_slices** converts the text vector into a stream of character indices and the **batch** method lets us group these characters into batches of the required length.

By using the **map** method to apply a simple function to each batch, we create our inputs and targets.

In [4]:
seq_length = 100
examples_per_epoch = len(text) // (seq_length+1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True) 

def split_input_target(data):
  input_text = data[:-1]
  target_text = data[1:]
  return input_text, target_text

#print(split_input_target('ola'))
dataset = sequences.map(split_input_target) #applies the batch method to all entries
print(dataset)

<MapDataset shapes: ((100,), (100,)), types: (tf.int64, tf.int64)>


In [5]:
batch_size = 64
buffer_size = 10000
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)
print(dataset)

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>


# Building the Model

Given all the characters computed until this moment, what will the next character be? This is what we will be training our RNN model to predict.

I have used **tf.keras.Sequential** to define the model since all the layers in it only have a single input and produce a single output. The different layers used are:

* **tf.keras.layers.Embedding**: This is the input layer. An embedding is used to map all the unique characters to vectors in multi-dimensional space, having embedding_dim dimensions.

* **tf.keras.layers.GRU**: A type of RNN with rnn_units number of units.(You can also use an LSTM layer here to see what works best for your data)

* **tf.keras.layers.Dense**: This is the output layer, with vocab_size outputs.

It is also useful to define all the hyper-parameters separately so that it’s easier for you to change them later without editing the model definition.

In [8]:
def build_model(batch_size, vocab_size, embedding_dim, rnn_units1, rnn_units2):
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape = [batch_size, None]),
                               tf.keras.layers.LSTM(rnn_units1, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
                               #tf.keras.layers.LSTM(rnn_units2, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
                               tf.keras.layers.Dense(vocab_size, activation='sigmoid')
  ])
  return model

vocab_size = len(vocab)
embedding_dim = 256
rnn_units1 = 1024
rnn_units2 = 150

model = build_model(batch_size, vocab_size, embedding_dim, rnn_units1, rnn_units2)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           27136     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 106)           108650    
Total params: 5,382,762
Trainable params: 5,382,762
Non-trainable params: 0
_________________________________________________________________


In [9]:
def loss(labels, probs):
  return tf.keras.losses.sparse_categorical_crossentropy(labels,
         probs, from_logits=False)
  
model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])

In [10]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
   filepath=checkpoint_prefix, save_weights_only=True)

In [20]:
EPOCHS= 25
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
latest_check = tf.train.latest_checkpoint(checkpoint_dir)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [11]:
# now eith 1024 rnn units
vocab_size = len(vocab)
embedding_dim = 256
rnn_units1 = 1024
rnn_units2 = 150

model = build_model(batch_size, vocab_size, embedding_dim, rnn_units1, rnn_units2)
model.summary()
model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
   filepath=checkpoint_prefix, save_weights_only=True)

EPOCHS= 25
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
latest_check = tf.train.latest_checkpoint(checkpoint_dir)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (64, None, 256)           27136     
_________________________________________________________________
lstm_1 (LSTM)                (64, None, 1024)          5246976   
_________________________________________________________________
dense_1 (Dense)              (64, None, 106)           108650    
Total params: 5,382,762
Trainable params: 5,382,762
Non-trainable params: 0
_________________________________________________________________
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [8]:
def build_model(batch_size, vocab_size, embedding_dim, rnn_units1, rnn_units2):
  model = tf.keras.Sequential([
                               tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape = [batch_size, None]),
                               tf.keras.layers.LSTM(rnn_units1, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
                               #tf.keras.layers.LSTM(rnn_units2, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
                               tf.keras.layers.Dense(vocab_size, activation='sigmoid')
  ])
  return model

def loss(labels, probs):
  return tf.keras.losses.sparse_categorical_crossentropy(labels,
         probs, from_logits=False)

# now eith 1024 rnn units
vocab_size = len(vocab)
embedding_dim = 300
rnn_units1 = 1024
rnn_units2 = 150

model = build_model(batch_size, vocab_size, embedding_dim, rnn_units1, rnn_units2)
model.summary()
model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt_{epoch}')
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
   filepath=checkpoint_prefix, save_weights_only=True)

EPOCHS= 25
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
latest_check = tf.train.latest_checkpoint(checkpoint_dir)


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (64, None, 300)           31800     
_________________________________________________________________
lstm_1 (LSTM)                (64, None, 1024)          5427200   
_________________________________________________________________
dense_1 (Dense)              (64, None, 106)           108650    
Total params: 5,567,650
Trainable params: 5,567,650
Non-trainable params: 0
_________________________________________________________________
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [9]:
def generate_text(model, start_string):
  
  num_generate = 1000 #anything

  input_eval = [char2index[s] for s in start_string] 
  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  scaling = 0.5 #kept at a lower value here

  #batch_size = 1 now!!
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    #remove batch dimension
    predictions = tf.squeeze(predictions, 0)
    predictions = predictions / scaling
    predicted_id = tf.random.categorical(predictions, num_samples = 1)[0, 0].numpy()
    input_eval = tf.expand_dims([predicted_id], 0)
    text_generated.append(index2char[predicted_id])
  
  return (start_string + ''.join(text_generated))

In [None]:
model = build_model(1, vocab_size, embedding_dim, rnn_units1, rnn_units2)
model.load_weights(latest_check)
model.build(tf.TensorShape([1, None]))
model.summary()

print(generate_text(model, start_string=u'Severus Snape'))

In [None]:
print(generate_text(model, start_string=u'Voldemort died'))

In [None]:
print(generate_text(model, start_string=u'Harry and Ron '))

In [None]:
print(generate_text(model, start_string=u'Dumbledore said to Harry'))