<a href="https://colab.research.google.com/github/awill139/Demystifying-AI-Course/blob/master/7_LSTMs_and_GRUs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np

In [0]:
def sigmoid( x):
    return 1 / (1 + np.exp(-x))
    
def dsigmoid( x):
    return sigmoid(x) * (1 - sigmoid(x))
    
def tangent( x):
    return np.tanh(x)

def dtangent( x):
    return 1 - np.tanh(x)**2

#Simple RNN#

![](https://cdn-images-1.medium.com/max/1600/1*UkI9za9zTR-HL8uM15Wmzw.png)

What is wrong with this?

Hard to train

Suffer from vanishing gradients, exploding gradients

Not easy to work with

Hard to remember values from long way in the past

We can solve this with a better architecture

![](https://www.researchgate.net/profile/S_Varsamopoulos/publication/329362532/figure/fig5/AS:699592479870977@1543807253596/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of-an-LSTM-cell.jpg)

Introducing the long short term memory cell or LSTM.

LSTM uses what we will call 'gates'

There exists the input gate which is just a sidmoid activated input

The forget gate which will be the same thing as the input, except we will use it differently, and will scale the hidden state C

The output gate which also mimics the input gate

The weights of each of these gates will be trained independently


We also have the hidden state C. The hidden state is what we remember from the previously given information

The big Us in the picture just denote the gate

You also may have noticed the g 'gate' but that is nothing more than the original RNN output

In [0]:
xs=3
hidden=5
ys=2
lr = 1e-3

In [0]:
x = np.zeros(xs+ys)
xs = xs + ys
y = np.zeros(ys)
cs = np.zeros(ys)
f = np.random.random((ys, xs+ys))
i = np.random.random((ys, xs+ys))
c = np.random.random((ys, xs+ys))
o = np.random.random((ys, xs+ys))
Gf = np.zeros_like(f)
Gi = np.zeros_like(i)
Gc = np.zeros_like(c)
Go = np.zeros_like(o)


In [0]:
def forwardProp():
    f = sigmoid(np.dot(f, x))
    cs *= f
    i = sigmoid(np.dot(i, x))
    c = tangent(np.dot(c, x))
    cs += i * c
    o = sigmoid(np.dot(o, x))
    y = o * tangent(cs)
    return cs, y, f, i, c, o

In [0]:
def backProp( e, pcs, f, i, c, o, dfcs, dfhs):
    e = np.clip(e + dfhs, -6, 6)
    do = tangent(cs) * e
    ou = np.dot(np.atleast_2d(do * dtangent(o)).T, np.atleast_2d(x))
    dcs = np.clip(e * o * dtangent(cs) + dfcs, -6, 6)
    dc = dcs * i
    cu = np.dot(np.atleast_2d(dc * dtangent(c)).T, np.atleast_2d(x))
    di = dcs * c
    iu = np.dot(np.atleast_2d(di * dsigmoid(i)).T, np.atleast_2d(x))
    df = dcs * pcs
    fu = np.dot(np.atleast_2d(df * dsigmoid(f)).T, np.atleast_2d(x))
    dpcs = dcs * f
    dphs = np.dot(dc, c)[:ys] + np.dot(do, o)[:ys] + np.dot(di, i)[:ys] + np.dot(df, f)[:ys] 
    return fu, iu, cu, ou, dpcs, dphs

In [0]:
#RMSprop
def update( fu, iu, cu, ou):
    Gf = 0.9 * Gf + 0.1 * fu**2 
    Gi = 0.9 * Gi + 0.1 * iu**2   
    Gc = 0.9 * Gc + 0.1 * cu**2   
    Go = 0.9 * Go + 0.1 * ou**2   
    f -= lr/np.sqrt(Gf + 1e-8) * fu
    i -= lr/np.sqrt(Gi + 1e-8) * iu
    c -= lr/np.sqrt(Gc + 1e-8) * cu
    o -= lr/np.sqrt(Go + 1e-8) * ou
    return

Let's try a different architecture

![](https://cdn-images-1.medium.com/max/1600/1*9z1Jrl8K99TorEQfsOTjpA.png)

Introducing the Gated Recurrent Unit or GRU

Here, we have the update gate in z which will essentially combines the input and forget gate from the LSTM

The reset gate in r which behaves a bit like the forget gate in the LSTM

The biggest difference is that the GRU lacks an output gate but just uses the hidden state as its output

In [0]:
# initialise the weights and biases, and their velocities
wstd = 0.2;
w1 = np.random.randn(xs,hidden)*wstd
w1v = np.zeros((xs,hidden))
b1 = np.zeros((hidden,))
b1v = np.zeros((hidden,))

wz = np.random.randn(2*hidden,hidden)*wstd
wzv = np.zeros((2*hidden,hidden)) # the weight velocity
bz = np.zeros((hidden,))
bzv = np.zeros((hidden,))

wr = np.random.randn(2*hidden,hidden)*wstd
wrv = np.zeros((2*hidden,hidden)) # the weight velocity
br = np.zeros((hidden,))
brv = np.zeros((hidden,))

wh = np.random.randn(2*hidden,hidden)*wstd
whv = np.zeros((2*hidden,hidden)) # the weight velocity
bh = np.zeros((hidden,))
bhv = np.zeros((hidden,))

w2 = np.random.randn(hidden,ys)*wstd
w2v = np.zeros((hidden,ys)) # the weight velocity
b2 = np.zeros((ys,))
b2v = np.zeros((ys,))

In [0]:
def predict(input):
    L = np.shape(input)[0]
    az = np.zeros((L,hidden))
    ar = np.zeros((L,hidden))
    ahhat = np.zeros((L,hidden))
    ah = np.zeros((L,hidden))

    a1 = tangent(np.dot(input,w1) + b1)
    x = np.concatenate((np.zeros((hidden)),a1[1,:]))
    az[1,:] = sigmoid(np.dot(x,wz) + bz)
    ar[1,:] = sigmoid(np.dot(x,wr) + br)
    ahhat[1,:] = tangent(np.dot(x,wh) + bh)
    ah[1,:] = az[1,:]*ahhat[1,:]

    for i in range(1,L):
        x = np.concatenate((ah[i-1,:],a1[i,:]))
        az[i,:] = sigmoid(np.dot(x,wz) + bz)
        ar[i,:] = sigmoid(np.dot(x,wr) + br)
        x = np.concatenate((ar[i,:]*ah[i-1,:],a1[i,:]))
        ahhat[i,:] = tangent(np.dot(x,wh) + bh)
        ah[i,:] = (1-az[i,:])*ah[i-1,:] + az[i,:]*ahhat[i,:]
    a2 = tangent(np.dot(ah,w2) + b2)
    return [a1,az,ar,ahhat,ah,a2]

In [0]:
    def compute_gradients(input,labels):
        [a1,az,ar,ahhat,ah,a2] = predict(input)
        error = (labels - a2)
        
        L = np.shape(input)[0]
        H = hidden
        dz = np.zeros((L,H))
        dr = np.zeros((L,H))
        dh = np.zeros((L,H))
        d1 = np.zeros((L,H))

        # this is ah from the previous timestep
        ahm1 = np.concatenate((np.zeros((1,H)),ah[:-1,:]))

        d2 = error*dtangent(a2)
        e2 = np.dot(error,w2.T)
        dh_next = np.zeros((1,hidden))
        for i in range(L-1,-1,-1):
            err = e2[i,:] + dh_next
            dz[i,:] = (err*ahhat[i,:] - err*ahm1[i,:])*dsigmoid(az[i,:])
            dh[i,:] = err*az[i,:]*dtangent(ahhat[i,:])
            dr[i,:] = np.dot(dh[i,:],wh[:H,:].T)*ahm1[i,:]*dsigmoid(ar[i,:])
            dh_next = err*(1-az[i,:]) + np.dot(dh[i,:],wh[:H,:].T)*ar[i,:] + np.dot(dz[i,:],wz[:H,:].T) + np.dot(dr[i,:],wr[:H,:].T)
            d1[i,:] = np.dot(dh[i,:],wh[H:,:].T) + np.dot(dz[i,:],wz[H:,:].T) + np.dot(dr[i,:],wr[H:,:].T)
        d1 = d1*dtangent(a1)
        # all the deltas are computed, now compute the gradients
        gw2 = 1.0/L * np.dot(ah.T,d2)
        gb2 = 1.0/L * np.sum(d2,0)
        x = np.concatenate((ahm1,a1),1)
        gwz = 1.0/L * np.dot(x.T,dz)
        gbz = 1.0/L * np.sum(dz,0)
        gwr = 1.0/L * np.dot(x.T,dr)
        gbr = 1.0/L * np.sum(dr,0)
        x = np.concatenate((ar*ahm1,a1),1)
        gwh = 1.0/L * np.dot(x.T,dh)
        gbh = 1.0/L * np.sum(dh,0)
        gw1 = 1.0/L * np.dot(input.T,d1)
        gb1 = 1.0/L * np.sum(d1,0)
        weight_grads = [gw1,gwr,gwz,gwh,gw2]
        bias_grads = [gb1,gbr,gbz,gbh,gb2]
        
        return weight_grads, bias_grads

In [0]:
def numerical_gradients(input,label,small=0.0001):
    weight_grads = []
    bias_grads = []
    wstr = ['w1','wr','wz','wh','w2']
    bstr = ['b1','br','bz','bh','b2']

    for i in range(len(wstr)):
        w = getattr(wstr[i])
        b = getattr(bstr[i])
        H,W = np.shape(w)
        wgrad = np.zeros((H,W))
        bgrad = np.zeros((W,))
        for j in range(W):
            for k in range(H):
                w[k,j] += small
                act1 = predict(input)
                err1 = np.mean(np.sum(0.5*np.square(label - act1[-1]),1))
                w[k,j] -= 2*small
                act2 = predict(input)
                err2 = np.mean(np.sum(0.5*np.square(label - act2[-1]),1))
                wgrad[k,j] = (err1-err2)/(2*small)
                w[k,j] += small
            b[j] += small
            act1 = predict(input)
            err1 = np.mean(np.sum(0.5*np.square(label - act1[-1]),1))
            b[j] -= 2*small
            act2 = predict(input)
            err2 = np.mean(np.sum(0.5*np.square(label - act2[-1]),1))
            bgrad[j] = (err1-err2)/(2*small)
            b[j] += small 
        weight_grads.append(wgrad)
        bias_grads.append(bgrad)
    return weight_grads, bias_grads

In [0]:
def backprop(input,label,momentum=0.9):
    weight_grads, bias_grads = compute_gradients(input,labels)
    wstr = ['1','r','z','h','2']
    for i in range(len(wstr)):
        wv = getattr("w"+wstr[i]+"v")
        wv = momentum*wv + lr*weight_grads[i]
        bv = getattr("b"+wstr[i]+"v")
        bv = momentum*bv + lr*bias_grads[i]
        w = getattr("w"+wstr[i])
        w += wv
        bv = getattr("b"+wstr[i])
        b += bv
    return 0

Which should you choose?
Both are fine. LSTMs are more flexible because they have more trainable parameters than GRUs, but this also leads to slower traixsg. Performance is mostly equal, one may do slightly better in a task than another. It is usually good to, once you determined your task, to test it on LSTM, then test it on GRU, compare loss and output and make your decision there and then do further development on this data with the chosen architecture. (Personally I tend to use GRUs for text and LSTMs for images. No real reason, as I've said, the performance is virtually negligible.)

I want to still do something cool, so here is a shakespeare text generation tutorial taken from [Tensorflow Tutorials](https://www.tensorflow.org/beta/tutorials/text/text_generation)

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

!pip install tensorflow-gpu==2.0.0-beta0
import tensorflow as tf

import numpy as np
import os
import time



In [0]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [0]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [0]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [0]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

65 unique characters


In [0]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [0]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '3' :   9,
  ':' :  10,
  ';' :  11,
  '?' :  12,
  'A' :  13,
  'B' :  14,
  'C' :  15,
  'D' :  16,
  'E' :  17,
  'F' :  18,
  'G' :  19,
  ...
}


In [0]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'First Citizen' ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [0]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//seq_length

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

F
i
r
s
t


In [0]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [0]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


In [0]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')


In [0]:
# Batch size
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [0]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model



In [0]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [0]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [0]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (64, None, 256)           16640     
_________________________________________________________________
lstm_2 (LSTM)                (64, None, 1024)          5246976   
_________________________________________________________________
dense_4 (Dense)              (64, None, 65)            66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [0]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 'ble not.\n\nBIANCA:\nBelieve me, sister, of all the men alive\nI never yet beheld that special face\nWhic'

Next Char Predictions: 
 "XRcQgFY!s$pY,PftiFjtdvmQs-VvMnYRxfa3RAnulbc;.'NFe'mSw!Pne-:!'!u asYjW G\n;OEQqmJptF:jeZQn!JZ!X;g;y ;H"


In [0]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 65)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.1743646


In [0]:
model.compile(optimizer='adam', loss=loss)

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)



In [0]:
import time
EPOCHS=10
tic = time.time()
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
toc = time.time()
toc - tic

Epoch 1/10


W0612 23:22:03.661798 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer
W0612 23:22:03.663102 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.iter
W0612 23:22:03.663949 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.beta_1
W0612 23:22:03.666805 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.beta_2
W0612 23:22:03.669498 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.decay
W0612 23:22:03.671381 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.learning_rate
W0612 23:22:03.673059 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.embeddings
W0612 23:22:03.674807 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-2.kernel
W0612 23:22:03.676509 140499343632256 util.py:2

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


473.1028571128845

In [0]:
tf.train.latest_checkpoint(checkpoint_dir)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
lstm_3 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_5 (Dense)              (1, None, 65)             66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


Preprocessing for GRU. Same thing, but there is some size difference somewhere

In [0]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//seq_length

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])
  
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))
  
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
  
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
    
# Batch size
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

def gru_build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

gru_model = gru_build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

gru_model.compile(optimizer = 'adam', loss = loss)

# Directory where the checkpoints will be saved
gru_checkpoint_dir = './gru_training_checkpoints'
# Name of the checkpoint files
gru_checkpoint_prefix = os.path.join(gru_checkpoint_dir, "gru_ckpt_{epoch}")

gru_checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=gru_checkpoint_prefix,
    save_weights_only=True)

dataset

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '3' :   9,
  ':' :  10,
  ';' :  11,
  '?' :  12,
  'A' :  13,
  'B' :  14,
  'C' :  15,
  'D' :  16,
  'E' :  17,
  'F' :  18,
  'G' :  19,
  ...
}
'First Citizen' ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]
F
i
r
s
t
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'
Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [0]:
tic = time.time()
GRU_EPOCHS=10
gru_history = gru_model.fit(dataset.repeat(), epochs=GRU_EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[gru_checkpoint_callback])
toc = time.time()
toc - tic

Epoch 1/10


W0612 23:29:58.364739 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer
W0612 23:29:58.365939 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.iter
W0612 23:29:58.368852 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.beta_1
W0612 23:29:58.371750 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.beta_2
W0612 23:29:58.374984 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.decay
W0612 23:29:58.376468 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer.learning_rate
W0612 23:29:58.377783 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-0.embeddings
W0612 23:29:58.379308 140499343632256 util.py:244] Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).layer_with_weights-2.kernel
W0612 23:29:58.380789 140499343632256 util.py:2

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


460.4146156311035

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [0]:
print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: I'll lay on.

POMPEY:
See, they do you go?

MONTIUS:
Tente maid, and not so.

LADY ANNE:
I would tef the galler. Come home set down by this nightlace of you,
While I infect it, I have rear so, but I had rather be
murder well again: cannot the reward in the bont's partian.

KING RICHARD III:
Upon the sicke, or on; but such a worthy day!
But do I bear, and learn.

QUEEN ELIZABETH:
But that have restrated by him and mine ede?
Who shall this trumpet's her. I'll have you will.

COMINIUS:
That, get thee joy. But she was no parson at many in him,
He was it were they are deceived: they are aleck
Of this of this sight:
Now, charge, God be thought, my son, thou shalt not boye?
Weich be with death, and puase.
Come, signifa come, we fool it Ifrethe parting of the curs'd with such and two maricards
With trial curse, it be well and you not howaft him sout
Shake's gentle fall;
Then thou wilt would pass him.

She's LIZAS:
Look'd, morworsul, Gurronanes again together,
Was the addived fellow; the

In [0]:
tf.train.latest_checkpoint(gru_checkpoint_dir)

gru_model = gru_build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

gru_model.load_weights(tf.train.latest_checkpoint(gru_checkpoint_dir))

gru_model.build(tf.TensorShape([1, None]))

gru_model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
gru_3 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_7 (Dense)              (1, None, 65)             66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


In [0]:
print(generate_text(gru_model, start_string=u"ROMEO: "))

ROMEO: How was it ere I live, himself, for in the man shall repost
As I am angry conquered from his death;
It was his Witonly,
Will stand going
And take the fatal fools:
Thistor, fie!
follow I fare an your rather makes thanks.

QUEEN ELIZABETH:
His grace my son with arry'd mother,' hath been suits stone and cross.

DUKE VINCENTIO:
Somes the matter for done so.

BRUTUS:
And I, but I'll think our titles; and I close it.
I will find him to aith I,
And with their sufferane shall use upon my son,
Thou call'd these three does y,
An earth mage sundue that and Edward, ig a swearing powerful?

OXFORD:
Well, methinks of milen:
High, but shall you find accompany;
I thank toward tignd his unlook'd my flown
Gods thou shalt another faults. I cannot with this prince.

LUCENTIO:
Sirrah, owe than with thy misdern; and the dege's nurse,
An as monest years that brought our aut of Gaunt: but therein some that
we better purrish on me: guilty out,
That comes his sonand and his cold man would the woes will I