## Objective: Build character corpus using a Recurrent Neural Network

Reading moby_dick.txt as data source. Taken from mody_dick-ORIG.txt which was downloaded from Project Gutenberg

Refer to this:
https://www.tensorflow.org/tutorials/text/text_generation

In [1]:
# Read text
text = open('moby_dick.txt', 'r', encoding="utf8").read()

# Print first 250 characters
print(text[:250])

﻿CHAPTER 1. Loomings.

Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. It


In [2]:
# Find all unique characters in the text
vocab = sorted(set(text))

print(vocab)

['\n', ' ', '!', '$', '&', '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£', 'â', 'æ', 'è', 'é', 'œ', '—', '‘', '’', '“', '”', '\ufeff']


In [3]:
print('{} total characters'.format(len(text)))
print('{} unique characters'.format(len(vocab)))

1191787 total characters
91 unique characters


### Vectorize the text

Create a mapping from unique characters to a numerical representation

In [4]:
import numpy as np

char_to_idx = { ch:i for i, ch in enumerate(vocab) }
idx_to_char = np.array(vocab)

In [5]:
# Map all characters in text to ints
text_as_int = np.array([ char_to_idx[c] for c in text ])

print(text_as_int[:250])

[90 26 31 24 39 43 28 41  1 12 10  1 35 67 67 65 61 66 59 71 10  0  0 26
 53 64 64  1 65 57  1 32 71 60 65 53 57 64 10  1 42 67 65 57  1 77 57 53
 70 71  1 53 59 67 85 66 57 74 57 70  1 65 61 66 56  1 60 67 75  1 64 67
 66 59  1 68 70 57 55 61 71 57 64 77 85 60 53 74 61 66 59  0 64 61 72 72
 64 57  1 67 70  1 66 67  1 65 67 66 57 77  1 61 66  1 65 77  1 68 73 70
 71 57  8  1 53 66 56  1 66 67 72 60 61 66 59  1 68 53 70 72 61 55 73 64
 53 70  1 72 67  1 61 66 72 57 70 57 71 72  1 65 57  0 67 66  1 71 60 67
 70 57  8  1 32  1 72 60 67 73 59 60 72  1 32  1 75 67 73 64 56  1 71 53
 61 64  1 53 54 67 73 72  1 53  1 64 61 72 72 64 57  1 53 66 56  1 71 57
 57  1 72 60 57  1 75 53 72 57 70 77  1 68 53 70 72  0 67 58  1 72 60 57
  1 75 67 70 64 56 10  1 32 72]


### Text Prediction

In [6]:
import tensorflow as tf

# Max length sentence for a single input of chars
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)

# Create training examples
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx_to_char[i.numpy()])

﻿
C
H
A
P


In [7]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx_to_char[item.numpy()])))

'\ufeffCHAPTER 1. Loomings.\n\nCall me Ishmael. Some years ago—never mind how long precisely—having\nlittle or'
' no money in my purse, and nothing particular to interest me\non shore, I thought I would sail about a'
' little and see the watery part\nof the world. It is a way I have of driving off the spleen and\nregula'
'ting the circulation. Whenever I find myself growing grim about\nthe mouth; whenever it is a damp, dri'
'zzly November in my soul; whenever\nI find myself involuntarily pausing before coffin warehouses, and\n'


In [8]:
#
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [9]:
for input_example, target_example in dataset.take(1):
    print('Input: ', repr(''.join(idx_to_char[input_example.numpy()])))
    print('Target: ', repr(''.join(idx_to_char[target_example.numpy()])))
    print()

Input:  '\ufeffCHAPTER 1. Loomings.\n\nCall me Ishmael. Some years ago—never mind how long precisely—having\nlittle o'
Target:  'CHAPTER 1. Loomings.\n\nCall me Ishmael. Some years ago—never mind how long precisely—having\nlittle or'



In [10]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}:", format(i))
    print("\tinput: {} ({:s})".format(input_idx, repr(idx_to_char[input_idx])))
    print("\texpected output: {} ({:s})".format(target_idx, repr(idx_to_char[target_idx])))

Step {:4d}: 0
	input: 90 ('\ufeff')
	expected output: 26 ('C')
Step {:4d}: 1
	input: 26 ('C')
	expected output: 31 ('H')
Step {:4d}: 2
	input: 31 ('H')
	expected output: 24 ('A')
Step {:4d}: 3
	input: 24 ('A')
	expected output: 39 ('P')
Step {:4d}: 4
	input: 39 ('P')
	expected output: 43 ('T')


In [11]:
# Create training batches
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int32, tf.int32)>

### Build the Model

In [12]:
EMBEDDING_DIM = 256       # embedding dimension

RNN_UNITS = 1024          # Number of RNN units

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = Sequential([
        Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        #LSTM
        Dense(vocab_size)
    ])
    return model

In [14]:
# Build it
model = build_model(
    vocab_size = len(vocab),
    embedding_dim = EMBEDDING_DIM,
    rnn_units = RNN_UNITS,
    batch_size = BATCH_SIZE)

### Try model

In [15]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "\t(batch_size, sequence_length, vocab_size)")

(64, 100, 91) 	(batch_size, sequence_length, vocab_size)


In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           23296     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 91)            93275     
Total params: 4,054,875
Trainable params: 4,054,875
Non-trainable params: 0
_________________________________________________________________


In [17]:
# Try first example in branch
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
sampled_indices

array([24, 75, 66, 76, 44,  5, 37, 11, 82, 27, 86, 13, 46, 30, 17, 46, 43,
       14, 87, 10, 61, 82, 64, 62, 66, 16, 54, 29, 50, 89, 63, 29, 13, 86,
        2, 59, 67, 77, 73, 53, 22,  2, 78, 89, 43, 21,  8, 70, 78, 38, 45,
       68, 77, 83,  5, 42, 70, 32, 63, 10, 25, 14,  6,  5, 22, 73, 13, 31,
       19,  2, 84,  4, 45, 30, 55, 56, 34, 81, 10, 34, 38, 76, 24, 60, 53,
       31, 61, 86, 87,  2, 11, 86, 46, 34, 60, 31, 43, 23, 12,  3],
      dtype=int64)

In [18]:
print("Input: \n", repr("".join(idx_to_char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx_to_char[sampled_indices])))

Input: 
 '. You would almost think a\ngreat gun had been discharged; and if you noticed the light wreath of\nvap'

Next Char Predictions: 
 'AwnxU(N0èD‘2WG6WT3’.ièljn5bF[”kF2‘!goyua;!z”T:,rzOVpyé(SrIk.B3)(;u2H8!œ&VGcdKæ.KOxAhaHi‘’!0‘WKhHT?1$'


### Train the Model

In [19]:
# Set up loss function with logits
from tensorflow.keras.losses import sparse_categorical_crossentropy

def loss(labels, logits):
    return sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 91)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.510721


In [20]:
#Configure training procedure
model.compile(optimizer='adam', loss=loss)

In [22]:
from tensorflow.keras.callbacks import ModelCheckpoint
import os

# Configure checkpoints
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = ModelCheckpoint(filepath = checkpoint_prefix, save_weights_only=True)

In [23]:
EPOCHS = 10

history = model.fit(dataset, epochs=EPOCHS, callbacks = [checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [25]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints\\ckpt_10'

In [26]:
model = build_model(len(vocab), EMBEDDING_DIM, RNN_UNITS, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [29]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            23296     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 91)             93275     
Total params: 4,054,875
Trainable params: 4,054,875
Non-trainable params: 0
_________________________________________________________________


### Evaluate text using the learned model

In [32]:
# Generate text
def generate_text(model, start_string, num_generate = 1000):
    
    # Convert start string into numbers
    input_eval = [ char_to_idx[s] for s in start_string ]
    input_eval = tf.expand_dims(input_eval, 0)
    
    text_generated = []
    temperature = 1.0    # low to predictable text vs high for surprising text
    
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        
        # remove batch dimension
        predictions = tf.squeeze(predictions, 0)
        
        # use a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        
        # pass the predicted character as the next input for the model along with the prev hidden state
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(idx_to_char[predicted_id])
        
    return (start_string + ''.join(text_generated))

In [36]:
print(generate_text(model, start_string=u"Great", num_generate = 10000))

Greatht Captain
Leviathan shallowed to man’s to spouting on a horizontal instant away.

The rush whole an enchanter’s bodies for whalemen of those forft. And yet he fell, Bildad, after
and solicient latituse: and mays, D’yir thing—”

“Certainmisly blows! Queequeg!
Wide Starb While the mawers of ark with a lexced into
whose whalement is ofteneath the food of the air, but to give me to his spout is far terms, circumstance, that
sect of the captain thanqued and will, and wides archilete; as the same
instant circling hither, like a flit out of experiment you must go, for up his
blood. Sirribly impitably to hunt that time in its uncentrable! Do ye, spotts, had regenerated sinking spoutings towards the
stifles without the ship; and, how he sideways to flew through me, making
is still chanced thy
simil us. Till all four years or they were the last repose of latitude, that to it!

THo is fe they caught by the time of old Rook of Jonah, their who saved a
very harpooneer you all more venticues! 