# RNN Text Generator Notebook
### _This is a developing story, check back for updates._

The goal is to write as a stoic person. What Marcus Aurelius wrote is our input to learn how he wrote his meditations and to predict his next chapter.

## Setup
### Import Tensorflow and other libraries

In [1]:
# Code created by: Carlos Utrilla Guerrero
# Code source: https://www.tensorflow.org/tutorials/text/text_generation

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
import time

### Download dataset

In [2]:
path_to_file = tf.keras.utils.get_file('meditations.mb.txt', 'http://classics.mit.edu/Antoninus/meditations.mb.txt')

### Read first data

In [3]:
# read, then decode for py2 compatibility
text = open(path_to_file, 'rb').read().decode(encoding = 'utf-8')
# length of text is the number of characters on it
print('Length of text: {} characters'.format(len(text)))

Length of text: 244067 characters


In [4]:
# see first 200 characters
print(text[:200])

Provided by The Internet Classics Archive.
See bottom for copyright. Available online at
    http://classics.mit.edu//Antoninus/meditations.html

The Meditations
By Marcus Aurelius


Translated by Geo


In [5]:
# Check the unique characters in the file
vocab = sorted(set(text))
print('unique characters {}'.format(len(vocab)))
print(vocab)

unique characters 74
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '4', '9', ':', ';', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


### Vectorize the text
As map strings to numerical list. Create two vlookup tables as one mapping the characters to numbers, and the other from numbers to characters.

In [6]:
char2idx= {u:i for i, u in enumerate(vocab)}
idx2char=np.array(vocab)
print(char2idx)
print(idx2char)

{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, ',': 7, '-': 8, '.': 9, '/': 10, '0': 11, '1': 12, '2': 13, '4': 14, '9': 15, ':': 16, ';': 17, '?': 18, '@': 19, 'A': 20, 'B': 21, 'C': 22, 'D': 23, 'E': 24, 'F': 25, 'G': 26, 'H': 27, 'I': 28, 'J': 29, 'K': 30, 'L': 31, 'M': 32, 'N': 33, 'O': 34, 'P': 35, 'Q': 36, 'R': 37, 'S': 38, 'T': 39, 'U': 40, 'V': 41, 'W': 42, 'X': 43, 'Y': 44, 'Z': 45, '[': 46, ']': 47, 'a': 48, 'b': 49, 'c': 50, 'd': 51, 'e': 52, 'f': 53, 'g': 54, 'h': 55, 'i': 56, 'j': 57, 'k': 58, 'l': 59, 'm': 60, 'n': 61, 'o': 62, 'p': 63, 'q': 64, 'r': 65, 's': 66, 't': 67, 'u': 68, 'v': 69, 'w': 70, 'x': 71, 'y': 72, 'z': 73}
['\n' ' ' '!' '"' "'" '(' ')' ',' '-' '.' '/' '0' '1' '2' '4' '9' ':' ';'
 '?' '@' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P'
 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' '[' ']' 'a' 'b' 'c' 'd' 'e' 'f'
 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x'
 'y' 'z']


In [7]:
text_as_int=np.array([char2idx[c] for c in text])
text_as_int

array([35, 65, 62, ..., 26,  9,  0])

#### We mapped char to int and we mapped the character as indexes from 0 to len(unique)

In [8]:
print('{')
for char,_ in zip(char2idx, range(23)):
    print(' {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
 '\n':   0,
 ' ' :   1,
 '!' :   2,
 '"' :   3,
 "'" :   4,
 '(' :   5,
 ')' :   6,
 ',' :   7,
 '-' :   8,
 '.' :   9,
 '/' :  10,
 '0' :  11,
 '1' :  12,
 '2' :  13,
 '4' :  14,
 '9' :  15,
 ':' :  16,
 ';' :  17,
 '?' :  18,
 '@' :  19,
 'A' :  20,
 'B' :  21,
 'C' :  22,
  ...
}


In [9]:
#print how the first 13 chars from text are mapped to integ
print('{} ---- char mapped to int ---> {}'.format(repr(text[:13]), text_as_int[:13]))

'Provided by T' ---- char mapped to int ---> [35 65 62 69 56 51 52 51  1 49 72  1 39]


### Prediction task
The forecasting task we try to perform is: Given a char, better a sequence of char, what is the most probable next char? 

+ Inputs: sequence of char

+ Train: model to predict the output

+ Output: following char at each time step

| Model Specif | - Recurrent depends on the previously seen elements, given all char computed until this time.

### Create training examples and targets
Next divide the text into examples of sequences. Each input sequence will contain seq_length characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one char to the right.

So break the text into chunks of seq_length+1. For instance, say seq_length is 4 and our text is "Hello". The input sequence would be "Hell" and target is "ello".

To do so, use ```tf.data.Dataset.from_tensor_slices``` function to convert vectorize text into a stream of character indices.


In [10]:
# The max length sentence we want for a single input of char
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

P
r
o
v
i


The ```batch``` method allowed us to convert these individual chars to sequences of the desired size.

In [11]:
sequences = char_dataset.batch(seq_length+1,drop_remainder = True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

'Provided by The Internet Classics Archive.\nSee bottom for copyright. Available online at\n    http://c'
'lassics.mit.edu//Antoninus/meditations.html\n\nThe Meditations\nBy Marcus Aurelius\n\n\nTranslated by Georg'
'e Long\n\n----------------------------------------------------------------------\n\nBOOK ONE\n\nFrom my gra'
'ndfather Verus I learned good morals and the government\nof my temper. \n\nFrom the reputation and remem'
'brance of my father, modesty and a manly\ncharacter. \n\nFrom my mother, piety and beneficence, and abst'


For each sequence, duplicate and shift it to form the input and target text using ```map``` method to apply a simple function to each batch

In [12]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
dataset= sequences.map(split_input_target)

Print examples input and target values:

In [13]:
for input_example, target_example in dataset.take(1):
    print('Input data:', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data: 'Provided by The Internet Classics Archive.\nSee bottom for copyright. Available online at\n    http://'
Target data: 'rovided by The Internet Classics Archive.\nSee bottom for copyright. Available online at\n    http://c'


Each index of these vectors are processed as one time step. For input step 0, model recieves index for 'F' and tries to predict index 'i' as the next character. Next step, it does the same, the **RNN** considers the previous step in addition to the current input character thought.

In [14]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print('Step {:4d}'.format(i))
    print('  input: {}  ({:s})'.format(input_idx, repr(idx2char[input_idx])))
    print('  expected output: {}  ({:s})'.format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 35  ('P')
  expected output: 65  ('r')
Step    1
  input: 65  ('r')
  expected output: 62  ('o')
Step    2
  input: 62  ('o')
  expected output: 69  ('v')
Step    3
  input: 69  ('v')
  expected output: 56  ('i')
Step    4
  input: 56  ('i')
  expected output: 51  ('d')


### Create training batches
It uses ```tf.data``` to split the text into manageable sequences. We need to shuffle data and pack it into batches and eventually feeding this data into model.

In [15]:
# Batch size
BATCH_SIZE = 64
# buffer size to shuffle the dataset. Amount of time allocate to process the data
BUFFER_SIZE = 10000 # temp computer memory assigned

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset


<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int32, tf.int32)>

### Build The Model
Use ```tf.keras.Sequential``` to define the model. In that case, three layers are used to define the model:
* ```tf.keras.layers.Embedding``` : The input layer. A training vlookup table that will map numbers of each chars to a vector with ```embedding_dim``` dimensions;
* ```tf.keras.layers.GRU```: A type of RNN with size ```units=rnn_units```
* ```tf.keras.layers.Dense```: The output layer with ```vocab_size``` outputs.

In [16]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# Embedding dimension
embedding_dim = 256
# number of RNN units
rnn_units=1024

In [17]:
def build_model(vocab_size,embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size,embedding_dim,
                                 batch_input_shape=[batch_size,None]),
        tf.keras.layers.GRU(rnn_units,
                           return_sequences=True,
                           stateful=True,
                           recurrent_initializer = 'glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [18]:
model = build_model(
vocab_size=len(vocab),
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)

### Try the model
It is time to run the model and check if it behaves as expected. First check the shape of the output.

In [19]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 74) # (batch_size, sequence_length, vocab_size)


Check model summary

In [20]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           18944     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 74)            75850     
Total params: 4,033,098
Trainable params: 4,033,098
Non-trainable params: 0
_________________________________________________________________


To get the actual predictions from the model, we need to sample from the output distribution, to get actual indices. The distribution is defined by the logits over the character vocabulary. Try for the first example of the batch:

In [21]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis = -1).numpy()

That will give us, at each timestep, a prediction of the next character index

In [22]:
sampled_indices

array([71, 66,  6,  1, 71, 17,  8, 45, 30, 13,  4,  1, 51, 11, 43, 68,  8,
       39, 49, 58, 30, 39, 63, 34, 30, 61, 12,  5, 14, 13, 14, 34, 12, 14,
       68, 56, 26, 48,  0, 18, 71, 26, 25, 60, 66, 13, 14, 32,  0, 67, 67,
       45, 60, 49, 21, 18, 15, 39, 57, 63, 66, 43,  3, 19, 43, 56, 61, 20,
       42,  3, 39, 12,  1,  1, 29, 38, 33,  7, 44, 67, 68, 65, 52, 70, 56,
       56,  4, 53, 23, 43, 13, 22, 35, 60, 67, 51, 31, 36, 24, 15],
      dtype=int64)

Now we should decode this prediction by this untrained model:

In [23]:
print("Input: \n", repr(''.join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n",repr(''.join(idx2char[sampled_indices ])))

Input: 
 ' art calling out on the Rostra, hast thou forgotten, man,\nwhat these things are?- Yes; but they are '

Next Char Predictions: 
 'xs) x;-ZK2\' d0Xu-TbkKTpOKn1(424O14uiGa\n?xGFms24M\nttZmbB?9TjpsX"@XinAW"T1  JSN,Yturewii\'fDX2CPmtdLQE9'


### Train the Model
At this point the model can be treated as typical classification model. Given RNN state, and the input this time step, predict the next class of the character.

#### Attach an optimizer, and a loss function
```tf.keras.losses.sparse_categorical_crossentropy``` loss function works. Also ```from logits``` need to be set cause return logits

In [24]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print('Prediction shape:', example_batch_predictions.shape, '# (batch_size, sequence_length, vocab_size)')
print('Scalar_loss:      ', example_batch_loss.numpy().mean())

Prediction shape: (64, 100, 74) # (batch_size, sequence_length, vocab_size)
Scalar_loss:       4.3045197


In [25]:
model.compile(optimizer = 'adam', loss=loss)

#### Configure Checkpoints
Use ```tf.keras.callbacks.ModelCheckpoint``` to ensure check are done and sabed during training.

In [26]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

#### Execute the training as EPOCHS = 10

To keep training time reasonable, use 10 epochs to train the model.

In [27]:
EPOCHS = 10

In [28]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Train for 37 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Generate text
For sake of simplicity, use a batch size of 1. To run model on different batch_size, we need to rebuild and restore the weights from the checkpoint.

In [29]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints\\ckpt_10'

In [30]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [31]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            18944     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 74)             75850     
Total params: 4,033,098
Trainable params: 4,033,098
Non-trainable params: 0
_________________________________________________________________


#### The prediction loop
_TO BE DEFINED_

In [32]:
def generate_text(model, start_string):
    # Evaluation step: generating text using learned model
    
    # number of characters to generate
    num_generate = 2000
    
    # Converting our start string to numbers (vectoring)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    
    # Empty string to store our results
    text_generated = []
    
    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text
    # Experiment to find the best setting.
    temperature = 1.0
    
    
    # Here batch size equal to 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions,0)
        
        # using categorical distribution to predict the character returned by the model
        predictions = predictions/temperature
        predicted_id= tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        
        # pass predicted value as next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(idx2char[predicted_id])
    return (start_string + ''.join(text_generated))

In [33]:
print(generate_text(model, start_string =u"MARCUS AURELIUS:"))

MARCUS AURELIUS: dodsing of whatever than it is in thow prward purpicion consider thyself suther which thou not suppes therseffer goods would and chelle toghing food
no lave one suisen everything, shilled or sticl diplive and such dory which destrouber and happens weore?
And of the dosteshis dispraticus and musiner,
whan it wised mand, thun. For hy wast fen in
Graming with past now porect, thou stan's not however has geen amout percuplecepore, time out for they deaphing, than
this things,
aptears
that is thou is make breat
mapely and distur's ippostent of quire to time any own atalt, but appect doe

hould main from tilf andyily,
to list does not end,
toung way less pains pheserver that mat is a soling the hasm govity. 

Lot juth the wholate (his part awnow perpor abstamenter,
butity. 

Fro most: and consting and purpoce,
but elly whice netrinuver, then forme of a buture ussalls kind, of the expucted bain, which ase meen such many ress what hap others, is a
commen misture
the waurth som

#### Additional features:
https://www.tensorflow.org/tutorials/text/text_generation