# RNN Text Generator Notebook
### _This is a developing story, check back for updates._

### The goal is to write as a stoic person. What Marcus Aurelius wrote is our input to learn how he wrote and to predict his next chapter.

## Setup
### Import Tensorflow and other libraries

In [1]:
# Code created by: Carlos Utrilla Guerrero
# Code source: https://www.tensorflow.org/tutorials/text/text_generation

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
import time

### Download dataset

In [3]:
path_to_file = tf.keras.utils.get_file('meditations.mb.txt', 'http://classics.mit.edu/Antoninus/meditations.mb.txt')

### Read first data

In [4]:
# read, then decode for py2 compatibility
text = open(path_to_file, 'rb').read().decode(encoding = 'utf-8')
# length of text is the number of characters on it
print('Length of text: {} characters'.format(len(text)))

Length of text: 244067 characters


In [5]:
# see first 200 characters
print(text[:200])

Provided by The Internet Classics Archive.
See bottom for copyright. Available online at
    http://classics.mit.edu//Antoninus/meditations.html

The Meditations
By Marcus Aurelius


Translated by Geo


In [None]:
# Check the unique characters in the file
vocab = sorted(set(text))
print('unique characters {}'.format(len(vocab)))
print(vocab)

### Vectorize the text
As map strings to numerical list. Create two vlookup tables as one mapping the characters to numbers, and the other from numbers to characters.

In [None]:
char2idx= {u:i for i, u in enumerate(vocab)}
idx2char=np.array(vocab)
print(char2idx)
print(idx2char)

In [None]:
text_as_int=np.array([char2idx[c] for c in text])
text_as_int

#### We mapped char to int and we mapped the character as indexes from 0 to len(unique)

In [None]:
print('{')
for char,_ in zip(char2idx, range(23)):
    print(' {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

In [None]:
#print how the first 13 chars from text are mapped to integ
print('{} ---- char mapped to int ---> {}'.format(repr(text[:13]), text_as_int[:13]))

### Prediction task
The forecasting task we try to perform is: Given a char, better a sequence of char, what is the most probable next char? 

+ Inputs: sequence of char

+ Train: model to predict the output

+ Output: following char at each time step

| Model Specif | - Recurrent depends on the previously seen elements, given all char computed until this time.

### Create training examples and targets
Next divide the text into examples of sequences. Each input sequence will contain seq_length characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one char to the right.

So break the text into chunks of seq_length+1. For instance, say seq_length is 4 and our text is "Hello". The input sequence would be "Hell" and target is "ello".

To do so, use ```tf.data.Dataset.from_tensor_slices``` function to convert vectorize text into a stream of character indices.


In [None]:
# The max length sentence we want for a single input of char
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

The ```batch``` method allowed us to convert these individual chars to sequences of the desired size.

In [None]:
sequences = char_dataset.batch(seq_length+1,drop_remainder = True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

For each sequence, duplicate and shift it to form the input and target text using ```map``` method to apply a simple function to each batch

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
dataset= sequences.map(split_input_target)

Print examples input and target values:

In [None]:
for input_example, target_example in dataset.take(1):
    print('Input data:', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Each index of these vectors are processed as one time step. For input step 0, model recieves index for 'F' and tries to predict index 'i' as the next character. Next step, it does the same, the **RNN** considers the previous step in addition to the current input character thought.

In [None]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print('Step {:4d}'.format(i))
    print('  input: {}  ({:s})'.format(input_idx, repr(idx2char[input_idx])))
    print('  expected output: {}  ({:s})'.format(target_idx, repr(idx2char[target_idx])))

### Create training batches
It uses ```tf.data``` to split the text into manageable sequences. We need to shuffle data and pack it into batches and eventually feeding this data into model.

In [None]:
# Batch size
BATCH_SIZE = 64
# buffer size to shuffle the dataset. Amount of time allocate to process the data
BUFFER_SIZE = 10000 # temp computer memory assigned

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset


### Build The Model
Use ```tf.keras.Sequential``` to define the model. In that case, three layers are used to define the model:
* ```tf.keras.layers.Embedding``` : The input layer. A training vloopup table that will map numbers of each chars to a vector with ```embedding_dim``` dimensions;
* ```tf.keras.layers.GRU```: A type of RNN with size ```units=rnn_units```
* ```tf.keras.layers.Dense```: The output layer with ```vocab_size``` outputs.

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# Embedding dimension
embedding_dim = 256
# number of RNN units
rnn_units=1024

In [None]:
def build_model(vocab_size,embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size,embedding_dim,
                                 batch_input_shape=[batch_size,None]),
        tf.keras.layers.GRU(rnn_units,
                           return_sequences=True,
                           stateful=True,
                           recurrent_initializer = 'glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [None]:
model = build_model(
vocab_size=len(vocab),
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)

### Try the model
It is time to run the model and check if it behaves as expected. First check the shape of the output.

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

Check model summary

In [None]:
model.summary()

To get the actual predictions from the model, we need to sample from the output distribution, to get actual indices. The distribution is defined by the logits over the character vocabulary. Try for the first example of the batch:

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis = -1).numpy()

That will give us, at each timestep, a prediction of the next character index

In [None]:
sampled_indices

Now we should decode this prediction by this untrained model:

In [None]:
print("Input: \n", repr(''.join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n",repr(''.join(idx2char[sampled_indices ])))

### Train the Model
At this point the model can be treated as typical classification model. Given RNN state, and the input this time step, predict the next class of the character.

#### Attach an optimizer, and a loss function
```tf.keras.losses.sparse_categorical_crossentropy``` loss function works. Also ```from logits``` need to be set cause return logits

In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print('Prediction shape:', example_batch_predictions.shape, '# (batch_size, sequence_length, vocab_size)')
print('Scalar_loss:      ', example_batch_loss.numpy().mean())

In [None]:
model.compile(optimizer = 'adam', loss=loss)

#### Configure Checkpoints
Use ```tf.keras.callbacks.ModelCheckpoint``` to ensure check are done and sabed during training.

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

#### Execute the training as EPOCHS = 10

To keep training time reasonable, use 10 epochs to train the model.

In [None]:
EPOCHS = 10

In [None]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

#### Generate text
For sake of simplicity, use a batch size of 1. To run model on different batch_size, we need to rebuild and restore the weights from the checkpoint.

In [None]:
tf.train.latest_checkpoint(checkpoint_dir)

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [None]:
model.summary()

#### The prediction loop
_TO BE DEFINED_

In [None]:
def generate_text(model, start_string):
    # Evaluation step: generating text using learned model
    
    # number of characters to generate
    num_generate = 200
    
    # Converting our start string to numbers (vectoring)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    
    # Empty string to store our results
    text_generated = []
    
    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text
    # Experiment to find the best setting.
    temperature = 1.0
    
    
    # Here batch size equal to 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions,0)
        
        # using categorical distribution to predict the character returned by the model
        predictions = predictions/temperature
        predicted_id= tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        
        # pass predicted value as next input to the model
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(idx2char[predicted_id])
    return (start_string + ''.join(text_generated))

In [None]:
print(generate_text(model, start_string =u"ROMEO:"))

#### Additional features:
https://www.tensorflow.org/tutorials/text/text_generation