<div class="alert alert-warning">

<b>ATTENTION</b>:
<p>This entire exercise is inspired directly from a Tensorflow tutorial, click the link if you need more details about it.</p>
<a href="https://www.tensorflow.org/alpha/tutorials/text/text_generation">Tensorflow tutorial</a>

</div>

## 1. Import dependencies 
Tensorflow background session is launched to define GPU settings and eager excecution is enabled:

<a href="https://www.tensorflow.org/guide/eager">Eager execution details</a>


In this first step we also define all global variables that will help managing redundancy:

- __*SEQUENCES_LENGTH*__: length (n. of chars) of the chuncks in which the entire text will be divided in during preprocess.
- __*NUM_GENERATE*__: numbers of characters to be generated.
- __*EPOCHS*__: number of epohcs in which the training is divided.
- __*BATCH_SIZE*__: number of samples after which update the wieghts.
- __*BEDDING_DIM*__: number of neurons in the Embeddings layer.
- __*NN_DIM*__: number of LSTM units in the networ.


In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
tf.enable_eager_execution()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
tf.keras.backend.set_session(session)

import numpy as np
import pandas as pd
import json
import re
import sys
import os
import time

SEQUENCES_LENGTH= 30
NUM_GENERATE= 500
EPOCHS = 100
BATCH_SIZE = 32
EMBEDDING_DIM = 128
RNN_DIM = 1024 

## 2. Import Aesop fables data
The chosen dataset is a JSON file containing 147 Aesop Fables divided in sentences.
For the availabilty, I need to to thanks this funny and interesting project on Aesop Fables which explore the connections between them using machine learning: <a href="https://github.com/itayniv/aesop-fables-stories">GitHub repository</a>

Here an example of how it is structured:
```json
{
  "stories":[
    {
      "number": "01",
      "title": "THE WOLF AND THE KID",
      "story": [
        "There was once a little Kid whose growing horns made him think he was a grown-up Billy Goat and able to take care of himself.",
        "So one evening when the flock started home from the pasture and his mother called, the Kid paid no heed and kept right on nibbling the tender grass.",
        "A little later when he lifted his head, the flock was gone.",
        "He was all alone.",
        "The sun was sinking.",
        "Long shadows came creeping over the ground.",
        "A chilly little wind came creeping with them making scary noises in the grass.",
        "The Kid shivered as he thought of the terrible Wolf.",
        "Then he started wildly over the field, bleating for his mother.",
        "But not half-way, near a clump of trees, there was the Wolf!",
        "The Kid knew there was little hope for him.",
        "Please, Mr. Wolf, he said trembling, I know you are going to eat me.",
        "But first please pipe me a tune, for I want to dance and be merry as long as I can.",
        "The Wolf liked the idea of a little music before eating, so he struck up a merry tune and the Kid leaped and frisked gaily.",
        "Meanwhile, the flock was moving slowly homeward.",
        "In the still evening air the Wolf's piping carried far.",
        "The Shepherd Dogs pricked up their ears.",
        "They recognized the song the Wolf sings before a feast, and in a moment they were racing back to the pasture.",
        "The Wolf's song ended suddenly, and as he ran, with the Dogs at his heels, he called himself a fool for turning piper to please a Kid, when he should have stuck to his butcher's trade."
      ],
      "moral": "Do not let anything turn you from your purpose.",
      "characters": []
    }, ...
```

In [11]:
def clean(text):
    '''
    '''
    text = text.lower()
    text = text.replace("ain't", "am not")
    text = text.replace("aren't", "are not")
    text = text.replace("can't", "cannot")
    text = text.replace("can't've", "cannot have")
    text = text.replace("'cause", "because")
    text = text.replace("could've", "could have")
    text = text.replace("couldn't", "could not")
    text = text.replace("couldn't've", "could not have")
    text = text.replace("should've", "should have")
    text = text.replace("should't", "should not")
    text = text.replace("should't've", "should not have")
    text = text.replace("would've", "would have")
    text = text.replace("would't", "would not")
    text = text.replace("would't've", "would not have")
    text = text.replace("didn't", "did not")
    text = text.replace("doesn't", "does not")
    text = text.replace("don't", "do not")
    text = text.replace("hadn't", "had not")
    text = text.replace("hadn't've", "had not have")
    text = text.replace("hasn't", "has not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd", "he would")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd've", "he would have")
    text = text.replace("'s", "")
    text = text.replace("'t", "")
    text = text.replace("'ve", "")
    text = text.replace(".", " . ")
    text = text.replace("!", " ! ")
    text = text.replace("?", " ? ")
    text = text.replace(";", " ; ")
    text = text.replace(":", " : ")
    text = text.replace(",", " , ")
    text = text.replace("´", "")
    text = text.replace("‘", "")
    text = text.replace("’", "")
    text = text.replace("“", "")
    text = text.replace("”", "")
    text = text.replace("\'", "")
    text = text.replace("\"", "")
    text = text.replace("-", "")
    text = text.replace("–", "")
    text = text.replace("—", "")
    text = text.replace("[", "")
    text = text.replace("]","")
    text = text.replace("{","")
    text = text.replace("}", "")
    text = text.replace("/", "")
    text = text.replace("|", "")
    text = text.replace("(", "")
    text = text.replace(")", "")
    text = text.replace("$", "")
    text = text.replace("+", "")
    text = text.replace("*", "")
    text = text.replace("%", "")
    text = text.replace("#", "")
    text = ''.join([i for i in text if not i.isdigit()])

    return text

try:
    
    fables = []
    fablesText = ''
    dirname = os.path.abspath('')
    filepath = os.path.join(dirname, 'input_data/aesopFables.json')

    with open(filepath) as json_file:  
        data = json.load(json_file)
        for p in data['stories']:
            fables.append(' '.join(p['story']))
            
    print('{} fables imported.'.format(len(fables)))
    
    cleanedFables = []
    for f in fables:
        cleaned = clean(f)
        cleanedFables.append(cleaned)
        fablesText += ' ' + cleaned + '\n'
    
    print('{} plots cleaned.'.format(len(cleanedFables)))
    
except IOError:
    
    sys.exit('Cannot find data!')


147 fables imported.
147 plots cleaned.


We need to investigate on fables max length to better decided preprocess hyperparamateres.

In [12]:
maxLen = 0
for f in cleanedFables:
    l = len(f)
    if l > maxLen: maxLen = l

maxLen

2321

## 3. Extract Vocabulary
The vocabulary is saved as: 
- a __numpy array__ to map each encoding to the right character
- a __dictionary__ to map each character to its encoding number 

We also create a __textAsInt__ variable that contains all fables text encoded.

In [15]:
vocabulary = sorted(set(fablesText))
print(vocabulary)
vocab_size = len(vocabulary)
print ('{} unique characters\n'.format(len(vocabulary)))

char2idx = {u:i for i, u in enumerate(vocabulary)}
idx2char = np.array(vocabulary)
textAsInt = np.array([char2idx[c] for c in fablesText])
print ('{} ---- characters mapped to int ---- > {}'.format(repr(fablesText[:20]), textAsInt[:20]))

['\n', ' ', '!', ',', '.', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
34 unique characters

' there was once a li' ---- characters mapped to int ---- > [ 1 27 15 12 25 12  1 30  8 26  1 22 21 10 12  1  8  1 19 16]


## 4. Preprocess text

Given a character, or a sequence of characters, what is the most probable next character? <br/>
This is the task we're training the model to perform, the input to the model will be a sequence of characters, and we train the model to predict the following character at each time step. 

We're going to divide the text into sequences, each input sequence will contain __SEQUENCES_LENGTH__ number of characters from the text. For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

For example, say SEQUENCES_LENGTH is 4 and our text is "Hello". 
- Input: "Hell"
- Target:"ello".

To do this first use the tf.data.Dataset.from_tensor_slices function to convert the text vector into a stream of character indices.

In [19]:
def split_input_target(chunk):
    inputText = chunk[:-1]
    targetText = chunk[1:]
    return inputText, targetText

# Create training examples and targets
examplesPerEpoch = len(fablesText) // SEQUENCES_LENGTH
stepsPerEpoch = examplesPerEpoch // BATCH_SIZE
print('Examples per Epoch: {}'.format(examplesPerEpoch))
print('Steps per Epoch: {}'.format(stepsPerEpoch))

charDataset = tf.data.Dataset.from_tensor_slices(textAsInt)
for i in charDataset.take(10):
    print(idx2char[i.numpy()])
    
print('\n')

sequences = charDataset.batch(SEQUENCES_LENGTH+1, drop_remainder=True)#The batch method lets us easily convert these individual characters to sequences of the desired size.
for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

dataset = sequences.map(split_input_target)
for input_example, target_example in  dataset.take(1):
    print ('\nInput data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
    for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
        print("Step {:4d}".format(i))
        print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
        print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

dataset = dataset.shuffle(10000).batch(BATCH_SIZE, drop_remainder=True)
dataset

Examples per Epoch: 4309
Steps per Epoch: 134
 
t
h
e
r
e
 
w
a
s


' there was once a little kid wh'
'ose growing horns made him thin'
'k he was a grownup billy goat a'
'nd able to take care of himself'
' .  so one evening when the flo'

Input data:  ' there was once a little kid w'
Target data: 'there was once a little kid wh'
Step    0
  input: 1 (' ')
  expected output: 27 ('t')
Step    1
  input: 27 ('t')
  expected output: 15 ('h')
Step    2
  input: 15 ('h')
  expected output: 12 ('e')
Step    3
  input: 12 ('e')
  expected output: 25 ('r')
Step    4
  input: 25 ('r')
  expected output: 12 ('e')


<DatasetV1Adapter shapes: ((32, 30), (32, 30)), types: (tf.int64, tf.int64)>

## 4. Build the model
The model will be a simple Neural Network composed by:
- Embeddings layer 
- Recurrent Layer (Long Short Memory Networks)
- Dense layer with vocabulary size dimensionality

In [5]:
rnn = tf.keras.layers.CuDNNLSTM 

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        rnn(rnn_units,
            return_sequences=True,
            recurrent_initializer='glorot_uniform',
            stateful=True),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

trainModel = build_model(
  vocab_size = vocab_size,
  embedding_dim=EMBEDDING_DIM,
  rnn_units=RNN_DIM,
  batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = trainModel(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

trainModel.summary()

(32, 30, 34) # (batch_size, sequence_length, vocab_size)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (32, None, 128)           4352      
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM)       (32, None, 1024)          4726784   
_________________________________________________________________
dense (Dense)                (32, None, 34)            34850     
Total params: 4,765,986
Trainable params: 4,765,986
Non-trainable params: 0
_________________________________________________________________


## 5. Train the model
We train the model and save its weigths in .h5 file.

In [6]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

trainModel.compile(
      optimizer = tf.train.AdamOptimizer(),
      loss = loss)

trainModel.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=stepsPerEpoch)

dirname = os.path.abspath('')
weightsPath = os.path.join(dirname, 'models/rnn_char_fables_{}_{}_{}_{}_{}_.h5'.format(
    EPOCHS, 
    SEQUENCES_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    RNN_DIM)
)
trainModel.save_weights(weightsPath)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## 6. Generation model
The generation model is the same used in training but with a __BATH_SIZE__ equal to 1 so that the model can digest one sample at a time.

In [7]:
rnn = tf.keras.layers.CuDNNLSTM

genModel = build_model(
  vocab_size = vocab_size,
  embedding_dim=EMBEDDING_DIM,
  rnn_units=RNN_DIM,
  batch_size=1)

dirname = os.path.abspath('')
weightsPath = os.path.join(dirname, 'models/rnn_char_fables_{}_{}_{}_{}_{}_.h5'.format(
    EPOCHS, 
    SEQUENCES_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    RNN_DIM)
)
genModel.load_weights(weightsPath)
genModel.build(tf.TensorShape([1, None]))
genModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 128)            4352      
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (1, None, 1024)           4726784   
_________________________________________________________________
dense_1 (Dense)              (1, None, 34)             34850     
Total params: 4,765,986
Trainable params: 4,765,986
Non-trainable params: 0
_________________________________________________________________


## 7. Generate text
In order to generate a sentence with a fixed dimensionality, the following generation loop is implemented:

- It Chooses a start string, initializes the RNN state and sets the number of characters to generate.
- It gets the prediction distribution of the next character using the start string and the RNN state.
- It uses a multinomial distribution to calculate the index of the predicted character and then it uses this predicted character as our next input to the model.
- The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one word. After predicting the next word, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.

<img src="images/generation_loop.png" alt="Generation Loop" width="500" height="400">

In [8]:
def generate_text(model, start_string, char_2_idx, idx_2_char):
    '''
    '''
    # Evaluation step (generating text using the learned weights)
    # Number of characters to generate
    numGenerate = NUM_GENERATE
    # Converting our start string to numbers (vectorizing)
    start_string = clean(start_string) 
    inputEval = [char_2_idx[s] for s in start_string]
    inputEval = tf.expand_dims(inputEval, 0)
    # Empty string to store our results
    textGenerated = []
    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0
    # Here batch size == 1
    model.reset_states()
    
    for i in range(numGenerate):
        predictions = model(inputEval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)
        # using a multinomial distribution to predict the word returned by the trainModel
        predictions = predictions / temperature
        predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
        # We pass the predicted word as the next input to the trainModel
        # along with the previous hidden state
        inputEval = tf.expand_dims([predicted_id], 0)
        textGenerated.append(idx_2_char[predicted_id])

    return (start_string + ''.join(textGenerated))

generated = generate_text(
        model=genModel, 
        start_string="There was once a little Bear", 
        char_2_idx=char2idx, 
        idx_2_char=idx2char
    )

print(generated)
session.close()

Instructions for updating:
Use tf.random.categorical instead.
there was once a little bear him ,  and the king of beasts is like wax in the ground roon with them ,  and the snakes kindly appearance ,  he bears .  so he about to d .  just then the cat let go a good lesson learned . 
 a friving her good boasted now the cranes were going to straid and strck .  besides ,  the animals were carried one of them .  it bear whire the lion had little happen i cannot tell you how gecorners .  i see !  mother mole saw sure i can soon gnaw this stalks out of the ground with  creature .  a thirsty
