## 1. Import dependencies 
Tensorflow background session is launched to define GPU settings and eager excecution is enabled:

<a href="https://www.tensorflow.org/guide/eager">Eager execution details</a>


In this first step we also define all global variables that will help managing redundancy:

- __*SEQUENCES_LENGTH*__: length (n. of chars) of the chuncks in which the entire text will be divided in during preprocess.
- __*NUM_GENERATE*__: numbers of characters to be generated.
- __*EPOCHS*__: number of epohcs in which the training is divided.
- __*BATCH_SIZE*__: number of samples after which update the wieghts.
- __*EMBEDDING_DIM*__: number of neurons in the Embeddings layer.
- __*RNN_DIM*__: number of LSTM units in the networ.


In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
tf.enable_eager_execution()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
tf.keras.backend.set_session(session)

import numpy as np
import pandas as pd
import json
import re
import sys
import os
import time
import gensim
from gensim.models import Word2Vec

SEQUENCES_LENGTH = 10
NUM_GENERATE = 100
EPOCHS = 100
BATCH_SIZE = 8
EMBEDDING_DIM = 128
RNN_DIM = 1024 

## 2. Import Aesop fables data
The chosen dataset is a JSON file containing 147 Aesop Fables divided in sentences.
For the availabilty, I need to to thanks this funny and interesting project on Aesop Fables which explore the connections between them using machine learning: <a href="https://github.com/itayniv/aesop-fables-stories">GitHub repository</a>

Here an example of how it is structured:
```json
{
  "stories":[
    {
      "number": "01",
      "title": "THE WOLF AND THE KID",
      "story": [
        "There was once a little Kid whose growing horns made him think he was a grown-up Billy Goat and able to take care of himself.",
        "So one evening when the flock started home from the pasture and his mother called, the Kid paid no heed and kept right on nibbling the tender grass.",
        "A little later when he lifted his head, the flock was gone.",
        "He was all alone.",
        "The sun was sinking.",
        "Long shadows came creeping over the ground.",
        "A chilly little wind came creeping with them making scary noises in the grass.",
        "The Kid shivered as he thought of the terrible Wolf.",
        "Then he started wildly over the field, bleating for his mother.",
        "But not half-way, near a clump of trees, there was the Wolf!",
        "The Kid knew there was little hope for him.",
        "Please, Mr. Wolf, he said trembling, I know you are going to eat me.",
        "But first please pipe me a tune, for I want to dance and be merry as long as I can.",
        "The Wolf liked the idea of a little music before eating, so he struck up a merry tune and the Kid leaped and frisked gaily.",
        "Meanwhile, the flock was moving slowly homeward.",
        "In the still evening air the Wolf's piping carried far.",
        "The Shepherd Dogs pricked up their ears.",
        "They recognized the song the Wolf sings before a feast, and in a moment they were racing back to the pasture.",
        "The Wolf's song ended suddenly, and as he ran, with the Dogs at his heels, he called himself a fool for turning piper to please a Kid, when he should have stuck to his butcher's trade."
      ],
      "moral": "Do not let anything turn you from your purpose.",
      "characters": []
    }, ...
```

In [2]:
def clean(text):
    '''
    '''
    text = text.lower()
    text = text.replace("ain't", "am not")
    text = text.replace("aren't", "are not")
    text = text.replace("can't", "cannot")
    text = text.replace("can't've", "cannot have")
    text = text.replace("'cause", "because")
    text = text.replace("could've", "could have")
    text = text.replace("couldn't", "could not")
    text = text.replace("couldn't've", "could not have")
    text = text.replace("should've", "should have")
    text = text.replace("should't", "should not")
    text = text.replace("should't've", "should not have")
    text = text.replace("would've", "would have")
    text = text.replace("would't", "would not")
    text = text.replace("would't've", "would not have")
    text = text.replace("didn't", "did not")
    text = text.replace("doesn't", "does not")
    text = text.replace("don't", "do not")
    text = text.replace("hadn't", "had not")
    text = text.replace("hadn't've", "had not have")
    text = text.replace("hasn't", "has not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd", "he would")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd've", "he would have")
    text = text.replace("'s", "")
    text = text.replace("'t", "")
    text = text.replace("'ve", "")
    text = text.replace(".", " . ")
    text = text.replace("!", " ! ")
    text = text.replace("?", " ? ")
    text = text.replace(";", " ; ")
    text = text.replace(":", " : ")
    text = text.replace(",", " , ")
    text = text.replace("´", "")
    text = text.replace("‘", "")
    text = text.replace("’", "")
    text = text.replace("“", "")
    text = text.replace("”", "")
    text = text.replace("\'", "")
    text = text.replace("\"", "")
    text = text.replace("-", "")
    text = text.replace("–", "")
    text = text.replace("—", "")
    text = text.replace("[", "")
    text = text.replace("]","")
    text = text.replace("{","")
    text = text.replace("}", "")
    text = text.replace("/", "")
    text = text.replace("|", "")
    text = text.replace("(", "")
    text = text.replace(")", "")
    text = text.replace("$", "")
    text = text.replace("+", "")
    text = text.replace("*", "")
    text = text.replace("%", "")
    text = text.replace("#", "")
    text = ''.join([i for i in text if not i.isdigit()])

    return text

try:
    
    fables = []
    fablesText = ''
    dirname = os.path.abspath('')
    filepath = os.path.join(dirname, 'input_data/aesopFables.json')

    with open(filepath) as json_file:  
        data = json.load(json_file)
        for p in data['stories']:
            fables.append(' '.join(p['story']))
            
    print('{} fables imported.'.format(len(fables)))
    
    cleanedFables = []
    for f in fables:
        cleaned = clean(f)
        cleanedFables.append(cleaned)
        fablesText += ' ' + cleaned + '\n'
    
    print('{} plots cleaned.'.format(len(cleanedFables)))
    
except IOError:
    
    sys.exit('Cannot find data!')


147 fables imported.
147 plots cleaned.


We need to investigate on fables max length to better decided preprocess hyperparamateres.

In [3]:
maxLen = 0
for f in cleanedFables:
    l = len(f.split(' '))
    if l > maxLen: maxLen = l

maxLen

549

## 3. Extract Vocabulary
The vocabulary is saved as: 
- a __numpy array__ to map each encoding to the right word
- a __dictionary__ to map each word to its encoding number 

We also create a __textAsInt__ variable that contains all fables text encoded.

In [4]:
# CREATE VOCABULARY OF WORDS
idx2word = []
word2idx = {'<PAD>' : 0, '<START>' : 1 , '<END>': 2}
wordSequence = []
for fable in cleanedFables:
    words = fable.split(' ')
    wordSequence.extend(words)
    for word in words:
        if word not in word2idx:
            word2idx[word] = len(word2idx)

for word in idx2word:
    word2idx[word] = len(word2idx)

idx2word = list(word2idx.keys())
textAsInt = np.array([word2idx[w] for w in wordSequence])
vocab_size = len(idx2word)
print('Vocabulary Size: {}'.format(vocab_size))


Vocabulary Size: 3062


## 4. Preprocess text

Given a word, or a sequence of words, what is the most probable next word? <br/>
This is the task we're training the model to perform, the input to the model will be a sequence of words, and we train the model to predict the following word at each time step. 

We're going to divide the text into sequences of words, each input sequence will contain __SEQUENCES_LENGTH__ number of words from the text. For each input sequence, the corresponding targets contain the same length of text, except shifted one word to the right.

For example, say SEQUENCES_LENGTH is 4 and our text is "Hello my name is Dario". 
- Input: "Hello my name is "
- Target: "my name is Dario".

To do this first use the tf.data.Dataset.from_tensor_slices function to convert the text vector into a stream of words indices.

In [5]:
def split_input_target(chunk):
    inputText = chunk[:-1]
    targetText = chunk[1:]
    return inputText, targetText

# Create training examples and targets
examplesPerEpoch = len(fablesText.split(' ')) // SEQUENCES_LENGTH
stepsPerEpoch = examplesPerEpoch // BATCH_SIZE
print('Examples per Epoch: {}'.format(examplesPerEpoch))
print('Steps per Epoch: {}'.format(stepsPerEpoch))

wordDataset = tf.data.Dataset.from_tensor_slices(textAsInt)
sequences = wordDataset.batch(SEQUENCES_LENGTH+1, drop_remainder=True)
dataset = sequences.map(split_input_target)
dataset = dataset.shuffle(10000).batch(BATCH_SIZE, drop_remainder=True)
dataset

Examples per Epoch: 3019
Steps per Epoch: 377


<DatasetV1Adapter shapes: ((8, 10), (8, 10)), types: (tf.int64, tf.int64)>

## 5. Extract embeddings matrix
Now that we're working with words and not with characters, we can load pre-trained embeddings.
It is a good practice to use them and in this case we calculated them with Google's Word2Vec model on the famous text8 dataset.
- *More details on __train_embeddings.ipyn__ notebook* (To be executed if the .bin file do not exists)

The embeddings are simply 128 (or whatever is the dimensionality during training) weigths from a single neuron in the input layer to the 128 neurons in the hidden layer trained to understand which words compared in the same context for a given text.

So we simply extract these weights for every single word in our vocabulary and build a matrix with them.

In [6]:
# Recreating embeddings index based on Tokenizer vocabulary
word2vecModel = gensim.models.Word2Vec.load('embeddings/text8_word2vec_skipgram_128.bin')
word2vec_vocabulary = word2vecModel.wv.vocab
embeddingIndex = dict()
counter = 0
for i, word in enumerate(idx2word):
    if word in word2vec_vocabulary :
        embeddingIndex[word] = word2vecModel[word]
    else:
        counter += 1

print("{} words without pre-trained embedding!".format(counter))
    
# Prepare embeddings matrix
embeddingMatrix = np.random.random((len(word2idx), EMBEDDING_DIM))
for i, word in enumerate(idx2word):
    embeddingVector = embeddingIndex.get(word)
    if embeddingVector is not None:
        embeddingMatrix[i] = embeddingVector

108 words without pre-trained embedding!


  


### _Or it is possible to use random weights_
Do not execute this cell to use pre-trained embeddings.

In [None]:
embeddingMatrix = np.random.random((len(word2idx), EMBEDDING_DIM))

## 6. Build the model
The model will be a simple Neural Network composed by:
- Embeddings layer 
- Recurrent Layer (Long Short Memory Networks)
- Dense layer with vocabulary size dimensionality

It is also important to notice:
- _tf.keras.layers.Embedding( ..., weights=[embeddingMatrix]_)

Added with respect to the previous char-generated notebook.

In [7]:
rnn = tf.keras.layers.CuDNNLSTM 

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None],
                                  weights=[embeddingMatrix]),
        rnn(rnn_units,
            return_sequences=True,
            recurrent_initializer='glorot_uniform',
            stateful=True),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

trainModel = build_model(
  vocab_size = vocab_size,
  embedding_dim=EMBEDDING_DIM,
  rnn_units=RNN_DIM,
  batch_size=BATCH_SIZE)

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = trainModel(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

trainModel.summary()

Instructions for updating:
Colocations handled automatically by placer.
(8, 10, 3062) # (batch_size, sequence_length, vocab_size)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (8, None, 128)            391936    
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM)       (8, None, 1024)           4726784   
_________________________________________________________________
dense (Dense)                (8, None, 3062)           3138550   
Total params: 8,257,270
Trainable params: 8,257,270
Non-trainable params: 0
_________________________________________________________________


## 7. Train the model
We train the model and save its weigths in .h5 file.

In [8]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

trainModel.compile(
      optimizer = tf.train.AdamOptimizer(),
      loss = loss)

trainModel.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=stepsPerEpoch)

dirname = os.path.abspath('')
weightsPath = os.path.join(dirname, 'models/LSTM_words_fables_{}_{}_{}_{}_{}_.h5'.format(
    EPOCHS, 
    SEQUENCES_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    RNN_DIM)
)
trainModel.save_weights(weightsPath)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## 8. Generation model
The generation model is the same used in training but with a __BATH_SIZE__ equal to 1 so that the model can digest one sample at a time.

In [9]:
rnn = tf.keras.layers.CuDNNLSTM

genModel = build_model(
  vocab_size = vocab_size,
  embedding_dim=EMBEDDING_DIM,
  rnn_units=RNN_DIM,
  batch_size=1)

dirname = os.path.abspath('')
weightsPath = os.path.join(dirname, 'models/LSTM_words_fables_{}_{}_{}_{}_{}_.h5'.format(
    EPOCHS, 
    SEQUENCES_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    RNN_DIM)
)
genModel.load_weights(weightsPath)
genModel.build(tf.TensorShape([1, None]))
genModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 128)            391936    
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (1, None, 1024)           4726784   
_________________________________________________________________
dense_1 (Dense)              (1, None, 3062)           3138550   
Total params: 8,257,270
Trainable params: 8,257,270
Non-trainable params: 0
_________________________________________________________________


## 9. Generate text
In order to generate a sentence with a fixed dimensionality, the following generation loop is implemented:

- It Chooses a start string, initializes the RNN state and sets the number of words to generate.
- It gets the prediction distribution of the next word using the start string and the RNN state.
- It uses a multinomial distribution to calculate the index of the predicted word and then it uses this predicted word as our next input to the model.
- The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one word. After predicting the next word, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.



In [12]:
def generate_text(model, start_string, word_2_idx, idx_2_word):
    '''
    '''
    # Evaluation step (generating text using the learned weights)
    # Number of characters to generate
    numGenerate = NUM_GENERATE
    # Converting our start string to numbers (vectorizing) 
    s = clean(start_string)
    inputEval = [word_2_idx[w] for w in s.split(' ')]
    inputEval = tf.expand_dims(inputEval, 0)
    # Empty string to store our results
    textGenerated = []
    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0
    # Here batch size == 1
    model.reset_states()

    for i in range(numGenerate):
        predictions = model(inputEval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)
        # using a multinomial distribution to predict the word returned by the trainModel
        predictions = predictions / temperature
        predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
        # We pass the predicted word as the next input to the trainModel
        # along with the previous hidden state
        inputEval = tf.expand_dims([predicted_id], 0)
        textGenerated.append(idx_2_word[predicted_id])

    return (start_string + ' ' + ' '.join(textGenerated))


generated = generate_text(
        model=genModel, 
        start_string="There was once a little Bear", 
        word_2_idx=word2idx, 
        idx_2_word=idx2word
    )

print(generated)
session.close()

There was once a little Bear every not see what good friends we shall become .  the waves washed it up on shore .  but his plans were very much changed when he met a lion and furiously began to tear it with their teeth .  and when they returned next day to look for visitors .  and after he had been walking .  wishing also to rest in a wolf and began to his life ,  and the goats out to feed ,  the wild goats scampered the animals respectfully made way for him ,  an ass
