## 1. Import dependencies 
Tensorflow background session is launched to define GPU settings.
In this first step we also define all global variables that will help managing redundancy:

- __*EPOCHS*__: number of epochs in which the training is divided.
- __*MAX_LENGTH*__: Maximum length of the variable dimension phrases..
- __*BATCH_SIZE*__: number of samples after which update the weights.
- __*EMBEDDING_DIM*__: number of neurons in the Embeddings layer.
- __*RNN_DIM*__: number of LSTM units in the network.

In [5]:
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
tf.keras.backend.set_session(session)

import pandas as pd
import numpy as np
import sys
import os
import re
import random
import json
import gensim
from gensim.models import Word2Vec
from tensorflow.keras.models import Model
from sklearn.utils import shuffle
from tensorflow.keras.layers import Dense, LSTM, CuDNNLSTM, Input, Embedding, TimeDistributed, Flatten, Dropout
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from copy import deepcopy
from tensorflow.keras.models import load_model

EPOCHS = 2
MAX_LENGTH = 100
BATCH_SIZE = 16
EMBEDDING_DIM = 128
HIDDEN_DIM = 1024

## 2. Import Aesop fables data
The chosen dataset is a JSON file containing 147 Aesop Fables divided in sentences.
For the availabilty, I need to to thanks this funny and interesting project on Aesop Fables which explore the connections between them using machine learning: <a href="https://github.com/itayniv/aesop-fables-stories">GitHub repository</a>

Here an example of how it is structured:
```json
{
  "stories":[
    {
      "number": "01",
      "title": "THE WOLF AND THE KID",
      "story": [
        "There was once a little Kid whose growing horns made him think he was a grown-up Billy Goat and able to take care of himself.",
        "So one evening when the flock started home from the pasture and his mother called, the Kid paid no heed and kept right on nibbling the tender grass.",
        "A little later when he lifted his head, the flock was gone.",
        "He was all alone.",
        "The sun was sinking.",
        "Long shadows came creeping over the ground.",
        "A chilly little wind came creeping with them making scary noises in the grass.",
        "The Kid shivered as he thought of the terrible Wolf.",
        "Then he started wildly over the field, bleating for his mother.",
        "But not half-way, near a clump of trees, there was the Wolf!",
        "The Kid knew there was little hope for him.",
        "Please, Mr. Wolf, he said trembling, I know you are going to eat me.",
        "But first please pipe me a tune, for I want to dance and be merry as long as I can.",
        "The Wolf liked the idea of a little music before eating, so he struck up a merry tune and the Kid leaped and frisked gaily.",
        "Meanwhile, the flock was moving slowly homeward.",
        "In the still evening air the Wolf's piping carried far.",
        "The Shepherd Dogs pricked up their ears.",
        "They recognized the song the Wolf sings before a feast, and in a moment they were racing back to the pasture.",
        "The Wolf's song ended suddenly, and as he ran, with the Dogs at his heels, he called himself a fool for turning piper to please a Kid, when he should have stuck to his butcher's trade."
      ],
      "moral": "Do not let anything turn you from your purpose.",
      "characters": []
    }, ...
```

In [2]:
def clean(text):
    '''
    '''
    text = text.lower()
    text = text.replace("ain't", "am not")
    text = text.replace("aren't", "are not")
    text = text.replace("can't", "cannot")
    text = text.replace("can't've", "cannot have")
    text = text.replace("'cause", "because")
    text = text.replace("could've", "could have")
    text = text.replace("couldn't", "could not")
    text = text.replace("couldn't've", "could not have")
    text = text.replace("should've", "should have")
    text = text.replace("should't", "should not")
    text = text.replace("should't've", "should not have")
    text = text.replace("would've", "would have")
    text = text.replace("would't", "would not")
    text = text.replace("would't've", "would not have")
    text = text.replace("didn't", "did not")
    text = text.replace("doesn't", "does not")
    text = text.replace("don't", "do not")
    text = text.replace("hadn't", "had not")
    text = text.replace("hadn't've", "had not have")
    text = text.replace("hasn't", "has not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd", "he would")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd've", "he would have")
    text = text.replace("'s", "")
    text = text.replace("'t", "")
    text = text.replace("'ve", "")
    text = text.replace(".", " . ")
    text = text.replace("!", " ! ")
    text = text.replace("?", " ? ")
    text = text.replace(";", " ; ")
    text = text.replace(":", " : ")
    text = text.replace(",", " , ")
    text = text.replace("´", "")
    text = text.replace("‘", "")
    text = text.replace("’", "")
    text = text.replace("“", "")
    text = text.replace("”", "")
    text = text.replace("\'", "")
    text = text.replace("\"", "")
    text = text.replace("-", "")
    text = text.replace("–", "")
    text = text.replace("—", "")
    text = text.replace("[", "")
    text = text.replace("]","")
    text = text.replace("{","")
    text = text.replace("}", "")
    text = text.replace("/", "")
    text = text.replace("|", "")
    text = text.replace("(", "")
    text = text.replace(")", "")
    text = text.replace("$", "")
    text = text.replace("+", "")
    text = text.replace("*", "")
    text = text.replace("%", "")
    text = text.replace("#", "")
    text = ''.join([i for i in text if not i.isdigit()])

    return text

try:
    
    fables = []
    dirname = os.path.abspath('')
    filepath = os.path.join(dirname, 'input_data/aesopFables.json')

    with open(filepath) as json_file:  
        data = json.load(json_file)
        for p in data['stories']:
            fables.append(' '.join(p['story']))
            
    print('{} fables imported.'.format(len(fables)))
    
    cleanedFables = []
    for f in fables:
        cleanedFables.append(clean(f))
    
    print('{} fables cleaned.'.format(len(cleanedFables)))

except IOError:
    sys.exit('Cannot find data!')

147 fables imported.
147 fables cleaned.


We need to investigate on fables max length and average length to better decided preprocess hyperparamateres.

In [3]:
sumLen = 0
maxLen = 0
for f in cleanedFables:
    words = f.split(' ')
    thisLen = len(words)
    sumLen += thisLen
    if thisLen > maxLen:
        maxLen = thisLen

avgLen = sumLen/len(cleanedFables)

print('Average fable length: {}'.format(avgLen))
print('Maximum fable length: {}'.format(maxLen))

Average fable length: 205.40816326530611
Maximum fable length: 549


## 3. Extract Vocabulary
The vocabulary is saved as: 
- a __numpy array__ to map each encoding to the right word
- a __dictionary__ to map each word to its encoding number 

We also create a __textAsInt__ variable that contains all fables text encoded.

In [4]:
# CREATE VOCABULARY OF WORDS
idx2word = []
word2idx = {'<PAD>' : 0, '<START>' : 1 , '<END>': 2}
wordSequence = []
for fable in cleanedFables:
    words = fable.split(' ')
    wordSequence.extend(words)
    for word in words:
        if word not in word2idx:
            word2idx[word] = len(word2idx)

for word in idx2word:
    word2idx[word] = len(word2idx)

idx2word = list(word2idx.keys())
# textAsInt = np.array([word2idx[w] for w in wordSequence])
vocab_size = len(idx2word)
print('Vocabulary Size: {}'.format(vocab_size))


Vocabulary Size: 3062


## 4. Preprocess text

Thanks to Encoder-Decoder architecture we can now train the model to generate variable dimension sequences, meaning that it will be the model itself to decide how many words have to be generated for a determined input sequence.
However in order to achieve this result the text has to preprocessed in a way that let the model undesrstand where a sequence starts and where it ends.
In fact in the previous code cell we had these three tokens to the vocabulary:

```python
word2idx = {'<PAD>' : 0, '<START>' : 1 , '<END>': 2}
```

We're going to divide the text into sequences of words, respecting a maximum length decided a priori.
Each sequence will generate as many samples as its number of words.

For example, say SEQUENCES_LENGTH is 4 and our text is "Hello my name is Dario and I love to code". 
- Sequences: "Hello my name is ", "Dario and I love", "to code"

Then with the first sequence:
- Input: "START Hello END" --> Target: "START my name is END" 
- Input: "START Hello my END" --> Target: "START name is END"
- Input:  "START Hello my name END" --> Target: "START is END"
- Input: "START Hello my name is END" --> Target: "START END"

In this sequences of every length can be preprocessed together.

In [6]:
inputSentences = []
targetSentences = []
outputSentences = []

for fable in cleanedFables:
        words = fable.split(' ')

        b=True
        while b:
            if('' in words): 
                words.remove('')
            else: b = False

        sentences = [words[i:i+MAX_LENGTH] for i in range(0, len(words), MAX_LENGTH)]
        for s in sentences:
            for i in range(1, len(s)):
                encode_tokens, decode_tokens = s[:i], s[i:]
                encode_tokens = ' '.join(['<START>'] + encode_tokens + ['<END>'])
                output_tokens = ' '.join(decode_tokens + ['<END>'])
                decode_tokens = ' '.join(['<START>'] + decode_tokens + ['<END>'])
                inputSentences.append(encode_tokens)
                targetSentences.append(decode_tokens)
                outputSentences.append(output_tokens)

numSamples = len(inputSentences)
print('Num samples: {}'.format(numSamples))

print("Creating dataset to feed Model . . . ")
dirname = os.path.abspath('')
filePath = os.path.join(dirname, os.path.join(dirname, 'preprocessed/dataset_ed_fables_{}_{}.csv'.format(
MAX_LENGTH,  
BATCH_SIZE)))

if os.path.exists(filePath):
    os.remove(filePath) 

d= {'input_encoder' : inputSentences, 'input_decoder' :targetSentences, 'output_decoder':outputSentences }
df = pd.DataFrame(data=d) 
df = shuffle(df)
df.to_csv(filePath, index=False)

print("Dataset printed on CSV.")

Num samples: 26809
Creating dataset to feed Model . . . 
Dataset printed on CSV.


In [7]:
def generate_data(word_2_idx, num_samples, max_length, vocab_length, batch_size=BATCH_SIZE):
    '''
    '''
    dirname = os.path.abspath('')
    filePath = os.path.join(dirname, os.path.join(dirname, 'preprocessed/dataset_ed_fables_{}_{}.csv'.format(
    MAX_LENGTH,  
    BATCH_SIZE)))
    df = pd.read_csv(filePath)
    
    encoderInputData = np.zeros((numSamples, max_length + 2), dtype='int')
    decoderInputData = np.zeros((numSamples, max_length + 2), dtype='int')
    decoderTargetData = np.zeros((numSamples, max_length + 2, 1),dtype='int')
    
    for i in range(0, numSamples):
        if(i%10000 == 0):print('Generating feeding data... {}/{}'.format(i,numSamples))
        encoderTokens = df.iloc[[i]]['input_encoder'].values[0].split(' ')
        decoderTokens = df.iloc[[i]]['input_decoder'].values[0].split(' ')
        outputTokens = df.iloc[[i]]['output_decoder'].values[0].split(' ')

        for t, word in enumerate(encoderTokens):
            encoderInputData[i, t] = word_2_idx[word]
        for t, word in enumerate(decoderTokens):
            decoderInputData[i, t] = word_2_idx[word]
        for t, word in enumerate(outputTokens):
            # decoderTargetData is ahead of decoderInputData by one timestep
            decoderTargetData[i, t, 0] = word_2_idx[word]

    
    return encoderInputData, decoderInputData, decoderTargetData

__Extract embeddings matrix__

In [8]:
# Recreating embeddings index based on Tokenizer vocabulary
word2vecModel = gensim.models.Word2Vec.load('embeddings/text8_word2vec_skipgram_128.bin')
word2vec_vocabulary = word2vecModel.wv.vocab
embeddingIndex = dict()
counter = 0
for i, word in enumerate(idx2word):
    if word in word2vec_vocabulary :
        embeddingIndex[word] = word2vecModel[word]
    else:
        counter += 1

print("{} words without pre-trained embedding!".format(counter))
    
# Prepare embeddings matrix
embeddingMatrix = np.random.random((len(word2idx), EMBEDDING_DIM))
for i, word in enumerate(idx2word):
    embeddingVector = embeddingIndex.get(word)
    if embeddingVector is not None:
        embeddingMatrix[i] = embeddingVector

108 words without pre-trained embedding!


  


__Or use random weights__

In [None]:
embeddingMatrix = np.random.random((len(word2idx), EMBEDDING_DIM))

__Define function to build the model__

In [9]:
def build_encoder(vocab_length, embedding_weigths=embeddingMatrix, embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM):
    '''
    '''
    # Define an input sequence and process it.
    # Input layer of the encoder :
    encoderInput = Input(shape=(None,))
    
    # Hidden layers of the encoder :
    encoder_embedding = Embedding(input_dim = vocab_length, output_dim = embedding_dim, weights=[embedding_weigths])(encoderInput)

    # Output layer of the encoder :
    encoder_LSTM = CuDNNLSTM(hidden_dim , return_state=True)
    encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)

    # We discard `encoder_outputs` and only keep the states.
    encoderStates = [state_h, state_c]
    
    
    return encoderInput, encoderStates


def build_encoder_gen(encoder_input, encoder_states):
    '''
    '''
    encoderModelGen = Model(encoder_input, encoder_states)

    return encoderModelGen


def build_decoder(vocab_length, encoderStates, embedding_weigths=embeddingMatrix, embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM):
    '''
    '''
    # Set up the decoder, using `encoderStates` as initial state.
    # Input layer of the decoder :
    decoderInput = Input(shape=(None,))

    # Hidden layers of the decoder :
    decoderEmbeddingLayer = Embedding(input_dim = vocab_length, output_dim = embedding_dim, weights=[embedding_weigths])
    decoder_embedding = decoderEmbeddingLayer(decoderInput)

    decoderLSTMLayer = CuDNNLSTM(hidden_dim , return_sequences=True, return_state=True)
    decoder_LSTM_output, _ , _ = decoderLSTMLayer(decoder_embedding, initial_state=encoderStates)

    # Output layer of the decoder :
    decoderDenseLayer = Dense(vocab_length, activation='softmax')
    decoderOutput = decoderDenseLayer(decoder_LSTM_output)

    return decoderInput, decoderOutput, decoderEmbeddingLayer,  decoderLSTMLayer, decoderDenseLayer


def build_decoder_gen(decoder_input, decoder_embedding_layer, decoder_LSTM_layer, decoder_dense, hidden_dim=HIDDEN_DIM):
    '''
    '''
    decoder_state_input_h = Input(shape=(hidden_dim,))
    decoder_state_input_c = Input(shape=(hidden_dim,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

    decoder_embedding_gen = decoder_embedding_layer(decoder_input)
    decoder_LSTM_output_gen, state_h_gen , state_c_gen = decoder_LSTM_layer(decoder_embedding_gen, initial_state = decoder_states_inputs)
    decoder_states_gen = [state_h_gen, state_c_gen]
    decoderOutputGen = decoder_dense(decoder_LSTM_output_gen)

    # sampling model will take encoder states and decoder_input(seed initially) and output the predictions(french word index) We dont care about decoder_states2
    decoderModelGen = Model(
    [decoder_input] + decoder_states_inputs,
    [decoderOutputGen] + decoder_states_gen
    )

    return decoderModelGen
  
def build_encoder_decoder_model(encoder_input, decoder_input, decoder_output):
    '''
    '''
    model = Model([encoder_input, decoder_input], decoder_output)
    model.summary()

    return model

__Train model__

In [10]:
dirname = os.path.abspath('')

encoderGenPath = os.path.join(dirname, 'models/encoder_fables_{}_{}_{}_{}_{}.h5'.format(
    EPOCHS, 
    MAX_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    HIDDEN_DIM)
)

decoderGenPath = os.path.join(dirname, 'models/decoder_fables_{}_{}_{}_{}_{}.h5'.format(
    EPOCHS, 
    MAX_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    HIDDEN_DIM)
)

encoderInput, encoderStates = build_encoder(vocab_length=vocab_size)

decoderInput, decoderOutput, decoderEmbeddingLayer,  decoderLSTMLayer, decoderDenseLayer = build_decoder(
    vocab_length=vocab_size, 
    encoderStates=encoderStates
)

model = build_encoder_decoder_model(
    encoder_input=encoderInput, 
    decoder_input=decoderInput, 
    decoder_output=decoderOutput
)

encoderInputData, decoderInputData, decoderTargetData = generate_data(
    word_2_idx=word2idx,
    num_samples=numSamples,
    max_length=MAX_LENGTH, 
    vocab_length=vocab_size
)

model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
model.fit([encoderInputData, decoderInputData], decoderTargetData, batch_size=BATCH_SIZE, epochs=EPOCHS)

encoderModelGen = build_encoder_gen(
    encoder_input = encoderInput, 
    encoder_states = encoderStates
)

decoderModelGen = build_decoder_gen(
    decoder_input = decoderInput, 
    decoder_embedding_layer = decoderEmbeddingLayer, 
    decoder_LSTM_layer = decoderLSTMLayer, 
    decoder_dense = decoderDenseLayer
)

encoderModelGen.save(encoderGenPath)
decoderModelGen.save(decoderGenPath)

session.close()

Instructions for updating:
Colocations handled automatically by placer.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 128)    391936      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 128)    391936      input_2[0][0]                    
_____________________________________

__Generate text__

In [11]:
def generate_text(sentences, encoder_model, decoder_model, vocab_length, word_2_idx, idx_2_word, max_length):
    '''
    '''
    for phrase in sentences:

        # Cleaning sentence
        phrase = clean(phrase)
        print('GENEREATING FROM: {}'.format(phrase))
        tokens = phrase.split(' ')
        inputSequence = np.zeros((1, max_length), dtype='int')
        for i, t in enumerate(tokens):
            inputSequence[0, i] = word_2_idx[t]

        # Encode the input as state vectors.
        statesValue = encoder_model.predict(inputSequence)
        # Generate empty target sequence of length 1.
        targetSeq = np.zeros((1, 1))
        targetSeq[0, 0] = word_2_idx['<START>']
        # Sampling loop for a batch of sequences
        # (to simplify, here we assume a batch of size 1).
        stopCondition = False
        decodedSentence = ''
        decodedList = []
        while not stopCondition:
            outputTokens, h, c = decoder_model.predict(
                [targetSeq] + statesValue)

            # Sample a token
            sampledTokenIndex = np.argmax(outputTokens[0, -1, :])
            sampledWord = idx_2_word[sampledTokenIndex]
            decodedList.append(sampledWord)
            decodedSentence += ' ' + sampledWord

            # Exit condition: either hit max length
            # or find stop character.
            if (sampledWord == '<END>' or len(decodedList)== max_length):
                stopCondition = True

            # Update the target sequence (of length 1).
            targetSeq = np.zeros((1, 1))
            targetSeq[0, 0] = sampledTokenIndex

            # Update states
            statesValue = [h, c]

        print('GENERATED: {}'.format(decodedSentence))

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
tf.keras.backend.set_session(session)
        
dirname = os.path.abspath('')

encoderGenPath = os.path.join(dirname, 'models/encoder_fables_{}_{}_{}_{}_{}.h5'.format(
    EPOCHS, 
    MAX_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    HIDDEN_DIM)
)

decoderGenPath = os.path.join(dirname, 'models/decoder_fables_{}_{}_{}_{}_{}.h5'.format(
    EPOCHS, 
    MAX_LENGTH, 
    BATCH_SIZE, 
    EMBEDDING_DIM,
    HIDDEN_DIM)
)

sentences = [
    'The Cock',
    'A Dog and a Wolf',
    'There was once a little Bear',
    'An eagle was given permission to fly over the country.',
    'A dog was talking to a bear asking for some food. The bear who was hungry too said no.',
    'There was once a little Mouse who walking in the forest. He found his way into a bear cave. It was alone and afraid. The cave was really dark and the Bear was sleeping.'
]

encoderModel = load_model(encoderGenPath)
encoderModel.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
decoderModel = load_model(decoderGenPath)
decoderModel.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

generate_text(
    sentences = sentences,
    encoder_model = encoderModel,
    decoder_model = decoderModel, 
    vocab_length = vocab_size, 
    word_2_idx = word2idx, 
    idx_2_word = idx2word, 
    max_length = MAX_LENGTH
)

session.close()

GENEREATING FROM: the cock
GENERATED:  mouse , who had to take them to the found , there he was able to do . lifting his wings he tried to rise from the ground . but the weight of his magnificent train held him down . instead of flying up to greet the first rays of the morning sun or to bathe in the rosy light among the floating clouds at sunset , he would have to walk the ground more encumbered and oppressed than any common barnyard fowl . <END>
GENEREATING FROM: a dog and a wolf
GENERATED:  tree , and the animals respectfully made way for him , an ass brayed a scornful remark as he passed . the lion felt a flash of anger . but when he turned his head and saw who had spoken , he walked quietly on . he would not honor the fool with even so much as a stroke of his claws . <END>
GENEREATING FROM: there was once a little bear
GENERATED:  , and there he was with a house on his back and little short legs that could hardly drag him along . one day he met a pair of ducks and told them all his