# Text Generation using Bidirectional LSTM and Doc2Vec models

The purpose of [this article](https://medium.com/@david.campion/text-generation-using-bidirectional-lstm-and-doc2vec-models-1-3-8979eb65cb3a) is to discuss about text generation, using machine learning approaches, especially neural networks.

It is not the first article about it, and probably not the last. Actually, there is a lot of litterature about text generation using "AI" techniques, and some codes are available to generate texts from existing novels, trying to create new chapters for **"Game of Thrones"**, **"Harry Potter"**, or a new piece in the style of **Shakespears**. Sometimes with interesting results.

Mainly, these approaches are using classic LSTM networks, and the are pretty fun to be experimented.

However, generated texts provide a taste of unachievement. Generated sentences seems quite right, whith correct grammar and syntax, as if the neural network was understanding correctly the structure of a sentence. But the whole new text does not have great sense. If it is not complete nosense. 

This problem could come from the approach itself, using only LSTM to generate text word by word. But how can we improve them ? In this article, I will try to investigate a new way to generate sentences.

It does not mean that I will use something completely different from LTSM : I am not, I will use LTSM network to generate sequences of words. However I will try to go further than a classic LSTM neural network and I will use an additional neural network (LSTM again), to select the best phrases.

Then, this article can be used as a tutorial. It describes :
 1. **how to train a neural network to generate sentences** (i.e. sequences of words), based on existing novels. I will use a bidirectional LSTM Architecture to perform that.
 2. **how to train a neural network to select the best next sentence for given paragraph** (i.e. a sequence of sentences). I will also use a bidirectional LSTM archicture, in addition to a Doc2Vec model of the target novels.


### Note about Data inputs
As data inputs, I will not use texts which are not free in term of intellectual properties. So I will not train the solution to create a new chapter for **"Game of Throne"** or **"Harry Potter"**.
Sorry about that, there is plenty of "free" text to perform such texts generation exercices and we can dive into the [Gutemberg project](http://www.gutenberg.org), which provides huge amount of texts (from [William Shakespears](http://www.gutenberg.org/ebooks/author/65) to [H.P. Lovecraft](http://www.gutenberg.org/ebooks/author/34724), or other great authors).

However, I am also a french author of fantasy and Science fiction. So I will use my personnal material to create a new chapter of my stories, hoping it can help me in my next work!

So, I will base this exercice on **"Artistes et Phalanges"**, a french fantasy novel I wrote over the 10 past years, wich I hope will be fair enough in term of data inputs. It contains more than 830 000 charaters.

By the way, if you're a french reader and found of fantasy, you can find it on iBook store and Amazon Kindle for free... Please note I provide also the data for free on my github repository. Enjoy it!

## 1. a Neural Network for Generating Sentences

The first step is to generate sentences in the style of a given author.

There is huge litterature about it, espacially using LSTM to perform such task. As this kind of network are working well for this job, we will use them.

The purpose of this note is not to deep dive into LSTM description, you can find very great article about them and I suggest you to read [this article](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) from Andrej Karpathy.

You can also find easily existing code to perform text generation using LSTM. On my github, you can find two tutorials, one using [Tensorflow](https://github.com/campdav/text-rnn-tensorflow), and another one using [Keras](https://github.com/campdav/text-rnn-keras) (over tensorflow), that is easier to understand.

For this first part of these exercice, I will re-use these materials, but with few improvements :
 - Instead of a simple _LSTM_, I will use a _bidirectional LSTM_. This network configuration converge faster than a single LSTM (less epochs are required), and from empiric tests, seems better in term of accuracy. You can have a look at [this article](https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/) from Jason Brownlee, for a good tutorial about bidirectional LSTM.
 - I will use Keras, which require less complexity to create the network of is more readible than conventional Tensorflow code.

### 1.1. What is the neural network task in our case ?

LSTM (Long Short Term Memory) are very good for analysing sequences of values and predicting the next values from them. For example, LSTM could be a very good choice if you want to predict the very next point of a given time serie (assuming a correlation exist in the sequence).

Talking about sentences and texts ; phrases (sentences) are basically sequences of words. So, it is natural to assume LSTM could be usefull to generate the next word of a given sentence.

In summary, the objective of a LSTM neural network in this situation is to guess the next word of a given sentence.

For example:
What is the next word of this following sentence : "he is walking down the"

Our neural net will take the sequence of words as input : "he", "is", "walking", ...
Its ouput will be a matrix providing the probability for each word from the dictionnary to be the next one of the given sentence.

Then, how will we build the complete text ? Simply iterating the process, by switching the setence by one word, including the new guessed word at its end. Then, we guess a new word for this new sentence. ad vitam aeternam.

### 1.1.1. Process

In order to do that, first, we build a dictionary containing all words from the novels we want to use.

 1. read the data (the novels we want to use),
 1. create the dictionnary of words,
 2. create the list of sentences,
 3. create the neural network,
 4. train the neural network,
 5. generate new sentences.

In [1]:
from __future__ import print_function
from keras.models import Sequential, Model
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM, Input, Flatten, Bidirectional
from keras.layers.normalization import BatchNormalization
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.metrics import categorical_accuracy
import numpy as np
import random
import sys
import os
import time
import codecs
import collections
from six.moves import cPickle

Using TensorFlow backend.


We have raw text and a lot of things have to be done to use them: split them in words list, etc.
In order to do that, I use the spacy library which is incredible to deal with texts. For this exercice, I will only use very few options from spacy.

In [2]:
#import spacy, and french model
import spacy
nlp = spacy.load('fr')

# parameters

In [4]:
data_dir = 'data/Artistes_et_Phalanges-David_Campion'# data directory containing input.txt
save_dir = 'save' # directory to store models
seq_length = 30 # sequence length
sequences_step = 1 #step to create sequences

In [5]:
file_list = ["101","102","103","104","105","106","107","108","109","110","111","112","201","202","203","204","205","206","207","208","209","210","211","212","213","214","301","302","303","304","305","306","307","308","309","310","311","312","313","314","401","402","403","404","405","406","407","408","409","410","411","412"]

vocab_file = os.path.join(save_dir, "words_vocab.pkl")

# read data

I create a specific function to create a list of words from raw text. I use spacy library, with a specific function to retrieve only lower character of the words and remove carriage returns (\n).

I am doing that because I want to reduce the number of potential words in my dictionnary, and I assume we do not have to avoid capital letters. Indeed, they are only part of the syntax of the text, it's shape, and do not deals with its sense.

In [6]:
def create_wordlist(doc):
    wl = []
    for word in doc:
        if word.text not in ("\n","\n\n",'\u2009','\xa0'):
            wl.append(word.text.lower())
    return wl

Create the list of sentences:

In [7]:
wordlist = []
for file_name in file_list:
    input_file = os.path.join(data_dir, file_name + ".txt")
    #read data
    with codecs.open(input_file, "r") as f:
        data = f.read()
    #create sentences
    doc = nlp(data)
    wl = create_wordlist(doc)
    wordlist = wordlist + wl

## Create dictionnary

The first step is to create the dictionnary, it means, the list of all words contained in texts. For each word, we will assign an index to it. 

In [9]:
# count the number of words
word_counts = collections.Counter(wordlist)

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

#size of the vocabulary
vocab_size = len(words)
print("vocab size: ", vocab_size)

#save the words and vocabulary
with open(os.path.join(vocab_file), 'wb') as f:
    cPickle.dump((words, vocab, vocabulary_inv), f)

vocab size:  11485


## create sequences
Now, we have to create the input data for our LSTM. We create two lists:
 - **sequences**: this list will contain the sequences of words used to train the model,
 - **next_words**: this list will contain the next words for each sequences of the **sequences** list.
 
In this exercice, we assume we will train the network with sequences of 30 words (seq_length = 30).

So, to create the first sequence of words, we take the 30th first words in the **wordlist** list. The word 31 is the next word of this first sequence, and is added to the **next_words** list.

Then we jump by a step of 1 (sequences_step = 1 in our example) in the list of words, to create the second sequence of words and retrieve the second "next word".

We iterate this task until the end of the list of words.

In [10]:
#create sequences
sequences = []
next_words = []
for i in range(0, len(wordlist) - seq_length, sequences_step):
    sequences.append(wordlist[i: i + seq_length])
    next_words.append(wordlist[i + seq_length])

print('nb sequences:', len(sequences))

nb sequences: 172104


When we iterate over the whole list of words, we create 172104 sequences of words, and retrieve, for each of them, the next word to be predicted.

However, these lists cannot be used "as is". We have to transform them in order to ingest them in the LSTM. Text will not be understood by neural net, we have to use digits.
However, we cannot only map a words to its index in the vocabulary, as it does not represent intrasinqly the word. It is better to reorganize a sequence of words as a matrix of booleans.

So, we create the matrix X and y :
 - X : the matrix of the following dimensions:
     - number of sequences,
     - number of words in sequences,
     - number of words in the vocabulary.
 - y : the matrix of the following dimensions:
     - number of sequences,
     - number of words in the vocabulary.
 
For each word, we retrieve its index in the vocabulary, and we set to 1 its position in the matrix.

In [11]:
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
    for t, word in enumerate(sentence):
        X[i, t, vocab[word]] = 1
    y[i, vocab[next_words[i]]] = 1

# Build Model

Now, here come the fun part. The creation of the neural network.
As you will see, I am using Keras which provide very good abstraction to design an architecture.

In this example, I create the following neural network:
 - bidirectional LSTM,
 - with size of 256 and using RELU as activation,
 - then a dropout layer of 0,6 (it's pretty high, but necesseray to avoid quick divergence)
 

The net should provide me a probability for each word of the vocabulary to be the next one after a given sentence. So I end it with:

 - a simple dense layer of the size of the vocabulary,
 - a softmax activation.
 
I use ADAM as otpimizer and the loss calculation is done on the categorical crossentropy.

Here is the function to build the network:

In [14]:
def bidirectional_lstm_model(seq_length, vocab_size):
    print('Build LSTM model.')
    model = Sequential()
    model.add(Bidirectional(LSTM(rnn_size, activation="relu"),input_shape=(seq_length, vocab_size)))
    model.add(Dropout(0.6))
    model.add(Dense(vocab_size))
    model.add(Activation('softmax'))
    
    optimizer = Adam(lr=learning_rate)
    callbacks=[EarlyStopping(patience=2, monitor='val_loss')]
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=[categorical_accuracy])
    return model

In [15]:
rnn_size = 256 # size of RNN
batch_size = 32 # minibatch size
seq_length = 30 # sequence length
num_epochs = 50 # number of epochs
learning_rate = 0.001 #learning rate
sequences_step = 1 #step to create sequences

In [16]:
md = bidirectional_lstm_model(seq_length, vocab_size)
md.summary()

Build LSTM model.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_1 (Bidirection (None, 512)               24047616  
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 11485)             5891805   
_________________________________________________________________
activation_1 (Activation)    (None, 11485)             0         
Total params: 29,939,421
Trainable params: 29,939,421
Non-trainable params: 0
_________________________________________________________________


If a print the summary of this model, you can see it has close to 61 millions of trainable parameters. It is huge, and the compute will take some time to complete.

## train data

Enough speech, we train the model now. We shuffle the training set and extract 10% of it as validation sample. We simply run :

In [None]:
#fit the model
callbacks=[EarlyStopping(patience=4, monitor='val_loss'),
           ModelCheckpoint(filepath=save_dir + "/" + 'my_model_gen_sentences_lstm.{epoch:02d}-{val_loss:.2f}.hdf5',\
                           monitor='val_loss', verbose=0, mode='auto', period=2)]
history = md.fit(X, y,
                 batch_size=batch_size,
                 shuffle=True,
                 epochs=num_epochs,
                 callbacks=callbacks,
                 validation_split=0.01)

Train on 170382 samples, validate on 1722 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50


In [25]:
#save the model
md.save(save_dir + "/" + 'my_model_gen_sentences_lstm.final.hdf5')

test


# Generate phrase

Great !
We have now trained a model to predict the next word of a given sequence of words. In order to generate text, the task is pretty simple:

 - we define a "seed" sequence of 30 words (30 is the number of words required by the neural net for the sequences),
 - we ask the neural net to predict word number 31,
 - then we update the sequence by moving words by a step of 1, adding words number 31 at its end,
 - we ask the neural net to predict word number 32,
 - etc. For as long as we want.
 
Doing this, we generate phrases, word by word.

In [27]:
#load vocabulary
print("loading vocabulary...")
vocab_file = os.path.join(save_dir, "words_vocab.pkl")

with open(os.path.join(save_dir, 'words_vocab.pkl'), 'rb') as f:
        words, vocab, vocabulary_inv = cPickle.load(f)

vocab_size = len(words)

loading vocabulary...


In [28]:
from keras.models import load_model
# load the model
print("loading model...")
model = load_model(save_dir + "/" + 'my_model_gen_sentences_lstm.final.hdf5')

loading model...


To improve the word generation, and tune a bit the prediction, we introduce a specific function to pick-up words.

We will not take the words with the highest prediction (or the generation of text will be boring), but would like to insert some uncertainties, and let the solution sometime pick-up words with less good prediction.

That is the purpose of the function **sample**, that will draw radomly a word from the vocabulary.

The probabilty for a word to be drawn will depends directly on its probability to be the next word. In order to tune this probability, we introduce a "temperature" to smooth or sharpen its value.

In [29]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [63]:
#initiate sentences
seed_sentences = "nolan avance sur le chemin de pierre et grimpe les marches ."
generated = ''
sentence = []
for i in range (seq_length):
    sentence.append("a")

seed = seed_sentences.split()

for i in range(len(seed)):
    sentence[seq_length-i-1]=seed[len(seed)-i-1]

generated += ' '.join(sentence)
print('Generating text with the following seed: "' + ' '.join(sentence) + '"')

print ()

Generating text with the following seed: "a a a a a a a a a a a a a a a a a a nolan avance sur le chemin de pierre et grimpe les marches ."



In [64]:
words_number = 100
#generate the text
for i in range(words_number):
    #create the vector
    x = np.zeros((1, seq_length, vocab_size))
    for t, word in enumerate(sentence):
        x[0, t, vocab[word]] = 1.
    #print(x.shape)

    #calculate next word
    preds = model.predict(x, verbose=0)[0]
    next_index = sample(preds, 0.34)
    next_word = vocabulary_inv[next_index]

    #add the next word to the text
    generated += " " + next_word
    # shift the sentence by one, and and the next word at its end
    sentence = sentence[1:] + [next_word]

print(generated)


a a a a a a a a a a a a a a a a a a nolan avance sur le chemin de pierre et grimpe les marches . — oui , je vais vous expliquer . nous allons devoir être trop loin de cette cité … le jeune homme s’ est à l’ air de la citadelle . — c’ est vrai , mais je ne sais pas ce que je suis d’ accord avec toi , vous êtes bien comme des artistes . — je ne vois pas , ajoute nolan en secouant la tête . — je peux pas ces derniers jours … je ne peux pas être en ce moment , je suis d’ accord … et … nolan ne peut pas se faire confiance
