# Cervantes 2.0

## MLND - Capstone Project

![Miguel de Cervantes Saavedra](images/cervantes.jpg)

## Domain Background

[Deep Learning](https://en.wikipedia.org/wiki/Deep_learning) is a new area of [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning), which has attracted a lot of attention lately due to the amazing results produced by Deep Learning models. With Deep Learning, it is now possible for an algorithm to predict things, classify images (objects) with great accuracy, detect fraudulent transactions, generate image, sound and text. These are tasks that were previously not possible to achieve by an algorithm and now perform better than a human.

In this project we will focus on Text Generation. Text Generation is part of [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) and can be used to [transcribe speech to text](http://www.jmlr.org/proceedings/papers/v32/graves14.pdf), perform [machine translation](http://arxiv.org/abs/1409.3215), generate handwritten text, image captioning, generate new blog posts or news headlines. 

RNNs are [very effective](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) when understanding sequence of elements and have been used in the past to generate text. I will use a Recurrent Neural Network to generate text inspired on the works of Cervantes.

![Basic RNN -> Unrolled RNN](images/basic_unrolled_RNN.png)

In order to generate text, we will look at a class of Neural Network where connections between units form a directed cycle, called Recurrent Neural Network (RNNs). RNNs use an internal memory to process sequences of elements and is able to learn from the syntactic structure of text. Our model will be able to generate text based on the text we train it with.

## Problem Statement


[Miguel de Cervantes Saavedra](https://en.wikipedia.org/wiki/Miguel_de_Cervantes), was a Spanish writer who is regarded as the greater writer in Spanish language. Famous for his novel, [Don Quixote](https://en.wikipedia.org/wiki/Don_Quixote), considered one of the best fiction novels ever written.

Unfortunately, Cervantes passed away 500 years ago and he will not be publishing new novels any time soon…. But, wouldn’t it be great if we could generate some text inspired on Don Quixote and other novels he published?

To solve our problem, we can use text from novels written by Cervantes in combination with the incredible power of Deep Learning, in particular RNNs, to generate text. Our deep learning model will be trained on existing Cervantes works and will output new text, based on the internal representation of the text it was trained on, in the Neural Network.  

![LSTM Cell](images/lstm_cell.png)

LSTM Cell

For our model to learn, we will use a special type of RNN called LSTMs (Long Short Term Memory), capable of learning long-term dependencies. LSTM can use its memory to generate complex, [realistic sequences](https://arxiv.org/pdf/1308.0850.pdf) containing long-range structure, just like the sentences that we want to generate. It will be able to remember information for a period of time, which will help at generating text of better quality. 

## Datasets and Input

To train our model we will use the text from his most famous novel (Don Quixote) and other [less known](http://www.gutenberg.org/cache/epub/14420/pg14420.txt) like Lady Cornelia, The Deceitful Marriage, The Little Gipsy Girl, etc. Also, we will not include any Plays, e.g. [Numancia](https://en.wikipedia.org/wiki/Miguel_de_Cervantes#La_Numancia), to train our model as it’s writing style differs from the novels and we want the generated text to follow the structure of a novel. All the novels are no longer protected under copyright and thanks to the [Gutenberg Project](https://www.gutenberg.org/), we are able to access all the text of [these books](https://www.gutenberg.org/ebooks/author/505).

Even though Miguel de Cervantes native language was Spanish, the text used to train our model will be in English. This is to make it easier for the reader to understand the input and output of our model.

Our Dataset is small as it is composed of only 2 files - Don Quixote and Exemplary Novels with a total size of 3.4 MB. Bigger datasets work better when training an RNN but for our case that is very specific it will be enough. Some additional information of the contents of the files below:

<table>
  <tr>
    <td>**File**</td>
    <td></td>
    <td>**Totals**</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>*Name*</td>
    <td>*Size*</td>
    <td>*Pages*</td>
    <td>*Lines*</td>
    <td>*Words*</td>
    <td>*Unique Words*</td>
  </tr>
  <tr>
    <td>DonQuixote.txt</td>
    <td>2.3 MB</td>
    <td>690</td>
    <td>40,008</td>
    <td>429,256</td>
    <td>42154</td>
  </tr>
  <tr>
    <td>ExemplaryNovels.txt</td>
    <td>1.1 MB</td>
    <td>303</td>
    <td>17,572</td>
    <td>189,037</td>
    <td></td>
  </tr>
</table>
* Note: Values in the table above will change after preprocessing.

There is some manual preprocessing that we will need to do as the text retrieved from Gutenberg Project contains additional content that is not necessary to train the model, for example:

* Preface
* Translator’s Preface
* About the author
* Index
* Dedications
* Footnotes includes in Exemplary Novels

**Note:** The files included in the dataset folder no longer contain the additional content mentioned above.

### Loading Data

Lets start by loading our Data and exploring it.

In [1]:
import numpy as np

filenames = ["dataset/DonQuixote.txt", "dataset/ExemplaryNovels.txt"]

text = ""

for fn in filenames:
    with open(fn, "r") as f:
            text += f.read()

In [2]:
print('Dataset Stats')
print('Unique words: {}'.format(len({word: None for word in text.split()})))
chapters = text.split('\n\n\n\n')
print('Number of chapters: {}'.format(len(chapters)))
sentence_count_chapter = [chapter.count('\n') for chapter in chapters]
print('Average number of sentences in each chapters: {}'.format(np.average(sentence_count_chapter)))

sentences = [sentence for chapter in chapters for sentence in chapter.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

Dataset Stats
Unique words: 39229
Number of chapters: 135
Average number of sentences in each chapters: 392.7925925925926
Number of lines: 53162
Average number of words in each line: 10.975941461946503


### Extra Preprocessing 
We need to prepare our data for our RNN, lets do some additional preprocessing:
* Lookup table: We need to create [word embeddings](https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings) to facilitate the training of our model. 

* Tokenize punctuation: This is to simplify training for our neural network. Making it easy for it to distinguish between *mad* and *mad!*

In [3]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of dataset split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    counts = Counter(text)
    vocab = sorted(counts, key=counts.get, reverse=True)
    
    vocab_to_int = {word: i for i, word in enumerate(vocab)}
    int_to_vocab = {v:k for k, v in vocab_to_int.items()}
    
    return vocab_to_int, int_to_vocab


### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks make it hard for the neural network to distinguish between the word "mad" and "mad!".

Implement the function `token_lookup` to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( . )
- Comma ( , )
- Quotation Mark ( " )
- Semicolon ( ; )
- Exclamation mark ( ! )
- Question mark ( ? )
- Left Parentheses ( ( )
- Right Parentheses ( ) )
- Dash ( -- )
- Return ( \n )

This dictionary will be used to token the symbols and add the delimiter (space) around it.  This separates the symbols as it's own word, making it easier for the neural network to predict on the next word. Make sure you don't use a token that could be confused as a word. Instead of using the token "dash", try using something like "||dash||".

In [4]:
token_lookup = {".": "||period||", \
         ",": "||comma||", \
         '"': "||quotation_mark||", \
         ";": "||semicolon||", \
         "!": "||exclamation_mark||", \
         "?": "||question_mark||", \
         "(": "||l_parenthesis||", \
         ")": "||r_parenthesis||", \
         "--": "||dash||", \
         "\n": "||return||"}

Lets preprocess all the data and save it to file.

In [5]:
import pickle

for key, token in token_lookup.items():
    text = text.replace(key, ' {} '.format(token))

text = text.split()

vocab_to_int, int_to_vocab = create_lookup_tables(text)

int_text = [vocab_to_int[word] for word in text]

# Saving the preprocessed data
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_lookup), open('preprocess.p', 'wb'))

## Preprocess Check Point
The preprocessed data has been saved to disk. No need to preprocess it again, by running the cell below it will be available to the notebook.

In [6]:
import numpy as np
import pickle

int_text, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess.p', mode='rb'))

## Cervantes Neural Network
Before getting started, lets check some requirements to run the Neural Network



### Check the Version of TensorFlow and Access to GPU

A GPU is suggested to train the Cervantes Neural Network as text generation takes a long time to train.

In [7]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network as text generation takes a long time to train in order to achieve good results.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.0.0
Default GPU Device: /gpu:0


### Neural Network Code
The building blocks of the Cervantes Neural Network are include in cervantes_nn.py. If you want to view the code run *cervnn??* in a separate cell after importing it.

Functions included in cervantes_nn:
- get_inputs: Creates the TF Placeholders for the Neural Network
- get_init_cell: Creates our RNN cell and initialises it.
- get_embed: Applies [embedding](https://www.tensorflow.org/tutorials/word2vec) to our input data.
- build_rnn: Creates a RNN using a RNN cell
- build_nn: Apply embedding to input data using your get_embed function. Builds RNN using cell and the build_rnn function. Finally, it applies a [fully connected layer](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/fully_connected) with a linear activation.
- get_batches: Creates a generator that returns batches of data used during training

In [48]:
import cervants_nn as cervnn

cervnn.reset_graph()

In [9]:
# View the code of cervantes_nn
cervnn??

## Cervantes Neural Network Training
### Hyperparameters
The following parameters are used to tune the Neural Network:

- `batch_size`: The number of training examples in one pass.
- `num_epochs`: One pass of all the training examples.
- `rnn_layer_size`: Number of RNN layers
- `rnn_size`: Size of the RNNs.
- `embed_dim`: Size of the embedding.
- `seq_length`: Number of words included in every sequence, e.g. sequence of five words. 
- `learning_rate`: How fast/slow the Neural Network will train.
- `dropout`: Simple way to prevents an RNN from overfitting - [link](http://jmlr.org/papers/v15/srivastava14a.html).
- `show_every_n_batches`: Number of batches the neural network should print progress.
- `save_every_n_epochs`: Number of epochs the neural network should save progress.

In [108]:
# Batch Size
batch_size = 512
# Number of Epochs
num_epochs = 500
# RNN Layers
rnn_layer_size = 2
# RNN Size
rnn_size = 256
# Embedding Dimension Size
# Using 300 as it is commonly used in Google's news word vectors and the GloVe vectors
embed_dim = 300
# Sequence Length
seq_length = 20
# Learning Rate
learning_rate = 0.001
# Dropout
dropout = 0.6

# Show stats for every n number of batches
show_every_n_batches = 100
# Save progress for every n number of epochs
save_every_n_epochs = 100

# Define saving directories
save_dir = './checkpoints/save'
logs_dir = './logs/'

### Build the Graph
Build the graph using Cervantes neural network

In [109]:
from tensorflow.contrib import seq2seq

train_graph = tf.Graph()
with train_graph.as_default():
    # Inputs
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = cervnn.get_inputs()
    input_data_shape = tf.shape(input_text)
    
    # Define the RNN cell
    cell, initial_state = cervnn.get_init_cell(batch_size=input_data_shape[0], 
                                               rnn_layers=rnn_layer_size, 
                                               rnn_size=rnn_size,
                                               keep_prob=dropout)
    # Builds Neural Network
    logits, final_state = cervnn.build_nn(cell, input_text, vocab_size, embed_dim,
                                         batch_size, rnn_layer_size, rnn_size, dropout)

    
    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

RNN Layers: 2 and Size: 256, Batch Size: Tensor("strided_slice:0", shape=(), dtype=int32)
(?, ?, 300)
RNN Layers: 2 and Size: 256, Batch Size: 512


## Train
Train Cervantes neural network on the preprocessed data.

In [None]:
batches = cervnn.get_batches(int_text, batch_size, seq_length)

# file_name_suffix = "-lr-{}-epochs-{}-sqe_length-{}-".format(learning_rate, num_epochs, seq_length)
run_id = '0004'

training_log = "batch_size: {}\nepochs: {}\nrnn_layer_size: {}\nrnn_size: {}\nembed_dim: {}\nseq_length: {}\nlr: {}\ndropout: {}\n--------\n".format(batch_size, num_epochs, rnn_layer_size, rnn_size, embed_dim, seq_length, learning_rate, dropout)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                current_log = 'Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss)
                training_log += current_log + "\n"
                print(current_log)
                
                # Save every 100 epochs
                if (epoch_i + 1) % save_every_n_epochs == 0:
                    saver = tf.train.Saver()
                    saver.save(sess, save_dir + '-' + run_id + '--c_epoch-' + str(epoch_i + 1))
                    model_saved_msg = 'Model Trained and Saved - Epoch: ' + str(epoch_i + 1)
                    print(model_saved_msg)
                    training_log += model_saved_msg + "\n"
                

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')
    
    text_file = open(logs_dir + "training_log-{}.txt".format(run_id), "w")
    text_file.write(training_log)
    text_file.close()

Epoch   0 Batch    0/70   train_loss = 9.989
Epoch   1 Batch   30/70   train_loss = 6.203
Epoch   2 Batch   60/70   train_loss = 6.203
Epoch   4 Batch   20/70   train_loss = 6.171
Epoch   5 Batch   50/70   train_loss = 6.084
Epoch   7 Batch   10/70   train_loss = 5.993
Epoch   8 Batch   40/70   train_loss = 5.923
Epoch  10 Batch    0/70   train_loss = 5.844
Epoch  11 Batch   30/70   train_loss = 5.831
Epoch  12 Batch   60/70   train_loss = 5.772
Epoch  14 Batch   20/70   train_loss = 5.714
Epoch  15 Batch   50/70   train_loss = 5.690
Epoch  17 Batch   10/70   train_loss = 5.619
Epoch  18 Batch   40/70   train_loss = 5.558
Epoch  20 Batch    0/70   train_loss = 5.482
Epoch  21 Batch   30/70   train_loss = 5.498
Epoch  22 Batch   60/70   train_loss = 5.468
Epoch  24 Batch   20/70   train_loss = 5.428
Epoch  25 Batch   50/70   train_loss = 5.407
Epoch  27 Batch   10/70   train_loss = 5.350
Epoch  28 Batch   40/70   train_loss = 5.297
Epoch  30 Batch    0/70   train_loss = 5.226
Epoch  31 

## Save Parameters
Save `seq_length` and `save_dir` for generating a new Cervantes text.

In [78]:
# Save parameters for checkpoint
pickle.dump((seq_length, save_dir), open('params-{}.p'.format(run_id), 'wb'))

## Training results

The table below captures the results of training the Cervantes Neural Network with different hyperparameters:

| Run ID | Batch Size | Epochs | RNN Layers | RNN Size | Embed Dim | Seq Length | LR | Dropout | Train Loss |
|:---:|:---:|:---:|:---:|:----:|:----:|:----:|:----:|:-----:|
| 0001 | 512 | 300 | 2 | 256 | 300 | 5 | 0.01 | 0.6 | 3.438 |
| 0002 | 512 | 500 | 2 | 256 | 500 | 5 | 0.001 | 0.6 | 1.488 |
| 0003 | 512 | 300 | 2 | 256 | 300 | 10 | 0.01 | 0.6 | 3.015 |
| 0004 | 512 | 500 | 2 | 256 | 500 | 10 | 0.001 | 0.6 | 1.220 |
| 0005 | 512 | 500 | 2 | 256 | 500 | 20 | 0.001 | 0.6 | 1.220 |




## Generate Cervantes Text

Before generating text, lets import our preprocessed data and the params of our run.

In [104]:
import tensorflow as tf
import numpy as np

run_id = "0004"

_, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess.p', mode='rb'))
seq_length, meta_dir = pickle.load(open('params-{}.p'.format(run_id), mode='rb'))

The functions below will generate Cervantes text based on some input.
- `load_dir`: Location where the graph metadata is saved
- `prime_word`: First word used to generate text
- `gen_length`: Length of text we want to generate.

In [96]:
def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    # Adding randomness to the word returned
    return np.random.choice(list(int_to_vocab.values()), 1, p=probabilities)[0]
    #return int_to_vocab[np.argmax(probabilities)]

def generate_text(load_dir, prime_word, gen_length):
    """
    Generates text
    :param load_dir: Location where the graph metadata is saved
    :param prime_word: First word used to generate text
    :param gen_length: How long the generated text will be
    :return: Generated text
    """
    loaded_graph = tf.Graph()
    with tf.Session(graph=loaded_graph) as sess:
        # Load saved model
        loader = tf.train.import_meta_graph(load_dir + '.meta')
        loader.restore(sess, load_dir)

        # Get Tensors from loaded model
        input_text, initial_state, final_state, probs = cervnn.get_tensors(loaded_graph)

        # Sentences generation setup
        gen_sentences = [prime_word]
        prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

        # Generate sentences
        for n in range(gen_length):
            # Dynamic Input
            dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
            dyn_seq_length = len(dyn_input[0])

            # Get Prediction
            probabilities, prev_state = sess.run(
                [probs, final_state],
                {input_text: dyn_input, initial_state: prev_state})

            pred_word = pick_word(probabilities[dyn_seq_length-1], int_to_vocab)

            gen_sentences.append(pred_word)

        # Remove tokens
        generated_text = ' '.join(gen_sentences)
        for key, token in token_dict.items():
            ending = ' ' if key in ['\n', '(', '"'] else ''
            generated_text = generated_text.replace(' ' + token.lower(), key)
        generated_text = generated_text.replace('\n ', '\n')
        generated_text = generated_text.replace('( ', '(')

        return generated_text
    
def print_text_for(run_id, epochs, initial_word, initial_epoch=100, text_length=100):
    for epoch in range(initial_epoch, epochs + 100, 100):
        print('-----------\n{} at run_id: {}, epoch: {}, text generated: \n------------\n{}'.format(initial_word, run_id, epoch, generate_text(meta_dir + '-' + run_id + '--c_epoch-' + str(epoch), initial_word, text_length)))

Lets start comparing text generated by our Cervantes Neural Networks with 

In [None]:
# Run 0001

print_text_for(run_id='0001', epochs=300, initial_word="Quixote", initial_epoch=300, text_length=200)

In [None]:

print_text_for(run_id='0004', epochs=500, initial_word="Quixote", initial_epoch=500, , text_length=200)

In [None]:

print_text_for(run_id='0005', epochs=500, initial_word="Quixote", initial_epoch=500, , text_length=200)

In [92]:
print_text_for(run_id='0001', epochs=300, initial_word="Sancho")

-----------
Sancho at run_id: 0001, epoch: 100, text generated: 
------------
Sancho--" I should do thou king in aught!" said he," give me so well for the least frequented, now
did you; nevertheless say to follow a flight without rank and injurious, or some his servant's head, wit that ring had been provoked; and
Sancho caught out, and
the world stand, incessant doubtful
they meet him,
commending himself up and Christians; and everyone nor is without speaking sorely discomfited, and the stranger asked Don Quixote of spilling the inn mentioned the
-----------
Sancho at run_id: 0001, epoch: 200, text generated: 
------------
Sancho, for when for thy persons approaching and had made herself passed at the capture, and blessing
back being astonished; and with what I now must bring you other
difference other gentleman) invited him to go, on desolate heaths
me of my own tastes?" Startled so we,
waiting got applying how thou hast." She
accepts his grandeur, with no
that respect, sprightly or n

In [93]:
print_text_for(run_id='0001', epochs=300, initial_word="Sancho")

-----------
Sancho at run_id: 0001, epoch: 100, text generated: 
------------
Sancho;" be seen it had fallen in pieces, and is as yet I called another, brother, and my sister knew that there,
Wearied the duchess, and in the people of us to
ransom Preciosa, and mistress of Rodaja had roses of thrashing, or injurious to love
reaches journey, time for each one beast, for thou art by my entire man, for you are sure will.

" Many that are we may be able to think."

" A discerning man
-----------
Sancho at run_id: 0001, epoch: 200, text generated: 
------------
Sancho
Panza, whom, to all
my grave, that the
book of justice a trifling countenance as of Dorothea, and believe are that they ought to make
chance together, and had made so piteous taken countess, or screened by spending them. The curate made with his only orders
were reviving and other bowels of
linen,
in order from eight loaves of yours of short of us never have you
dare to look to attack the princess to go before at their huts to 

## References

* NLP Tokenization - [https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)

* Vector Representations of Words - [https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings](https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings) 

* Recurrent Neural Networks - [https://www.tensorflow.org/tutorials/recurrent](https://www.tensorflow.org/tutorials/recurrent) 

* Alex Graves - Generating Sequences With Recurrent Neural Networks [https://arxiv.org/pdf/1308.0850.pdf](https://arxiv.org/pdf/1308.0850.pdf)

* Christopher Olah - Understanding LSTM Networks [http://colah.github.io/posts/2015-08-Understanding-LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) 

* Prasad Kawthekar, Raunaq Rewari, Suvrat Bhooshan - Evaluating Generative Models for Text Generation - [https://web.stanford.edu/class/cs224n/reports/2737434.pdf](https://web.stanford.edu/class/cs224n/reports/2737434.pdf) 