# Cervantes 2.0

## MLND - Capstone Project

![Miguel de Cervantes Saavedra](images/cervantes.jpg)

## Domain Background

[Deep Learning](https://en.wikipedia.org/wiki/Deep_learning) is a new area of [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning), which has attracted a lot of attention lately due to the amazing results produced by Deep Learning models. With Deep Learning, it is now possible for an algorithm to predict things, classify images (objects) with great accuracy, detect fraudulent transactions, generate image, sound and text. These are tasks that were previously not possible to achieve by an algorithm and now perform better than a human.

In this project we will focus on Text Generation. Text Generation is part of [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) and can be used to [transcribe speech to text](http://www.jmlr.org/proceedings/papers/v32/graves14.pdf), perform [machine translation](http://arxiv.org/abs/1409.3215), generate handwritten text, image captioning, generate new blog posts or news headlines. 

RNNs are [very effective](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) when understanding sequence of elements and have been used in the past to generate text. I will use a Recurrent Neural Network to generate text inspired on the works of Cervantes.

![Basic RNN -> Unrolled RNN](images/basic_unrolled_RNN.png)

In order to generate text, we will look at a class of Neural Network where connections between units form a directed cycle, called Recurrent Neural Network (RNNs). RNNs use an internal memory to process sequences of elements and is able to learn from the syntactic structure of text. Our model will be able to generate text based on the text we train it with.

## Problem Statement


[Miguel de Cervantes Saavedra](https://en.wikipedia.org/wiki/Miguel_de_Cervantes), was a Spanish writer who is regarded as the greater writer in Spanish language. Famous for his novel, [Don Quixote](https://en.wikipedia.org/wiki/Don_Quixote), considered one of the best fiction novels ever written.

Unfortunately, Cervantes passed away 500 years ago and he will not be publishing new novels any time soon…. But, wouldn’t it be great if we could generate some text inspired on Don Quixote and other novels he published?

To solve our problem, we can use text from novels written by Cervantes in combination with the incredible power of Deep Learning, in particular RNNs, to generate text. Our deep learning model will be trained on existing Cervantes works and will output new text, based on the internal representation of the text it was trained on, in the Neural Network.  

![LSTM Cell](images/lstm_cell.png)

LSTM Cell

For our model to learn, we will use a special type of RNN called LSTMs (Long Short Term Memory), capable of learning long-term dependencies. LSTM can use its memory to generate complex, [realistic sequences](https://arxiv.org/pdf/1308.0850.pdf) containing long-range structure, just like the sentences that we want to generate. It will be able to remember information for a period of time, which will help at generating text of better quality. 

## Datasets and Input

To train our model we will use the text from his most famous novel (Don Quixote) and other [less known](http://www.gutenberg.org/cache/epub/14420/pg14420.txt) like Lady Cornelia, The Deceitful Marriage, The Little Gipsy Girl, etc. Also, we will not include any Plays, e.g. [Numancia](https://en.wikipedia.org/wiki/Miguel_de_Cervantes#La_Numancia), to train our model as it’s writing style differs from the novels and we want the generated text to follow the structure of a novel. All the novels are no longer protected under copyright and thanks to the [Gutenberg Project](https://www.gutenberg.org/), we are able to access all the text of [these books](https://www.gutenberg.org/ebooks/author/505).

Even though Miguel de Cervantes native language was Spanish, the text used to train our model will be in English. This is to make it easier for the reader to understand the input and output of our model.

Our Dataset is small as it is composed of only 2 files - Don Quixote and Exemplary Novels with a total size of 3.4 MB. Bigger datasets work better when training an RNN but for our case that is very specific it will be enough. Some additional information of the contents of the files below:

<table>
  <tr>
    <td>**File**</td>
    <td></td>
    <td>**Totals**</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>*Name*</td>
    <td>*Size*</td>
    <td>*Pages*</td>
    <td>*Lines*</td>
    <td>*Words*</td>
    <td>*Unique Words*</td>
  </tr>
  <tr>
    <td>DonQuixote.txt</td>
    <td>2.3 MB</td>
    <td>690</td>
    <td>40,008</td>
    <td>429,256</td>
    <td>42154</td>
  </tr>
  <tr>
    <td>ExemplaryNovels.txt</td>
    <td>1.1 MB</td>
    <td>303</td>
    <td>17,572</td>
    <td>189,037</td>
    <td></td>
  </tr>
</table>
* Note: Values in the table above will change after preprocessing.

There is some manual preprocessing that we will need to do as the text retrieved from Gutenberg Project contains additional content that is not necessary to train the model, for example:

* Preface
* Translator’s Preface
* About the author
* Index
* Dedications
* Footnotes included in Exemplary Novels

**Note:** The files included in the dataset folder no longer contain the additional content mentioned above.

### Loading Data

Lets start by loading our Data and exploring it.

In [1]:
import numpy as np

filenames = ["dataset/DonQuixote.txt", "dataset/ExemplaryNovels.txt"]

text = ""

for fn in filenames:
    with open(fn, "r") as f:
            text += f.read()

In [2]:
print('Dataset Stats')
print('Unique words: {}'.format(len({word: None for word in text.split()})))
chapters = text.split('\n\n\n\n')
print('Number of chapters: {}'.format(len(chapters)))
sentence_count_chapter = [chapter.count('\n') for chapter in chapters]
print('Average number of sentences in each chapters: {}'.format(np.average(sentence_count_chapter)))

sentences = [sentence for chapter in chapters for sentence in chapter.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

Dataset Stats
Unique words: 39229
Number of chapters: 135
Average number of sentences in each chapters: 392.7925925925926
Number of lines: 53162
Average number of words in each line: 10.975941461946503


### Extra Preprocessing 
We need to prepare our data for our RNN, lets do some additional preprocessing:
* Lookup table: We need to create [word embeddings](https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings) to facilitate the training of our model. 

* Tokenize punctuation: This is to simplify training for our neural network. Making it easy for it to distinguish between *mad* and *mad!*

In [3]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of dataset split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    counts = Counter(text)
    vocab = sorted(counts, key=counts.get, reverse=True)
    
    vocab_to_int = {word: i for i, word in enumerate(vocab)}
    int_to_vocab = {v:k for k, v in vocab_to_int.items()}
    
    return vocab_to_int, int_to_vocab


### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks make it hard for the neural network to distinguish between the word "mad" and "mad!".

Implement the function `token_lookup` to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( . )
- Comma ( , )
- Quotation Mark ( " )
- Semicolon ( ; )
- Exclamation mark ( ! )
- Question mark ( ? )
- Left Parentheses ( ( )
- Right Parentheses ( ) )
- Dash ( -- )
- Return ( \n )

This dictionary will be used to token the symbols and add the delimiter (space) around it.  This separates the symbols as it's own word, making it easier for the neural network to predict on the next word. Make sure you don't use a token that could be confused as a word. Instead of using the token "dash", try using something like "||dash||".

In [4]:
token_lookup = {".": "||period||", \
         ",": "||comma||", \
         '"': "||quotation_mark||", \
         ";": "||semicolon||", \
         "!": "||exclamation_mark||", \
         "?": "||question_mark||", \
         "(": "||l_parenthesis||", \
         ")": "||r_parenthesis||", \
         "--": "||dash||", \
         "\n": "||return||"}

Lets preprocess all the data and save it to file.

In [5]:
import pickle

for key, token in token_lookup.items():
    text = text.replace(key, ' {} '.format(token))

text = text.split()

vocab_to_int, int_to_vocab = create_lookup_tables(text)

int_text = [vocab_to_int[word] for word in text]

# Saving the preprocessed data
pickle.dump((int_text, vocab_to_int, int_to_vocab, token_lookup), open('preprocess.p', 'wb'))

## Preprocess Check Point
The preprocessed data has been saved to disk. No need to preprocess it again, by running the cell below it will be available to the notebook.

In [6]:
import numpy as np
import pickle

int_text, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess.p', mode='rb'))

## Cervantes Neural Network
Before getting started, lets check some requirements to run the Neural Network



### Check the Version of TensorFlow and Access to GPU

A GPU is suggested to train the Cervantes Neural Network as text generation takes a long time to train.

In [7]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network as text generation takes a long time to train in order to achieve good results.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.0.0
Default GPU Device: /gpu:0


### Neural Network Code
The building blocks of the Cervantes Neural Network are include in cervantes_nn.py. If you want to view the code run *cervnn??* in a separate cell after importing it.

Functions included in cervantes_nn:
- get_inputs: Creates the TF Placeholders for the Neural Network
- get_init_cell: Creates our RNN cell and initialises it.
- get_embed: Applies [embedding](https://www.tensorflow.org/tutorials/word2vec) to our input data.
- build_rnn: Creates a RNN using a RNN cell
- build_nn: Apply embedding to input data using your get_embed function. Builds RNN using cell and the build_rnn function. Finally, it applies a [fully connected layer](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/fully_connected) with a linear activation.
- get_batches: Creates a generator that returns batches of data used during training

In [8]:
import cervants_nn as cervnn

cervnn.reset_graph()

In [9]:
# View the code of cervantes_nn
cervnn??

## Cervantes Neural Network Training
### Hyperparameters
The following parameters are used to tune the Neural Network:

- `batch_size`: The number of training examples in one pass.
- `num_epochs`: One pass of all the training examples.
- `rnn_layer_size`: Number of RNN layers
- `rnn_size`: Size of the RNNs.
- `embed_dim`: Size of the embedding.
- `seq_length`: Number of words included in every sequence, e.g. sequence of five words. 
- `learning_rate`: How fast/slow the Neural Network will train.
- `dropout`: Simple way to prevents an RNN from overfitting - [link](http://jmlr.org/papers/v15/srivastava14a.html).
- `show_every_n_batches`: Number of batches the neural network should print progress.
- `save_every_n_epochs`: Number of epochs the neural network should save progress.

In [9]:
# Batch Size
batch_size = 512
# Number of Epochs
num_epochs = 700
# RNN Layers
rnn_layer_size = 2
# RNN Size
rnn_size = 256
# Embedding Dimension Size
# Using 300 as it is commonly used in Google's news word vectors and the GloVe vectors
embed_dim = 300
# Sequence Length
seq_length = 10
# Learning Rate
learning_rate = 0.001
# Dropout
dropout = 0.6

# Show stats for every n number of batches
show_every_n_batches = 100
# Save progress for every n number of epochs
save_every_n_epochs = 100

run_id = '0007'

# Define saving directories
save_dir = './checkpoints/save'
logs_dir = './logs/'

### Build the Graph
Build the graph using Cervantes neural network

In [10]:
from tensorflow.contrib import seq2seq

train_graph = tf.Graph()
with train_graph.as_default():
    # Inputs
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = cervnn.get_inputs()
    input_data_shape = tf.shape(input_text)
    
    # Define the RNN cell
    cell, initial_state = cervnn.get_init_cell(batch_size=input_data_shape[0], 
                                               rnn_layers=rnn_layer_size, 
                                               rnn_size=rnn_size,
                                               keep_prob=dropout)
    # Builds Neural Network
    logits, final_state = cervnn.build_nn(cell, input_text, vocab_size, embed_dim,
                                         batch_size, rnn_layer_size, rnn_size, dropout)

    
    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

RNN Layers: 2 and Size: 256, Batch Size: Tensor("strided_slice:0", shape=(), dtype=int32)
(?, ?, 300)
RNN Layers: 2 and Size: 256, Batch Size: 512


## Train
Train Cervantes neural network on the preprocessed data.

In [None]:
batches = cervnn.get_batches(int_text, batch_size, seq_length)

# file_name_suffix = "-lr-{}-epochs-{}-sqe_length-{}-".format(learning_rate, num_epochs, seq_length)

training_log = "batch_size: {}\nepochs: {}\nrnn_layer_size: {}\nrnn_size: {}\nembed_dim: {}\nseq_length: {}\nlr: {}\ndropout: {}\n--------\n".format(batch_size, num_epochs, rnn_layer_size, rnn_size, embed_dim, seq_length, learning_rate, dropout)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                current_log = 'Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss)
                training_log += current_log + "\n"
                print(current_log)
                
                # Save every 100 epochs
                if (epoch_i + 1) % save_every_n_epochs == 0:
                    saver = tf.train.Saver()
                    saver.save(sess, save_dir + '-' + run_id + '--c_epoch-' + str(epoch_i + 1))
                    model_saved_msg = 'Model Trained and Saved - Epoch: ' + str(epoch_i + 1)
                    print(model_saved_msg)
                    training_log += model_saved_msg + "\n"
                

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')
    
    text_file = open(logs_dir + "training_log-{}.txt".format(run_id), "w")
    text_file.write(training_log)
    text_file.close()

Epoch   0 Batch    0/141   train_loss = 9.989
Epoch   0 Batch  100/141   train_loss = 6.210
Epoch   1 Batch   59/141   train_loss = 6.192
Epoch   2 Batch   18/141   train_loss = 6.254
Epoch   2 Batch  118/141   train_loss = 6.199
Epoch   3 Batch   77/141   train_loss = 6.179
Epoch   4 Batch   36/141   train_loss = 6.242
Epoch   4 Batch  136/141   train_loss = 6.187
Epoch   5 Batch   95/141   train_loss = 6.175
Epoch   6 Batch   54/141   train_loss = 6.260
Epoch   7 Batch   13/141   train_loss = 6.229
Epoch   7 Batch  113/141   train_loss = 6.191
Epoch   8 Batch   72/141   train_loss = 6.203
Epoch   9 Batch   31/141   train_loss = 6.196
Epoch   9 Batch  131/141   train_loss = 6.156
Epoch  10 Batch   90/141   train_loss = 6.141
Epoch  11 Batch   49/141   train_loss = 5.984
Epoch  12 Batch    8/141   train_loss = 5.699
Epoch  12 Batch  108/141   train_loss = 5.483
Epoch  13 Batch   67/141   train_loss = 5.378
Epoch  14 Batch   26/141   train_loss = 5.153
Epoch  14 Batch  126/141   train_l

Epoch 126 Batch   34/141   train_loss = 2.703
Epoch 126 Batch  134/141   train_loss = 2.711
Epoch 127 Batch   93/141   train_loss = 2.819
Epoch 128 Batch   52/141   train_loss = 2.702
Epoch 129 Batch   11/141   train_loss = 2.678
Epoch 129 Batch  111/141   train_loss = 2.696
Epoch 130 Batch   70/141   train_loss = 2.652
Epoch 131 Batch   29/141   train_loss = 2.722
Epoch 131 Batch  129/141   train_loss = 2.665
Epoch 132 Batch   88/141   train_loss = 2.659
Epoch 133 Batch   47/141   train_loss = 2.572
Epoch 134 Batch    6/141   train_loss = 2.683
Epoch 134 Batch  106/141   train_loss = 2.732
Epoch 135 Batch   65/141   train_loss = 2.639
Epoch 136 Batch   24/141   train_loss = 2.568
Epoch 136 Batch  124/141   train_loss = 2.645
Epoch 137 Batch   83/141   train_loss = 2.654
Epoch 138 Batch   42/141   train_loss = 2.588
Epoch 139 Batch    1/141   train_loss = 2.575
Epoch 139 Batch  101/141   train_loss = 2.650
Epoch 140 Batch   60/141   train_loss = 2.579
Epoch 141 Batch   19/141   train_l

Epoch 252 Batch   68/141   train_loss = 1.904
Epoch 253 Batch   27/141   train_loss = 1.943
Epoch 253 Batch  127/141   train_loss = 1.998
Epoch 254 Batch   86/141   train_loss = 1.870
Epoch 255 Batch   45/141   train_loss = 1.896
Epoch 256 Batch    4/141   train_loss = 1.961
Epoch 256 Batch  104/141   train_loss = 1.938
Epoch 257 Batch   63/141   train_loss = 1.844
Epoch 258 Batch   22/141   train_loss = 1.882
Epoch 258 Batch  122/141   train_loss = 1.923
Epoch 259 Batch   81/141   train_loss = 1.894
Epoch 260 Batch   40/141   train_loss = 1.911
Epoch 260 Batch  140/141   train_loss = 1.852
Epoch 261 Batch   99/141   train_loss = 1.929
Epoch 262 Batch   58/141   train_loss = 1.914
Epoch 263 Batch   17/141   train_loss = 1.923
Epoch 263 Batch  117/141   train_loss = 1.929
Epoch 264 Batch   76/141   train_loss = 1.901
Epoch 265 Batch   35/141   train_loss = 1.874
Epoch 265 Batch  135/141   train_loss = 2.012
Epoch 266 Batch   94/141   train_loss = 1.883
Epoch 267 Batch   53/141   train_l

Epoch 378 Batch  102/141   train_loss = 1.631
Epoch 379 Batch   61/141   train_loss = 1.544
Epoch 380 Batch   20/141   train_loss = 1.575
Epoch 380 Batch  120/141   train_loss = 1.602
Epoch 381 Batch   79/141   train_loss = 1.557
Epoch 382 Batch   38/141   train_loss = 1.513
Epoch 382 Batch  138/141   train_loss = 1.586
Epoch 383 Batch   97/141   train_loss = 1.525
Epoch 384 Batch   56/141   train_loss = 1.538
Epoch 385 Batch   15/141   train_loss = 1.570
Epoch 385 Batch  115/141   train_loss = 1.522
Epoch 386 Batch   74/141   train_loss = 1.503
Epoch 387 Batch   33/141   train_loss = 1.553
Epoch 387 Batch  133/141   train_loss = 1.523
Epoch 388 Batch   92/141   train_loss = 1.546
Epoch 389 Batch   51/141   train_loss = 1.563
Epoch 390 Batch   10/141   train_loss = 1.614
Epoch 390 Batch  110/141   train_loss = 1.509
Epoch 391 Batch   69/141   train_loss = 1.535
Epoch 392 Batch   28/141   train_loss = 1.545
Epoch 392 Batch  128/141   train_loss = 1.544
Epoch 393 Batch   87/141   train_l

Epoch 504 Batch   36/141   train_loss = 1.299
Epoch 504 Batch  136/141   train_loss = 1.294
Epoch 505 Batch   95/141   train_loss = 1.325
Epoch 506 Batch   54/141   train_loss = 1.296
Epoch 507 Batch   13/141   train_loss = 1.266
Epoch 507 Batch  113/141   train_loss = 1.232
Epoch 508 Batch   72/141   train_loss = 1.331
Epoch 509 Batch   31/141   train_loss = 1.240


## Save Parameters
Save `seq_length` and `save_dir` for generating a new Cervantes text.

In [None]:
# Save parameters for checkpoint
pickle.dump((seq_length, save_dir), open('params-{}.p'.format(run_id), 'wb'))

## Training results

The table below captures the results of training the Cervantes Neural Network with different hyperparameters:

| Run ID | Batch Size | Epochs | RNN Layers | RNN Size | Embed Dim | Seq Length | LR | Dropout | Train Loss |
|:---:|:---:|:---:|:---:|:----:|:----:|:----:|:----:|:-----:|
| 0001 | 512 | 300 | 2 | 256 | 300 | 5 | 0.01 | 0.6 | 3.438 |
| 0002 | 512 | 500 | 2 | 256 | 300 | 5 | 0.001 | 0.6 | 1.488 |
| 0003 | 512 | 300 | 2 | 256 | 300 | 10 | 0.01 | 0.6 | 3.015 |
| 0004 | 512 | 500 | 2 | 256 | 300 | 10 | 0.001 | 0.6 | 1.112 |
| 0005 | 512 | 500 | 3 | 256 | 300 | 10 | 0.001 | 0.6 | 1.178 |
| 0006 | 512 | 500 | 3 | 256 | 300 | 20 | 0.001 | 0.6 | 1.317 |
| 0007 | 512 | 700 | 2 | 256 | 300 | 10 | 0.001 | 0.6 | 0.998 |

For a detailed view of the training loss, checkout the [training logs](./logs/) included with the project.

### Training Loss

We can see that a learning rate of 0.01 is too large to train our Neural Network. When we trained it with 0.01, we were never able to achieve a train loss < 3. Another indicator of this is that the learning plateaus in both runs (0001, 0003); in *0001* it plateaus at around epoch 100 and in *0003* at around epoch 180.

The training loss improved when we use a learning rate of 0.001. The lower learning rate improves our Neural network performance by -2.0. As the training is not plateauing, we are also able to train it longer. This is why we increase the epochs of run *0002* to 500.

### Sequence Length

Our basic RNN was trained with a sequence length of 5. The sequence length, is the number of words to be included in every sequence.  

We can see an improvement when increasing the sequence length to 10. This means that our RNN will use a longer sequence to train our Neural Network. Which ends up improving significantly the quality of text generated by our network. 

With a sequence length of 5, our text didn't made much sense, the sentences are short and the paragraphs are not well structured. 

When using a model trainer with sequence length of 10, we can notice that the text makes much more sense, and the quality of the sentences and paragraphs improves significantly. 

### Train some more

Our best result in the the first 6 runs was run *0004*. If we train our network longer with the same parameters, we achieve a train loss of less than 1. 

## Conclusion

Even though our train loss for the last run is less than 1, we can see in the samples below that the text generated by *run 0004* and *run 0007* models are similar is several ways:
- They are both able to open and close quotations
- The text makes more sense when compared with *run 0001*, which is expected as the sequence length (10) used to train both models is longer than *run 0001* (5)
- Paraghaps are well formed.
- Sentence length is similar and close to the average sentence length.

Text Samples:

- Run 0004:
    " Senor," said Sancho," I mean to know from this perilous journey in the ugly which has been bound;
    " At any rate, Dulcinea," replied the actor
- Run 0007:
    Quixote or cost him his squire, unless indeed his wife might follow him
    Don Quixote bade Sancho he settled three days with open his heart in fixing his affections should comply with Preciosa

--------------------

Below, we will generate some text to check our results.

## Generate Cervantes Text

Before generating text, lets import our preprocessed data and the params of our run.

In [13]:
import tensorflow as tf
import numpy as np

_, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess.p', mode='rb'))
seq_length, meta_dir = pickle.load(open('params-{}.p'.format(run_id), mode='rb'))

The functions below will generate Cervantes text based on some input.
- `load_dir`: Location where the graph metadata is saved
- `prime_word`: First word used to generate text
- `gen_length`: Length of text we want to generate.

In [14]:
def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    # Adding randomness to the word returned
    return np.random.choice(list(int_to_vocab.values()), 1, p=probabilities)[0]
    #return int_to_vocab[np.argmax(probabilities)]

def generate_text(load_dir, prime_word, gen_length):
    """
    Generates text
    :param load_dir: Location where the graph metadata is saved
    :param prime_word: First word used to generate text
    :param gen_length: How long the generated text will be
    :return: Generated text
    """
    loaded_graph = tf.Graph()
    with tf.Session(graph=loaded_graph) as sess:
        # Load saved model
        loader = tf.train.import_meta_graph(load_dir + '.meta')
        loader.restore(sess, load_dir)

        # Get Tensors from loaded model
        input_text, initial_state, final_state, probs = cervnn.get_tensors(loaded_graph)

        # Sentences generation setup
        gen_sentences = [prime_word]
        prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

        # Generate sentences
        for n in range(gen_length):
            # Dynamic Input
            dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
            dyn_seq_length = len(dyn_input[0])

            # Get Prediction
            probabilities, prev_state = sess.run(
                [probs, final_state],
                {input_text: dyn_input, initial_state: prev_state})

            pred_word = pick_word(probabilities[dyn_seq_length-1], int_to_vocab)

            gen_sentences.append(pred_word)

        # Remove tokens
        generated_text = ' '.join(gen_sentences)
        for key, token in token_dict.items():
            ending = ' ' if key in ['\n', '(', '"'] else ''
            generated_text = generated_text.replace(' ' + token.lower(), key)
        generated_text = generated_text.replace('\n ', '\n')
        generated_text = generated_text.replace('( ', '(')

        return generated_text
    
def print_text_for(run_id, epochs, initial_word, initial_epoch=100, text_length=100):
    for epoch in range(initial_epoch, epochs + 100, 100):
        print('-----------\n{} at run_id: {}, epoch: {}, text generated: \n------------\n{}'.format(initial_word, run_id, epoch, generate_text(meta_dir + '-' + run_id + '--c_epoch-' + str(epoch), initial_word, text_length)))

Lets start comparing text generated by our different Cervantes Neural Networks runs. 

In [17]:
# Run 0001

print_text_for(run_id='0001', epochs=300, initial_word="Quixote", initial_epoch=300, text_length=200)

-----------
Quixote at run_id: 0001, epoch: 300, text generated: 
------------
Quixote," and
my profession abideth about her Majesty then, the cloth was frantic, that it
is
so hast been, that she has not spread of showing and tear resting with him, for the persuasion, leaving him by force or get up by full great
achievements of what book,
Leaves the pastime?"

" I would lay myself upon him with
a burnished hand to the waist.
The thieves laughed at
the other little intelligence, and where the redress, in the service of some dwarf, and
we were in the line of
mutual Sancho, this,
I have not such matters of us with I not heard by which is the best world,
Non when the bano seated myself and Sancho Panza asked the same of his
story, they were excellent wife at once, and here the green intentions of the Judge so cautiously gave her. The blush, and heard them came to over a payment with shepherd, laying round it
he ordered her, and


In [21]:

print_text_for(run_id='0004', epochs=500, initial_word="Quixote", initial_epoch=500, text_length=200)

-----------
Quixote at run_id: 0004, epoch: 500, text generated: 
------------
Quixote," and am the idea thou wilt give more
good quickly to see now thou hast won the good
thing, as I have told them not."

" Senor," said Sancho," I mean to know from
this perilous journey in the ugly which has been bound; for it is
that? What are it in me; but the knight-errant should come free?"

" At any rate, Dulcinea," replied the actor held in nonsense of your wife, master the devil who is so generous that I can go to the house of
Luscinda, Preciosa, and more will by everything the
wrong the beauty itself ought to wash its course.

" What could mean be" The greatest who is long in
his own behalf so long as I have chosen to
Uchali, who shall not go to defend your ass
will run wrong, for the mercy I was now in a
coach that, after having sold their riches and language that
Preciosa asked


In [43]:

print_text_for(run_id='0007', epochs=500, initial_word="Quixote", initial_epoch=500, text_length=200)

-----------
Quixote at run_id: 0007, epoch: 500, text generated: 
------------
Quixote or cost him his squire, unless indeed his
wife might follow him or with great respect. Their
master would be, the blind parents describes, or
look free, and all the rest with your life shall be imagined; the gipsy had
rule come to his heart. The extreme day I have heard of her good; in
the moment of this my house and soul to find
the paper; though, as it were, we
believe it is because they who are dead, and bind
me to undergo the exertion thou great desire, for indeed he said to him cannot read it aloud.

Don Quixote bade Sancho he settled three days with open his heart
in fixing his affections should comply with Preciosa, and
to keep secret whatever will be no wish on, as it was queens as his master had
heard him, he strove to sing him, maintaining the sun favouring befall), which was not of it, both of them, served in her courtesy
for the


## References

* NLP Tokenization - [https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)

* Vector Representations of Words - [https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings](https://www.tensorflow.org/tutorials/word2vec#motivation_why_learn_word_embeddings) 

* Recurrent Neural Networks - [https://www.tensorflow.org/tutorials/recurrent](https://www.tensorflow.org/tutorials/recurrent) 

* Alex Graves - Generating Sequences With Recurrent Neural Networks [https://arxiv.org/pdf/1308.0850.pdf](https://arxiv.org/pdf/1308.0850.pdf)

* Christopher Olah - Understanding LSTM Networks [http://colah.github.io/posts/2015-08-Understanding-LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) 

* Prasad Kawthekar, Raunaq Rewari, Suvrat Bhooshan - Evaluating Generative Models for Text Generation - [https://web.stanford.edu/class/cs224n/reports/2737434.pdf](https://web.stanford.edu/class/cs224n/reports/2737434.pdf) 

* Ilya Sutskever, James Martens, Geoffrey Hinton - Generating Text with Recurrent Neural Networks - [http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf](http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf)