# Train Text RNN Tensorflow - TUTORIAL

This notebook describes how to train a RNN model, in order to generate the next word of a sentence (word by word).
The training is done using a full text (whatever you want: novel, etc.).

Before going through this tutorial, I suggest to read the very very good blog note from Andrej Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

This project takes also a lot from : https://github.com/hunkim/word-rnn-tensorflow by hunkim.
(honestly: almost every thing, this word-rnn-tensorflow project is so great)


In [None]:
from __future__ import print_function
import numpy as np
import tensorflow as tf

import time
import os
from six.moves import cPickle

from simple_model import Model

import codecs
import collections

import argparse

## Create variables

We create variables required to train our neural net, to save the model, retrieve data, etc.

In [None]:
data_dir = 'data/Artistes_et_Phalanges-David_Campion'# data directory containing input.txt
input_encoding = None # character encoding of input.txt, from https://docs.python.org/3/library/codecs.html#standard-encodings'
log_dir = 'logs'# directory containing tensorboard logs
save_dir = 'save' # directory to store checkpointed models
rnn_size = 256 # size of RNN hidden state
num_layers = 2 # number of layers in the RNN
model = 'lstm' # lstm model
batch_size = 50 # minibatch size
seq_length = 25 # RNN sequence length
num_epochs = 25 # number of epochs
save_every = 1000 # save frequency
grad_clip = 5. #clip gradients at this value
learning_rate= 0.002 #learning rate
decay_rate = 0.97 #decay rate for rmsprop
gpu_mem = 0.666 #%% of gpu memory to be allocated to this process. Default is 66.6%%
init_from = None

## Load and Prepare the Data

First step is to ingest data from the text (input.txt file) and prepare inputs and targets for the training.

### Load Data
The objective of this section is to process the full text.

It's very easy to modify this part to change the way data are used.
Here, the code split every words from the text, however we could do similar operation to split every characters, etc.    

In [None]:
input_file = os.path.join(data_dir, "input.txt")
vocab_file = os.path.join(data_dir, "vocab.pkl")
tensor_file = os.path.join(data_dir, "data.npy")

Opening of the input file:

In [None]:
with codecs.open(input_file, "r", encoding=None) as f:
    data = f.read()

Then we split all sentences, word by word.

**Note**: this command will split words based on spaces.
If a dot or a comma is closed/attached to a word, it will be added to it.

You can modify this cell in order to refine the way words are extracted (even splitting by characters).

In [None]:
x_text = data.split()

### Build vocabulary
The next step is to build a vocabulary mapping from word to index based on the previous sentences.

In order to to that, we have to define:

- vocabulary mapping (word -> index)
- inverse vocabulary mapping. (index -> word)

In [None]:
# count the number of words
word_counts = collections.Counter(x_text)

# Mapping from index to word : that's the vocabulary
vocabulary_inv = [x[0] for x in word_counts.most_common()]
vocabulary_inv = list(sorted(vocabulary_inv))

# Mapping from word to index
vocab = {x: i for i, x in enumerate(vocabulary_inv)}
words = [x[0] for x in word_counts.most_common()]

the size of the vocabulary is:

In [None]:
vocab_size = len(words)

We save the vocabulary file. Could ber usefull later...

In [None]:
with open(vocab_file, 'wb') as f:
    cPickle.dump((words), f)

### Tensor creation
We create the tensor based on the vocalubary, then we save it.

In [None]:
tensor = np.array(list(map(vocab.get, x_text)))

# Save the data to data.npy
np.save(tensor_file, tensor)

print('tensor is:' + str(tensor))
print("It's shape: " + str(np.shape(tensor)))

### create batches
First, we calculate the number of batches we can use to train the model:

In [None]:
num_batches = int(tensor.size / (batch_size * seq_length))
print('number of batches is: ' + str(num_batches))

Then, we modify the tensor, following the real number of batches.

We select only the firsts values required for all batches.

In [None]:
tensor = tensor[:num_batches * batch_size * seq_length]
print('The shape of the new tensor is: '+ str(np.shape(tensor)))

Now, we have to define **'inputs'** (xdata) and **'targets'** (ydata) data for the training.

In our tutorial, due to our objectives, they are similar in term of shape:

In [None]:
xdata = tensor
ydata = np.copy(tensor)

We have to set-up correctely the targets (ydata).

in our exemple, we want to __predict the next words of a sentence__, so __ydata__ is a shift by one word from __xdata__.
In order to have a __ydata__ with the same shape, we copy the first component of __xdata__ to the last one of __ydata__.

Dumb example: if the complete xdata is: "the quick brown fox jumps over the lazy dog"
- xdata = [the, quick, brown, fox, jumps, over, the, lazy, dog]
- ydata = [quick, brown, fox, jumps, over, the, lazy, dog, the]
    

In [None]:
ydata[:-1] = xdata[1:]
ydata[-1] = xdata[0]

Then, we create batches: we split xdata and ydata in several batches.

In [None]:
x_batches = np.split(xdata.reshape(batch_size, -1), num_batches, 1)
y_batches = np.split(ydata.reshape(batch_size, -1), num_batches, 1)

### reset batch pointer

In [None]:
pointer = 0

We save words and vocabs.

It will be usefull when we would like to generate text from a trained model.

In [None]:
with open(os.path.join(save_dir, 'words_vocab.pkl'), 'wb') as f:
    cPickle.dump((words, vocab), f)

## Set up the Model

We create the model.
If you want to deep dive inside, please have a look to the Model class in the __*simple_model.py*__ file.

__*simple_model.py*__ file describes a model class, with function to train it and to generate text.

Note: By default, the RNN is a LTSM network. You can easily switch to another type by modifying the python script (simple RNN or GRU).

In [None]:
model = Model(data_dir,input_encoding,log_dir,save_dir,rnn_size,num_layers,model,batch_size,seq_length,num_epochs,save_every,grad_clip,learning_rate,decay_rate,gpu_mem,init_from, vocab_size)

After that, we create a "writer" to populate logs file.
It will be very usefull display additional infomation from __tensorboard__.

In order to do that:
- we merge all summaries collected in the default graph,
- then we create a summary writer, that will save info in the log folder.

__Note:__
From a separate console line, run the following:

      tensorboard --logdir=./logs/

This command will start Tensorboard. Then, info will be available on the following url:
    http://0.0.0.0:6006

In [None]:
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(log_dir)

A last, we set up a variable for gpu options of the model:

In [None]:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_mem)

# Train the Model

Here is the big part.

The following section describes how to open a session :
- add the graph to the writer (for the log)
- global variable initialization,
- creation of a saver to store models in files,
- for each epochs:
     - assign the learning rate for the epoch,
     - reinitilization of variables:
         - pointer for batches,
         - state of the model,
         - variable to calculate speed.
     - then loop over all batches:
         - select x and y for the active batch,
         - set the feeding string for the model,
         - train the model,
         - display some info in the console,
         - save the model sometimes
- close de session.
         

In [None]:
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
        #add the session graph to the writer
        train_writer.add_graph(sess.graph)

        #initialize global variables
        tf.global_variables_initializer().run()

        #create the Saver to save the model and its variables.
        saver = tf.train.Saver(tf.global_variables())

        #create a for loop, to run over all epochs (defined as e)
        for e in range(model.epoch_pointer.eval(), num_epochs):
            #a session encapsulates the environement in which operations objects are executed.
                        
            #Initialization:
            
            #here we assign to the lr (learning rate) value of the model, the value : args.learning_rate * (args.decay_rate ** e))
            sess.run(tf.assign(model.lr, learning_rate * (decay_rate ** e)))
            
            #we define the state of the model. At the beginning, its the initial state of the model.
            state = sess.run(model.initial_state)
            #speed to 0 at the beginning.
            speed = 0
            #reinitialize pointer for batches
            pointer = 0
            
            if init_from is None:
                assign_op = model.epoch_pointer.assign(e)
                sess.run(assign_op)

            if init_from is not None:
                pointer = model.batch_pointer.eval()
                init_from = None

            #in each epoch, for loop to run over each batch (b)
            for b in range(pointer, num_batches):
                #define the starting date:
                start = time.time()
                #define x and y for the next batch
                x, y = x_batches[pointer], y_batches[pointer]
                pointer += 1

                #create the feeding string for the model.
                #input data are x, targets are y, the initiate state is state, and batch time 0.
                feed = {model.input_data: x, model.targets: y, model.initial_state: state,
                        model.batch_time: speed}

                #run the session and train.
                summary, train_loss, state, _, _ = sess.run([merged, model.cost, model.final_state,
                                                             model.train_op, model.inc_batch_pointer_op], feed)
                #add summary to the log
                train_writer.add_summary(summary, e * num_batches + b)

                #calculate the speed of the batch.
                #this information will be displayed later.
                speed = time.time() - start

                #display something in the console
                #---------------------------------
                #print information:
                if (e * num_batches + b) % batch_size == 0:
                    print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" \
                        .format(e * num_batches + b,
                                num_epochs * num_batches,
                                e, train_loss, speed))
                
                #save model:
                if (e * num_batches + b) % save_every == 0 \
                        or (e==num_epochs-1 and b == num_batches-1): # save for the last result
                    #define the path to the model
                    checkpoint_path = os.path.join(save_dir, 'model_test.ckpt')
                    #save the model, woth increment ()
                    saver.save(sess, checkpoint_path, global_step = e * num_batches + b)
                    print("model saved to {}".format(checkpoint_path))
        
        #close the session
        train_writer.close()

## Now...
Now, the model is trained and stored locally. It can be used to generate sample of text !

Open the notebook __**Generate_text**__ to continue...