## Vanilla RNN on Tensorflow

In this notebook, we'll learn how to build a character-wise rnn trained on Anna Karenina books.
This tutorial is based on Andrej Karpathy's [RNN Post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

<img src="https://github.com/udacity/deep-learning/raw/d94980095d1187998e2e0544966bb417f031390f/intro-to-rnns/assets/charseq.jpeg" alt="Drawing" style="width: 600px;">

The purpose of this tutorial is to learn step by step, starting from understanding the problem itself and the implementation of RNN.

## Problem Definition

As you can see from Andrej Karpathy's blog, RNN can be used to generate sequence of characters that later will start to learn of what is the most probable next characters given the current character as an input. In this tutorial we will do the same but with different datasets

In [2]:
import numpy as np 
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

Before feeding the network of our input data, we need to do some sort of transformation first since our network doesn't understand characters. In usual machine learning problem, we feed the algorithm with a vector of features. This vector will be used by the algorithm to make a boundary decision.
The situation is the same for RNN. Except we need to transform our input from sequence of characters into sequence of vectors. The most common approach for this transformation is by using one-hot encoding, pre-trained word embedding (you can have a look at gensim), or randomly generated vectors that later will be updated towards our training process. In this tutorial we will use the later approach by creating randomly generated vectors.

For the first step, we'll load the text file and convert it into list of ids. The list of IDs will be used as an index for our lookup table. Lookup table? What is that? Don't get confuse, we'll get there.

In [9]:
with open('anna.txt', 'r') as f:
    text = f.read()

# build our vocab by only use a set of characters that appear in our training data
vocab = sorted(set(text))

# convert our list of characters to list of integers
# for example: {'a': 1, 'b': 2}
vocab_to_int = {c: i for i, c in enumerate(vocab)}

# create a reverse lookup table to convert back the list of ids into characters
int_to_vocab = dict(enumerate(vocab))

# create new dataset with all the text already converted into list of indices
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

Let's take a peek into our data

In [13]:
print("Our real data:")
print(text[:100])
print("\nOur encoded data:")
print(encoded[:100])

Our real data:
Chapter 1


Happy families are all alike; every unhappy family is unhappy in its own
way.

Everythin

Our encoded data:
[31 64 57 72 76 61 74  1 16  0  0  0 36 57 72 72 81  1 62 57 69 65 68 65 61
 75  1 57 74 61  1 57 68 68  1 57 68 65 67 61 26  1 61 78 61 74 81  1 77 70
 64 57 72 72 81  1 62 57 69 65 68 81  1 65 75  1 77 70 64 57 72 72 81  1 65
 70  1 65 76 75  1 71 79 70  0 79 57 81 13  0  0 33 78 61 74 81 76 64 65 70]
1985223


As you can see, we have successfully transformed our data. Now in order to train our network, the next step is to split our data into several batch. So instead of putting all of our training data at once into our network, which could lead to vanishing/exploding gradient problem, longer time to converge, and possibly OOM error if you train this on GPU.

Let's create a function that automatically generate the batch for us

In [14]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns batches of size
       n_seqs x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       n_seqs: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the number of characters per batch and number of batches we can make
    characters_per_batch = n_seqs * n_steps
    n_batches = len(arr)//characters_per_batch
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * characters_per_batch]
    
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
        yield x, y

In [15]:
batches = get_batches(encoded, 10, 50)
x, y = next(batches)

In [16]:
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[31 64 57 72 76 61 74  1 16  0]
 [ 1 57 69  1 70 71 76  1 63 71]
 [78 65 70 13  0  0  3 53 61 75]
 [70  1 60 77 74 65 70 63  1 64]
 [ 1 65 76  1 65 75 11  1 75 65]
 [ 1 37 76  1 79 57 75  0 71 70]
 [64 61 70  1 59 71 69 61  1 62]
 [26  1 58 77 76  1 70 71 79  1]
 [76  1 65 75 70  7 76 13  1 48]
 [ 1 75 57 65 60  1 76 71  1 64]]

y
 [[64 57 72 76 61 74  1 16  0  0]
 [57 69  1 70 71 76  1 63 71 65]
 [65 70 13  0  0  3 53 61 75 11]
 [ 1 60 77 74 65 70 63  1 64 65]
 [65 76  1 65 75 11  1 75 65 74]
 [37 76  1 79 57 75  0 71 70 68]
 [61 70  1 59 71 69 61  1 62 71]
 [ 1 58 77 76  1 70 71 79  1 75]
 [ 1 65 75 70  7 76 13  1 48 64]
 [75 57 65 60  1 76 71  1 64 61]]


Our `get_batches` function has done a great job for us to generate our batch. Okay now we have done all the preparations and let's go to the fun stuff. Building the RNN

# RNN

In RNN, as we may know, we need to employ the RNN-cell multiple times (called steps), in this case from the beginning of our sentence and usually until the end of line. For the purpose of our learning, we can start small by splitting each of the sentences into 5 steps

In [18]:
# hyperparam
num_steps = 5
bach_size = 50
hidden_size = 100
learning_rate = 1e-1
embedding_size = 300

In [19]:
def build_inputs(batch_size, num_steps):
    """Placeholders in tensorflow act as a container of our input and later will
    be fed into the tensorflow engine
    """
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name="inputs")
    targets = tf.placeholder(tf.int32, [batch_size, num_steps], name="targets")
    
    return inputs, targets

Remember we convert all of our input into a sequence of ids?
This is where all get interesting, 

In [18]:
def initiate_embedding_lookup(inputs, output_size, embedding_size):
    """
    """
    embedding_weight = tf.get_variable('Embedding', [output_size, embedding_size])
    embedding_lookup = tf.nn.embedding_lookup(embedding_weight, inputs)
    return embedding_lookup

In [22]:
def build_vanilla_rnn(inputs, hidden_size, embedding_lookup):
    inputs = tf.unstack(embedding_lookup, axis=1)
    cell = tf.contrib.rnn.BasicRNNCell(hidden_size)
    output, state = tf.nn.static_rnn(cell, inputs, dtype=tf.float32)
    return tf.reshape(output, shape=(-1, hidden_size))

In [13]:
def build_output(hidden_size, output, output_size):
    weight = tf.get_variable('weight', [hidden_size, output_size])
    bias = tf.get_variable('bias', [output_size])
    logits = tf.matmul(output, weight) + bias
    
    return logits

In [15]:
def build_loss(targets, output_size, logits):
    one_hot_target = tf.one_hot(targets, output_size)
    one_hot_target = tf.reshape(one_hot_target, shape=(-1, output_size))
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=one_hot_target)
    loss = tf.reduce_mean(loss)
    return loss

In [16]:
def build_optimizer(learning_rate, loss):
    return tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

In [19]:
inputs, targets = build_inputs(None, x.shape[1])

In [20]:
embedding_lookup = initiate_embedding_lookup(inputs, len(vocab ), embedding_size)

In [23]:
output = build_vanilla_rnn(inputs, hidden_size, embedding_lookup)

In [24]:
logits = build_output(hidden_size, output, len(vocab))

In [25]:
loss = build_loss(targets, len(vocab), logits)

In [26]:
opt = build_optimizer(learning_rate, loss)

In [None]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for e in range(10):
        for _x, _y in get_batches(encoded, 5, 50):
            feed = {
                inputs: _x,
                targets: _y,
            }
            batch_loss, _ = sess.run([loss, opt], feed_dict=feed)
            print batch_loss