## Vanilla RNN
### Overview:
The code works! While that in by itself is a major accomplishment, I'm not sure that will get me fame! Regardless, the focus is on understanding the vanilla RNN by developing it from a scratch. Everything, except for one/two specialized tensorflow method, should be completely transparent at the vector-wise mathematics level.

The generated text at the end seems reasonable enough. With the relatively small amount of training data available, including a single layer RNN, I think the architecture is far from being optimal. There are so many hyper-parameters to explore (state_size, embedding_size, batch_size, learning_rate, num_layers etc. etc.) that this could be a weeks' worth of efforts by themselves.

### Based mostly on code and tutorial at

(Someday I only hope I'll be able to write posts as well as the below)
- https://github.com/suriyadeepan/rnn-from-scratch/blob/master/vanilla.py

- http://suriyadeepan.github.io/2017-02-13-unfolding-rnn-2/

Some other resources to understand, especially, the origin of LSTMs and word-embeddings
- https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html
- http://sebastianruder.com/word-embeddings-1/

In [1]:
import tensorflow as tf
import numpy as np
state_size = 37 # (parameter for the RNN cell; describes how wide the cell should be)
embedding_size = 27 #(parameter for embedding layer; describes the size of embedding for each token)

### Mutable variables for weights and biases
- W: weight matrices  for the recurrent relationship between cell states
- U: weight matrices  for the connection between input and cell state
- b: bias parameter for input -> cell state connections

In [2]:
W = tf.Variable(initial_value=tf.random_normal(mean=0.,stddev=0.1,shape=[state_size, state_size]))
U = tf.Variable(initial_value=tf.random_normal(mean=0.,stddev=0.1,shape=[embedding_size, state_size]))
b = tf.get_variable('b', shape=[state_size], initializer=tf.constant_initializer(0.))

### Placeholder for input and target

In [3]:
ph_xs = tf.placeholder(shape=[None, None], dtype=tf.int32)
ph_ys = tf.placeholder(shape=[None, None], dtype=tf.int32)
ph_init_state = tf.placeholder(shape=[None, state_size], dtype=tf.float32, name="initial_state")

### Data Acquisition
The data in the files were populated using the *Data_Provider* notebook. *train_x* and *train_y* are both arrays of shape: [10,200], where 200 is a fixed sequence length for this model and 10 is the total number of samples we have available. 

In [4]:
train_x = np.load("train_x.npy")
train_y = np.load("train_y.npy")

def get_train_batches(train_x, train_y, batch_size):
    for i in range(0, train_x.shape[0], batch_size):    
        yield train_x[i : i+batch_size], train_y[i : i+batch_size]

In [5]:
# Import some essential utilites from Data_Provider.py notebook
import pickle
vocab_to_int = pickle.load(open('vocab_to_int.txt','rb'))
int_to_vocab = pickle.load(open('int_to_vocab.txt','rb'))

### Embedding
Given an input data of shape: [batch_size, seq_length], rnn_inputs returns a tensor of shape: [batch_size, seq_length, embedding_size]. This is purely driven by our decision to use an embedding matrix to represent each token (characters). Otherwise, we would've had to one-hot encode each character, and modify our input layer. I'm not sure what the complexity of this task would need to be, but using embedding matrices seem to be the preferred approach in most NLP applications of neural network. I think it massively improves the underlying computational complexity as well.

In [6]:
# number of unique characters.. you'd normally do this by inspecting the data directly
num_classes = 83
embeddings = tf.Variable(initial_value=tf.random_normal(mean=0., stddev=0.1, shape=[num_classes,embedding_size]))
rnn_inputs = tf.nn.embedding_lookup(embeddings, ph_xs)

### The main equation that powers RNN

The equation below describes how cell-state at current time ($s_{t}$) must be updated as a function of the cell-state in the previous step ($s_{t-1}$), and input word at the current step ($x_t$):
$$ s_t = tanh ( W \times s_{t-1} + U \times x_{t} ) $$

Note that we're excluding the bias term for simplicity. Also, the weight matrices(W and U) are universally shared, as is the matrix V, which is introduced later to evaluate the errors

In [7]:
def step(hidden_previous, x):
    '''
    Method evaluates the new cell-state based on hidden
    state from previous step, and input for given shape.
    It uses the global shared parameter: W, U, and b.
    
    parameters:
    - hidden_previous: This is the hidden state from the
    previous time_step. See the unrolled computational
    graph in any vanilla RNN tutorial.
    
    - x: input for current time_step    
    '''
    temp_a = tf.matmul(hidden_previous, W)
    temp_b = tf.matmul(x, U) + b
    return tf.tanh(temp_a + temp_b)

### tf.scan

Method builds a loop that dynamically unforlds, to recursively apply the function step over all elements of the rnn_inputs. This is also one of the most important step to intuitively understand. I think without using the step function, it'd be extremely complex to implement the recursion successfully (and/or accurately)

The dimensions also need to be reshuffled, so that the sequence length dimension is exposed as the 0th dimension. This is to enable iteration over elements of the sequence. I.e. Tensor of form [batch_size, seq_length, embedding_size] is transposed to [seq_length, batch_size, embedding_size]. The reshuffled tensor is represented by *transposed_inputs*

The _states_ returned by scan is an array of states from all the time steps, using which we will predict the output probabilities at each step.

In [8]:
transposed_inputs = tf.transpose(rnn_inputs,[1,0,2])

'''
tf.scan applies the step method recursively on the second argument, i.e
transposed_inputs. The returned value from one execution will also be the input 
to the next call of the method. 
'''
states = tf.scan(step, # the method that should be applied to each vec from below. Updates weights
                 transposed_inputs, # [vec1, vec2, .... , vecN]: each vec. corresponds to the embedding row for a word.
                                    # There are seqlen num. of words for every single batch. Hence, this method will recurse throug
                                    # seqlen * batch_size times, each time returning a vector of length: state_size
                 initializer=ph_init_state) # initializer. For computing the first argument (i.e. vec1), begin by a random set of values.

In [9]:
# The shape below is [seqlen, batch_size, state_size]. It's basically a tensor where each character in each 
# sequence and batch has been encoded in the internal state of the RNN
print (states.shape.as_list())

[None, None, 37]


Remember that states returns a tensor with shape [seqlen, batch_size, state_size]. Getting states[-1] returns the last element along the first dimension, i.e. the tensor [batch_size, state_size] for the last seq. length (in other words, the last time-step). Assign it to *last_state*. This variable will eventually be used for text generation.

In [10]:
last_state = states[-1]

# transpose to: [batch_size, seqlen, state_size]
states = tf.transpose(states,[1,0,2])

### Output

The *states* tensor returned by _scan_ is of shape: [seqlen * batch-size, state_size]. Thus reshape to allow vector multiplication against the output network involving V and bo, where:

- V: weight matrices for connecting cell-state to the output (or target)
- b: bias vector for connecting cell-state to the output

In [11]:
V = tf.Variable(initial_value=tf.random_normal(mean=0., stddev=0.1, shape=[state_size, num_classes]))
bo = tf.get_variable('bo', shape=[num_classes], initializer=tf.constant_initializer(0.))
states_reshaped = tf.reshape(states, [-1, state_size])
logits = tf.matmul(states_reshaped, V) + bo

# The shape of predictions will be: [seqlen * batch_size, num_classes]
predictions = tf.nn.softmax(logits)

### Optimization
ph_ys is shaped as follows: [batch_size, seq_length]. In order to use it against logits obtained by multiplying hidden_states (states) and output matrix (V), we need to collapse it into a single dimension.

In [12]:
# basically flatted ys into a flat vector
ph_ys_reshaped = tf.reshape(ph_ys, shape=[-1])

The tensorflow method: https://www.tensorflow.org/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits
is very interesting. It (the method) will basically look at (the integer representation of) each character and evaluate a one-hot representation based on its label. Thus, ph_ys_reshaped, of size [batch_size * seqlen] will effectively be unrolled into a vector of length [batch_size * seqlen * num_classes]. The method then uses this unrolled vector to evaluate the loss against *logits*

In [13]:
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=ph_ys_reshaped, logits=logits)
loss = tf.reduce_mean(losses)
train_op = tf.train.AdamOptimizer(learning_rate=0.01).minimize(loss)

### Training

In [14]:
batch_size = 100 # Take 100 at a time. We have 10,000 needed, so 100 iterations per epochs
seq_length = 100 # each vector is fixed length of 100 (by inspecting data)
chkpt_path = "ckpts/"

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    train_loss = 0.
    n_epochs = 5
    for i in range(n_epochs):
        for xs, ys in get_train_batches(train_x, train_y, batch_size=100):
            _, train_loss_val = sess.run([train_op, loss], 
                                         feed_dict={ph_xs: xs, 
                                                    ph_ys: ys,
                                                    ph_init_state: np.zeros([batch_size, state_size])})
        print("epochs: ",i, ", loss evaluated: ",train_loss_val)

    # Save the model at the end of the run.
    saver = tf.train.Saver()
    saver.save(sess, chkpt_path+"vanilla_rnn.ckpt", global_step=n_epochs)

epochs:  0 , loss evaluated:  2.55229
epochs:  1 , loss evaluated:  2.36326
epochs:  2 , loss evaluated:  2.26657
epochs:  3 , loss evaluated:  2.18202
epochs:  4 , loss evaluated:  2.12148


### Text Generation

In [15]:
def restore_session(sess):
    ckpt = tf.train.get_checkpoint_state(chkpt_path)
    
    saver = tf.train.Saver()
    if ckpt and ckpt.model_checkpoint_path:
        print("restoring model from ",ckpt.model_checkpoint_path)
        saver.restore(sess, ckpt.model_checkpoint_path)
    return sess

In [16]:
def printChars(chars):
    print('------- Generated Text ----------')
    print(''.join(str(c) for c in chars))
    print('-------      END        ---------')

In [17]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess = restore_session(sess)
    
    current_char = vocab_to_int['g']
    num_chars = 100             # number of char sequence to try generate
    chars = [current_char]     # some input char to get started with text generation
    batch_size = 1             # generate 1 batch a time (i wonder if its possible to specify a larger batch)
    state = np.zeros([batch_size, state_size]) # initial state to start off with
    
    for i in range(num_chars):
        preds, state = sess.run([predictions, last_state], 
                                feed_dict={ph_xs: np.array(current_char).reshape([1,1]),
                                           ph_init_state: state})
        current_char = np.random.choice(preds.shape[-1], 1, p=np.squeeze(preds))[0]
        chars.append(int_to_vocab[current_char])
    printChars(chars)

restoring model from  ckpts/vanilla_rnn.ckpt-5
------- Generated Text ----------
21ing nike sold her tit
to his toon of the croviteable as not aneve the lore, dasturcot-at the Vamady 
-------      END        ---------
