### What is RNN ?

Traditional Neural Networks and Convolutional Neural Networks are NOT suitable for handling inputs which come in sequences. For example, if you just have a word and asked to identify its part of speech - it's NOT possible. Word can take different form depending on context, so it can be possible only by looking other nearby words in the sentence. The entire sequence needs to be studied to determine the response. 

This is where Recurrent Neural Networks (RNN) find their usage. As the RNN traverses the input sequence, output for every input also becomes a part of the input for the next item of the sequence. **The term recurrent comes from the fact that the output of the pervious step is fed as one of the input of the current step. When this gets repeated over and over, the last output would be the result of all the previous inputs and the last input.**

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

### What is LSTM? 

Sometimes, we only need to look at recent information to perform the present task. Like trying to predict last word in  "the clouds are in the sky". RNNs are very apt for sequence classification problems and the reason they’re so good at this is that **they’re able to retain important data from the recent inputs and use that information to modify the current output**. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” If the sequences are quite long, the gradients (values calculated to tune the network) computed during their training (backpropagation) either vanish (multiplication of many 0< values < 1) or explode (multiplication of many large values) causing it to train very slowly.

**Long Short Term Memory** networks – usually just called **LSTMs** – are a special kind of RNN, capable of learning long-term dependencies and retaining memory. LSTMs solve the gradient problem by introducing a few more gates that control access to the cell state.



### Learn RNN/LSTM through example

Given a binary string, yes with just 0 and 1s of length 10; find the count of 1s in the string. 

It's trivial probem in any programming language of your choice **but NOT so trivial in RNN world.**
This will help us understand the fundamentals of RNN architecute. 

### Data Setup

Generate all possible binary strings of length 10 and randomize it.

In [1]:
import numpy as np
from random import shuffle

input_values = ['{0:010b}'.format(i) for i in range(2**10)]
shuffle(input_values)
print('shuffled input={}'.format(input_values[:10]))

#print('\n Total number of inputs ={}'.format(len(input_values)))

shuffled input=['1101110100', '1001110100', '0000001000', '1100010000', '1110101101', '0001010111', '1010110001', '1011100101', '1000100001', '1110100111']


Each string needs to be vectorized or converted to array so that it can be passed to neural network. 

In [2]:
input_values = [map(int,i) for i in input_values]
ti  = []
for i in input_values:
    temp_list = []
    for j in i:
            temp_list.append([j])
    ti.append(np.array(temp_list))
input_values = ti

print(input_values[:2])

[array([[1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0]]), array([[1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0]])]


As the problem states; the output is going to be the count of 1s in the input. Finding count is easier but we need to convert the value in **one-hot-encoding**. 

In [3]:
output_values = []

for i in input_values:
    count = 0
    for j in i:
        if j[0] == 1:
            count+=1   # count the value of 1
    temp_list = ([0]*11) # create 11 sized array as [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    temp_list[count]=1   
    output_values.append(temp_list)

print(output_values[:10])

[[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]


Now we have got the input and output values. We need to divide these values into train and test data.

In [4]:
NUM_EXAMPLES = 500
test_input = input_values[NUM_EXAMPLES:] 
test_output = output_values[NUM_EXAMPLES:] #everything beyond 500

train_input = input_values[:NUM_EXAMPLES]
train_output = output_values[:NUM_EXAMPLES] #till 500


### Model Designing

Solving this problem using TensorFlow so that we can focus on fundamentals. 

The dimensions for data are **[Batch Size, Sequence Length, Input Dimension]**. Batch size is unknown and to be determined at runtime. Target will hold the training output data which are the correct results that we desire. These placeholders will be supplied data later during execution phase of tensorflow.

In [5]:
import tensorflow as tf

# 10 inputs 
data = tf.placeholder(tf.float32, [None, 10,1], name='input_placeholder')   

# output could be any number between 0 and 10 (both included); so 11 values are possible
target = tf.placeholder(tf.float32, [None, 11], name='labels_placeholder') 

**The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.**The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/
    
For each LSTM cell that we are going to initialize below we need to provide value for hidden dimension or number of units in LSTM cell. The value of it is it up to us, too high a value may lead to overfitting or a very low value may yield extremely poor results. 

In [6]:
num_hidden = 12  # we can play with different values here

cell = tf.contrib.rnn.BasicLSTMCell(num_hidden,state_is_tuple=True)
val, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)

We unroll the network and pass the data to it and store the output in val. We also get the state at the end of the dynamic run as a return value but we discard it because every time we look at a new sequence, the state becomes irrelevant for us.

We transpose the output to switch batch size with sequence size. After that we take the values of outputs only at sequence’s last input, which means in a string of 10 we’re only interested in the output we got at the 10th character and the rest of the output for previous characters is irrelevant here.

In [7]:
val = tf.transpose(val, [1, 0, 2])
last = tf.gather(val, int(val.get_shape()[0]) - 1)

In [8]:
print(num_hidden)
print(target.get_shape()[1])
print(target.get_shape()[1])

12
11
11


In [9]:
weight = tf.Variable(tf.truncated_normal([num_hidden, int(target.get_shape()[1])]))
bias = tf.Variable(tf.constant(0.1, shape=[target.get_shape()[1]]))

The dimension of the weights will be num_hidden X number_of_classes. Thus on multiplication with the output (val), the resulting dimension will be batch_size X number_of_classes which is what we are looking for.

In [10]:
prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)

After multiplying the output with the weights and adding the bias, we will have a matrix with a variety of different values for each class. What we are interested in is the probability score for each class i.e the chance that the sequence belongs to a particular class. We then calculate the softmax activation to give us the probability scores.

We calculate the cross entropy loss (more details here) and use that as our cost function. The cost function will help us determine how poorly or how well our predictions stack against the actual results. This is the function that we are trying to minimize. 

Adding the log term helps in penalizing the model more if it is terribly wrong and very little when the prediction is close to the target. The last step in model design is to prepare the optimization function.

In [11]:
cross_entropy = -tf.reduce_sum(target * tf.log(tf.clip_by_value(prediction,1e-10,1.0)))

Tensorflow has a few optimization functions like RMSPropOptimizer, AdaGradOptimizer, etc. We choose AdamOptimzer and we set minimize to the function that shall minimize the cross_entropy loss that we calculated previously.

In [12]:
optimizer = tf.train.AdamOptimizer()
minimize = optimizer.minimize(cross_entropy)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


### Calculating the error on test data

This error is a count of how many sequences in the test dataset were classified incorrectly. This gives us an idea of the correctness of the model on the test dataset.

In [13]:
mistakes = tf.not_equal(tf.argmax(target, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))

### Execution of the graph

We’re done with designing the model. Now the model is to be executed!

In [14]:
init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)

Instructions for updating:
Use `tf.global_variables_initializer` instead.


In [15]:
batch_size = 100
no_of_batches = int(len(train_input)/batch_size)
epoch = 2000

for i in range(epoch):
    ptr = 0
    for j in range(no_of_batches):
        inp, out = train_input[ptr:ptr+batch_size], train_output[ptr:ptr+batch_size]
        ptr+=batch_size
        sess.run(minimize,{data: inp, target: out})
    if i % 500 == 0:
        print("Epoch - ",str(i))

Epoch -  0
Epoch -  500
Epoch -  1000
Epoch -  1500


In [16]:
incorrect = sess.run(error,{data: test_input, target: test_output})
print('Epoch {:2d} error {:3.1f}%'.format(i + 1, 100 * incorrect))
sess.close()

Epoch 2000 error 0.4%


For the final epoch, the error rate is less than 1% across the entire dataset! 

**Our neural network figured that out by itself! We did not instruct it to perform any of the counting operations.**