# Predicting words using a modified Karpathy RNN
##### Authors: Bret Stine, Sean Vucinich
##### Date: December 14th, 2018

## Introduction:
For this project, we work off Andrej Karpathy's RNN and modify it to work with tensorflow. Karpathy's code takes a manual approach to predicting words via an RNN. Translating his work into tensorflow allows for streamlined and easy to understand process, assuming the reader is familiar with tensorflow machine learning. 

For the modified Karpathy RNN, we will be using the Alice in Wonderland text file that was provided to us. The goal for this project is to produce text that will learn and improve on itself to output coherent sentences, or at the very least legible words.

# **Challenges and Issues:**

Throughout the creation of this notebook, numerous issues were encountered that needed to be solved in order to have a working RNN based upon the Karpathy code.

* **Understanding Karpathy's code**: The first challenge was getting a basic understanding of Karpathy's code, as it isn’t the easiest thing to look at, the variables aren't descriptive without reading through the code many times, amongst other problems. This was overcome through breaking down the code into sections and running line-by-line to understand what everything accomplishes.

*  **Generating a Basic RNN**: The second issue, once the Karpathy code was understood, was how to take it and build an RNN to continue to build upon. This issue was overcome by starting with the basic RNN code from a previous lab and adapting it to work as the framework for the code in this notebook.

*  **Using the sparse softmax function**: When we were working on the graph construction of the RNN, one of the first problems we ran into was the choosing the loss function, as the framework RNN we had decided to work upon used the **tf.nn.sparse.softmax_cross_entropy_with_logits()** function. This function however was incompatible with our RNN output due to a mismatch of the rank of the tensors. Through exploring other loss functions, we were able to avoid this issue by switching to the **tf.nn.softmax_cross_entropy_with_logits_v2()** which not only performed better, but bypassed the original issue.

* **GPU enabled RNN Cell (tf.contrib.cudnn_rnn.CudnnRNNRelu)**: This issue had to do with switching the basic cell used for the RNN graph with a GPU enabled one, in an effort to vastly improve training performance. However, we had to abandon this idea early on, as the Relu variation of the Cudnn enabled RNN cell outputs differently than the **tf.contrib.rnn.BasicRNNCell()** outputs. This would have necessitated a complete re-write of the graph construction part of the graph, which while extremely beneficial to training speed, would have taken away from completing the project.

* **One Hot Encoding - How to do it**: We ran into two issues when it came to one-hot encoding of the batches, the first was exactly should we preform the one hot encoding. From looking through the Karparthy code, and examples of one-hot encoding of text in TensorFlow, we came up with two options. We could either try to use the **tf.one_hot()** function, or we could write our own, and after playing around with the built-in function in TensorFlow with little success, we decided to write our own simple function. To do this we simply take the input then preform a **np.eye()** function upon the input array, which returns a 2-D array with ones on the diagonal and zeros elsewhere.

* **One Hot Encoding - Location in the code**: The second issue when it came to preform one-hot encoding was where should we actually preform the function. Initially we thought we should one-hot the text when we reshape it as an encoded sequence of characters, as part of the last pre-processing step. However, this caused problems during the graph execution phase, which resulted in us taking a step back and thinking again about where the text should be encoded. We ultimately decided that we will one-hot encode the batches when they are called in our **next_batch()** function during the graph execution phase, as this resulted in no performance loss and avoided the previous errors.

* **Rank of inputs and targets**: This was an unexpected issue that arose early on, and that was having rank mismatching on the input and target tensors. Due to us basing our code upon an earlier RNN graph example, we had initially believed that the tensor only needed the single **seq_length** parameter for it to work properly. While the input tensor did continue without error, the target tensor did not, which prompted us to go back and figure out that *both* tensors needed to have a rank of 3 for our RNN to function properly.

* **Next Batch Function**: During the graph execution phase, we had initially wanted to use the **shuffle_batch()** function that we had previously used on other neural net problems, but using that function led to errors and we had to quickly rule it out. The solution to this issue came from inspiration in the textbook *Hands-On Machine Learning with Scikit-Learn & TensorFlow*, which has in the RNN chapter an example **next_batch()** function. Using this function as a base, we modified it a bit, and incorporated our one-hot encoding into the return, which solved the issues we had encountered with the shuffle batch.

* **Generating Predictions**: This issue popped up towards the end of the project, once we had the RNN functioning, we had to generate predictions from it. We had quickly realized that we had not done this using an RNN before, so we once again fell back upon the textbook for assistance. While the textbook only had a basic example of how to generate predications, combined with the knowledge of generating predications from previous output on other assignments, and the general format of what that would look like, we were able to write a loop to generate them using the output tensor as such: ***predictions = outputs.eval(feed_dict={inputs: X_batch})***

* **Outputting text**: Once the previous issue was solved, we thought that outputting the text from the predictions would be easy, however it turned out to be time consuming. Initially we had thought we needed to simply assign the predictions to letters in a loop, however we quickly realized that the predictions were float values so of course, we could not follow that approach. We then fell back to the Karpathy code for assistance, where upon dissecting the **sample()** function, we learned the general format of how we could sample indices from our predictions by generating probabilities, and then use that to output text, by randomly choosing a character using the probabilities as weights.





## Importing Data:

In [1]:
import numpy as np
import tensorflow as tf
import requests

data = requests.get('https://raw.githubusercontent.com/bretstine/alice/master/alice.txt')
data = data.text
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('data has %d characters, %d unique.' % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

data has 147726 characters, 70 unique.


The only libraries needed are numpy and tensorflow. The above code is provided by Karpathy. This takes in a file, gets the unique characters used, and maps letters to numbers. To better utilize the map of letters to numbers, a training set with labels is required to use machine learning.

In [0]:
# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 0.0065

# create training sequences and corresponding labels
Xi = []
yi = []
for i in range(0, len(data)-seq_length-1, 1):
        Xi.append([char_to_ix[ch] for ch in data[i:i+seq_length]])
        yi.append([char_to_ix[ch] for ch in data[i+1:i+seq_length+1]])
# reshape the data
        
# in X_modified, each row is an encoded sequence of characters
X_modified = np.reshape(Xi, (len(Xi), seq_length))
y_modified = np.reshape(yi, (len(yi), seq_length))

Above is code provided by Dr. Bruns to create training sets with corresponding labels. 

## Graph Construction Phase:

In [3]:
tf.reset_default_graph()

inputs = tf.placeholder(tf.float32, [None, seq_length, vocab_size], name='inputs')
targets = tf.placeholder(tf.float32, [None, seq_length, vocab_size], name='targets')

with tf.name_scope("rnn"):
    basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=vocab_size, activation=tf.nn.relu)
    outputs, states = tf.nn.dynamic_rnn(basic_cell, inputs, dtype=tf.float32)

with tf.name_scope("loss"):
    xentropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=outputs, labels=tf.cast(targets, tf.int32))
    loss = tf.reduce_mean(xentropy)

with tf.name_scope("train"):
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()

Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.


For the graph construction phase, we used the standard BasicRNNCell with dynamic_rnn as they were the best performing. To account for loss, softmax_cross_entropy_with_logits_v2 was used. We ran into an issue here because sparse_softmax_cross_entropy_with_logits was originally used. However, due to the mismatch of ranks due to the labels, we went first mentioned loss function. Training has remained the same as the last few labs and homeworks with using AdamOptimizer to minimize loss.

In [0]:
def one_hot(X, y):
    return (np.eye(vocab_size)[X], np.eye(vocab_size)[y])

def next_batch(batch_size, inputs, targets):
    rnd_idx = np.arange(0 , len(inputs))
    np.random.shuffle(rnd_idx)
    rnd_idx = rnd_idx[:batch_size]
    inputs_shuffle = [inputs[i] for i in rnd_idx]
    targets_shuffle = [targets[i] for i in rnd_idx]
    return one_hot(np.asarray(inputs_shuffle), np.asarray(targets_shuffle))

To account for one-hot-encoding, we created a custom function utilizing the np.eye function that created an array the same size of the inputted matrix. This function ties into the next_batch function that operates like that of the shuffle_batch from the book.

In [5]:
n_iterations = 10001
batch_size = vocab_size
smooth_loss = -np.log(1.0/vocab_size) * seq_length

with tf.Session() as sess:
     init.run()
     for iteration in range(n_iterations):
         X_batch, y_batch = next_batch(batch_size, X_modified, y_modified)
         sess.run(training_op, feed_dict={inputs: X_batch, targets: y_batch})
         if iteration % 100 == 0:
             test = []
             predictions = outputs.eval(feed_dict={inputs: X_batch})
             for j in predictions:
                 p = np.exp(j) / np.sum(np.exp(j))
                 ix = np.random.choice(range(vocab_size), p=np.sum(p,axis=0))
                 x = np.zeros((vocab_size, 1))
                 x[ix] = 1
                 test.append(ix)
             txt = ''.join(ix_to_char[ix] for ix in test)
             if iteration % 2000 == 0:
                 print ('----\n%s \n----' % (txt, ))
                 mse = loss.eval(feed_dict={inputs: X_batch, targets: y_batch})
                 smooth_loss = smooth_loss * 0.999 + mse * 0.001
                 print ('iter %d \tloss: %f \tsmooth_loss: %f' % (iteration, mse, smooth_loss))


----

VyNhJh(THJLFMI UhX.W
TFusybyLEflh()-:;tMajKpN!HLFPCtPcEmYvYzxYwYhfPg 
----
iter 0 	loss: 4.215692 	smooth_loss: 106.110384
----
X enoottYehoylf elea 
hlk
onrrirhea nhyde
r , h
J
ah h 
----
iter 2000 	loss: 2.199935 	smooth_loss: 106.006474


KeyboardInterrupt: ignored

In [7]:
n_iterations = 20001
batch_size = vocab_size
smooth_loss = -np.log(1.0/vocab_size) * seq_length

with tf.Session() as sess:
     init.run()
     for iteration in range(n_iterations):
         X_batch, y_batch = next_batch(batch_size, X_modified, y_modified)
         sess.run(training_op, feed_dict={inputs: X_batch, targets: y_batch})
         if iteration % 100 == 0:
             test = []
             predictions = outputs.eval(feed_dict={inputs: X_batch})
             for j in predictions:
                 p = np.exp(j) / np.sum(np.exp(j))
                 ix = np.random.choice(range(vocab_size), p=np.sum(p,axis=0))
                 x = np.zeros((vocab_size, 1))
                 x[ix] = 1
                 test.append(ix)
             txt = ''.join(ix_to_char[ix] for ix in test)
             mse = loss.eval(feed_dict={inputs: X_batch, targets: y_batch})
             smooth_loss = smooth_loss * 0.999 + mse * 0.001
             if iteration % 20000 == 0:
                 print ('----\n%s \n----' % (txt, ))
                 print ('iter %d \tloss: %f \tsmooth_loss: %f' % (iteration, mse, smooth_loss))

----
MeEMYzMwgth w"MRECrXw;rdjBEuJ(A.nz;p,;!":CT![DLlHH"eCZLJ I:grWut].Upr 
----
iter 0 	loss: 4.218764 	smooth_loss: 106.110387
----
d
'e
soeuiddodouriucs
 

luee
 iead
eintgrs aeeo 
----
iter 20000 	loss: 1.828715 	smooth_loss: 87.217526


# **Conclusion:**

This was by far the best project we have worked on. To our knowledge, this is the first time we are producing a visual result besides simply accuracy. Though our output was not the greatest, this means that there is much room for improvement and above all, learning. We will continue to work on this to produce better results as a potential resume builder. Having a concentration in data science, we know that this is only the beginning. We are motivated to always do better, and this project has been an experience all on its own. A few important things we learned, going through a manual version make us think about how the program runs and how complicated machine learning can be. From this, we will strive to do better and continue learning. When we finally outputted text that improved upon itself, we celebrated and were incredibly excited. We do not think any project besides this one would have been as satisfying.