# Generating text with Tensorflow!
In today's lab, we will explore recurrent neural networks in greater detail. To do this, we will use the Google Tensorflow machine learning framework and the Python programming language to implement a simple recurrent neural network and train it to generate some data using a file downloaded from the internet. We will be using a Jupyter notebook (http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) to run this code. Jupyter is a tool used by researchers in a wide array of fields to visualize and interactively run code. Hold the shift and enter keys to execute a block of code. You will see the output below the code block. For starters, try executing the block of code beneath this text field. You should be able to input your name and see it printed (along with a greeting) in the box below.



In [34]:
name = input("Type your name here:")
print ("Hello " + name + "!")

Type your name here:Zach
Hello Zach!


# What is Google Tensorflow?

Google Tensorflow is a leading open source Python library for machine learning. Today, we will learn how to use a recurrent  neural network to generate fake text based on a user defined training set. You can train this system to generate your own fake news, works of fiction, poetry or any other kind of text you like :)



To begin, you will need to install *Google Tensorflow*. You can find information at https://www.tensorflow.org/install/ on how to install tensorflow in your operating system of choice. Once have completed the steps outlined in the link above, you will need to install *TFlearn*, a simplified programming interface for doing deep learning with Tensorflow. Use the following link to obtain directions for installing TFlearn on your computer: http://tflearn.org/.
Once you have downloaded and installed the two libraries, execute the following code. 

*IF YOU DO NOT SEE "GO SCOTS"  PRINTED OUT AFTER THE CODE EXECUTES, YOU MAY NOT HAVE CORRECTLY INSTALLED OR CONFIGURED EITHER GOOGLE TENSORFLOW OR TFLEARN*. 


The links below may be useful in troubleshooting any problems with these pieces of software: 

1. https://www.tensorflow.org/install/install_linux#common_installation_problems

2. http://tflearn.org/getting_started/



In [35]:
# from __future__ import absolute_import, division, print_function

import os, sys, argparse
import urllib
import time
import tflearn
import tensorflow as tf
from tflearn.data_utils import *
hello = tf.constant('Go Scots!')
sess = tf.Session()
print(sess.run(hello))


b'Go Scots!'


# Lab goals & Introduction
In this lab, we want to build a language model (https://en.wikipedia.org/wiki/Language_model) using a recurrent neural network.  Language models allow us to predict the likilhood of a collection of words appearing in a certain order within a given data set. This model has numerous applications in the fields of machine translation, chat bot creation and natural language processing. However, instead of translating or recognizing sentences using probability, we shall leverage our language model to generate text. Using an abiterary sequence of sentences as a trainning set, we shall teach a neural network to generate similar sentences based on the probability of each word appearing near the last word generated by the network. 

In order to complete this task, our first step will be to obtain a *corpus*, a set of data which we will use to train our neural network. Some corpora contain annotations that give the computer hints about what role a word plays in a sentence (i.e: subject, verb, noun). However, since we are just generating text and not attempting to gain any information about the content of the text, we can simply use a chunk of plain text without any of this sort of metadata as our training set.

In order to place the corpus in a form that the neural network can understand, we will use a parser. Thankfully, TFlearn comes bundled with an easy to use parser function that will create a continuous vector representation of words in our corpus and their semantic connections.

First, we need to load a training corpus. Execute the code block below and type the name of a valid plain file (.txt) in the your home directory in the indicated field. If you do not have a valid text field, press RETURN and type the URL of a text file you want to use as your corpus. You can use https://raw.githubusercontent.com/tflearn/tflearn.github.io/master/resources/shakespeare_input.txt as your corpus if you cannot find a file you want to you use.

In [29]:
path = input("Type the corpus file name here:")
if not os.path.isfile(path):
    url = input("Invalid file! Type a url to download it:")
    urllib.request.urlretrieve(url, path)
    print("Downloading....")
    time.sleep(5) # delays for 5 seconds
    path = input("Type the corpus file name you downloaded here:")
    if not os.path.isfile(path):
        print("Unable to locate your file. Please try again.")

print("File seems to have loaded :) ")
    

Type the corpus file name here:shakespeare_input.txt


Now that we have some data to work with, we can finish setting up the parser with the settings for our neural network. The code comments explain the purpose of each variable. More information on the role of these variables can be found at: http://tflearn.org/models/generator/

In [32]:
#Length of text sequences to analyze
maxlen = 25 
#Temperature denotes the novelty or riskiness of the generated output.
temp = 1.0 #A value closer to 0 will result in output closer to the input, so higher is riskier
# create neural network model name from textfile input
model_name=path.split('.')[0] 
#You don't need to do anything with this code :) 
if temp > 2 or temp < 0:
    print("Temperature out of suggested range.  Suggested temp range is 0.0-2.0") 
else:
    print("Will display multiple temperature outputs")

Will display multiple temperature outputs


Now, we need to load our data into a format that Tensorflow understands. We will use one of TFlearn's built-in functions to accomplish this task. The *textfile_to_semi_redundant_sequences* method tokenizes the text file indicated by the first paramater (path), converting it into an array of words. Next, this method will remove infrequent words since the neural network won't be able to make sense of them due to the lack of of contextual examples for these words within the data set. 

After tokenizing and cleaning the corpus of infrequent words, the method *vectorizes* the corpus. This process creates a vector space of the words in the data set. The vector space repersentation of the words in the corpus is a continuous vector space where semantically similar words are mapped to nearby points. Thus, 'cat' and dog might be mapped to points that sit near to each other since they are semantically similar (both being pets). In contrast, words like 'table' and 'airplane' sit a considerable distance from each other due to their dissimilarity. Tensorflow and TFLearn use an algorithm called "Word2Vec" to create this vector space. A more detailed summary of vector space models of words can be found at https://www.tensorflow.org/tutorials/word2vec.

To tokenize and vectorize the corpus, run the following code block. You should see something similar to the lines of the text below as output once the vector space has been generated. Please note this process may take some time to complete on slower computers.

`Vectorizing text...
Text total length: 4,573,338
Distinct chars   : 67
Total sequences  : 1,524,438`



In [44]:
X, Y, char_idx = \
    textfile_to_semi_redundant_sequences(path, seq_maxlen=maxlen, redun_step=3)

Vectorizing text...
Text total length: 4,573,338
Distinct chars   : 67
Total sequences  : 1,524,438


# Creating the neural net
In this lab, we use a recurrent neural network (RNN) along with the softmax regression algorithm. Before we begin describing how to invoke an RNN in Tensorflow, we will first quickly review the basic idea of an RNN. In essence, RNNs differ from traditional neural networks in that they "perform the same task for every element of a sequence, with the output being depended on the previous computations." [1] In ordinary neural network models, we assume the inputs and outputs are independent of each other. This model is poorly suited for our task since we need to know the preceding words in a sentence in order to properly predict the next word in the sentence. Thus, RNNs are the natural choice are this type of application.  

In the code below, we start by loading in the formatted input data. Next, we define an LSTM (Long short-term memory) neural network. LSTM neural nets are a special kind of RNN that contains a structure called a memory cell. This cell consists of an input gate, a neuron with a self-recurrent connection, a forget gate and an output gate. [3]

![Figure 1: An Illustration of an LSTM memory cell. (source: http://deeplearning.net/) ](http://deeplearning.net/tutorial/_images/lstm_memorycell.png)
 

                                    Figure 1: An Illustration of an LSTM memory cell. 
                                            (source: http://deeplearning.net/)


This type of cell has the unique ability to moderate its self-recurrent connection, allowing it to recall or forget its previous state as required. LSTM networks perform well when applied to problems relating to language modeling.

For more information about this topic, read the following Tensorflow documentation article: https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html

Tensorflow represent computations as graphs. [5] Nodes in the graph are called "ops" or operations. Each of these operations takes zero or more a typed multi-dimensional arrays (Tensors) and may product more Tensors. Tensorflow graphs are "descriptions of computations." However, in order to perform any computational operation, a Tensorflow graph must launched in a session. The following link will enable you to visualize these relations: http://playground.tensorflow.org/. Try replicating the network above using the tool located in the link above. More information about Tensorflow graphs and sessons can be found at https://www.tensorflow.org/versions/r0.10/get_started/basic_usage.

In the code below, we use an LSTM network with a softmax function in order to create a network that will replicate the language patterns from the corpus. For this lab, we will use a three layer LSTM neural network. Each time we invoke the *tflearn.lstm* function, we add a new layer to our network *g*. After defining each LSTM layer, we make a call to the dropout method. These dropout calls help keep the network from overfitting. These calls prevent each layer from learning an incorrect relation. We use a softmax function in the final stage of the neural network to normalize the outputs of the LSTM. [4]


In [43]:
#Source: http://tflearn.org/examples/#natural-language-processing
g = tflearn.input_data([None, maxlen, len(char_idx)])
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512, return_seq=True)
g = tflearn.dropout(g, 0.5)
g = tflearn.lstm(g, 512)
g = tflearn.dropout(g, 0.5)
g = tflearn.fully_connected(g, len(char_idx), activation='softmax')
g = tflearn.regression(g, optimizer='adam', loss='categorical_crossentropy',
                       learning_rate=0.001)


Now that we have defined our neural network layers, we can now construct our model. We pass our layers `g` into the model, along with our location within the corpus `char_idx` and a file pattern for our temporary files.  

In [39]:

m = tflearn.SequenceGenerator(g, dictionary=char_idx,
                              seq_maxlen=maxlen,
                              clip_gradients=5.0,
                              checkpoint_path='model_'+ model_name)


# Training time!
  Now that we have defined our neural net, we can train it with a sub set of our corpus. First, we use one of TFlearn's built-in functions to grab a random sequence of sentences from our corpus. Next, we run a training iteration by calling the fit method on our model `m`. We then use the generate function to compare our results with other elements of the data set at varying temperatures.
We iterate over the data set and run the training function 50 times. Press CMD / CTRL + ENTER to run the code block below. You should the result of the training shortly after running the command. However, it take some time for the entire process to finish.

In [None]:
for i in range(50):
    seed = random_sequence_from_textfile(path, maxlen)
    m.fit(X, Y, validation_set=0.1, batch_size=128,
          n_epoch=1, run_id=model_name)
    print("-- TESTING...")
    if temp is not None:
        print("-- Test with temperature of %s --" % temp)
        print(m.generate(600, temperature=temp, seq_seed=seed))
    else:
        print("-- Test with temperature of 1.0 --")
        print(m.generate(600, temperature=1.0, seq_seed=seed))
        print("-- Test with temperature of 0.5 --")
        print(m.generate(600, temperature=0.5, seq_seed=seed))

# Visualizing Networks with TensorBoard (optional)


![alt text](https://www.tensorflow.org/versions/r0.12/images/graph_vis_animation.gif "Tensorboard")

                                Figure 2: Exploring a Neural Network in TensorBoard
                                        (source: http://tensorflow.org)
Now that we have a working neural network, we can visualize it using TensorBoard. TensorBoard is a tool for graphing Tensorflow neural networks. TensorBoard also lets users adjust the weights and other properties of a network in real time a using visual interface.

1. Add a parameter to the SequenceGenerator call to save the weight data from the model `m` to a temporary directory using the `tensorboard_dir='/tmp/tflearn_logs'` argument.

2. Set up TensorBoard using the tutorial here: https://www.tensorflow.org/get_started/summaries_and_tensorboard

3.  Open the command prompt on your computer and run the command below to start TensorBoard. You may need to modify this command a little bit in order to make it work on your operating system and python version.

4. Open a web browser and go to http://localhost:6006 to access TensorBoard.




In [42]:
#Command to start tensorboard
#$> python -m tensorflow.tensorboard --logdir=/tmp/tflearn_logs
#Run code in terminal (Mac/Linux) or command prompt (Windows)


## Other things to try:
1. Try changing the temperature and maximum sequence length. How do these variables  effect the results? 
2. Download several different corpora and compare the generated data with the training data. Can you tell the difference? What are the traits that make a corpus easy to replicate? Which factors within a corpus make the algorithm fail to generate similar looking text? Briefly summarize your results in 1-3 paragraphs.
(Hint: https://www.gutenberg.org/ is a great resource for plain text finding training sets)


Sources:
    1. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
    2. https://github.com/tflearn/tflearn/wiki
    3. http://www.cs.toronto.edu/~urtasun/courses/CSC2515/05nnets-2515.pdf
    4. http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html
    5. https://www.tensorflow.org/api_docs/python/