# Recurrent Neural Networks 

## RNN Intro 

Recurrent is defined as occuring often or repeatedly

For RNNs, we perform the same task for each element in the input sequence 

## RNN History 

First attempt to add memory to neural networks were Time Delay Neural Networks (TDNNs 1989)

Simple RNN/ Elman Networks (1990) followed 

Vanishing gradient problem was a problem for RNNs too
    contributions of information decayed geometrically over time 
    
Long Short-Term Memory (LSTMs mid-90s) addressed the vanishing gradient problem 
* some signals (state variables) are kept fixed by using gates and reintroduced (or not) at an appropraite time in the future 
* arbitrary time intervals can be represented and temporal dependnecies can be captured  


## RNN Applications

* speech recognition, time series prediction, gesture recognition 

## Forward Prop and BackProp

Has a bunch of useful calculations that I derived earlier, easier to read imo than the initial time we went through it in course1

## Recurrent Neural Networks

Why are they called RNNs?
* Perform the same task for each element in the input sequence

RNNs have **memory elements** or **states** that attempt to retain previous information of previous inputs 

Temporal dependencies 
* the current output depends on both a current input and also memory element which takes into account past inputs 

Two fundamental differences between RNNs and FFNNs
* We only considered the current input for FFNNs, but with RNNs we consider previous inputs/outputs through representing the inputs and outputs in a sequence 
* RNNs have memory elements (hidden neuron in FFNNs), which we also consider previous iterations of through representing these memory elements in a sequence 

RNN **memory** is defined as the output of the hidden layer which serves as additional input to the network at the following training step  
* $h$ notation (hidden) is changed to $\bar{s}_t$ for state 

RNNs can be "unfolded in time," what we use when working with RNNs





## RNN - Unfolded Model 

* looks cleaner 

## Backpropagation Through Time (BPTT)

* train network at timestep $T$ as well as take into account all the previous time steps 

**Accumulative Gradient**:
* to update the the matrix $W_s$ at some time step $N$, we need $\dfrac{\partial E_N}{\partial W_s}$ such that:
    * $\dfrac{\partial E_N}{\partial W_s} = \sum\limits^N_{i=1} \dfrac{\partial E_N}{\partial \bar{y}_N}\dfrac{\partial \bar{y}_N}{\partial \bar{s}_i} \dfrac{\partial \bar{s}_i}{\partial W_s}$
    * This will take into account each individual time step's gradient, accumulating it as we go through time.

* similarly, for $W_x$, at some time step $N$, we have $\dfrac{\partial E_N}{\partial W_x}$ such that:
    * $\dfrac{\partial E_N}{\partial W_s} = \sum\limits^N_{i=1} \dfrac{\partial E_N}{\partial \bar{y}_N}\dfrac{\partial \bar{y}_N}{\partial \bar{s}_i} \dfrac{\partial \bar{s}_i}{\partial W_x}$
    
* we consider **all** paths when accumulating the gradient.
    * i have a big issue with this if you want to build deep NNs, it just seems infeasibly expensive....
    * oh ok, apparently **LSTM**s address this issue 

## RNN Summary

* **gradient clipping** addresses the exploding gradient problem
https://arxiv.org/abs/1211.5063 

## From RNN to LSTM 

A couple reasons why (actuallyyy, they're fundamental problems lol):
* avoid the loss of information
* avoid the vanishing gradient problem 

* **paper**!
http://www.bioinf.jku.at/publications/older/2604.pdf



# Long Short-Term Memory Networks (LSTM)

Some pre-course reading warm-ups 
* http://blog.echen.me/2017/05/30/exploring-lstms/
* https://www.youtube.com/watch?v=iX5V1WpxxkY
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/

## RNN vs LSTM

* Due to the vanishing gradient problem, RNNs have a hard time getting information from iterations of the distant past (in the example, the probability of recognizing a wolf is relating to bears recognized in the distant past)

LSTMs keep track of both **long term memory** and **short term memory** oh my god.

## Basics of LSTM 

Every LSTM cell has 4 gates: 
* forget gate 
* remember gate 
* learn gate 
* use gate 

## Architecture of LSTM 


## Learn Gate 

**Combines** short term memory and the new event, **ignores** some of them, and passes them to the output

Let $\text{STM}_{t-1}$ be the short term memory at time $t-1$ and $E_t$ the event at time $t$, the information it actually passes is represented by: 

* Combine step:
$N_t = \tanh(W_n[\text{STM}_{t-1}, E_t] + b_n)$

* Ignore step: apply $i_t$
where $i_t$ is the 'ignore factor', a vector, but we multiply element-wise. We calculate $i_t$ by passing $\text{STM}_{t-1},$ and  $E_t$ through as small neural network such that:

$i_t = \sigma(W_i[\text{STM}_{t-1}, E_t] + b_i)$
  
Thus we have that the learn gate is $\text{Learn_Gate} = N_t \times i_t$


## Forget Gate 

Chooses which information to keep and which to forget. We multiply long term memory from the last iteration $\text{LTM}_{t-1}$ by a forget factor $f_t$ such that:

$f_t = \sigma(W_f[\text{STM}_{T-1}, E_t] + b_f)$

Then, we have $\text{Forget_Gate }=\text{LTM}_{t-1} \times f_t$

## Remember Gate 

Add the Forget Gate and the Learn Gate outputs such that: 

$\text{Remember Gate} = \text{LTM}_{t-1} \times f_t + N_t \times i_t$

## Use Gate 

also known as the output gate, uses the output of the Forget Gate and the short term memory of the previous iteration and the event, we have:

$U_t = \tanh(W_u\text{LTM}_{t-1} \times f_t + b_u)$

$V_t = \sigma(W_v[\text{STM}_{t-1}, E_t] + b_v)$

Output is (a new short term memory output which is equivalent to the output of the cell):

$\text{STM}_t = U_t * V_t$

## Resources! DL4J

https://deeplearning4j.org/lstm.html

# Implementation of RNN and LSTM 

## Sequence batching

batch up the input to efficiently use matrix operations 

## Implementing a Character-wise RNN 

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html
    

In [21]:
## Generator functions: defined like a regular function, but have the keyword `yield` that returns a value (or values) 
## and freezes the state of the function until called again (via .next() or the next iteration in a for loop)

# Generator functions are useful in that they keep memory pool small while computing iterative processes.

def get_up_to_some(num):
    counter = 0
    while counter < num:
        yield counter 
        counter += 1

In [26]:
# Example of calling a generator function. Notice how the state is frozen even after going out of the for loop. 

calvs_generator = get_up_to_some(10)

for i in calvs_generator:
    print(i)
    if i == 5: 
        break
print("hello! after the for loop")
print(next(calvs_generator))


0
1
2
3
4
5
hello! after the for loop
6


In [13]:


import numpy as np 

arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13])
n_seqs = 2 
n_steps = 3 


In [14]:
characters_per_batch = n_seqs * n_steps 
print(characters_per_batch)

6


In [15]:
n_batches = len(arr) // characters_per_batch
print(n_batches)

2


In [18]:
arr = arr[:n_batches * characters_per_batch]
print(arr)
# keep only enough to keep complete batches 

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]]


In [19]:
arr = arr.reshape((n_seqs, -1))
print(arr)

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]]


In [None]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM cell.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    def build_cell(keep_prob):
        
        # Use a basic LSTM cell
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)

        # Add dropout to the cell outputs
        # DropoutWrapper accepts an rnn cell and has different probabilities to keep for output, input, and state cells 
        drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)


        
    # stack up multiple LSTM layers, all with a given lstm_size 
    cell = tf.contrib.rnn.MultiRNNCell([build_cell(lstm_size, keep_prob) for _ in range(num_layers)])
    
    # initialize all the lstm layers as zeros 
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return cell, initial_state

In [3]:
import tensorflow as tf

t1 = [[1, 2, 3], [4, 5, 6]]
t2 = [[7, 8, 9], [10, 11, 12]]
dim0 = tf.concat([t1, t2], 0)  # [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
dim1 = tf.concat([t1, t2], 1)  # [[1, 2, 3, 7, 8, 9], [4, 5, 6, 10, 11, 12]]

# t3 = [2,3]
# t4 = [2,3]

# tf.shape(tf.concat([t3, t4], 0))  # [4, 3]
# tf.shape(tf.concat([t3, t4], 1))  # [2, 6]

In [7]:
print(dim1)

Tensor("concat_5:0", shape=(2, 6), dtype=int32)


http://suriyadeepan.github.io/2017-01-07-unfolding-rnn/

## Prog assignment, building LSTMs 

so hard to understand this one:

### RNN Output

Here we'll create the output layer. We need to connect the output of the RNN cells to a full connected layer with a softmax output. The softmax output gives us a probability distribution we can use to predict the next character, so we want this layer to have size $C$, the number of classes/characters we have in our text.

If our input has batch size $N$, number of steps $M$, and the hidden layer has $L$ hidden units, then the output is a 3D tensor with size $N \times M \times L$. The output of each LSTM cell has size $L$, we have $M$ of them, one for each sequence step, and we have $N$ sequences. So the total size is $N \times M \times L$. 

We are using the same fully connected layer, the same weights, for each of the outputs. Then, to make things easier, we should reshape the outputs into a 2D tensor with shape $(M * N) \times L$. That is, one row for each sequence and step, where the values of each row are the output from the LSTM cells. We get the LSTM output as a list, `lstm_output`. First we need to concatenate this whole list into one array with [`tf.concat`](https://www.tensorflow.org/api_docs/python/tf/concat). Then, reshape it (with `tf.reshape`) to size $(M * N) \times L$.

Once we have the outputs reshaped, we can do the matrix multiplication with the weights. We need to wrap the weight and bias variables in a variable scope with `tf.variable_scope(scope_name)` because there are weights being created in the LSTM cells. TensorFlow will throw an error if the weights created here have the same names as the weights created in the LSTM cells, which they will be default. To avoid this, we wrap the variables in a variable scope so we can give them unique names.

> **Exercise:** Implement the output layer in the function below.

In [None]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        lstm_output: List of output tensors from the LSTM layer
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each sequence.
    # Concatenate lstm_output over axis 1 (the columns)
    seq_output = tf.concat(lstm_output, axis=1)
    # Reshape seq_output to a 2D tensor with lstm_size columns
    x = tf.reshape(seq_output, [-1, in_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        # Create the weight and bias variables here
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.matmul(x,softmax_w) + b
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits, name='predictions')
    
    return out, logits

# Hyperparameters 

## Learning Rate 

Learning Rate Decay
* linear learning rate decay
* exponential learning rate decay
* adaptive learning rate 
    * adjust learning rate based what's happening with the error:
        * if error is decreasing too slowly, increase learning rate 
        * if error 'bounces around'/diverges (for example), decrease learning rate 
        
Tensorflow readings:

Adagrad: https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer

Adam: https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer

Exponential decay: https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay






## Minibatch Size  

* too large minibatch size (1024), could be too computationally taxing and worse accuracy potentially

## Number of Iterations 

* In choosing the number of iterations/ the epochs, we should look at the validation error
    * if validation error goes up, then apply early stopping to the model at the point of lowest validation error 
    
TF validation monitors: https://www.tensorflow.org/get_started/#early_stopping_with_validationmonitor

Validation monitors have went away in favor of SessionRunHooks:

https://www.tensorflow.org/api_docs/python/tf/train/SessionRunHook

Varying types of hooks:

https://www.tensorflow.org/api_docs/python/tf/train/StopAtStepHook

https://www.tensorflow.org/api_docs/python/tf/train/NanTensorHook

## Number of hidden units 

You'd generally want more hidden units for more complex functions 

Too many hidden units can cause overfitting 

Keep adding hidden units until the validation error starts to get worse 

Having the first hidden layer larger in number than the input layer has been seen to be beneficial in a number of tasks 

Deeper Conv nets perform better, otherwise FFNNs in some cases seem to work really well with 3 layers, but YMMV with more layers 


## RNN hyperparameters 

Choosing a cell type: 
* LSTM 
* vanilla RNN cell 
* gated recurrent unit cell (GRU)   
    * 1 and 3 work perform better in practice 
How deep the layer is, how many hidden layers do we stack

Embedding dimensionality 
* more than 500 dimensions is rare 

LSTM vs GRU 
* depends on the task 

LSTM is really good for **advanced speech recognition** 

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

https://arxiv.org/abs/1412.3555

An Empirical Exploration of Recurrent Network Architectures

http://proceedings.mlr.press/v37/jozefowicz15.pdf

Visualizing and Understanding Recurrent Networks

https://arxiv.org/abs/1506.02078

LSTM: A Search Space Odyssey

https://arxiv.org/pdf/1503.04069.pdf

Understanding LSTM Networks

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Massive Exploration of Neural Machine Translation Architectures

https://arxiv.org/abs/1703.03906v2

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

https://arxiv.org/abs/1610.09975

**Speech Recognition with Deep Recurrent Neural Networks**

https://arxiv.org/abs/1303.5778

Sequence to Sequence Learning with Neural Networks

https://arxiv.org/abs/1409.3215

Show and Tell: A Neural Image Caption Generator

https://arxiv.org/abs/1411.4555

DRAW: A Recurrent Neural Network For Image Generation

https://arxiv.org/abs/1502.04623

A Long Short-Term Memory Model for Answer Sentence Selection in
Question Answering

http://www.aclweb.org/anthology/P15-2116

SEQUENCE-TO-SEQUENCE RNNS FOR
TEXT SUMMARIZATION

https://pdfs.semanticscholar.org/3fbc/45152f20403266b02c4c2adab26fb367522d.pdf

## More Resources 

Practical recommendations for gradient-based training of deep architectures

https://arxiv.org/abs/1206.5533

http://neuralnetworksanddeeplearning.com/chap3.html#how_to_choose_a_neural_network's_hyper-parameters

Efficient Backprop

http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

How to Generate a Good Word Embedding?

https://arxiv.org/abs/1507.05523

Systematic evaluation of CNN advances on the ImageNet

https://arxiv.org/abs/1606.02228



# Embeddings and Word2vec 

## Embeddings Intro 

Word embeddings are NN method for representing data with a huge number of classes more efficiently.
* greatly improve the abilitiy of networks to learn data by representing data with lower dimensional vectors.
* Given training data, finds relationships between words that are usually correlated together ('United', 'States') vs. ('United', 'Pancakes')

## Implementing Word2vec

Readings

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
    * If two words have similar 'contexts' then the model should output similar word vectors for these two

    * But the neural network is too big! (300 x 10,000  x num_training_data), we'll need to enhance this by:
        * treating word pairs or 'phrases' as their own 'word' (ie word vector representation) in the model. So 'Boston' and 'Globe' have a really different meaning separately than if used together, 'Boston Globe', so 'Boston Globe' has it's own word vector representation in the model.
        * subsampling is used as well. There's a chance a word is deleted from the text based on frequency.
        * negative sampling is used as well. Only a subset of the weights are updated, with the negative samples being more likely to be updated 

* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)

Continuous Bag of Words (CBOW) : context around your word
* 'The quick brown fox jumped over the lazy dog', if you're looking at fox and the context around fox, then CBOW would be something like ('quick', 'brown', 'jumped' ,'over'), and shift this CBOW continually as you move along the sentence.

Skip-gram
* You have a word and we try to predict what context that word appears in 

Efficiency of list comprehensions: 

http://blog.cdleary.com/2010/04/efficiency-of-list-comprehensions/

# AUTO ML???

http://automl.info/

http://www.ml4aad.org/automl/

https://futurism.com/google-artificial-intelligence-built-ai/

https://research.googleblog.com/2017/05/using-machine-learning-to-explore.html


