In [1]:
import tensorflow as tf
import tensorflow.contrib.rnn as rnn

In [2]:
sess = tf.InteractiveSession()

## Gated Recurrent Unit (GRU)

As we have seen during the lectures, a GRU is a relatively simple RNN unit that avoids the vanishing/exploding gradient problem. I will first use GRUs as an example, since they are easier to set up than LSTMs and generally work well.

Making a recurrent network in Tensorflow will start with constructing a unit. This unit will be used to process the input at every timestep. The GRU unit is called `GRUCell` in Tensorflow. In its basic usage, we just give it one parameter --- the size of the hidden representation. Let us construct a GRU unit with a hidden layer size of 100:

In [3]:
gru_cell = rnn.GRUCell(100)

This basic unit is not yet useful by itself. As discussed in the RNN lecture, we have to unroll it in time for training and prediction. You could see as creating an RNN cell for every timestep, where the parameters between
all the cells are shared. There are a couple of Tensorflow functions that unroll an RNN, but for reasons that
will become clear later, we will use the `dynamic_rnn` function. In the following example, we are assuming an
input of batch size *512*, with *10* timesteps per instance, and a size of *50* per timestep. This could e.g.
be a character embedding of size *50*:

In [4]:
x = tf.placeholder(shape=(512, 10, 50), dtype=tf.float32)
gru = tf.nn.dynamic_rnn(gru_cell, x, dtype=tf.float32)
gru

(<tf.Tensor 'rnn/transpose:0' shape=(512, 10, 100) dtype=float32>,
 <tf.Tensor 'rnn/while/Exit_2:0' shape=(?, 100) dtype=float32>)

The `dynamic_rnn` function takes the cell and the input as its arguments. We also have to specify the data type
of the output. `dynamic_rnn` returns a 2-tuple. The first value in the tuple contains the hidden states for all timesteps of all instances. The second value contains the final hidden representation. For the task at hand (word tag
prediction) we are only interested in the final hidden representation. Hence, we can discard the first element
of the tuple.

Note that we assumed that the input is always 10 time steps. Since our inputs are words of a varying length, this
is not a reasonable assumption to make. In order to handle all normal words, you could set this dimension of the
input to some reasonably large number of characters, or just use the length of the longest word in the 
training/validation set.

However, if we set the *n* to a reasonable number, it would be problematic to return the hidden state at timestep
*n* for words that are shorter than *n* characters. This is problematic, because in these cases we will be feeding
some timesteps with bogus (e.g. all-zero) representations of characters that do not exist. Luckily, dynamic_rnn
provides the `sequence_length` keyword argument. Through this argument, we can add a tensor that specifies the length
of each sequence as an additional input. The GRU will then always return the hidden representation at the
sequence's length as the final hidden representation. For example:

In [5]:
tf.reset_default_graph()
gru_cell = rnn.GRUCell(100)
x = tf.placeholder(shape=(512, 10, 50), dtype=tf.float32)
seq_lens = tf.placeholder(shape=(512,), dtype=tf.int32)
_, hidden = tf.nn.dynamic_rnn(gru_cell, x, sequence_length=seq_lens, dtype=tf.float32)

This provides enough to set up an RNN for the word classification task. `hidden` is a regular hidden
representation that could e.g. be the input of the `softmax` function.

### Bidirectional RNN

To create a bidirectional RNN, you can pretty much use the same procedure. However, now you need to create two
cells: one for the forward RNN and one for the backward RNN. The RNN can then be unrolled using the
`bidirectional_dynamic_rnn` function:

In [6]:
tf.reset_default_graph()
forward_cell = rnn.GRUCell(100)
backward_cell = rnn.GRUCell(100)
x = tf.placeholder(shape=(512, 10, 50), dtype=tf.float32)
seq_lens = tf.placeholder(shape=(512,), dtype=tf.int32)
_, hidden = tf.nn.bidirectional_dynamic_rnn(forward_cell, backward_cell, x, sequence_length=seq_lens, dtype=tf.float32)
hidden

(<tf.Tensor 'bidirectional_rnn/fw/fw/while/Exit_2:0' shape=(?, 100) dtype=float32>,
 <tf.Tensor 'bidirectional_rnn/bw/bw/while/Exit_2:0' shape=(?, 100) dtype=float32>)

As the `dynamic_rnn` function, `birectional_dynamic_rnn` also returns a 2-tuple. But the second tuple element
is now a 2-tuple as well. These values represent the final hidden representation of the forward and backwards
RNNs. You can feed them into another layer, by e.g. concatenating them using `tf.concat`.

### Regularization using dropout

RNN cells can be regularized using dropout. In order to do so, use a `DropoutWrapper`:

In [7]:
gru_cell = rnn.GRUCell(100)
regularized_cell = rnn.DropoutWrapper(gru_cell)

and use `regularized_cell` while unrolling the RNN. Note that you can specify the dropout rates with keyword arguments, see the [`DropoutWrapper` documentation](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper) for more information.

## LSTMs

Unrolling an LSTM works similar to a GRU:
    
* Use the `BasicLSTMCell` unit.
* Note that the final hidden representation has two outputs: `h` is the hidden representation, `c` the memory cell.

In [8]:
tf.reset_default_graph()
lstm_cell = rnn.BasicLSTMCell(100)
x = tf.placeholder(shape=(512, 10, 50), dtype=tf.float32)
seq_lens = tf.placeholder(shape=(512,), dtype=tf.int32)
_, hidden = tf.nn.dynamic_rnn(lstm_cell, x, sequence_length=seq_lens, dtype=tf.float32)
hidden.h

<tf.Tensor 'rnn/while/Exit_3:0' shape=(?, 100) dtype=float32>

In [9]:
sess.close()