# Sequences

$ \textbf{x}^{\langle t \rangle} \in \mathbb{R^{n}}, t \in \{1, 2, ..., T_x\} $

$ \hat{\textbf{y}}^{\langle t \rangle} \in \mathbb{R^{m}}, t \in \{1, 2, ..., T_y\} $

# Recurrent Connection

$ \textbf{h}^{\langle t \rangle} = f(\textbf{h}^{\langle t-1 \rangle}, \textbf{x}^{\langle t \rangle}) $

Unrolled over 2 items in a sequence:

$ \textbf{h}^{\langle 0 \rangle} = \textbf{0} $

$ \textbf{h}^{\langle 1 \rangle} = f(\textbf{h}^{\langle 0 \rangle}, \textbf{x}^{\langle 1 \rangle}) $

$ \textbf{h}^{\langle 2 \rangle} = f(\textbf{h}^{\langle 1 \rangle}, \textbf{x}^{\langle 2 \rangle}) $

# Example: Dot Product + Tanh

Hidden state and output at time t:

$ \textbf{z}^{\langle t \rangle} = \textbf{V} \cdot \textbf{h}^{\langle t-1 \rangle} + \textbf{U} \cdot \textbf{x}^{\langle t \rangle} + \textbf{b} $

$ \textbf{h}^{\langle t \rangle} = \text{tanh}(\textbf{z}^{\langle t \rangle}) $

If we have a sequence of 2 items, unfolded computation is:

$ \textbf{h}^{\langle 0 \rangle} = \textbf{0} $

$ $

$ \textbf{z}^{\langle 1 \rangle} = \textbf{V} \cdot \textbf{h}^{\langle 0 \rangle} + \textbf{U} \cdot \textbf{x}^{\langle 1 \rangle} + \textbf{b} $

$ \textbf{h}^{\langle 1 \rangle} = \text{tanh}(\textbf{z}^{\langle 1 \rangle}) $

$ $

$ \textbf{z}^{\langle 2 \rangle} = \textbf{V} \cdot \textbf{h}^{\langle 1 \rangle} + \textbf{U} \cdot \textbf{x}^{\langle 2 \rangle} + \textbf{b} $

$ \textbf{h}^{\langle 2 \rangle} = \text{tanh}(\textbf{z}^{\langle 2 \rangle}) $

Gradient flow through the cells:

$$
\frac{\partial C}{\partial \textbf{W}}
=
\dots +
\frac{\partial C}{\partial \hat{\textbf{y}}^{\langle 5 \rangle}}
\frac{\partial \hat{\textbf{y}}^{\langle 5 \rangle}}{\partial \textbf{h}^{\langle 5 \rangle}}
\frac{\partial \textbf{h}^{\langle 5 \rangle}}{\partial \textbf{h}^{\langle 4 \rangle}}
\frac{\partial \textbf{h}^{\langle 4 \rangle}}{\partial \textbf{h}^{\langle 3 \rangle}}
\frac{\partial \textbf{h}^{\langle 3 \rangle}}{\partial \textbf{h}^{\langle 2 \rangle}}
\frac{\partial \textbf{h}^{\langle 2 \rangle}}{\partial \textbf{h}^{\langle 1 \rangle}}
\frac{\partial \textbf{h}^{\langle 1 \rangle}}{\partial \textbf{W}}
+ \dots
$$

If we stack row vectors of U with V horizontally, and stack previous state with current input vertically:

$$ \textbf{W} = \Big( \textbf{V} \, \, \textbf{U} \Big) $$

$$
 \textbf{h}^{\langle t \rangle} = 
 \text{tanh}
 \Bigg(
  \textbf{W} \cdot \begin{pmatrix} \textbf{h}^{\langle t-1 \rangle} \\ \textbf{x}^{\langle t \rangle} \end{pmatrix}  
 + \textbf{b}
 \Bigg)
$$

In [1]:
from tensorflow.python.keras.layers.recurrent import SimpleRNN
?SimpleRNN

[0;31mInit signature:[0m [0mSimpleRNN[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Fully-connected RNN where the output is to be fed back to input.

See [the Keras RNN API guide](https://www.tensorflow.org/guide/keras/rnn)
for details about the usage of RNN API.

Arguments:
  units: Positive integer, dimensionality of the output space.
  activation: Activation function to use.
    Default: hyperbolic tangent (`tanh`).
    If you pass None, no activation is applied
    (ie. "linear" activation: `a(x) = x`).
  use_bias: Boolean, (default `True`), whether the layer uses a bias vector.
  kernel_initializer: Initializer for the `kernel` weights matrix,
    used for the linear transformation of the inputs. Default:
    `glorot_uniform`.
  recurrent_initializer: Initializer for the `recurrent_kernel`
    weights matrix, used for the linear transformation of the recurrent state.
    Default: 

# Gradient Clipping

Clipping by L2-norm:

$ 
\frac{dC}{dW} = \nabla, n = ||\nabla||_2 
\\
\nabla_i \leftarrow 
\begin{cases}
  \nabla_i \cdot \frac{t}{n}  & \text{if } n > t \\    
  \nabla_i & \text{otherwise}
\end{cases}
$

Clipping within $[-m, m]$ interval:

$
\frac{dC}{dW} = \nabla, m > 0
\\
\nabla_i \leftarrow 
\begin{cases}
  min(\nabla_i, m) & \text{if } \nabla_i > 0 \\    
  max(\nabla_i, -m) & \text{otherwise}
\end{cases}
$

In [2]:
%%capture
from tensorflow.python.keras.optimizers import SGD

# Every Keras optimizer supports gradient
# clipping via its constructor parameters.

# All parameter gradients will be clipped to
# a maximum norm of 1.
sgd = SGD(lr=0.01, clipnorm=1.)

# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
sgd = SGD(lr=0.01, clipvalue=0.5)

# Docs: https://keras.io/optimizers/

# Long Short-Term Memory

$$ 
 \begin{pmatrix} 
     \textbf{i}^{\langle t \rangle} \\ 
     \textbf{f}^{\langle t \rangle} \\
     \textbf{o}^{\langle t \rangle} \\
     \textbf{g}^{\langle t \rangle} \\
 \end{pmatrix}  
    =
 \begin{pmatrix}
     \sigma \\
     \sigma \\
     \sigma \\
     \text{tanh} 
 \end{pmatrix}
 \cdot
 \textbf{W}
 \cdot
 \begin{pmatrix} 
     \textbf{h}^{\langle t-1 \rangle} \\ 
     \textbf{x}^{\langle t \rangle} 
 \end{pmatrix}  
 + \textbf{b}
$$

$$
\textbf{c}^{\langle t \rangle} = 
    \textbf{f} \odot \textbf{c}^{\langle t-1 \rangle} 
    + \textbf{i}^{\langle t \rangle} \odot \textbf{g}^{\langle t \rangle} 
$$


$$
\textbf{h}^{\langle t \rangle} = 
    \textbf{o} \odot \text{tanh}(\textbf{c}^{\langle t \rangle}) 
$$

In [3]:
from tensorflow.python.keras.layers.recurrent import LSTM
?LSTM

[0;31mInit signature:[0m [0mLSTM[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Long Short-Term Memory layer - Hochreiter 1997.

 Note that this cell is not optimized for performance on GPU. Please use
`tf.compat.v1.keras.layers.CuDNNLSTM` for better performance on GPU.

Arguments:
  units: Positive integer, dimensionality of the output space.
  activation: Activation function to use.
    Default: hyperbolic tangent (`tanh`).
    If you pass `None`, no activation is applied
    (ie. "linear" activation: `a(x) = x`).
  recurrent_activation: Activation function to use
    for the recurrent step.
    Default: hard sigmoid (`hard_sigmoid`).
    If you pass `None`, no activation is applied
    (ie. "linear" activation: `a(x) = x`).
  use_bias: Boolean, whether the layer uses a bias vector.
  kernel_initializer: Initializer for the `kernel` weights matrix,
    used for the linear transformati

# Gated Recurrent Unit

Based on Cho, K, et al. (2014).
Note that the original paper has no biases, while the equations below do.

$$
 \begin{pmatrix} 
     \textbf{r}^{\langle t \rangle} \\ 
     \textbf{z}^{\langle t \rangle}
 \end{pmatrix}  
    =
 \text{sigmoid}
 \Big[
 \textbf{U}
 \cdot
 \begin{pmatrix} 
     \textbf{h}^{\langle t-1 \rangle} \\ 
     \textbf{x}^{\langle t \rangle}
 \end{pmatrix}  
 + \textbf{b}
 \Big]
$$

$$
\tilde{\textbf{h}}^{\langle t \rangle} 
    = 
\text{tanh}
\Big[
\textbf{V} 
\cdot
\begin{pmatrix} 
    \textbf{r}^{\langle t \rangle} \odot \textbf{h}^{\langle t-1 \rangle} \\ 
    \textbf{x}^{\langle t \rangle}
\end{pmatrix} 
    + 
\textbf{c}
\Big]
$$

$$
\textbf{h}^{\langle t \rangle} 
    = 
\Big[
\textbf{z}^{\langle t \rangle} \odot \textbf{h}^{\langle t-1 \rangle}
\Big]
    +
\Big[
(1 - \textbf{z}^{\langle t \rangle}) \odot \tilde{\textbf{h}}^{\langle t \rangle}
\Big]
$$

In [4]:
from tensorflow.python.keras.layers.recurrent import GRU
?GRU

[0;31mInit signature:[0m [0mGRU[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Gated Recurrent Unit - Cho et al. 2014.

There are two variants. The default one is based on 1406.1078v3 and
has reset gate applied to hidden state before matrix multiplication. The
other one is based on original 1406.1078v1 and has the order reversed.

The second variant is compatible with CuDNNGRU (GPU-only) and allows
inference on CPU. Thus it has separate biases for `kernel` and
`recurrent_kernel`. Use `'reset_after'=True` and
`recurrent_activation='sigmoid'`.

Arguments:
  units: Positive integer, dimensionality of the output space.
  activation: Activation function to use.
    Default: hyperbolic tangent (`tanh`).
    If you pass `None`, no activation is applied
    (ie. "linear" activation: `a(x) = x`).
  recurrent_activation: Activation function to use
    for the recurrent step.
    Default: hard sig

In [5]:
from tensorflow.python.keras.layers import CuDNNGRU
?CuDNNGRU

[0;31mInit signature:[0m [0mCuDNNGRU[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Fast GRU implementation backed by cuDNN.

More information about cuDNN can be found on the [NVIDIA
developer website](https://developer.nvidia.com/cudnn).
Can only be run on GPU.

Arguments:
    units: Positive integer, dimensionality of the output space.
    kernel_initializer: Initializer for the `kernel` weights matrix, used for
      the linear transformation of the inputs.
    recurrent_initializer: Initializer for the `recurrent_kernel` weights
      matrix, used for the linear transformation of the recurrent state.
    bias_initializer: Initializer for the bias vector.
    kernel_regularizer: Regularizer function applied to the `kernel` weights
      matrix.
    recurrent_regularizer: Regularizer function applied to the
      `recurrent_kernel` weights matrix.
    bias_regularizer: Regularizer fun