## Recurrent Neural Network

- Usage: analyze sequential data, for example a sequence of images in a video, language translation

- What is sequential data (example: predict the next word):

    - I went to have dinner with my .... (possible answer: friend/parents/roommate)

    - My dad and mom came to campus for my graduation commencement two days ago. I went to have dinner with my .... (possible answer: parents)

- Features of RNN:
    - variable-length sequences
    - track long-term dependencies
    - maintain information about the order
    - share parameters across the sequence (reduce the number of parameters in the RNN; the number of parameters should be independent of the length of the seq)

### Mathematical formulation of recurrent layer:
$$
h_t = \sigma(W_{ih} x_t + b_{ih} + W_{hh} h_{t-1} + b_{hh})
$$

- $h_t$: hidden state at time $t$
- $x_t$: input at time $t$
- $W_{ih}$: input-hidden weights
- $W_{hh}$: hidden-hidden weights
- $b_{ih}$: input-hidden bias
- $b_{hh}$: hidden-hidden bias
- $\sigma$: activation function (ReLU or tanh)

### Example: one recurrent layer
- input (example: a sentence "I like machine learning", x1 = "I", x2 = "like", x3 = "learning"
$$
input = (x_1, x_2, x_3, x_4)
$$
- one recurrent layer:
$$
h_1 = \sigma(W_{ih} x_1 + b_{ih} + W_{hh} h_{0} + b_{hh})
$$
$$
h_2 = \sigma(W_{ih} x_2 + b_{ih} + W_{hh} h_{1} + b_{hh})
$$
$$
h_3 = \sigma(W_{ih} x_3 + b_{ih} + W_{hh} h_{2} + b_{hh})
$$
$$
h_4 = \sigma(W_{ih} x_4 + b_{ih} + W_{hh} h_{3} + b_{hh})
$$
- output:
$$
output = (h_1, h_2, h_3, h_4)
$$
#### Note
- The initial hidden state $h_0$ is usually taken as either zero or random tensor
- share parameters ($W_{ih}$, $W_{hh}$, $b_{ih}$, $b_{ih}$) across the sequence
- input can be arbitrary length: $(x_1, x_2, \cdots, x_n)$

In [22]:
import torch
import torch.nn as nn

# input_size: The number of expected features in the input x
# hidden_size: The number of features in the hidden state h
# num_layers: Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1
rnn = torch.nn.RNN(input_size=20, hidden_size=50)

# input: (sequence_length, batch_size/sample_number, input_size)
input = torch.randn(3, 32, 20)
x1, x2, x3 = input[0], input[1], input[2]

# h_0, initial hidden state: (num_layers, batch_size, hidden_size)
h_0 = torch.zeros(1, 32 ,50)

# output: (sequence_length, batch_size, hidden_size)
# h_n, final hidden state: (num_layers, batch_size, hidden_size)
output, hn = rnn(input, h_0)

print(output.size(), hn.size())

torch.Size([3, 32, 50]) torch.Size([1, 32, 50])


### Example: two recurrent layers
- input:
$$
input = (x_1, x_2, x_3)
$$

- layer 1:
$$
h_1^{[1]} = \sigma(W_{ih}^{[1]} x_1 + b_{ih}^{[1]} + W_{hh}^{[1]} h_{0}^{[1]} + b_{hh}^{[1]})
$$
$$
h_2^{[1]} = \sigma(W_{ih}^{[1]} x_2 + b_{ih}^{[1]} + W_{hh}^{[1]} h_{1}^{[1]} + b_{hh}^{[1]})
$$
$$
h_3^{[1]} = \sigma(W_{ih}^{[1]} x_3 + b_{ih}^{[1]} + W_{hh}^{[1]} h_{2}^{[1]} + b_{hh}^{[1]})
$$
Note: share parameters ($W_{ih}^{[1]}$, $W_{hh}^{[1]}$, $b_{ih}^{[1]}$, $b_{ih}^{[1]}$) across the sequence

- layer 2:
$$
h_1^{[2]} = \sigma(W_{ih}^{[2]} h_1^{[1]} + b_{ih}^{[2]} + W_{hh}^{[2]} h_{0}^{[2]} + b_{hh}^{[2]})
$$
$$
h_2^{[2]} = \sigma(W_{ih}^{[2]} h_2^{[1]} + b_{ih}^{[2]} + W_{hh}^{[2]} h_{1}^{[2]} + b_{hh}^{[2]})
$$
$$
h_3^{[2]} = \sigma(W_{ih}^{[2]} h_3^{[1]} + b_{ih}^{[2]} + W_{hh}^{[2]} h_{2}^{[2]} + b_{hh}^{[2]})
$$
Note: share parameters ($W_{ih}^{[2]}$, $W_{hh}^{[2]}$, $b_{ih}^{[2]}$, $b_{ih}^{[2]}$) across the sequence

- output:
$$
output = (h_1^{[2]}, h_2^{[2]}, h_3^{[2]})
$$

![alternative text](deep-rnn.svg)

In [2]:
import torch
import torch.nn as nn

# input_size: The number of expected features in the input x
# hidden_size: The number of features in the hidden state h
# num_layers: Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1
rnn = torch.nn.RNN(input_size=20, hidden_size=50, num_layers=2)

# input: (sequence_length, batch_size, input_size)
input = torch.randn(3, 32, 20)
# x1, x2, x3 = input[0], input[1], input[2]

# h_0, initial hidden state: (num_layers, batch_size, hidden_size)
h_0 = torch.randn(2, 32 ,50)

# output: (sequence_length, batch_size, hidden_size)
# h_n, final hidden state: (num_layers, batch_size, hidden_size)
output, hn = rnn(input, h_0)

print(output.size(), hn.size())

torch.Size([3, 32, 50]) torch.Size([2, 32, 50])
