In [2]:
import torch
import numpy as np
from pprint import pprint
print("\nNotebook done in pytorch version: ", torch.__version__)

ModuleNotFoundError: No module named 'torch'

## Naive Recurrent unit


- $s^{<t-1>}$ activation from previous time $t$
- $x^{<t>}$ input to the recurrent unit at time $t$

#### How to compute the first hidden state and output 

We need some initial state for the recurrent unit $s^{<0>}= (0,\dots,0)$ and an activation function for the hidden state $g_s$ (usually tanh or relu) and activation function for the output  $g_y$ (sigmoid if we have a binary classification problem/ softmax if we have K classes).

- The recurrent signal at time $t=1$, $s^{<1>}$  is computed as:
$$
s^{<1>} = g_s\left( W_{ss} \cdot s^{<0>} + W_{xa} \cdot x^{<1>} + b_s \right)
$$

- The output signal at time $t=1$, $\hat{y \,}^{<1>}$ is computed as:
$$
\hat{y \,}^{<1>} = g_y\left( W_{ys} \cdot s^{<0>} + b_y \right)
$$


#### How to compute the hidden state and output at time t

- The recurrent signal at time $t$, $s^{<t>}$  is computed as:
$$
s^{<t>} = g_s\left( W_{ss} \cdot s^{<t-1>} + W_{xa} \cdot x^{<t>} + b_s \right)
$$

- The output signal at time $t=1$, $\hat{y \,}^{<1>}$ is computed as:
$$
\hat{y \,}^{<t>} = g_y\left( W_{ys} \cdot s^{<t>} + b_y \right)
$$



In [24]:
np.random.seed(1234)
rnn = torch.nn.RNN(input_size=6,hidden_size=256)

In [25]:
pprint(list(rnn.state_dict().keys()))

['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']


In [26]:
print("hidden to hidden weights size:", rnn.weight_hh_l0.size())
print("hidden to hidden bias  size:", rnn.bias_hh_l0.size())

print("input to hidden weight matrix size:", rnn.weight_ih_l0.size())
print("input to hidden bias  size:", rnn.bias_ih_l0.size())

hidden to hidden weights size: torch.Size([256, 256])
hidden to hidden bias  size: torch.Size([256])
input to hidden weight matrix size: torch.Size([256, 6])
input to hidden bias  size: torch.Size([256])



### Stacking weight notation

Let us assume

- $W_{ss}$ is $(100,100)$ matrix.
- $W_{sx}$ is $(100,10.000)$ matrix.

Then we can stack matrix $W_{ss}$ and $W_{sx}$ horizontally and create $W_s$.
The new matrix will have the same number of rows (100) but it will have as many columns as the sum.
The notation $W_s = [W_{ss} W_{ax}]$ is usually used to emphasize that matrices have been concatenated size by side and the number columns has increase but the number of rows stays the same. Notice that 

- $W_{s}$ is $(100,10.100)$ matrix.


Let us denote by $[v_1, v_2]$ the vertical concatenation of vectors. If $v_1$ is $(n_1,1)$ and $v_2$ is $(n_2,1)$ then 
$[v_1, v_2]$ will be $(n_1+n_2,1)$ vector.

Using the matrix $W_{s}$  and the previous notation of vector concatenation we can rewrite the forward equations as

- The recurrent signal at time $t$, $a^{<t>}$  is computed as:
$$
s^{<t>} = g_s\left( W_{s} \cdot \left[\substack{s^{<t-1>} \\  x^{<t>}} \right] + b_s \right)
$$

- The output signal at time $t=1$, $y^{<1>}$ is computed as:
$$
\hat{y \,}^{<t>} = g_y\left( W_{ys} \cdot   s^{<t>} + b_y \right)
$$


- We might simply write for the output signal at time $t=1$, 
$$
\hat{y \,}^{<t>} = g_y \left( W_{y} \cdot   s^{<t>} + b_y \right)
$$

Notice that if a state vector had 100 dimensions and an input vector had 10.000 dimensions then $\left[s^{<t-1>} , x^{<t>}\right] $ or simply $ \left[\substack{s^{<t-1>} \\  x^{<t>}} \right]$ would be a 10.100 dimensional vector. Therefore the multiplication $W_{s} \cdot \left[s^{<t-1>} , x^{<t>}\right]$ or  $W_{s} \cdot \left[\substack{s^{<t-1>} \\  x^{<t>}} \right]$   is a well defined $(100,10.100) \cdot (10.100,1)$.


## Simplified Gated recurrent unit 

The GRU has a variable $c$ (memory cell). The memory cell will provide memory to remember past observed things on sequences.

The GRU will output $c^{<t>} = s^{<t>}$. We use this notation because we will describe later on the LSTM and it will make sense to use this notation.

- At every time step we will consider the value $\hat{c\,}^{<t>} $ to be a candidate to replace $c^{<t>}$.
The candidate will be computed as

$$\hat{c\,}^{<t>} = \tanh \left( W_c \cdot  \left[c^{<t-1>}, x^{<t>} \right] + b_c \right)$$ 

- Which in stacked notation will be 

$$\hat{c\,}^{<t>} = \tanh \left( W_c \cdot  \left[ \substack{c^{<t-1>} \\ x^{<t>}} \right] + b_c \right)$$


### Update Gate $\Gamma_u$

The main addition of the GRU with respect to the simple RNN cell is the addition of the update gate $\Gamma_u$.
You can thing about this value as beeing 0 or 1 even though in practise it will be a value between 0 and 1 since it is a sigmoid applied to a vector.


The GRU at every time step things about updating  $c^{<t>}$ with $\hat{c\,}^{<t>}$. Who will control the update will be the update gate $\Gamma_u$

- The Update gate value $\Gamma_u$ is computed as

$$
\Gamma_u = \sigma \left( W_u \cdot \left[c^{<t-1>}, x^{<t>} \right] + b_u \right)
$$

- The cell state $c^{<t>}$  will be updated as

$$
c^{<t>} = \Gamma_u \odot \hat{c\,}^{<t>} .+ ( 1 - \Gamma_u  ) \odot c^{<t-1>}
$$


Notice that $\dim(\Gamma_u) = \dim(\hat{c\,}) =  \dim(c)$



## Gated recurrent unit 

- https://www.youtube.com/watch?v=wSabaLGEegM&list=PLBAGcD3siRDittPwQDGIIAWkjz-RucAc7&index=8

- https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

- https://pytorch.org/docs/master/nn.html

## Full GRU:

- At every time step we will consider the value $\hat{c\,}^{<t>} $ to be a candidate to replace $c^{<t>}$.
The candidate will be computed as

$$\hat{c\,}^{<t>} = \tanh \left( W_c \cdot  \left[ \Gamma_r  \odot c^{<t-1>}, x^{<t>} \right] + b_c \right)$$ 

- The Recurrent gate value $\Gamma_r$ is computed as

$$
\Gamma_r = \sigma \left( W_r \cdot \left[c^{<t-1>}, x^{<t>} \right] + b_r \right)
$$

- The Update gate value $\Gamma_u$ is computed as

$$
\Gamma_u = \sigma \left( W_u \cdot \left[c^{<t-1>}, x^{<t>} \right] + b_u \right)
$$


- The cell state $c^{<t>}$  will be updated as

$$
c^{<t>} = \Gamma_u \odot \hat{c\,}^{<t>} .+ ( 1 - \Gamma_u  ) \odot c^{<t-1>}
$$


In summary  $W_c, W_r, W_u$






![GRU_diagram](./GRU_diagram.png)



In [27]:
np.random.seed(1234)
gru = torch.nn.GRU(input_size=6, hidden_size=256)
sample = torch.autograd.Variable(torch.Tensor(np.random.rand(6).reshape(1,1,6)))

In [28]:
sample

tensor([[[0.1915, 0.6221, 0.4377, 0.7854, 0.7800, 0.2726]]])

In [29]:
type(gru.forward(sample)), len(gru.forward(sample))

(tuple, 2)

In [30]:
a,b = gru.forward(sample)

In [31]:
a.size(), b.size()

(torch.Size([1, 1, 256]), torch.Size([1, 1, 256]))

In [32]:
pprint(list(gru.state_dict().keys()))

['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']


In [33]:
print("hidden to hidden weights size:", gru.weight_hh_l0.size())
print("hidden to hidden bias  size:", gru.bias_hh_l0.size())

print("input to hidden weight matrix size:", gru.weight_ih_l0.size())
print("input to hidden bias  size:", gru.bias_ih_l0.size())

hidden to hidden weights size: torch.Size([768, 256])
hidden to hidden bias  size: torch.Size([768])
input to hidden weight matrix size: torch.Size([768, 6])
input to hidden bias  size: torch.Size([768])
