# Understanding the Weights in RNNs

## Instructions
0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.
0. Look at the code in [Part A: Single Unit Simple Recurrent Layer](#Part-A:-Single-Unit-Simple-Recurrent-Layer) and complete the [Part A Exercise](#Part-A-Exercise)
0. Look at the code in [Part B: Two Unit Simple Recurrent Layer](#Part-B:-Two-Unit-Simple-Recurrent-Layer) and complete the [Part B Exercise](#Part-B-Exercise)
0. Optionally, look at the code in [Part C: LSTM Layer](#Part-C:-LSTM-Layer) and complete the [Part C Exercise](#Part-C-Exercise)

## Documentation/Sources
* [Class Notes](https://jennselby.github.io/MachineLearningCourseNotes/#recurrent-neural-networks)
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://keras.io/](https://keras.io/) Keras API documentation
* [Keras recurrent tutorial](https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent)

## Part A: Single Unit Simple Recurrent Layer

Before we dive into something as complicated as LSTMs, Let's take a deeper look at simple recurrent layer weights.

In [1]:
import numpy
from keras.layers import SimpleRNN
from keras.models import Sequential
from keras.layers import LSTM

The neurons in the recurrent layer pass their output to the next layer, but also back to themselves. The input shape says that we'll be passing in one-dimensional inputs of unspecified length (the None is what makes it unspecified).

In [2]:
one_unit_SRNN = Sequential()
one_unit_SRNN.add(SimpleRNN(units=1, input_shape=(None, 1), activation='linear', use_bias=False))

In [3]:
one_unit_SRNN_weights = one_unit_SRNN.get_weights()
one_unit_SRNN_weights

[array([[-0.55325055]], dtype=float32), array([[-1.]], dtype=float32)]

We can set the weights to whatever we want, to test out what happens with different weight values.

In [8]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 2
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.get_weights()

[array([[1.]], dtype=float32), array([[2.]], dtype=float32)]

We can then pass in different input values, to see what the model outputs.

The code below passes in a single sample that has three time steps.

In [9]:
one_unit_SRNN.predict(numpy.array([ [[3], [3], [7]] ]))

array([[25.]], dtype=float32)

## Part A Exercise
Figure out what the two weights in the one_unit_SRNN model control. Be sure to test your hypothesis thoroughly. Use different weights and different inputs.

Let the weight at [0,0,0] be $w_0$ and the weight at [1,0,0] be $w_1$. Values in parentheses in the diagram below show the intermediate output before summation and after multiplying by the corresponding weight.

$w_0$ controls the effect of the *input* on the current timestep's recurrent layer. For example, suppose $w_0=2$ (and $w_1=1$). Then, the network computes the output value as follows:

```
      (6)
[3] — w_0 → 6
            |
           w_1 (6)
      (6)   ↓
[3] — w_0 → 12
            |
           w_1 (12)
     (14)   ↓
[7] — w_0 → 26 → [26]
```

$w_1$ controls the value of the *recurrent layer from the previous timestep*'s effect on the current timestep. For example, suppose $w_0=1$ (and $w_1=2$). Then, the network computes the output value as follows:

```
      (3)
[3] — w_0 → 3
            |
           w_1 (6)
      (3)   ↓
[3] — w_0 → 9
            |
           w_1 (18)
      (7)   ↓
[7] — w_0 → 25 → [25]
```

## Part B: Two Unit Simple Recurrent Layer

In [10]:
two_unit_SRNN = Sequential()
two_unit_SRNN.add(SimpleRNN(units=2, input_shape=(None, 1), activation='linear', use_bias=False))

In [11]:
two_unit_SRNN_weights = two_unit_SRNN.get_weights()
two_unit_SRNN_weights

[array([[-0.5520738 ,  0.15473759]], dtype=float32),
 array([[ 0.98677623,  0.1620891 ],
        [ 0.1620891 , -0.9867761 ]], dtype=float32)]

In [12]:
two_unit_SRNN_weights[0][0][0] = 1
two_unit_SRNN_weights[0][0][1] = 1
two_unit_SRNN_weights[1][0][0] = 0
two_unit_SRNN_weights[1][0][1] = 1
two_unit_SRNN_weights[1][1][0] = 0
two_unit_SRNN_weights[1][1][1] = 1
two_unit_SRNN.set_weights(two_unit_SRNN_weights)
two_unit_SRNN.get_weights()

[array([[1., 1.]], dtype=float32),
 array([[0., 1.],
        [0., 1.]], dtype=float32)]

This passes in a single sample with four time steps.

In [14]:
two_unit_SRNN.predict(numpy.array([ [[3], [3], [7], [5]] ]))

array([[ 5., 31.]], dtype=float32)

## Part B Exercise
What do each of the six weights of the two_unit_SRNN control? Again, test out your hypotheses carefully.

The neural network above can be described as:

$$
\begin{pmatrix}
    r_0(t) \\
    r_1(t)
\end{pmatrix} = \begin{pmatrix}
    w_{1,0,0} & w_{1,1,0} \\
    w_{1,0,1} & w_{1,1,1} 
\end{pmatrix}\begin{pmatrix}
    r_0(t-1) \\
    r_1(t-1)
\end{pmatrix} + \begin{pmatrix}
    w_{0,0,0} \\
    w_{0,0,1}
\end{pmatrix}x(t)
$$

where $r_i(t)$ represents the value of the $i$-th node of the recurrent layer at time $t$, and $x(t)$ represents the input to the network at time $t$. Applied to the input data, this yields (for $t=0,1,2,3$):

$$
\begin{pmatrix}
    r_0(0) \\
    r_1(0)
\end{pmatrix} = \begin{pmatrix}
    0 & 0 \\
    1 & 1
\end{pmatrix}\begin{pmatrix}
    0 \\
    0
\end{pmatrix} + \begin{pmatrix}
    1 \\
    1
\end{pmatrix}\cdot 3 = \begin{pmatrix}
    3 \\
    3
\end{pmatrix}
$$

$$
\begin{pmatrix}
    r_0(1) \\
    r_1(1)
\end{pmatrix} = \begin{pmatrix}
    0 & 0 \\
    1 & 1
\end{pmatrix}\begin{pmatrix}
    3 \\
    3
\end{pmatrix} + \begin{pmatrix}
    1 \\
    1
\end{pmatrix}\cdot 3 = \begin{pmatrix}
    3 \\
    9
\end{pmatrix}
$$

$$
\begin{pmatrix}
    r_0(2) \\
    r_1(2)
\end{pmatrix} = \begin{pmatrix}
    0 & 0 \\
    1 & 1
\end{pmatrix}\begin{pmatrix}
    3 \\
    9
\end{pmatrix} + \begin{pmatrix}
    1 \\
    1
\end{pmatrix}\cdot 7 = \begin{pmatrix}
    7 \\
    19
\end{pmatrix}
$$

$$
\begin{pmatrix}
    r_0(3) \\
    r_1(3)
\end{pmatrix} = \begin{pmatrix}
    0 & 0 \\
    1 & 1
\end{pmatrix}\begin{pmatrix}
    7 \\
    19
\end{pmatrix} + \begin{pmatrix}
    1 \\
    1
\end{pmatrix}\cdot 5 = \begin{pmatrix}
    5 \\
    31
\end{pmatrix}
$$




## Part C: LSTM Layer
### Optional

In [27]:
one_unit_LSTM = Sequential()
one_unit_LSTM.add(LSTM(units=1, input_shape=(None, 1),
                       activation='linear', recurrent_activation='linear',
                       use_bias=False, unit_forget_bias=False,
                       kernel_initializer='zeros',
                       recurrent_initializer='zeros',
                       return_sequences=True))

In [28]:
one_unit_LSTM_weights = one_unit_LSTM.get_weights()
one_unit_LSTM_weights

[array([[0., 0., 0., 0.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [32]:
one_unit_LSTM_weights[0][0][0] = 1
one_unit_LSTM_weights[0][0][1] = 0
one_unit_LSTM_weights[0][0][2] = 1
one_unit_LSTM_weights[0][0][3] = 1
one_unit_LSTM_weights[1][0][0] = 0
one_unit_LSTM_weights[1][0][1] = 0
one_unit_LSTM_weights[1][0][2] = 0
one_unit_LSTM_weights[1][0][3] = 0
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.get_weights()

[array([[1., 0., 1., 1.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [33]:
one_unit_LSTM.predict(numpy.array([ [[0], [1], [2], [4]] ]))

array([[[ 0.],
        [ 1.],
        [ 8.],
        [64.]]], dtype=float32)

## Part C Exercise
### Optional
Conceptually, the [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) has several _gates_:

* __Forget gate__: these weights allow some long-term memories to be forgotten.
* __Input gate__: these weights decide what new information will be added to the context cell.
* __Output gate__: these weights decide what pieces of the new information and updated context will be passed on to the output.

It also has a __cell__ that can hold onto information from the current input (as well as things it has remembered from previous inputs), so that it can be used in later outputs.

Identify which weights in the one_unit_LSTM model are connected with the context and which are associated with the three gates. This is considerably more difficult to do by looking at the inputs and outputs, so you could also treat this as a code reading exercise and look through the keras code to find the answer.

_Note_: The output from the predict call is what the linked explanation calls $h_{t}$.

Weights of the form $w_{0,i,j}$ connect the input layer to the LSTM layer, while weights of the form $w_{1,i,j}$ connect the LSTM layer to itself.

Weights of the form $w_{i,j,0}$ connect to the *input gate*.

Weights of the form $w_{i,j,1}$ connect to the *forget gate*.

Weights of the form $w_{i,j,2}$ connect to the *cell state*.

Weights of the form $w_{i,j,3}$ connect to the *output gate*.