# Assignment 07: Building Recurrent NN Layers by Hand
---

**Due Date:** Tuesday 06/25/2025 (by midnight)

**Please fill these in before submitting, just in case I accidentally mix up file names while grading**:

Name: Jane Hacker

CWID-5: (Last 5 digits of cwid)

# Introduction 

Welcome to our first assignment over Text/Sequence deep learning systems.  In this assignment
you will implement by hand the key components of Recurrent Neural layers in NumPy.

Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have "memory". They can read inputs $x^{\langle t \rangle}$ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a unidirectional RNN to take information from the past to process later inputs. A bidirectional RNN can take context from both the past and the future. 

**Notation**:

In this notebook we will use

- A superscript (in angle brackets) like $x^{\langle t \rangle}$ denotes some vale at the $t^{th}$ time step.

**Instructions:**

- As with the previous assignment, you will need to create the function declarations asked for
  in `src/assg_tasks.py`.  Make sure you use
  [Python Docstrings](https://www.geeksforgeeks.org/python-docstrings/) and are generally
  following [Pep8 Python Style Guide](https://peps.python.org/pep-0008/) for your code.
- Cells with `### TESTED` comment contain unit tests that are run on your implementation.  You will
  need to uncomment the call to the unit tests, but otherwise need to stay as given in the original
  notebook.
- Likewise since you need to write your declaration of the functions asked for the tasks, don't forget
  to uncomment/add the appropriate `from assg_src include X` statements in both this notebook and
  in the `../src/test_assg_tasks.py`

**In this assignment, you will:**

- Implement the basic building blocks of a recurrent NN layer implementation.
- Learn more about the modifications of an LSTM, and adding its residual connections
  to avoid vanishing gradients issues.
- Learn in detail the operations of recurrent layer cells and how they work.
- See some examples of how recurrent layer operations can be unrolled in order to calculate
  gradients over the tensor operations performed by them.


# Packages

The following imports should be all of the backages that you will need for this assignment.
We are using the Keras API in this assignment and in future assignments, so the `tensorflow` and `keras` modules
you need are now available in the notebook.

In [1]:
# assignment wide imports go here, usually all of your imports for noteboosk should
# be put up at the top here, if they were not given to you at the start of the assignment
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# The following ipython magic will reload changed file/modules.
# So when editing function in source code modules, you should
# be able to just rerun the cell, not restart the whole kernel.
%load_ext autoreload
%autoreload 2

The imports of the function you will write have been commented out here this time.  You will need to uncomment
the imports once you declare and write your functions here, and also in the `src/test_assg_tasks.py` file to
run the unit tests on your work.

In [3]:
# Import functions/moduls from this project.  We manually set the
# PYTHONPATH to append the location to search for this assignments
# functions to just ensure the imports are found
import sys
sys.path.append("../src")

# assignment function imports for doctests and github autograding
# these are required for assignment autograding
from assg_utils import run_unittests, run_doctests
from assg_tasks import softmax
from assg_tasks import sigmoid
#from assg_tasks import rnn_cell_forward
#from assg_tasks import rnn_forward
#from assg_tasks import lstm_cell_forward
#from assg_tasks import lstm_forward

# Task 1: Forward Propagation of a basic Recurrent NN Layer

The following figure is modified from our text showing a recurrent layer
cell unrolled.  I changed notation a bit from our text, to be more
precise in the following discussions.

![Basic RNN model](../figures/ch10-7-simple-rnn-unrolled.png)

**Figure 1**: Basic RNN model unrolled in time.  Shows conceptually the inputs and outputs
that occur at timestep $t$.

Also in this assignment, we will be getting a detailed look at what the RNN
cell computes.  We will use $x^{\langle t \rangle}$ to denote the input
from a sequence (like a token) at time $t$.

If you look closely and compare to our class materials, 
you will see that we are computing two outputs from an RNN cell computation
at time $t$

1. What we will call the activation at time $t$: $a^{\langle t \rangle}$.
2. And what we will call the prediction at time  $t$:
   $\hat{y}^{\langle t \rangle}$

## Task 1.1: Implement `rnn_cell_forward` Call

The following computations of a single RNN cycle / cell has a few extra details not mentioned
in our class materials.  It is a bit closer to the original description of the RNN layer,
where two outputs, the activation (used as the RNN state) and a prediction (e.g. the actual predicted
output target at time $t$, are produced by the `rnn_cell_forward()`.

A recurrent neural network can be seen as the repeated use of a single cell or computation.  You are
going to first implement the computations for a single time-step.  The following figure describes in
more detail the tensor operations for a single time-step of an RNN cell.

![Basic RNN cell](../figures/rnn-cell.png)

**Figure 2**: Basic RNN cell / computations at time $t$.  Cell takes an input $x^{\langle t \rangle}$
(current input) and $a^{\langle t-1 \rangle}$ (previous hidden state activation containing information
from the past), and outputs $a^{\langle t \rangle}$ (hidden state activation for current timestep)
which is given to the next RNN cell computation, and used to make prediction $\hat{y}^{\langle t \rangle}$
for this timestep 2.

**Task**: Implement the `rnn_cell_forward()` method.  

The size of the inputs, the internal state activations, and the outputs can all differ.  For
example, in our weather prediction model there were 14 weather measurements that were recorded
at each 10 minute interval. Call the size of the input features `n_x`.

Likewise we can specify the number of outputs a recurrent layer should create and learn.
This doesn't have to be equal to the numper of input features.  Call this `n_y`.

But also, in the original specification of the RNN layer, the size of the internel state
activations didn't necessarily have to be either `n_x` nor `n_y`.  Call the size of the
internal state activations `n_a`.

This method will be given the following input parameters:

- The current input  $x^{\langle t \rangle}$.  This will be a vector equal to the number of
  features in a sample, for example in our weather prediction example there were
  14 weather measurements each time step, so `n_x = 14` and the input would be
  vectors of shape `(14,)`
- The previous activation state $a^{\langle t-1 \rangle}$, call this parameter something
  like `a_prev`. This will be a vector of shape `(n_a, )`
- There are 3 weight matrices and two bias matrices:
  - $W_a$ will be multiplied by the inputs, so it has a shape `(n_x, n_a)`
  - $U_a$ is the weight matrix multiplied by the activations, so it has a shape `(n_a, n_a)`
  - $b_a$ will be a vector of shape `(n_a,)`
  - $W_y$ is multiplied by the state activations, so its shape is `(n_a, n_y)`
  - $b_y$ will be a vector of shape `(n_y,)`
  - All of these parameters are passed in as a tuple for the third parameter in this order
    `(W_a, U_a, b_a, W_y, b_y)`

And as you may be able to guess, the circle with `x` represent matrix multiplications (dot product),
and the circle with `+` are adds of the biases.

So your forward cell will be calculating the following functions:

\begin{equation}
\begin{aligned}
a^{\langle t \rangle} &= \tanh(x^{\langle t \rangle} W_a + a^{\langle t-1 \rangle} U_a + b_a) \\
\hat{y}^{\langle t \rangle} &= \text{softmax}(a^{\langle t \rangle} W_y + b_y) \\
\end{aligned}
\end{equation}

The cell calculations should be vectorized.  And actually the input $x^{\langle t \rangle}$ is not just a 
simple vector, we will be passing in batches.  So for example lets say the batch size is 64 (call this `batch_size`), and
the number of features in the input is `n_x = 14`.  Then the actual of shape of $x^{\langle t \rangle}$
will be `(batch_size, nx)` = `(64, 14)`.  That means that the actual activations $a^{\langle t \rangle}$ are of shape `(batch_size, n_a)`,
and the predicted outputs $\hat{y}^{\langle t \rangle}$ will be of shape `(batch_size, n_y)`.

Your function should take the described 3 inputs and output the calculated $a^{\langle t \rangle}$
and $\hat{y}^{\langle t \rangle}$ for the given batch of inputs.

In [4]:
### TESTED function rnn_cell_forward()
# uncomment when ready to run the unit tests for function
#run_unittests(['test_rnn_cell_forward'])

# a test to check performing cell tensor operations as expected
# example where we have a batch size of 10, number of input features is 3, number of
# output features is 2, and the internal state carries 5 activations
batch_size = 10
n_x = 3
n_a = 5
n_y = 2

# create some random matrices with a set seed so we always get same output result
np.random.seed(1)
x_t = np.random.randn(batch_size, n_x)
a_prev = np.random.randn(batch_size, n_a)

W_a = np.random.randn(n_x, n_a)
U_a = np.random.randn(n_a, n_a)
b_a = np.random.randn(n_a)
W_y = np.random.randn(n_a, n_y)
b_y = np.random.randn(n_y)

print('x_t: ', x_t.shape)
print('a_prev: ', a_prev.shape)
print('W_a: ', W_a.shape)
print('U_a: ', U_a.shape)
print('b_a: ', b_a.shape)
print('W_y: ', W_y.shape)
print('b_y: ', b_y.shape)

a_t = np.zeros((2,2))
y_t = np.zeros((2,2))
#a_t, y_t = rnn_cell_forward(x_t, a_prev, (W_a, U_a, b_a, W_y, b_y))

print('a_t: ', a_t.shape)
print('y_t: ', y_t.shape)

# 3rd and 4th sample of a and y
print(a_t[3:5])
print(y_t[3:5])

x_t:  (10, 3)
a_prev:  (10, 5)
W_a:  (3, 5)
U_a:  (5, 5)
b_a:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a_t:  (2, 2)
y_t:  (2, 2)
[]
[]


**Expected Results** Given the random seed, you should get the following results if your calculations in `rnn_cell_forward()` are performed correctly.

```
x_t:  (10, 3)
a_prev:  (10, 5)
W_a:  (3, 5)
U_a:  (5, 5)
b_a:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a_t:  (10, 5)
y_t:  (10, 2)
[[-0.99523806 -0.06488166 -0.28849778  0.99917567 -0.88942078]
 [ 0.97054882 -0.93853384  0.92825359  0.77745675 -0.90964872]]
[[0.42204273 0.57795727]
 [0.21579613 0.78420387]]
```

## Task 1.2: RNN Forward Pass

A recurrent neural network (RNN) layer is a repetition of the RNN cell forward unit 
you've just built.  If your input sequence data is using a sequence length of 10,
then you will need use a loop that iterates the `sequence_length` number of times, calling
your `rnn_cell_forward()` function to calculate the outputs and activations states for each
input of the sequence.

[]()

**Task**: Code the forward propagation of a full RNN layer sequence as a function named `rnn_forward()`.
Some definitions of values we use in the description here.  

- `batch_size`: The size of a batch to process
- `sequence_length`: The number of time stepse in the sequence/time series input data
- `n_x`, `n_a`, `n_y`: As before, the number of features of the input, the number of activation state values,
  and the number of features output for each item in the sequence.

Your function will take the following parameters:

- `x`: A batch of inputs.  The inputs for the whole RNN layer are of shape `(batch_size, sequence_length, n_x)`.
  That is to say, to compute the forward pass, the layer is given number of samples as a batch.  All samples
  in the batch are sequences of length `sequence_length`.  For example, for our weather forcasting timeseries
  task, we had 5 days of samples taken once an hour, so the sequence length there was $5 \times 24 = 120$.
  And at a single measurement time there are `n_x` features that were measured and are given as input, for
  example there were 14 weather measurements in our weather prediction task.
- `a_init`: The initial value of the stat activations before iterating.  Could be 0, but sometimes we need to
  set the initial state (e.g. remember encoder/decoder architecture).  Recall the activations are of
  shape `(batch_size, n_a)`.
- `parameters`: All of the same parameters are again passed into this function, to be passed along
   and used by your `rnn_cell_forward()`.  They are passed in as a tuple for the third parameter in this order
    `(W_a, U_a, b_a, W_y, b_y)`

This function will need to iterate over all `sequence_length` sequence items, pulling out the batch of items
for time `t`.  This function returns the activations and output predictions.  However these will now both
be 3D tensors.  The predicted outputs $\hat{y}$ will be of shape `(batch_size, sequence_length, n_y)`
and the hidden activation states will be of shape `(batch_size, sequence_length, n_a)`.

This function will be some more practice with manipulating numpy arrays, but hopefully you have the idea here.
The pseudo-code for this function is:

1. Extract the `batch_size` and `sequence_length` from your inputs (it is dimension 1).
   You also need the `n_a` and `n_y`, which is available in multiple ways from the shape of your `parameters`
2. Initialize an activations and y predictions array of the needed shape.  You can for example initialize these
   to 0.
3. Initialize your current activations to the initial activations you were given `a_init`.
4. Iterate t = `0` .. `sequence_length`
   - Extract the inputs for time `t`.  You need all batches and all input features for time `t`.  Recall that your
     `rnn_call_forward()` takes 2D tensors as input.
   - Call your `rnn_cell_forward()` with the slice out inputs, the current state activation, and the parameters.
   - Your function returns the new calculated activations and y prediction outputs.  Save those in your 3D arrays
     you will return of all the activations and predictions
   - Make sure that you have updated the activations so the new ones are passed into the cell forward call
     in next iteration.

If you perform the iteration correctly, you should have the history of activations and predicted outputs
as 3D tensors, that should be returned from this function.

In [5]:
### TESTED function rnn_forward()
# uncomment when ready to run the unit tests for function
#run_unittests(['test_rnn_forward'])

# a test to check performing cell tensor operations as expected
# example where we have a batch size of 10, number of input features is 3, number of
# output features is 2, and the internal state carries 5 activations
batch_size = 10
sequence_length = 8
n_x = 3
n_a = 5
n_y = 2

# create some random matrices with a set seed so we always get same output result
np.random.seed(1)
x = np.random.randn(batch_size, sequence_length, n_x)
a_init = np.random.randn(batch_size, n_a)

W_a = np.random.randn(n_x, n_a)
U_a = np.random.randn(n_a, n_a)
b_a = np.random.randn(n_a)
W_y = np.random.randn(n_a, n_y)
b_y = np.random.randn(n_y)

print('x: ', x.shape)
print('a_init: ', a_init.shape)
print('W_a: ', W_a.shape)
print('U_a: ', U_a.shape)
print('b_a: ', b_a.shape)
print('W_y: ', W_y.shape)
print('b_y: ', b_y.shape)

a = np.zeros((10,2))
y = np.zeros((10,2))
#a, y = rnn_forward(x, a_init, (W_a, U_a, b_a, W_y, b_y))

print('a: ', a.shape)
print('y: ', y.shape)

# the 6th sample in batch of each
print(a[6])
print(y[6])

x:  (10, 8, 3)
a_init:  (10, 5)
W_a:  (3, 5)
U_a:  (5, 5)
b_a:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a:  (10, 2)
y:  (10, 2)
[0. 0.]
[0. 0.]


**Expected Results**: For the given seed, we expect you to get the following output from your `rnn_forward()`
function if everything is implemented correctly:

```
x:  (10, 8, 3)
a_init:  (10, 5)
W_a:  (3, 5)
U_a:  (5, 5)
b_a:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a:  (10, 8, 5)
y:  (10, 8, 2)
[[ 0.99678874  0.95856544  0.03469727 -0.97763179  0.99976266]
 [ 0.99996493  0.52446059  0.44327442 -0.67532026  0.99952257]
 [ 0.99999831 -0.80330671 -0.99409029  0.94567413  0.80104963]
 [ 0.22600828 -0.70257738 -0.02746938 -0.77244757  0.71340253]
 [ 0.70489704  0.73414165 -0.9316695   0.8268125  -0.46866482]
 [-0.5500474   0.93933467 -0.70401339  0.96963972  0.99945562]
 [ 0.95698489 -0.88623095  0.93160166 -0.95792092  0.99999674]
 [ 0.99534743  0.92742305 -0.90651793 -0.05962163 -0.90932672]]
[[0.69506766 0.30493234]
 [0.6006428  0.3993572 ]
 [0.99191778 0.00808222]
 [0.29487726 0.70512274]
 [0.98783339 0.01216661]
 [0.96180779 0.03819221]
 [0.14607254 0.85392746]
 [0.95925418 0.04074582]]
```

Congratulations! You've successfully built the forward propagation of a recurrent neural network
from scratch.

**Situations when this RNN will perform better**:

- This will work well enough for some applications, but it suffers from the vanishing gradient problems. 
- The RNN works best when each output $\hat{y}^{\langle t \rangle}$ can be estimated using "local" context.  
- "Local" context refers to information that is close to the prediction's time step $t$.
- More formally, local context refers to inputs $x^{\langle t' \rangle}$ and predictions $\hat{y}^{\langle t \rangle}$ where $t'$ is close to $t$.

In the next part, you will build a more complex LSTM model, which is better at addressing vanishing gradients. The LSTM will be better able to remember a piece of information and keep it saved for many timesteps. 

# Task 2: Long Short Term Memory (LSTM) network

## Task 2.1: LSTM cell forward
The LSTM cell is a bit more complicated.  The following shows the operation of an LSTM-cell:

![LSTM Cell](../figures/lstm-cell.png)

**Figure 3**: LSTM-cell. This tracks and updates a "memory cell state" or memory variable $m^{\langle t \rangle}$ at every time-step, which can be different from $a^{\langle t \rangle}$
The 4 gates shown, forget, update, memory and output, all involve a basic tensor multiplication and add of bias.  Thier function is to update
the memory cell state in different ways (in theory).

This task will be similar to the previous one, you will first implement a function to perform
the tensor operations for a single timestep.  Then a function with an iterative loop that will call
the first function the needed number of times to calculate the forward pass.

**Task**: Implement the `lstm_cell_forward()` method.  

Like before, the size of the inputs, internal state activations and outputs can all differ.
For the LSTM layer there is an internel memory channel in addition to the internal state activations.
We will call the inputs, activations, memory and y predictions $x, a, m, y$ respectively. The
internal memory channel must have the same number of units as the state activations, so `n_m = n_a`
but otherwise `n_x` and `n_y` can be different from `n_m, n_a`.

This metod will be given the following input parameters:

- The current input  $x^{\langle t \rangle}$.  This will be a vector equal to the number of
  features in a sample, for example in our weather prediction example there were
  14 weather measurements each time step, so `n_x = 14` and the input would be
  vectors of shape `(14,)`
- The previous activation state $a^{\langle t-1 \rangle}$, call this parameter something
  like `a_prev`. This will be a vector of shape `(n_a, )`
- The previous memory cell state $m^{\langle t-1 \rangle}$, call this parameter something
  like `m_prev`.  The number of units must be the same as the state activations so `n_m = n_a`.
- There are forget, update, memory, output and y predictions operations in the LSTM cell, all need
  a weight matrix and bias vector for their calculations:
  - $W_f, b_f$ are the tensors for the forget gate with shape `(n_a + n_x, n_a)` and `(n_a,)`
  - $W_u, b_u$ are the tensors for the update gate with shape `(n_a + n_x, n_a)` and `(n_a,)`
  - $W_m, b_m$ are the tensors for the memory gate with shape `(n_a + n_x, n_a)` and `(n_a,)`
  - $W_o, b_o$ are the tensors for the output (activations) with shape `(n_a + n_x, n_a)` and `(n_a,)`
  - $W_y, b_y$ are the tensors for the y predictions with shape `(n_a, n_y)` and `(n_y, `)
  - All of these parameters are passed in as a tuple for the fourth parameter in this order
    `(W_f, b_f, W_u, b_u, W_m, b_m, W_o, b_o, W_y, b_y)`

The operations using the `forget`, `memory` and `update` gates are a bit more complex than before, but
mathematically can be expressed as follows:

\begin{equation}
C = [ a^{\langle t-1 \rangle}, x^{\langle t \rangle} ]
\end{equation}

We will use the concatenation of the previous activations `a_prev` and the current inputs `x_t` in the
following expressions (Hint: np.concatenate()).  For example if the batch size is 10 and the
previous activations are of shape `(10, 5)` and the current inputs are of shape `(10, 3)`,
then concatenating `a_prev` and `x_t` results in a matrix with still 10 rows, but now 8 columns.
Be sure that the previous activations end up in columns 0-4 and the inputs in columns 5-7 when
you concatenate, or you won't get the correct results.

Given the concatenation `D` of the previous activations and the current inputs, the `forget`,
`update`, `memory` and `output` gate operations can be defined simply as:

\begin{equation}
\begin{aligned}
\text{forget} &= \text{sigmoid}( C W_f + b_f) \\
\text{update} &= \text{sigmoid}( C W_u + b_u) \\
\text{memory} &= \text{tanh}( C W_m + b_m) \\
\text{output} &= \text{sigmoid}( C W_o + b_o) \\
\end{aligned}
\end{equation}

As you can see these are all similar and all use the basic tensor operations of a matrix multiplication and addition you are familiar with.  The `memory` gate uses a `tanh` (see `np.tanh()`) activation function while the result use the `sigmoid`.  You can find `sigmoid` and `softmax` functions in TensorFlow, but they will return Tensor objects, and we want to stick with straight numpy in this
assignment and not import the `tensorflow` library.  You implemented `sigmoid` and `softmax` before,
so we have given implementations of these fucntions for you to use in your `assg_tasks.py` file.

The get outputs are then used to calculate the resulting `a_t` (activation state at time t),
`m_t` (memory cell state at time t), and `y_t` (y predictions at time t).  These are the 3 results
that are returned from this function `(a_t, m_t, y_t)`.  These three final outputs are calculated
using the gates like this:

\begin{equation}
\begin{aligned}
m_t &= \text{forget} \times m^{\langle t-1 \rangle} + \text{update} \times \text{memory} \\
a_t &= \text{output} \times \tanh(m_t) \\
y_t &= \text{softmax}( a_t W_y + b_y ) \\
\end{aligned}
\end{equation}

The $\times$ are indicating element wise multiplications rather than matrix multiplications here.
The previous two equations represent the full update function of the LSTM recurrent layer.  The 
forget, memory, update and output weights are separate sets of tensor matrixs that can be learned.
As you can see the design is highly engineered.  The names of the gates give some sense of the
purpose intended for them, though it is questionable if they really work as they were originally
intended.  But the memory state is basically updated using an elementwise multiplcation and
addition.  This ends up being a linear operation, and the update to $m^{\langle t \rangle}$ ends
up not changing too much and in a linear fashin.  So the memory state acts as a kind of skip
connection (residual connection) in LSTM cell iterations for gradient descent.

In [6]:
### TESTED function lstm_cell_forward()
# uncomment when ready to run the unit tests for function
#run_unittests(['test_lstm_cell_forward'])

# a test to check performing cell tensor operations as expected
# example where we have a batch size of 10, number of input features is 3, number of
# output features is 2, and the internal state carries 5 activations
batch_size = 10
n_x = 3
n_a = 5
n_y = 2

# create some random matrices with a set seed so we always get same output result
np.random.seed(1)
x_t = np.random.randn(batch_size, n_x)
a_prev = np.random.randn(batch_size, n_a)
m_prev = np.random.randn(batch_size, n_a)

W_f = np.random.randn(n_a + n_x, n_a)
b_f = np.random.randn(n_a)
W_u = np.random.randn(n_a + n_x, n_a)
b_u = np.random.randn(n_a)
W_m = np.random.randn(n_a + n_x, n_a)
b_m = np.random.randn(n_a)
W_o = np.random.randn(n_a + n_x, n_a)
b_o = np.random.randn(n_a)
W_y = np.random.randn(n_a, n_y)
b_y = np.random.randn(n_y)

print('x_t: ', x_t.shape)
print('a_prev: ', a_prev.shape)
print('m_prev: ', m_prev.shape)
print('W_f: ', W_f.shape)
print('b_f: ', b_f.shape)
print('W_u: ', W_u.shape)
print('b_u: ', b_u.shape)
print('W_m: ', W_m.shape)
print('b_m: ', b_m.shape)
print('W_o: ', W_o.shape)
print('b_o: ', b_o.shape)
print('W_y: ', W_y.shape)
print('b_y: ', b_y.shape)

a_t = np.zeros((2,2))
m_t = np.zeros((2,2))
y_t = np.zeros((2,2))
#a_t, m_t, y_t = lstm_cell_forward(x_t, a_prev, m_prev, (W_f, b_f, W_u, b_u, W_m, b_m, W_o, b_o, W_y, b_y))

print('a_t: ', a_t.shape)
print('m_t: ', m_t.shape)
print('y_t: ', y_t.shape)

# 3rd and 4th sample of a_t, m_t and y_t
print(a_t[3:5])
print(m_t[3:5])
print(y_t[3:5])

x_t:  (10, 3)
a_prev:  (10, 5)
m_prev:  (10, 5)
W_f:  (8, 5)
b_f:  (5,)
W_u:  (8, 5)
b_u:  (5,)
W_m:  (8, 5)
b_m:  (5,)
W_o:  (8, 5)
b_o:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a_t:  (2, 2)
m_t:  (2, 2)
y_t:  (2, 2)
[]
[]
[]


## Task 2.2: Forward pass for LSTM

The hard works for implementing the LSTM forward pass has mostly been done, and your
implementation of the full forward pass for the LSTM layer will be similar to what
you previously did for the simple RNN.

**Task**: Code the forward propagation of a full LSTM layer sequence as a function named `lstm_forward()`.
Some definitions of values we use in the description here.  

- `batch_size`: The size of a batch to process
- `sequence_length`: The number of time stepse in the sequence/time series input data
- `n_x`, `n_a`, `n_y`: As before, the number of features of the input, the number of activation state values,
  and the number of features output for each item in the sequence.  The number of features for the
  memory state `n_m` will be equal to the number of activations `n_a`.

Your function will take the following parameters:

- `x`: A batch of inputs.  The inputs for the whole LSTM layer are of shape `(batch_size, sequence_length, n_x)`.
  That is to say, to compute the forward pass, the layer is given number of samples as a batch.  All samples
  in the batch are sequences of length `sequence_length`.  For example, for our weather forcasting timeseries
  task, we had 5 days of samples taken once an hour, so the sequence length there was $5 \times 24 = 120$.
  And at a single measurement time there are `n_x` features that were measured and are given as input, for
  example there were 14 weather measurements in our weather prediction task.
- `a_init`: The initial value of the stat activations before iterating.  Could be 0, but sometimes we need to
  set the initial state (e.g. remember encoder/decoder architecture).  Recall the activations are of
  shape `(batch_size, n_a)`.
- `parameters`: All of the same parameters are again passed into this function, to be passed along
   and used by your `lstm_cell_forward()`.  They are passed in as a tuple for the third parameter in this order
    `(W_f, b_f, W_u, b_u, W_m, b_m, W_o, b_o, W_y, b_y)`

An observant student might have been expecting an `m_init` to initialize the memory state, the same
as the `a_init`.  However we need to be able to initialize the activations at the start of the
forward state.  But the cell memory state always starts out as 0, so `m_init` will be needed but you will initialize it to 0 before beginning your time step iterations.


This function will need to iterate over all `sequence_length` sequence items, pulling out the batch of items
for time `t`.  This function returns the activation, memory state and output predictions that result
from all of the iterations.  However these will now
be 3D tensors.  The predicted outputs $\hat{y}$ will be of shape `(batch_size, sequence_length, n_y)`
and the hidden activation states and memory cell states will both be shaped `(batch_size, sequence_length, n_a)`.

The pseudo-code for this function is:

1. Extract the `batch_size` and `sequence_length` from your inputs (it is dimension 1).
   You also need the `n_a` and `n_y`, which is available in multiple ways from the shape of your `parameters`
2. Initialize an activations and y predictions array of the needed shape.  You can for example initialize these
   to 0.
3. Initialize your current activations to the initial activations you were given `a_init`.
4. Iterate t = `0` .. `sequence_length`
   - Extract the inputs for time `t`.  You need all batches and all input features for time `t`.  Recall that your
     `rnn_call_forward()` takes 2D tensors as input.
   - Call your `rnn_cell_forward()` with the slice out inputs, the current state activation, and the parameters.
   - Your function returns the new calculated activations and y prediction outputs.  Save those in your 3D arrays
     you will return of all the activations and predictions
   - Make sure that you have updated the activations so the new ones are passed into the cell forward call
     in next iteration.

If you perform the iteration correctly, you should have the history of activations and predicted outputs
as 3D tensors, that should be returned from this function.


In [7]:
### TESTED function lstm_forward()
# uncomment when ready to run the unit tests for function
#run_unittests(['test_lstm_forward'])

# a test to check performing cell tensor operations as expected
# example where we have a batch size of 10, number of input features is 3, number of
# output features is 2, and the internal state carries 5 activations
batch_size = 10
sequence_length = 8
n_x = 3
n_a = 5
n_y = 2

# create some random matrices with a set seed so we always get same output result
np.random.seed(1)
x = np.random.randn(batch_size, sequence_length, n_x)
a_init = np.random.randn(batch_size, n_a)

W_f = np.random.randn(n_a + n_x, n_a)
b_f = np.random.randn(n_a)
W_u = np.random.randn(n_a + n_x, n_a)
b_u = np.random.randn(n_a)
W_m = np.random.randn(n_a + n_x, n_a)
b_m = np.random.randn(n_a)
W_o = np.random.randn(n_a + n_x, n_a)
b_o = np.random.randn(n_a)
W_y = np.random.randn(n_a, n_y)
b_y = np.random.randn(n_y)

print('x: ', x.shape)
print('a_init: ', a_init.shape)
print('W_f: ', W_f.shape)
print('b_f: ', b_f.shape)
print('W_u: ', W_f.shape)
print('b_u: ', b_f.shape)
print('W_m: ', W_f.shape)
print('b_m: ', b_f.shape)
print('W_o: ', W_f.shape)
print('b_o: ', b_f.shape)
print('W_y: ', W_y.shape)
print('b_y: ', b_y.shape)

a = np.zeros((10,2))
m = np.zeros((10,2))
y = np.zeros((10,2))
#a, m, y = lstm_forward(x, a_init, (W_f, b_f, W_u, b_u, W_m, b_m, W_o, b_o, W_y, b_y))

print('a: ', a.shape)
print('m: ', a.shape)
print('y: ', y.shape)

# the 6th sample in batch of each
print(a[6])
print(m[6])
print(y[6])

x:  (10, 8, 3)
a_init:  (10, 5)
W_f:  (8, 5)
b_f:  (5,)
W_u:  (8, 5)
b_u:  (5,)
W_m:  (8, 5)
b_m:  (5,)
W_o:  (8, 5)
b_o:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a:  (10, 2)
m:  (10, 2)
y:  (10, 2)
[0. 0.]
[0. 0.]
[0. 0.]


**Expected Results**: For the given seeds, if you have done all calculations as specified, you should get the following outputs:

```
x:  (10, 8, 3)
a_init:  (10, 5)
W_f:  (8, 5)
b_f:  (5,)
W_u:  (8, 5)
b_u:  (5,)
W_m:  (8, 5)
b_m:  (5,)
W_o:  (8, 5)
b_o:  (5,)
W_y:  (5, 2)
b_y:  (2,)
a:  (10, 8, 5)
m:  (10, 8, 5)
y:  (10, 8, 2)
[[ 0.33106747  0.13796993 -0.00918376 -0.14513632  0.0842039 ]
 [-0.02811128  0.28693576 -0.02530277 -0.036286    0.2427269 ]
 [-0.24675213 -0.1234593  -0.61121026 -0.05457205  0.66088247]
 [-0.05236292 -0.70535923 -0.47803672  0.03029571  0.77776549]
 [-0.16682891 -0.08195929 -0.07687751  0.1187123   0.24526771]
 [ 0.36712994  0.24792886 -0.22405346  0.02844528  0.22654813]
 [ 0.04708272  0.06898404 -0.512682   -0.00995657  0.48537134]
 [-0.14107302  0.15767014 -0.33808058 -0.03324728  0.18678606]]
[[ 0.43974553  0.1625789  -0.00975767 -0.15896916  0.11043907]
 [-0.08665783  0.51434267 -0.02604385 -0.05949502  0.4851059 ]
 [-0.27797833 -0.16043439 -1.01003334 -0.05720113  1.30363934]
 [-0.3050144  -1.14317762 -0.5422558   0.22876783  1.86032122]
 [-0.45851336 -0.41038619 -0.09937711  0.19449822  2.52408118]
 [ 0.50119446  0.49577678 -0.68611907  0.03542264  2.41137585]
 [ 0.08260235  0.12430022 -0.62310968 -0.01389028  2.89012875]
 [-0.44837114  1.01221385 -0.37985992 -0.07050988  2.35030661]]
[[0.86029619 0.13970381]
 [0.78735166 0.21264834]
 [0.91794355 0.08205645]
 [0.97988862 0.02011138]
 [0.86544253 0.13455747]
 [0.84260115 0.15739885]
 [0.88703823 0.11296177]
 [0.80061921 0.19938079]]
```

# Task 3: Backpropagation in Recurrent Neural Networks (Optional / Ungraded)

**Todo**: In future add in optional / ungraded walk through of calculating
backpropagation by unrolling RNN iterations.

In modern deep learning frameworks, you only have to implement the forward pass, and the framework takes care of the backward pass, so most deep learning engineers do not need to bother with the details of the backward pass. If however you are an expert in calculus and want to see the details of backprop in RNNs, you can work through this optional portion of the notebook. 

When in an earlier course you implemented a simple (fully connected) neural network, you used backpropagation to compute the derivatives with respect to the cost to update the parameters. Similarly, in recurrent neural networks you can calculate the derivatives with respect to the cost in order to update the parameters.


# Summary

Congratulations on completing this assignment. Hopefully you now have better insights into how
recurrent layers compute their outputs given a batch of sequence data.

<font color='blue'>
    
**What to remember from this assignment:**

- Operations in recurrent layers are done (conceptually) one time step at a time.
- The operations are tensor operations that you are familiar with.
- For simple RNN, two weight tensors, that are multiplied by the inputs and the previous
  activation states are applied and combined with a bias to generate the next activation state
  and output from the RNN cell.
- LSTM contain an extra memory cell state that is propagated along the time step sequence.
- LSTM layers use a (rather overcomplicated) set of preliminary gates, known as forget, memory,
  update, and output gates, in the sequence of calculations to produce the final y predictions
  and internal activations and memory state.