# Assignment 1: Introduction to the Fully Recurrent Network

*Author:* Thomas Adler

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.


## Exercise 1: Numerical stability of the binary cross-entropy loss function

We will use the binary cross-entropy loss function to train our RNN, which is defined as
$$
L_{\text{BCE}}(\hat y, y) = -y \log \hat y - (1-y) \log (1-\hat y),
$$
where $y$ is the label and $\hat y$ is a prediction, which comes from a model (e.g. an RNN) and is usually sigmoid-activated, i.e., we have
$$
\hat y = \sigma(z) = \frac{1}{1+e^{-z}}.
$$
The argument $z$ is called *logit*. For reasons of numerical stability it is better to let the model emit the logit $z$ (instead of the prediction $\hat y$) and incorporate the sigmoid activation into the loss function. Explain why this is the case and how we can gain numerical stability by combining the two functions $L_{\text{BCE}}(\hat y, y)$ and $\sigma(z)$ into one function $L(z, y) = L_{\text{BCE}}(\sigma(z), y)$.

*Hint: Prove that $\log(1+e^{z}) = \log (1+e^{-|z|}) + \max(0, z)$ and argue why the right-hand side is numerically more stable. Finally, express $L(z,y)$ in terms of that form.*

########## YOUR SOLUTION HERE ##########


Combining Sigmoid and BCE: Substitude $\hat y = \sigma(z)$ into the BCE loss func:

$$
L(z,y) = -y\log(\sigma(z)) - (1-y)\log(1-\sigma(z))
$$

applying:

$$
\log(\sigma(z)) = -\log(1+e^{-z})
$$

$$
\log(1-\sigma(z)) = -\log(1+e^z)
$$

which resulted from applying log to: $\sigma(z) = \frac{1}{1+e^{-z}}$

Substitude, the BCE becomes:

$$
L(z,y) = y\log(1+e^{-z}) + (1-y)\log(1+e^z)
$$

To avoid issues relate to computing $\log (1-\hat y)$ and $\log(\hat y)$ when $z$ is extremely small or larg it can cause underflow or overflow issues in computations. To do so, instead of calculating loss based on $\hat y$, the loss should be calculated via logit z directly.

To prove the hint:

$$
\log(1 + e^z) = \log(1 + e^{-|z|}) + \max(0, z),
$$

There are two cases:


1. **For $z \geq 0$:**

$$\log(1 + e^z) = z + \log(1 + e^{-z}),$$
and since $\max(0, z) = z$, the equation holds.

2. **For $z < 0$:**

$$\log(1 + e^z) = \log(1 + e^z),$$

Therefore, the numerically stable version of the binary cross-entropy loss function is the below expression which we proved in the begining:

$$L(z, y) = y \log(1 + e^{-z}) + (1 - y) \log(1 + e^z).$$


Side note:

Instead of explicitly calculating $\hat y = \sigma(z)$ we can calculate the loss in terms of $z$ (the logit) directly for numerical stability.

## Exercise 2: Derivative of the loss

Calculate the derivative of the binary cross-entropy loss function $L(z, y)$ with respect to the logit $z$.

########## YOUR SOLUTION HERE ##########

The binary cross-entropy loss function:

$$
L(z, y) = -\left[ y \log(\sigma(z)) + (1 - y) \log(1 - \sigma(z)) \right]
$$

Substitude the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ to the BCE loss function:

$$
L(z, y) = -\left[ y \log\left( \frac{1}{1 + e^{-z}} \right) + (1 - y) \log\left( 1 - \frac{1}{1 + e^{-z}} \right) \right]
$$

## Simplification:

I. Simplifying the First Term

$$
y \log\left( \frac{1}{1 + e^{-z}} \right)
$$


$$
\implies y \log\left( \frac{1}{1 + e^{-z}} \right) = -y \log(1 + e^{-z})
$$

II. Simplifying the Second Term


$$
(1 - y) \log\left( 1 - \frac{1}{1 + e^{-z}} \right)
$$

inside the logarithm:

$$
1 - \frac{1}{1 + e^{-z}}
$$


$$
\implies 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}}
$$

$$
\implies (1 - y) \log\left( 1 - \frac{1}{1 + e^{-z}} \right) = (1 - y) \log\left( \frac{e^{-z}}{1 + e^{-z}} \right)
$$

$$
\implies \log\left( \frac{e^{-z}}{1 + e^{-z}} \right) = \log(e^{-z}) - \log(1 + e^{-z}) = -z - \log(1 + e^{-z})
$$

So:

$$
\implies (1 - y) \log\left( 1 - \frac{1}{1 + e^{-z}} \right) = (1 - y)(-z - \log(1 + e^{-z}))
$$

Expanding this expression:

$$
\implies (1 - y)(-z - \log(1 + e^{-z})) = -(1 - y)z - (1 - y)\log(1 + e^{-z})
$$

### Combining Both Parts


$$
\implies L(z, y) = (1 - y)z + \log(1 + e^{-z})
$$


To find the derivative of the loss with respect to the logit $z$ (using the chain rule):

$$
\frac{\partial L}{\partial z} = \frac{\partial L}{\partial p} \cdot \frac{\partial p}{\partial z}
$$

The partial derivative of the loss with respect to the sigmoid output $p$ is:

$$
\frac{\partial L}{\partial p} = -\frac{y}{p} + \frac{1 - y}{1 - p}
$$

The derivative of the sigmoid function with respect to $z$ is:

$$
\frac{\partial p}{\partial z} = \sigma(z) \cdot (1 - \sigma(z)) = p(1 - p)
$$

The derivative of the loss function $L(z, y)$ with respect to the logit ($z$) is:

$$
\frac{\partial L}{\partial z} = \sigma(z) - y
$$

### Intuition Behind the Result (Side note)

* The derivative $\sigma(z) - y $ tells us the difference between the predicted probability $\sigma(z)$ and the true label $y$.
* This value is often referred to as the **error term** in the context of gradient descent: it represents how far off the prediction is from the actual label.
* If the prediction $\sigma(z)$ is greater than the actual value $y$, the derivative will be positive, suggesting that the model needs to reduce the logit value to minimize the loss.


## Exercise 3: Initializing the network
Consider the fully recurrent network
$$
s(t) = W x(t) + R a(t-1) \\
a(t) = \tanh(s(t)) \\
z(t) = V a(t) \\
\hat y(t) = \sigma(z(t))
$$
for $t \in \mathbb{N}, x(t) \in \mathbb{R}^{D}, s(t) \in \mathbb{R}^{I}, a(t) \in \mathbb{R}^{I}, z(t) \in \mathbb{R}^K, \hat y(t) \in \mathbb{R}^K$ and $W, R, V$ are real matrices of appropriate sizes and $\hat a(0) = 0$.

*Compared to the lecture notes we choose $f(x) = \tanh(x) = (e^x - e^{-x})(e^x + e^{-x})^{-1}$ and $\varphi(x) = \sigma(x) = (1+e^{-x})^{-1}$. Further, we introduced an auxiliary variable $z(t)$ and transposed the weight matrices.*

Write a function `init` that takes a `model` and integers $D, I, K$ as arguments and stores the matrices $W, R, V$ as members `model.W`, `model.R`, `model.V`, respectively. The matrices should be `numpy` arrays of appropriate sizes and filled with random values that are uniformly distributed between -0.01 and 0.01.

In [6]:
import numpy as np
from scipy.special import expit as sigmoid

class Obj(object):
    pass

model = Obj()
T, D, I, K = 10, 3, 5, 1

def init(model, D, I, K):
    # Using uniform(lowerBound, upperBound, shape)

    # Initialize W with shape (5x3)
    model.W = np.random.uniform(-0.01, 0.01, (I, D))
    # Initialize R with shape (5x5)
    model.R = np.random.uniform(-0.01, 0.01, (I, I))
    # Initialize D with shape (1x5)
    model.V = np.random.uniform(-0.01, 0.01, (K, I))

Obj.init = init
model.init(D, I, K)

print("W shape:", model.W.shape)
print("R shape:", model.R.shape)
print("V shape:", model.V.shape)

W shape: (5, 3)
R shape: (5, 5)
V shape: (1, 5)


## Exercise 4: The forward pass
Implement the forward pass for the fully recurrent network for sequence classification (many-to-one mapping). To this end, write a function `forward` that takes a `model`, a sequence of input vectors `x`, and a label `y` as arguments. The inputs will be represented as a `numpy` array of shape `(T, D)`. It should execute the behavior of the fully recurrent network and evaluate the (numerically stabilized) binary cross-entropy loss at the end of the sequence and return the resulting loss value. Store the sequence of hidden activations $(a(t))_{t=1}^T$ and the logit $z(T)$ into `model.a` and `model.z`, respectively.

In [None]:
def forward(model, x, y):
    T, D = x.shape  # Sequence length and input dimensionality

    # Initialize hidden state
    h = np.zeros((D,)) 

    # Initialize storage for activations
    model.a = []

    # Iterate over each time step
    for t in range(T):
        # Recurrent operation
        h = np.tanh(np.dot(model.W_h, h) + np.dot(model.W_x, x[t]) + model.b)

        # Store hidden activations
        model.a.append(h)

    # Final logit for the last time step
    z_T = np.dot(model.W_out, h) + model.b_out
    model.z = z_T

    # Compute binary cross-entropy loss (with numerical stability)
    loss = -y * np.log(sigmoid(z_T)) - (1 - y) * np.log(1 - sigmoid(z_T))

    return loss


Obj.forward = forward
model.forward(np.random.uniform(-1, 1, (T, D)), 1)

## Exercise 5: The computational graph

Visualize the computational graph of the fully recurrent network unfolded in time. The graph should show the functional dependencies of the nodes $x(t), a(t), z(t), L(z(t), y(t))$ for $t \in \{1, 2, 3\}$. Use the package `networkx` in combination with `matplotlib` to draw a directed graph with labelled nodes and edges. If you need help take a look at [this guide](https://networkx.org/documentation/stable/tutorial.html). Make sure to arrange the nodes in a meaningful way.

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

########## YOUR SOLUTION HERE ##########