# Sequence Modeling: Recurrent Nets

**Recurrent neural networks (RNNs)** are neural networks designed for **sequential data**.
Just like convolutional networks specialize in processing grids of values (like images), RNNs specialize in processing sequences:
$$
x^{(1)}, x^{(2)}, \ldots, x^{(n)}
$$

RNNs can handle **long sequences** and **variable-length sequences**, which would be difficult for regular feedforward networks.

A key idea behind RNNs is **parameter sharing**—an idea from the 1980s. Sharing the same parameters across different time steps allows the model to generalize across sequences of different lengths.

For example, consider the sentences:

* “I went to Nepal in 2009”
* “In 2009, I went to Nepal.”

A feedforward network would treat each word position separately and would need to relearn patterns for each position. But an RNN uses the **same weights at every time step**, so it can detect the important word (“2009”) no matter where it appears.

For simplicity, think of an RNN processing vectors $ x^{(t)} $ across time steps $ t = 1 \ldots \tau $. Each output depends on the previous ones, all using the **same update rule**. This creates a very deep computational graph through time.

## Unfolding Computational Graphs

A **computational graph** shows how inputs and parameters flow through operations to produce outputs. When we **unfold** a recurrent computation, we reveal repeated use of the same parameters over time.

Consider a classical dynamical system:

<img src="img/ex1.png">

$$
h^{(t)} = f(h^{(t-1)}; \theta)
$$

Here, each state $ h^{(t)} $ is computed from the previous state using the **same parameters** $ \theta $. Unfolding this system shows the same function applied repeatedly through time.

We can also have a system that receives an external input at each step:

$$
h^{(t)} = f(h^{(t-1)}, x^{(t)}; \theta)
$$

Now the state $ h^{(t)} $ carries information from the **entire past sequence**.

<img src="img/ex2.png">

This RNN has no outputs—it simply processes the input sequence and stores information in the hidden state $ h $ as it moves through time.

* Left: circuit diagram (black square = 1-step delay)
* Right: unfolded graph (each node is a point in time)

RNNs can be built in many different forms. Just as many functions can be expressed as feedforward networks, almost any function involving recurrence can be framed as a recurrent neural network.


We can describe the unfolded recurrence after $ t $ steps using a function $ g^{(t)} $:

$$
h^{(t)} = g^{(t)}(x^{(t)}, x^{(t-1)}, \ldots, x^{(1)}) = f(h^{(t-1)}, x^{(t)}; \theta)
$$

The function $ g^{(t)} $ takes the **entire past sequence**
$(x^{(t)}, x^{(t-1)}, \ldots, x^{(1)})$
and produces the current state $ h^{(t)} $.
But when we **unfold** the recurrence, we see that $ g^{(t)} $ is really just the repeated application of the same function $ f $ at each time step.

Unfolding gives us two big advantages:

1. **Fixed input size for the model.**
   The model doesn’t need to accept a full history of variable length.
   Instead, it only needs to handle the transition from one state to the next.

2. **Parameter sharing across time.**
   The same transition function $ f $ with the same parameters is used at every time step.

Because of these two properties, we only need to learn **one model** $ f $ that works for all time steps and all sequence lengths. We don’t need to learn separate models $ g^{(t)} $ for different values of $ t $, which makes RNNs scalable and efficient.


## Recurrent Neural Networks

Here is a representative RNN example that we will refer to throughout this notebook:

<img src="img/ex3.png">

The computational graph shows how to compute the **training loss** for an RNN that maps an input sequence $x$ to a corresponding output sequence $o$. A loss $L$ measures how far each output $o$ is from its target $y$.

* When using **softmax outputs**, $o$ represents unnormalized log probabilities.
* The loss internally computes $\hat{y} = \text{softmax}(o)$ and compares it to the target $y$.

The RNN has three sets of parameters:

* Input-to-hidden weights: $U$
* Hidden-to-hidden (recurrent) weights: $W$
* Hidden-to-output weights: $V$
* Left: RNN with recurrent connections.
* Right: Same RNN **unfolded over time**, showing one node per time step.

### Forward Propagation

Assuming a **tanh activation** for hidden units and discrete outputs (e.g., words or characters), the RNN updates are:

$$
a^{(t)} = b + W h^{(t-1)} + U x^{(t)}
$$
$$
h^{(t)} = \tanh(a^{(t)})
$$
$$
o^{(t)} = c + V h^{(t)}
$$
$$
\hat{y}^{(t)} = \text{softmax}(o^{(t)})
$$

Here:

* $b$ and $c$ are bias vectors
* $U, V, W$ are weight matrices for input-to-hidden, hidden-to-output, and hidden-to-hidden connections

This RNN maps an input sequence to an output sequence of the **same length**, and the **total loss** for the sequence is the sum of the losses at each time step.

**Example:** Let $L^{(t)}$ be the negative log-likelihood of $y^{(t)}$ given $x^{(1)}, \ldots, x^{(t)}$. Then the total loss for a sequence is:

$$
L(x^{(1)},\ldots,x^{(\tau)}, y^{(1)},\ldots,y^{(\tau)}) = \sum_t L^{(t)} = -\sum_t \log p_{\text{model}}\big(y^{(t)} \mid x^{(1)},\ldots,x^{(t)}\big)
$$

Here, $p_{\text{model}}(y^{(t)} \mid x^{(1)},\ldots,x^{(t)})$ is obtained by reading the entry for $y^{(t)}$ from the model’s output vector $\hat{y}^{(t)}$.

Computing the gradient of this loss is **expensive**:

* Forward pass moves **left to right** through the unrolled graph
* Backward pass moves **right to left**
* Runtime: $O(\tau)$, cannot be parallelized (each step depends on the previous)
* Memory: $O(\tau)$, as forward states must be stored for the backward pass

This gradient computation on an unrolled RNN is called **Back-Propagation Through Time (BPTT).**

### Figure 1

<img src="img/ex4.png">

This RNN only feeds its **output** $o^{(t)}$ back to the hidden layer:

* Inputs: $x^{(t)}$
* Hidden activations: $h^{(t)}$
* Outputs: $o^{(t)}$
* Targets: $y^{(t)}$
* Loss: $L^{(t)}$
* Left: circuit diagram
* Right: unfolded computational graph

Here, the hidden state $h$ is only connected to the future indirectly via the output $o$. Unless $o$ is very high-dimensional, the network may **lose important past information**, making it less powerful. However, training is easier because each step can be trained more independently, allowing **parallelization**.

<img src="img/ex5.png">

A time-unfolded RNN with **a single output at the end of the sequence**:

* Can summarize the entire sequence into a **fixed-size representation**
* The target can be at the end, or the gradient can come from further downstream modules

This setup is commonly used for tasks like sequence classification or summarization.


## Teacher Forcing and Networks with Output Recurrence

<img src="img/ex6.png">

**Teacher forcing** is a training technique for RNNs that have connections from their **output** at one time step to the **hidden state** at the next time step.

* **Left (Training):** Feed the correct output $y^{(t)}$ from the training set as input to $h^{(t+1)}$.
* **Right (Deployment):** The true output is unknown. We approximate $y^{(t)}$ using the model’s output $o^{(t)}$ and feed it back into the network.

A network with recurrence only from **output to hidden** (no hidden-to-hidden connections) is **less powerful**, because:

* The output must capture all relevant past information.
* If the training targets don’t include the full system state, the network may miss important context.

The advantage is that **training can be parallelized**:

* Each time step is independent during training.
* Gradients can be computed per step, since the correct output is provided by the training set.

### Teacher Forcing and Maximum Likelihood

Teacher forcing comes naturally from the **maximum likelihood criterion**. For example, consider a sequence with two time steps:

$$
\log p(y^{(1)}, y^{(2)} \mid x^{(1)}, x^{(2)}) =
\log p(y^{(1)} \mid x^{(1)}, x^{(2)}) + \log p(y^{(2)} \mid y^{(1)}, x^{(1)}, x^{(2)})
$$

At $t=2$, the model is trained to maximize the probability of $y^{(2)}$ given:

* The input sequence so far $(x^{(1)}, x^{(2)})$
* The **previous target** $y^{(1)}$ from the training set

Instead of feeding the model’s own previous output back during training, **teacher forcing uses the true target values**, ensuring correct guidance at each step (as shown in the figure above).
