![](img/575_banner.png)

# Lecture 6: Introduction to Recurrent Neural Networks (RNNs)

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

## Lecture plan, imports, LO

### Lecture plan 

- Left-over iClicker questions from lecture 5
- RNNs motivation
- RNN forward pass
- Break
- iClicker questions
- RNN training 
- RNN architectures
- Final comments and summary

<br><br>

## Imports

In [1]:
import IPython
from IPython.display import HTML, display

<br><br>

### Learning outcomes

From this lecture you will be able to 

- Explain the motivation to use RNNs. 
- Explain how an RNN differs from a feed-forward neural network. 
- Define vanilla or simple RNNs. 
- Explain three weight matrices in RNNs. 
- Explain parameter sharing in RNNs. 
- Explain how states and outputs are calculated in the forward pass of an RNN. 
- Explain the backward pass in RNNs at a high level. 
- Specify different architectures of RNNs and explain how these architectures are used in the context of NLP applications.
- Broadly explain character-level text generation with RNNs.
- Specify the shapes of weight matrices in RNNs.
- Carry out forward pass with RNNs in `PyTorch`.
- Explain stacked RNNs and bidirectional RNNs and the difference between the two.

<br><br>

## Motivation

### RNN-generated music! 

- [Magenta PerformanceRNN](https://www.youtube.com/watch?v=dMhQalLBXIU)

In [2]:
### An example of a state-of-the-art language model
url = "https://www.youtube.com/embed/dMhQalLBXIU"
# IPython.display.IFrame(url, width=500, height=300)

- Language is an inherently sequential phenomenon.
- This temporal nature of language is reflected in the metaphors used to describe language 
    - *flow of conversation*, *news feeds*, or *twitter streams*

### Fixed-length input

- ML algorithms we have seen in 571, 572, and 573 work with fixed length input.  
    - SVM
    - Logistic Regression
    - Multi-layer Perceptron

- Example of fixed length input
$$X = \begin{bmatrix}1 & 0.8 & \ldots & 0.3\\ 0 & 0 &  \ldots & 0.4\\ 1 & 0.2 &  \ldots & 0.8 \end{bmatrix}$$ 

$$y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$$

### Fixed-length input

- When we used these models for sentiment analysis we created a **fixed size** input representation using `CountVectorizer`, where we had simultaneous access to all aspects of the input. 

$$X = \begin{bmatrix}\text{"@united you're terrible. You don't understand safety"}\\ \text{"@JetBlue safety first !! #lovejetblue"}\\ \text{"@SouthwestAir truly the best in #customerservice!"}\\ \end{bmatrix} \text{ and } y = \begin{bmatrix}0 \\ 1 \\ 1 \end{bmatrix} $$ 
<br><br>
$$X_{counts} = \begin{bmatrix}1 & 3 & \ldots & 2\\ 1 & 0 & \ldots & 0\\ 0 & 2 & \ldots & 1\end{bmatrix} \text{ and } y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$$

### Sentiment analysis using feed-forward neural networks 

- Reminder: In feed-forward neural networks, 
    - all connections flow forward (no loops)
    - each layer of hidden units is fully connected to the next
- We pass fixed sized vector representation of text (e.g., representation created with `CountVectorizer`) as input. 
- We lose the temporal aspect of text in this representation. 

### How about using Markov models? 

- They have some temporal aspect. 

![](img/Markov_assumption.png)

<!-- <center> -->
<!-- <img src="img/Markov_assumption.png" height="550" width="550">  -->
<!-- </center> -->

### Recall language modeling task 

- Recall the task of predicting the next word given a sequence. 
- What's the probability of an upcoming word?  
    - $P(w_t|w_1,w_2,\dots,w_{t-1})$
    
<blockquote>
    I am studying medicine at UBC because I want to work as a ___.
</blockquote>


- Recall that when we used Markov models for this task, we made Markov assumption. 
    - Markov model: $P(w_t|w_1,w_2,\dots,w_{t-1}) \approx P(w_t|w_{t-1})$
    - Markov model with more context: $P(w_t|w_1,w_2,\dots,w_{t-1}) \approx P(w_t|w_{t-2}, w_{t-1})$ 
- These models do not have memory beyond the previous 2, 3 or maximum $n$ steps and when $n$ becomes larger, there is sparsity problem.  
- Also, they have huge RAM requirements because you have to store all ngrams. 
- Would a Markov model with $n=5$ predict the correct words in the following cases? 
<blockquote>
    I am studying medicine at UBC because I want to work as a <b>__</b>.<br>
    I am studying law at UBC because I want to work as a <b>__</b>.<br>
    I am studying history at UBC because I want to work as a <b>__</b>.     
</blockquote>



### RNNs motivation 

- RNNs can help us with this limited memory problem!
- **RNNs are a kind of neural network model which use hidden units to retain information over time.**  
- Unlike Markov models, this approach does not impose a fixed-length limit on this prior context; the context embodied in the previous hidden layer can include information extending back to the beginning of the sequence.
- Condition the neural network on all previous time steps. 

- Can a feedforward network handle sequences effectively?

![](img/feedforwardNN.png)

<!-- <img src="img/feedforwardNN.png" height="400" width="400">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

They are not inherently designed to handle sequences because they lack ability to capture temporal dependencies. But it is possible to incorporate some context. For example, by create n-gram features with `CountVectorizer`.


### RNN intuition: Example

- How can a temporal dimension be added to a feedforward neural network?
- For word representation with a vector of size 4, a single feedforward neural network can be used for prediction.
- For 2 words, two separate feedforward neural networks can be used together.

![](img/RNN-intro.png)

<!-- <img src="img/RNN-intro.png" height="400" width="400">  -->

(Credit: [Stanford CS224d slides](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf))

How to connect multiple feedforward networks? 
- **Make connections between hidden layers**.
- The network typically consists of input, hidden layer, and output. The hidden layer is connected to itself for recurrent connections.
- Sequences can be processed by presenting one element at a time to the network.

<br><br><br><br>

## RNN details

### RNN presentations

- Unrolled presentation 

![](img/RNN-intro.png)

<!-- <img src="img/RNN-intro.png" height="400" width="400">  -->


- Recursive presentation

![](img/RNN_recursive_2.png)
<!-- <img src="img/RNN_recursive_2.png" height="200" width="200">  -->

<!-- <center> -->
<!-- <img src="img/RNN_recursive_2.png" height="300" width="300">  -->
<!-- </center>      -->

- The key distinction between non-recurrent and recurrent architectures is **the inclusion of a new set of weights connecting the previous hidden layer to the current hidden layer**.
- The hidden layer from the previous time step acts as a form of "memory" that influences decisions made at later time steps.
- These weights determine how the network incorporates the previous context when computing output for the current input.

### RNN as a graphical model

- RNNs can be visualized as a graphical model. The states below are the hidden layers in each time step.  
    - Somewhat similar to hidden Markov models (HMMs) 
    - But a hidden state in an RNN is continuous valued, high dimensional, and much richer. 
- Each state is a function of the previous state and the input.
- A state contains information about the whole past sequence. 
    - $h_t = g(x_t, x_{t-1}, \dots, x_2, x_1)$ 

![](img/RNN-dynamic-model.png)

<!-- <img src="img/RNN-dynamic-model.png" height="400" width="400">  -->

### RNN as a feedforward neural network
- Adding a temporal dimension and the recursion make RNNs appear to be complex. But they are not all that different from standard feedforrward neural networks. 
- Given an input vector and the values for the hidden layer from the previous time step we are still performing standard feedforward calculations. 
- The most significant change lies in the new set of weights $U$ that connect the hidden layer from the previous time step to the current hidden layer. 
- As with the other weights in the network, these connections are trained via a variant of backpropagation.

![](img/RNN-as-FFNN.png)

<!-- <img src="img/RNN-as-FFNN.png" height="500" width="500">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Parameter sharing

- What are the parameters of this model? There are three weight matrices. 
    - Input to hidden weight matrix: $W$
    - Hidden to output weight matrix: $V$    
    - Hidden to hidden weight matrix: $U$
    
- The key point in RNNs: **All weights between time steps are shared.**   

### Dimensionality of different weight matrices
Lets consider an example: 
- Suppose input vector $x_t$ is of size 300 (i.e., $x_t \in \mathbb{R}^{300}$)
- Suppose the hidden state vector is of size 100 (memory of the network) (i.e., $h_t \in \mathbb{R}^{100}$)
- Suppose the output vector $y_t$ is of size 60 (i.e., $y_t \in \mathbb{R}^{60}$)
- $W_{100 \times 300}$, $V_{60\times 100}$, $U_{100\times 100}$ 

![](img/RNN-as-FFNN.png)

<!-- <img src="img/RNN-as-FFNN.png" height="500" width="500">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- Input size: Suppose $x \in \mathbb{R}^{d_{in}}$
- Output size: Suppose $y \in \mathbb{R}^{d_{out}}$
- Hidden size: Suppose $h \in \mathbb{R}^{d_h}$
- Three kinds of weights: $W_{d_{h}\times d_{in}}$, $V_{d_{out}\times d_{h}}$, $U_{d_h\times d_h}$    
> You may see transpose of these matrices in some notations.

<br><br><br><br>

## Forward pass in RNNs
- The forward inference in RNNs is very similar to what you have seen with feedforward networks. 
- Given an input $x_t$ at timestep $t$, how do we compute the new hidden state $h_{t}$ and output $y_t$? 

![](img/RNN-dynamic-model.png)

<!-- <img src="img/RNN-dynamic-model.png" height="400" width="400">  -->

#### Computing the new state $h_t$

- Multiply the input $x_t$ with the weight matrix between input and hidden layer ($W$) and the state or the hidden layer from the previous time step $h_{t-1}$ with the weight matrix between hidden layers ($U$). 
- Add these values together. 
- Add the bias vector and pass the result through a suitable activation function $g$. 

$$
h_t = g(U_{d_h \times d_h}(h_{t-1})_{d_h \times 1} + W_{d_{h} \times d_{in}} (x_t)_{d_{in} \times 1}  + b_1)\\
$$ 

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->

#### Computing the output $y_t$

- Once we have the value for the new state $h_t$, we can calculate the output vector $y_t$ by multiplying $h_t$ with the weight matrix $V$ between the hidden layer and the output layer, adding the bias vector, and applying an appropriate activation function $f$ the multiplication.  

$$
y_t = f(V_{d_{out} \times d_{h}} (h_t){_{d_h \times 1}} + b_2)
$$ 

![](img/RNN-dynamic-model.png)

<!-- <img src="img/RNN-dynamic-model.png" height="400" width="400">  -->

- Typically, we are interested in soft classification. So computing $y_t$ involves a softmax computation which provides a probability distribution over the possible output classes. 

$$
y_t = \text{softmax}(Vh_t + b_2)
$$ 


### Summary 

So in the forward pass we compute the new state $h_t$ and the output $y_t$ for all time steps in a sequence, as shown below.  

$$
h_t = g(Uh_{t-1} + Wx_t + b_1)\\
y_t = \text{softmax}(Vh_t + b_2)
$$ 

![](img/RNN-dynamic-model.png)
<!-- <center> -->
<!-- <img src="img/RNN-dynamic-model.png" height="400" width="400">  -->
<!-- </center>     -->

### Forward pass pseudo code

We compute this for the full sequence. 

- Given: $x$, network
- $h_0 = 0$
- for $t$ in 1 to length(input sequence $x$)
    - $h_t = g(Uh_{t-1} + Wx_t + b_1$)
    - $y_t = \text{softmax}(Vh_t + b_2)$

Note that the matrices $U$, $V$ and $W$ are **shared across time** and new values for $h_t$ and $y_t$ are calculated at each time step.

### RNN Forward pass with PyTorch

- See the documentation [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).

In [3]:
import torch
import torch.nn as nn
from torch.nn import RNN

#### Creating an RNN object 

We are creating an RNN with 
- only one hidden layer 
- input of size 20 (e.g., imagine word vectors of size 20)
- hidden layer of size 10 

In [4]:
rnn = nn.RNN(20, 10, 1)  # input size, hidden_size, number of layers

#### Input

- The input is going to be sequences (e.g., sequences of words)
- We need to provide the sequence length of the sequence and the size of each input vector. 
- For example, suppose you have the following sequence and you are representing each word with a 20-dimensional word vector, then your sequence length is going to be 5 and input size is going to be 20.  

> Cherry blossoms are beautiful .

In [5]:
inp = torch.randn(5, 20)  # sequence length, input size
inp

tensor([[ 1.5098, -0.1926, -0.4484, -1.4965,  2.1181, -0.1226, -0.3528, -1.5821,
          0.3901,  0.1464,  0.1158, -1.2263, -0.9533, -0.0492, -0.9606,  0.0234,
         -0.3946, -1.2163,  0.3433,  0.3573],
        [ 0.2797,  0.7981,  1.0023,  0.6701,  0.6317,  0.4525, -1.5256,  0.7832,
         -0.7206, -0.1639, -0.3042,  0.8411, -0.8911, -1.6665, -1.1807, -0.4542,
          1.0172,  1.8657,  0.4950, -1.1755],
        [ 0.5938,  0.7645,  0.1083,  0.2653,  1.3708, -0.6390, -1.8478, -0.2760,
          0.5242, -0.5039,  0.0874, -1.1764, -1.0408,  0.5149, -1.0565,  0.8938,
          0.1098, -1.0632,  0.3263,  0.5102],
        [ 0.6355, -1.3617, -0.3484,  1.9830,  0.6681,  0.0059,  0.3680,  0.1266,
          0.3448, -0.9010,  0.2430,  1.5386,  0.0574,  0.7943,  1.5282,  0.4070,
          0.7298, -0.6993,  0.1714,  0.3751],
        [-0.4997,  1.0177, -0.3549,  0.1806,  0.4644,  0.5211, -1.3615,  0.0337,
         -0.7859, -0.1764,  0.3620, -0.4455, -0.5399,  1.3742,  0.2995,  1.7460,
      

#### Initial hidden state

- At the 0th time step, we do not have anything to remember. So we initialize the hidden state randomly. 
- Let's initialize h0. 
- The shape of h0 is the number of hidden layers and hidden size. 

In [6]:
h0 = torch.randn(1, 10)  # number of hidden layers, hidden size

In [7]:
h0

tensor([[-0.7295, -0.1498,  0.8130, -1.0384,  0.8369, -0.3162,  0.5140, -1.1429,
          0.9118, -0.0173]])

#### Calculating new hidden states and output

In [8]:
# PyTorch calculates the output and new hidden states for us for all time steps.
output, hn = rnn(inp, h0)

In [9]:
hn  # hidden state for the last time step in the sequence

tensor([[-0.7839,  0.4937,  0.1763, -0.2922,  0.3116,  0.4545, -0.3453, -0.2193,
         -0.8915, -0.2134]], grad_fn=<SqueezeBackward1>)

In [10]:
output

tensor([[ 0.6763, -0.3148,  0.7597,  0.9100, -0.6214,  0.8229, -0.0993,  0.5824,
         -0.5741,  0.4514],
        [-0.3105,  0.8508,  0.2145,  0.5286, -0.6977, -0.8278,  0.0504, -0.5231,
          0.9345,  0.6215],
        [ 0.2092,  0.4821,  0.4495,  0.7968, -0.3406,  0.3009, -0.6349, -0.0692,
         -0.5152, -0.0124],
        [-0.2362, -0.4294,  0.2354,  0.2154,  0.6586, -0.5720,  0.0145, -0.2873,
         -0.5269, -0.4186],
        [-0.7839,  0.4937,  0.1763, -0.2922,  0.3116,  0.4545, -0.3453, -0.2193,
         -0.8915, -0.2134]], grad_fn=<SqueezeBackward1>)

In [11]:
output.shape 

torch.Size([5, 10])

By default, `tanh` activation function is used.  

#### Shapes of the weight matrices 

What would be the shapes of weight matrices? 
- Input to hidden ($W$)
- Hidden to hidden ($U$)

<br><br><br><br>

Weight matrix $W$ between input to hidden layer: 

In [12]:
inp.shape

torch.Size([5, 20])

In [13]:
rnn.state_dict()["weight_ih_l0"].shape # (hidden, input)

torch.Size([10, 20])

Weight matrix $U$ between hidden layer in time step $t-1$ to hidden layer in time step $t$: 

In [14]:
h0.shape

torch.Size([1, 10])

In [15]:
rnn.state_dict()["weight_hh_l0"].shape # (hidden, hidden)

torch.Size([10, 10])

Note that the `rnn` above is calculating the output of the hidden layer at each time step but we are not calculating $y_t$ in each time step $t$. 

<br><br><br><br>

![](img/eva-coffee.png)

<br><br><br><br>

## ❓❓ Questions for you

### Exercise 6.1: Select all of the following statements which are **True** (iClicker)

- (A) RNNs pass along information between time steps through hidden layers.
- (B) RNNs are appropriate only for text data.
- (C) At each time step in an RNN, we use a unique hidden state (`h`), a unique input (`X`), but we reuse the same `U` matrix of weights.
- (D) The number of parameters in an RNN language model would grow with the number of time steps.
- (E) The hidden state at the current time step in an RNN depends only on the input data at the current time step and the hidden state from the previous time step.
<br><br><br><br>

```{admonition} Exercise 6.1: V's Solutions!
:class: tip, dropdown
- (A) True
- (B) False
- (C) True
- (D) False
- (E) True
```

## Training RNNs

- RNN is a **supervised machine learning model**. Similar to feedforward networks, we'll use a 
    - training set
    - a loss function  
    - backpropagation to obtain the gradients needed to adjust the weights in these networks 

- We have 3 sets of weights (and the corresponding bias terms) to update
    - $W \rightarrow $ the weight matrix between input layer and hidden layer
    - $U \rightarrow $ the weight matrix between previous hidden layer to current hidden layer
    - $V \rightarrow $ the weight matrix between hidden layer and output layer

We want to assess the error occurring at time step $t$.

- To compute the loss function for the output at time $t$ we need the hidden layer from time $t-1$.
- The hidden layer at time $t$ influences both the output at time $t$ and the hidden layer at time $t+1$. 

![](img/RNN_loss.png)

<!-- <img src="img/RNN_loss.png" height="1500" width="1500">  -->

[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

- To assess the error occurring to $h_t$, we need to know its influence on both the current output and the ones that follow.  
- This is different than the usual backpropagation. We need to tailor backpropogation algorithm to this situation. In RNNs we use a generalized version of **Backpropogation called Backpropogation Through Time (BPTT)** 



- The overall loss is the summation of losses at each time step. 

### RNN code in 112 lines of Python

- See [the code](https://gist.github.com/karpathy/d4dee566867f8291f086) for the above in ~112 lines of Python written by Andrej Karpathy. The code has only `numpy` dependency. 

<br><br><br><br>

## RNN applications 

### What can we do with RNNs?

- We have seen the basic RNN architecture below. 

![](img/RNN_introduction.png)

<!-- <img src="img/RNN-intro.png" height="500" width="500">  -->

- But a number of architectures are possible, which makes them a very rich family of models.  

### RNN architectures

- A number of possible RNN architectures

![](img/RNN_architectures.png)

<!-- <center> -->
<!-- <img src="img/RNN_architectures.png" height="1500" width="1500">  -->
<!-- </center>     -->

[source](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

Let's see how can we apply it to three different types of NLP tasks:
- Sequence labeling (e.g., POS tagging)
- Sequence classification (e.g. sentiment analysis or text classification)
- Text generation

### Sequence labeling 

- The task is to assign a label from a fixed set of labels to each element in the sequence.  
    - Part-of-speech tagging 
    - Named entity recognition
- Many-to-many architecture
- Inputs are usually pre-trained word embeddings and outputs are tag probabilities generated by a softmax layer over the given tagset. 
- The RNN block is an abstraction representing an unrolled simple RNN consisting of an input layer, hidden layer and output layer at each time step and shared weight matrices $U$, $W$, and $V$. 

![](img/RNN_seq_labeling.png)

<!-- <img src="img/RNN_seq_labeling.png" height="600" width="600">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Sequence classification

- We have done text classification such as sentiment analysis or spam identification before with traditional ML models, where we ignored the temporal nature of language.  
- These are actually sequence classification tasks where we want to map a sequence of text to a label from a small set of labels (e.g., positive, negative, neutral). 
- To apply RNNs in this setting, we take the text to be classified and pass one word at a time generating a new hidden layer at each time step. We can then take the hidden layer from the last time step, $h_n$, which has the compressed representation of the entire sequence. We pass this representation through a feedforward neural network which chooses a class via a softmax.     
- This is a many-to-one RNN architecture. 

![](img/RNN_classification.png)

<!-- <img src="img/RNN_classification.png" height="600" width="600">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- Similar to the sequence labeling example, we can also pass word embeddings as input. 
- Note that in this approach we do not have immediate outputs at each time step and we do not need to compute $y$ at each time step. We only have an output at the last time step. 
- So there won't be loss terms associated with each time step. 
- The loss function used to train the weights in the network is entirely based on the final text classification task. 
- We will compare the output of the softmax layer of the feed-forward classifier and the actual $y$ to calculate the loss (e.g., cross-entropy loss) and this loss will drive the training. 
- The error signal is backpropagated all the way through the weights in the feed-forward classifier, through its input, which is the hidden layer output of the last time step, through the three sets of RNN weights: $U$, $V$, and $W$.  

### Text generation

- The idea is similar to text generation with Markov models. 
- We start with a seed. We then continue to sample words conditioned on our previous choices until we reach a pre-determined desired length of a sequence or end-of-sequence token is generated.
- In the context of RNNs
    - We start with a seed. In the example below, we are starting with a special beginning of sequence token \<s\>. 
    - We use embedding representation of this token and pass it to the RNN. 
    - We sample a word in the output from the softmax distribution.  
    - We use this sampled word as the input in the next time step and then sample the next word in the same fashion. 
    - We continue this until the fixed length limit or the end of the sentence marker is reached. 

![](img/RNN_generation.png)

<!-- <img src="img/RNN_generation.png" height="600" width="600">  -->
    
- The same idea can be used for music generation. 

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

You can find a toy example of of RNN text generation with PyTorch in [AppendixC](AppendixC-toy-RNN.ipynb).

### Image captioning 

- The same idea can be used for more complicated applications such as machine translation, summarization, or image captioning. 
- The idea is to prime the generation component with an appropriate context. 
- For example, in image captioning we can prime the generation component with a meaningful  representation of an image given by the last layer in CNNs.  
- We'll talk more about this application next week. 

![](img/image_captioning.png)

<!-- <center> -->
<!-- <img src="img/image_captioning.png" width="1000" height="1000"> -->
<!-- </center> -->
    
[Source](https://cs.stanford.edu/people/karpathy/sfmltalk.pdf)

<br><br><br><br>

## Stacked and Bidirectional RNN architectures

- We have seen a simple RNN with one hidden layer. 
- But RNNs are quite flexible. 
- Two common ways to create complex networks by combining RNNs are:
    - Stacked RNNs
    - Bidirectional RNNs

### Stacked RNNs 

- In the examples thus far, the input of RNNs was a sequence of word or character embeddings. We were passing the output of the RNN layer to the output layer and the outputs have been vectors useful for predicting next words, tags, or sequence labels.  

![](img/RNN_seq_labeling.png)
<!-- <img src="img/RNN_seq_labeling.png" height="600" width="600">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- But nothing prevents us from using **the sequence of outputs from one RNN as an input sequence to another one**.
- These are called **stacked RNNs** which consist of multiple networks where the output of one layer serves as the input to a subsequent layer. 

![](img/RNN_stacked.png)


<!-- <img src="img/RNN_stacked.png" height="600" width="600">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- Stacked RNNs generally outperform single-layer networks. 
- The network learns a different level of abstraction at each layer. 
- You can optimize your network for number of layers for your specific application and dataset.  
- But remember that more layers means higher training cost. 

### Bidirectional RNNs 

- The RNN uses information from the prior context to make predictions at time $t$. 
- But in many applications (e.g., POS tagging) we do have access to the entire input sequence and knowing the context on the right of time $t$ can be useful. 
- For example, suppose you are doing POS tagging and you are at the token **Teddy** in the sequence. It will be useful to know the right context in order to make the decision on whether it should be tagged as a _noun_ or a _proper noun_.  

> He said , " Teddy Roosevelt was a great president ! "<br>

> He said , " Teddy bears are on sale ! "


- How can we use the words on the right of time step $t$ as context?  
- In the left-to-right RNN, the hidden state at time $t$ represents everything the network knows about the sequence up to that point. 
- Suppose $h_t^f$ denotes a hidden state at time $t$ representing everything the network has gleaned from the sequence so far. 
$$h_t^f = RNN_{forward}(x_1, x_2, \dots, x_t) $$
- We can also train the network in the reverse direction, from right to left, to take advantage of the right context. 
- With this approach the hidden state at time $t$, $h_t^b$ represents all the information we have learned about the sequence from time $t$ to the end of the sequence. 
$$h_t^b = RNN_{backward}(x_t, x_{t+1}, \dots, x_n) $$
- (Somewhat similar to the $\alpha$ and $\beta$ values in the forward and backward algorithms in HMMs.)

A **bidirectional RNN** combines two independent RNNs:
- One where the input is processed from the start to the end
- The other from the end to the start. 
- Each RNN will result in some representation of the input. 
- We then combine the two representations computed by two independent RNNs into a single vector which captures both the left and right contexts of an input at each point in time. 
- We can combine vectors by
    - Concatenating them, as shown in the picture below or
    - Element-wise addition 
    - Element-wise multiplication

![](img/bidirectional_seq_labeling.png)
<!-- <img src="img/bidirectional_seq_labeling.png" height="600" width="600">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- You can also use bidirectional RNNs for sequence classification. 
- Recall that in sequence classification we pass the final hidden state of the RNN as input to a subsequent feedforward classifier. 
- The problem with this approach is that the final hidden state reflects more information about the end of the sequence than its beginning. 
- Bidirectional RNNs provide a simple solution to this problem. We can create a final hidden state by combining hidden states of forward and backward passes so that the hidden state reflects information about both the beginning and end of the sequence. 

![](img/bidirectional_classification.png)
<!-- <img src="img/bidirectional_classification.png" height="600" width="600">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

<br><br><br><br>

## Final comments and summary

### Important ideas to know 

- RNNs are supervised neural network models to process sequential data.
- The intuition is to put multiple feed-forward networks together and making connections between hidden layers.  
- They have feedback connections in their structure to "remember" previous inputs, when reading in a sequence. 
- In simple RNNs sequences are processed one element at a time. The output of each neural unit at time $t$ is based on the current input at $t$ and the hidden layer at time $t-1$.  

- In RNNs, the parameters are shared across different time steps.
- A generalized version of backpropagation called backpropagation through time is used for training the network. 
- In practice truncated backpropagation through time is used where we work through chunks. 
- A number of RNNs architectures are possible. 
- RNNs fail to capture long-distance dependencies because of the problems like vanishing gradients.  
- In practice, some other complicated variants such as LSTMs and GRUs are used. 

### Coming up ...
- Intuition of transformers

### Resources

- [Sequence processing with Recurrent Neural Networks](https://web.stanford.edu/~jurafsky/slp3/9.pdf) (The notes above are heavily based on this resource.)
- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Coursera: NLP sequence models](https://www.coursera.org/lecture/nlp-sequence-models/recurrent-neural-network-model-ftkzt)
- [RNN code in 112 lines of Python](https://gist.github.com/karpathy/d4dee566867f8291f086#file-min-char-rnn-py-L112)
