![](img/575_banner.png)

# Lecture 6: Introduction to Recurrent Neural Networks (RNNs)

UBC Master of Data Science program, 2021-22

Instructor: Varada Kolhatkar

## Lecture plan, imports, LO

### Lecture plan 

- Left-over iClicker questions from lecture 5
- RNNs motivation
- RNN inference
- Break
- iClicker questions
- RNN training 
- RNN architectures
- Final comments and summary

<br><br>

## Imports

In [1]:
import IPython
from IPython.display import HTML, display

<br><br>

### Learning outcomes

From this lecture you will be able to 

- Explain the motivation to use RNNs. 
- Explain how an RNN differs from a feed-forward neural network. 
- Define vanilla or simple RNNs. 
- Explain three weight matrices in RNNs. 
- Explain parameter sharing in RNNs. 
- Explain how states and outputs are calculated in the forward pass of an RNN. 
- Explain the backward pass in RNNs at a high level. 
- Specify different architectures of RNNs and explain how these architectures are used in the context of NLP applications. 

<br><br>

## Motivation

### RNN-generated music! 

- [Magenta PerformanceRNN](https://www.youtube.com/watch?v=dMhQalLBXIU)

In [2]:
### An example of a state-of-the-art language model
url = "https://www.youtube.com/embed/dMhQalLBXIU"
IPython.display.IFrame(url, width=500, height=300)

- Language is an inherently sequential phenomenon.
- This temporal nature of language is reflected in the metaphors used to describe language 
    - *flow of conversation*, *news feeds*, or *twitter streams*

### Fixed-length input

- ML algorithms we have seen in 571, 572, and 573 work with fixed length input.  
    - SVM
    - Logistic Regression
    - Multi-layer Perceptron

- Example of fixed length input
$$X = \begin{bmatrix}1 & 0.8 & \ldots & 0.3\\ 0 & 0 &  \ldots & 0.4\\ 1 & 0.2 &  \ldots & 0.8 \end{bmatrix}$$ 

$$y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$$

### Fixed-length input

- When we used these models for sentiment analysis we created a **fixed size** input representation using `CountVectorizer`, where we had simultaneous access to all aspects of the input. 

$$X = \begin{bmatrix}\text{"@united you're terrible. You don't understand safety"}\\ \text{"@JetBlue safety first !! #lovejetblue"}\\ \text{"@SouthwestAir truly the best in #customerservice!"}\\ \end{bmatrix} \text{ and } y = \begin{bmatrix}0 \\ 1 \\ 1 \end{bmatrix} $$ 
<br><br>
$$X_{counts} = \begin{bmatrix}1 & 3 & \ldots & 2\\ 1 & 0 & \ldots & 0\\ 0 & 2 & \ldots & 1\end{bmatrix} \text{ and } y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$$

### Sentiment analysis using feed-forward neural networks 

- Reminder: In feed-forward neural networks, 
    - all connections flow forward (no loops)
    - each layer of hidden units is fully connected to the next
- We pass fixed sized vector representation of text (e.g., representation created with `CountVectorizer`) as input. 
- We lose the temporal aspect of text in this representation. 

In [3]:
# import mglearn

# display(mglearn.plots.plot_single_hidden_layer_graph())

### How about using Markov models? 

- They have some temporal aspect. 

![](img/Markov_assumption.png)

<!-- <center> -->
<!-- <img src="img/Markov_assumption.png" height="550" width="550">  -->
<!-- </center> -->

### Recall language modeling task 

- Recall the task of predicting the next word given a sequence. 
- What's the probability of an upcoming word?  
    - $P(w_t|w_1,w_2,\dots,w_{t-1})$
    
<blockquote>
    I am studying medicine at UBC because I want to work as a ___.
</blockquote>


### Language modeling: Why should we care?

Powerful idea in NLP and helps in many tasks.
- Machine translation 
    * P(In the age of data algorithms have the answer) > P(the age data of in algorithms answer the have)
- Spelling correction
    * My office is a 20  <span style="color:red">minuet</span> bike ride from my home.  
        * P(20 <span style="color:blue">minute</span> bike ride from my home) > P(20 <span style="color:red">minuet</span> bike ride from my home)
- Speech recognition 
    * P(<span style="color:blue">I read</span> a book) > P(<span style="color:red">Eye red</span> a book)

### Motivation: Language modeling task 

- Recall that when we used Markov models for this task, we made Markov assumption. 
    - Markov model: $P(w_t|w_1,w_2,\dots,w_{t-1}) \approx P(w_t|w_{t-1})$
    - Markov model with more context: $P(w_t|w_1,w_2,\dots,w_{t-1}) \approx P(w_t|w_{t-2}, w_{t-1})$ 
- These models are 'memoryless' in the sense that they do not have memory beyond the previous 2, 3 or maximum $n$ steps and when $n$ becomes larger, there is sparsity problem.  
- Also, they have huge RAM requirements because you have to store all ngrams. 
- Would a Markov model with $n=5$ predict the correct words in the following cases? 
<blockquote>
    I am studying medicine at UBC because I want to work as a <b>__</b>.<br>
    I am studying law at UBC because I want to work as a <b>__</b>.<br>
    I am studying history at UBC because I want to work as a <b>__</b>.     
</blockquote>



### RNNs motivation 

- RNNs can help us with this limited memory problem!
- **RNNs are a kind of neural network model which use hidden units to remember things over time.**   
- Critically, unlike Markov models, this approach does not impose a fixed-length limit on this prior context; the context embodied in the previous hidden layer can include information extending back to the beginning of the sequence.
- Condition the neural network on all previous time steps. 


### RNN intuition: Example

- Put a number of feedforward networks together.
- Suppose I have 1 word represented by a vector of size 4 and I want to predict something about that word, I use one feedforward neural network. 
- Suppose I have 2 words, I use 2 of these networks and put them together. 

![](img/RNN_two_feedforward.png)

<!-- <center> -->
<!-- <img src="img/RNN_two_feedforward.png" height="800" width="800">  -->
<!-- </center> -->

(Image credit: [learnopencv](https://www.learnopencv.com/understanding-feedforward-neural-networks/))    

### How do we put multiple feedforward networks together? 

- Put a number of feedforward networks together by making connections between the hidden layers.
- Process sequences by presenting one element at a time to the network.
- So we have an input, hidden layer, and an output and the hidden layer is connected to itself. 

![](img/RNN_introduction.png)

<!-- <center> -->
<!-- <img src="img/RNN_introduction.png" height="800" width="800">  -->
<!-- </center>     -->

(Credit: [Stanford CS224d slides](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf))

<br><br><br><br>

## RNN details

### RNN presentations

- Unrolled presentation 

![](img/RNN_introduction.png)

<!-- <center> -->
<!-- <img src="img/RNN_introduction.png" height="500" width="500">  -->
<!-- </center>  -->

- Recursive presentation

![](img/RNN_recursive_2.png)

<!-- <center> -->
<!-- <img src="img/RNN_recursive_2.png" height="300" width="300">  -->
<!-- </center>      -->

- We are adding a temporal dimension to the feed-forward neural network. 
- The hidden layer from the previous time step provides a form of **memory** which informs the decisions to be made at later points in time. 
- The main difference between non-recurrent architectures and recurrent architectures is the new set of weights that connect the hidden layer from previous time step to the current hidden layer. 
- These weights determine how the network will make use of the previous context when calculating output for the current input. 
- Similar to other weights in neural network models, these weights will be trained with backpropagation. 

![](img/RNN_introduction.png)

<!-- <img src="img/RNN_introduction.png" height="500" width="500">  -->

### RNN as a graphical model

- RNNs can be visualized as a graphical model. The states below are the hidden layers in each time step.  
    - Somewhat similar to hidden Markov models (HMMs) 
    - But a hidden state in an RNN is continuous valued, high dimensional, and much richer. 
- Each state is a function of the previous state and the input.
- A state contains information about the whole past sequence. 
    - $s_t = g(x_t, x_{t-1}, \dots, x_2, x_1)$ 

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->

### Parameter sharing

- What are the parameters of this model? We have three weight matrices. 
    - Input to hidden weight matrix: $U$
    - Hidden to output weight matrix: $V$    
    - Hidden to hidden weight matrix: $W$
    
- The key point in RNNs: **We share all weights between time steps.**    

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->

### RNN parameters

- Input size: Suppose $x \in \mathbb{R}^d$
- Output size: Suppose $y \in \mathbb{R}^q$
- Hidden size: Suppose $s \in \mathbb{R}^p$
- Three kinds of weights: $U_{d\times p}$, $V_{p\times q}$, $W_{p\times p}$    

![](img/RNN_dynamic_model.png)

<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->

### RNN parameters: Language modeling example

- Embedding size: 300, vocabulary size: 10,000
- Hidden layer size: 100 (memory of the network)
- $x_t \in \mathbb{R}^{300}$
- $y_t \in \mathbb{R}^{10000}$
- $s_t \in \mathbb{R}^{100}$
- $U_{300\times 100}$, $V_{100\times 10000}$, $W_{100\times 100}$

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->


### Forward pass in RNNs

- Given an input $x_t$ at timestep $t$, how do we compute the new state $s_{t}$ and output $\hat{y_t}$? 

![](img/RNN_dynamic_model.png)

<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->

#### Computing the new state $s_t$

- Multiply the input $x_t$ with the weight matrix between input and hidden layer ($U$) and the state or the hidden layer from the previous time step $s_{t-1}$ with the weight matrix between hidden layers ($W$). 
- Add these values together. 
- Add the bias vector and pass the result through a suitable activation function $g$. 

$$
s_t = g(Ws_{t-1} + Ux_t + b_1)\\
$$ 

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->

#### Computing the output $\hat{y_t}$

- Once we have the value for the new state $s_t$, we can calculate the output vector $\hat{y_t}$ by multiplying $s_t$ with the weight matrix $V$ between the hidden layer and the output layer, adding the bias vector, and applying an appropriate activation function $f$ the multiplication.  

$$
\hat{y}_t = f(Vs_t + b_2)
$$ 

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->

- Typically, we are interested in soft classification. So computing $\hat{y_t}$ consists of a softmax computation which provides a probability distribution over the possible output classes. 

$$
\hat{y}_t = \text{softmax}(Vs_t + b_2)
$$ 


### Forward pass in RNNs

So in the forward pass we compute the new state and the output $\hat{y_t}$ for all time steps in a sequence, as shown below.  

$$
s_t = g(Ws_{t-1} + Ux_t + b_1)\\
\hat{y}_t = \text{softmax}(Vs_t + b_2)
$$ 

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="800" width="800">  -->
<!-- </center>     -->

### Forward pass in RNNs

We compute this for the full sequence. 

- Given: $x$, network
- $s_0 = 0$
- for $t$ in 1 to length(input sequence $x$)
    - $s_t = g(Ws_{t-1} + Ux_t + b_1$)
    - $\hat{y}_t = \text{softmax}(Vs_t + b_2)$

Note that the matrices $U$, $V$ and $W$ are **shared across time** and new values for $s_t$ and $\hat{y_t}$ are calculated at each time step.

![](img/RNN_dynamic_model.png)

<!-- <center> -->
<!-- <img src="img/RNN_dynamic_model.png" height="500" width="500">  -->
<!-- </center>     -->

### RNN Forward pass with PyTorch

- See the documentation [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).

In [4]:
import torch
import torch.nn as nn
from torch.nn import RNN

#### Creating an RNN object 

We are creating an RNN with 
- only one hidden layer 
- input of size 20 (e.g., imagine word vectors of size 20)
- hidden layer of size 10 

In [5]:
rnn = nn.RNN(20, 10, 1)  # input size, hidden_size, number of layers

#### Input

- The input is going to be sequences (e.g., sequences of words)
- We need to provide the sequence length of the sequence and the size of each input vector. 
- For example, suppose you have the following sequence and you are representing each word with a 20-dimensional word vector, then your sequence length is going to be 5 and input size is going to be 20.  

> Cherry blossoms are beautiful .

In [6]:
inp = torch.randn(5, 20)  # sequence length, input size
inp

tensor([[-4.7574e-02, -2.2040e+00,  8.5885e-02, -3.7213e-01, -7.6761e-04,
          7.2929e-01, -8.2384e-01, -1.8235e-01,  1.1587e+00,  1.6223e+00,
         -2.2509e+00,  8.4252e-01, -1.3771e+00, -2.7338e+00,  1.4807e+00,
          9.9048e-02, -1.0505e+00, -7.2140e-01,  1.0735e+00,  1.6458e+00],
        [-1.2386e-01, -6.2394e-01,  2.5957e+00,  1.1262e+00,  1.1617e+00,
         -2.2316e-02,  2.6256e-01, -8.1836e-01,  3.7086e-01,  5.3580e-01,
          1.1683e-01, -7.1010e-01,  3.9168e-01,  3.3282e-01, -3.8471e-02,
          3.7599e-01,  6.9960e-02, -8.3775e-02, -6.3650e-01,  3.0342e-01],
        [-8.9973e-01,  2.4189e-01,  1.6836e+00,  1.7879e+00,  3.7536e-01,
          1.2323e+00,  1.2282e+00,  1.0380e+00, -1.1312e+00,  1.5787e+00,
          1.9125e-01, -1.7115e+00,  1.1831e+00,  1.5660e-01,  1.0187e+00,
          8.4310e-02, -1.8809e-01, -1.2377e+00, -1.7061e+00,  3.4520e-01],
        [-2.1737e-01,  8.0734e-02,  6.1478e-01, -1.7132e-01, -1.0812e+00,
          7.2855e-01,  6.7692e-01, 

#### Initial hidden state

- At the 0th time step, we do not have anything to remember. So we initialize the hidden state randomly. 
- Let's initialize h0. 
- The shape of h0 is the number of hidden layers and hidden size. 

In [7]:
h0 = torch.randn(1, 10)  # number of hidden layers, hidden size

In [8]:
h0

tensor([[-0.8056,  0.6861,  1.3814,  0.9685, -1.0769,  1.2966, -1.1321, -1.2586,
         -0.2973,  0.3546]])

#### Calculating new hidden states and output

In [10]:
# PyTorch calculates the output and new hidden states for us for all time steps.
output, hn = rnn(inp, h0)

In [11]:
hn  # hidden state for the last time step in the sequence

tensor([[-0.1727,  0.6454, -0.4417, -0.4433, -0.6560, -0.6890,  0.6628, -0.8453,
         -0.4844, -0.5396]], grad_fn=<SqueezeBackward1>)

In [12]:
output

tensor([[-0.7415, -0.9241, -0.7748, -0.7862, -0.8525,  0.0716,  0.9822, -0.8240,
          0.6622, -0.8995],
        [-0.9070,  0.7495, -0.0297, -0.7718, -0.8986,  0.0978, -0.6883, -0.1934,
         -0.6140,  0.9283],
        [-0.6862,  0.5918,  0.2887, -0.7196, -0.5608, -0.0134,  0.2477,  0.1595,
         -0.3826,  0.9602],
        [ 0.7681,  0.4862, -0.0831,  0.8805, -0.7822, -0.2449, -0.2778,  0.2566,
          0.6631,  0.6536],
        [-0.1727,  0.6454, -0.4417, -0.4433, -0.6560, -0.6890,  0.6628, -0.8453,
         -0.4844, -0.5396]], grad_fn=<SqueezeBackward1>)

In [13]:
output.shape  # For each

torch.Size([5, 10])

By default, `tanh` activation function is used.  

#### Shapes of the weight matrices 

Weight matrix $U$ between input to hidden layer: 

In [14]:
inp.shape

torch.Size([5, 20])

In [15]:
rnn.state_dict()["weight_ih_l0"].shape

torch.Size([10, 20])

Weight matrix $W$ between hidden layer in time step $t$ to hidden layer in time step $t+1$: 

In [16]:
h0.shape

torch.Size([1, 10])

In [17]:
rnn.state_dict()["weight_hh_l0"].shape

torch.Size([10, 10])

Note that the `rnn` above is calculating the output of the hidden layer at each time step but we do not calculating $\hat{y}$ in each time step. 

<br><br><br><br>

![](img/eva-coffee.png)

<br><br><br><br>

## ❓❓ Questions for you

iClicker cloud join link: https://join.iclicker.com/4QVT4

### Exercise 6.1: Select all of the following statements which are **True** (iClicker)

- (A) RNNs pass along information between time steps through hidden layers.
- (B) RNNs are appropriate only for text data.
- (C) At each time step in an RNN, we use a unique hidden state (`h`), a unique input (`X`), but we reuse the same `W` matrix of weights.
- (D) The number of parameters in an RNN language model would grow with the number of time steps.
- (E) If you have `n` sequences, the input of an RNN is a three dimensional tensor. 
<br><br><br><br>

```{admonition} Exercise 6.1: V's Solutions!
:class: tip, dropdown
- (A) True
- (B) False
- (C) True
- (D) False
- (E) True
```

## Training RNNs

- RNN is a **supervised machine learning model**. Similar to feedforward networks, we'll use a 
    - training set
    - a loss function  
    - backpropagation to obtain the gradients needed to adjust the weights in these networks 

- We have 3 sets of weights (and the corresponding bias terms) to update
    - $U \rightarrow $ the weight matrix between input layer and hidden layer
    - $W \rightarrow $ the weight matrix between previous hidden layer to current hidden layer
    - $V \rightarrow $ the weight matrix between hidden layer and output layer

We want to assess the error occurring at time step $t$.

- To compute the loss function for the output at time $t$ we need the hidden layer from time $t-1$.
- The hidden layer at time $t$ influences both the output at time $t$ and the hidden layer at time $t+1$. 

![](img/RNN_loss.png)

<!-- <img src="img/RNN_loss.png" height="1500" width="1500">  -->

[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

- To assess the error occurring to $h_t$, we need to know its influence on both the current output and the ones that follow.  
- This is different than the usual backpropagation. We need to tailor backpropogation algorithm to this situation. In RNNs we use a generalized version of **Backpropogation called Backpropogation Through Time (BPTT)** 



- The overall loss is the summation of losses at each time step. 

### RNN code in 112 lines of Python

- See [the code](https://gist.github.com/karpathy/d4dee566867f8291f086) for the above in ~112 lines of Python written by Andrej Karpathy. The code has only `numpy` dependency. 

<br><br><br><br>

## RNN applications 

### What can we do with RNNs?

- We have seen the basic RNN architecture below. 

![](img/RNN_introduction.png)

<!-- <center> -->
<!-- <img src="img/RNN_introduction.png" height="800" width="800">  -->
<!-- <center> -->

- But a number of architectures are possible, which makes them a very rich family of models.  

### RNN architectures

- A number of possible RNN architectures

![](img/RNN_architectures.png)

<!-- <center> -->
<!-- <img src="img/RNN_architectures.png" height="1500" width="1500">  -->
<!-- </center>     -->

[source](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

Let's see how can we apply it to three different types of NLP tasks:
- Sequence labeling (e.g., POS tagging)
- Sequence classification (e.g. sentiment analysis or text classification)
- Text generation

### Sequence labeling 

- The task is to assign a label from a fixed set of labels to each element in the sequence.  
    - Part-of-speech tagging 
    - Named entity recognition
- Many-to-many architecture
- Inputs are usually pre-trained word embeddings and outputs are tag probabilities generated by a softmax layer over the given tagset. 
- The RNN block is an abstraction representing an unrolled simple RNN consisting of an input layer, hidden layer and output layer at each time step and shared weight matrices $U$, $W$, and $V$. 

![](img/RNN_seq_labeling.png)

<!-- <img src="img/RNN_seq_labeling.png" height="800" width="800">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Sequence classification

- We have done text classification such as sentiment analysis or spam identification before with traditional ML models, where we ignored the temporal nature of language.  
- These are actually sequence classification tasks where we want to map a sequence of text to a label from a small set of labels (e.g., positive, negative, neutral). 
- To apply RNNs in this setting, we take the text to be classified and pass one word at a time generating a new hidden layer at each time step. We can then take the hidden layer from the last time step, $h_n$, which has the compressed representation of the entire sequence. We pass this representation through a feedforward neural network which chooses a class via a softmax.     
- This is a many-to-one RNN architecture. 

![](img/RNN_classification.png)

<!-- <img src="img/RNN_classification.png" height="800" width="800">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- Similar to the sequence labeling example, we can also pass word embeddings as input. 
- Note that in this approach we do not have immediate outputs at each time step and we do not need to compute $\hat{y}$ at each time step. We only have an output at the last time step. 
- So there won't be loss terms associated with each time step. 
- The loss function used to train the weights in the network is entirely based on the final text classification task. 
- We will compare the the output of the softmax layer of the feed-forward classifier and the actual $y$ to calculate the loss (e.g., cross-entropy loss) and this loss will drive the training. 
- The error signal is backpropagated all the way through the weights in the feed-forward classifier, through its input, which is the hidden layer of the last time step, through the three sets of RNN weights: $U$, $V$, and $W$.  

### Text generation

- The idea is similar to text generation with Markov models. 
- We start with a seed. We then continue to sample words conditioned on our previous choices until we reach a pre-determined desired length of a sequence or end-of-sequence token is generated.
- In the context of RNNs
    - We start with a seed. In the example below, we are starting with a special beginning of sequence token \<s\>. 
    - We use embedding representation of this token and pass it to the RNN. 
    - We sample a word in the output from the softmax distribution.  
    - We use this sampled word as the input in the next time step and then sample the next word in the same fashion. 
    - We continue this until the fixed length limit or the end of the sentence marker is reached. 

![](img/RNN_generation.png)

<!-- <center> -->
<!-- <img src="img/RNN_generation.png" height="800" width="800">  -->
<!-- </center> -->
    
- The same idea can be used for music generation. 

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Image captioning 
<br>

- The same idea can be used for more complicated applications such as machine translation, summarization, or image captioning. 
- The idea is to prime the generation component with an appropriate context. 
- For example, in image captioning we can prime the the generation component with a meaningful  representation of an image given by the last layer in CNNs.  
- We'll talk more about this application next week. 

![](img/image_captioning.png)

<!-- <center> -->
<!-- <img src="img/image_captioning.png" width="1000" height="1000"> -->
<!-- </center> -->
    
[Source](https://cs.stanford.edu/people/karpathy/sfmltalk.pdf)

- We know basics of RNNs. 
- Now we'll look at a toy example for character-level text generation using RNNs. 
- Recall that given a sequence of characters, character-level text generation is the task of modeling probability distribution of the next character in the sequence. 

### A toy "hello" RNN 
- Suppose we want to train a character-level RNN on sequence "hello". 
- The vocabulary is 4 and we want our model to learn the following: 
    - "e" should be likely given "h" 
    - "l" should be likely given "he" 
    - "l" should be likely given "hel" 
    - "o" should be likely given "hell"     

![](img/RNN_char_generation_train.png)

<!-- <center> -->
<!-- <img src="img/RNN_char_generation_train.png" height="500" width="500">  -->
<!-- <center>     -->

[Source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

### Shapes of input, hidden, and output weight matrices
- Shape of $W_{xh}$ ($U$) is going to be: $4 \times 3$
- Shape of $W_{hh}$ ($W$) is going to be: $3 \times 3$
- Shape of $W_{hy}$ ($V$) is going to be: $3 \times 4$
$$
s_t = g(Ws_{t-1} + Ux_t + b_1)\\
\hat{y}_t = \text{softmax}(Vs_t + b_2)
$$ 

![](img/RNN_char_generation_train.png)

<!-- <center> -->
<!-- <img src="img/RNN_char_generation_train.png" height="600" width="600">  -->
<!-- <center>   -->

[Source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

Let's build a simple RNN for this using `PyTorch`. 

In [2]:
torch.manual_seed(123)

<torch._C.Generator at 0x105efa370>

- Let's define a mapping between indices and characters. 

In [3]:
idx2char = ["h", "e", "l", "o"]

We need some representation for the input. Let's use one-hot representation. 

In [4]:
one_hot_lookup = [
    [1, 0, 0, 0],  # h
    [0, 1, 0, 0],  # e
    [0, 0, 1, 0],  # l
    [0, 0, 0, 1],  # o
]

Next let's create one-hot representation of `X`. 

In [5]:
X = [0, 1, 2, 2]  # indices for the input "hell"
X_one_hot = [one_hot_lookup[x] for x in X]
inputs = torch.Tensor(X_one_hot)
inputs

tensor([[1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.]])

In [6]:
y = [1, 2, 2, 3]
labels = torch.LongTensor(y)
labels

tensor([1, 2, 2, 3])

### Defining some variables 

In [7]:
num_classes = 4  # size of vocab
EPOCHS = 10  # number of epochs
input_size = 4  # size of vocab or one-hot size
hidden_size = 3  # output from the RNN.
batch_size = 1  # we are not batching in this toy example.
sequence_length = 1  # we are processing characters one by one in this toy example
num_layers = 1  # one-layer rnn

![](img/RNN_char_generation_train.png)

<!-- <center> -->
<!-- <img src="img/RNN_char_generation_train.png" height="500" width="500">  -->
<!-- <center>     -->

[Source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

In [8]:
class ToyRNN(nn.Module):
    def __init__(self, debug=False):
        super(ToyRNN, self).__init__()

        # PyTorch core RNN module
        self.rnn = nn.RNN(
            input_size=input_size, hidden_size=hidden_size, batch_first=True
        )

        # Fully connected layer for the output
        self.fc = nn.Linear(hidden_size, num_classes)

        # Debugging flag
        self.debug = debug

    def forward(self, hidden, x):
        x = x.view(batch_size, sequence_length, input_size)  # reshape the input
        if self.debug:
            print("\n\n")
            print("Input shape = ", x.size())

        out, hidden = self.rnn(x, hidden)
        if self.debug:
            print("out shape = ", out.size())
            print("Hidden shape = ", hidden.size())

        out = out.reshape(out.shape[0], -1)  # reshape to pass before the output layer
        if self.debug:
            print("out shape after reshaing = ", out.size())

        out = self.fc(out)
        if self.debug:
            print("out shape after passing through fc = ", out.size())

        return hidden, out

    def init_hidden(self):
        return torch.zeros(num_layers, batch_size, hidden_size)

### Instantiate the model

In [9]:
model = ToyRNN()
print(model)

# Set loss and optimizer function
# Loss increases as the predicted probability diverges from the actual label.
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

ToyRNN(
  (rnn): RNN(4, 3, batch_first=True)
  (fc): Linear(in_features=3, out_features=4, bias=True)
)


### Train the model

In [10]:
for epoch in range(EPOCHS):
    optimizer.zero_grad()
    loss = 0
    hidden = model.init_hidden()

    pred = ""
    for inp, label in zip(inputs, labels):
        hidden, output = model(hidden, inp)
        val, idx = output.max(1)
        pred += idx2char[idx.data[0]]
        loss += criterion(output, torch.LongTensor([label]))
    print("Epoch: %d, loss: %1.3f, preidcted: %s" % (epoch + 1, loss, pred))

    loss.backward()
    optimizer.step()

Epoch: 1, loss: 6.082, preidcted: oeoe
Epoch: 2, loss: 5.008, preidcted: olol
Epoch: 3, loss: 4.393, preidcted: llll
Epoch: 4, loss: 4.155, preidcted: llll
Epoch: 5, loss: 3.991, preidcted: llll
Epoch: 6, loss: 3.697, preidcted: llll
Epoch: 7, loss: 3.280, preidcted: llll
Epoch: 8, loss: 2.864, preidcted: ello
Epoch: 9, loss: 2.459, preidcted: ello
Epoch: 10, loss: 2.020, preidcted: ello


![](img/RNN_char_generation_train.png)

<!-- <center> -->
<!-- <img src="img/RNN_char_generation_train.png" height="600" width="600">  -->
<!-- <center>     -->

[Source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

- We have our toy RNN for text generation! 
- Usually we would do it on large text corpora (e.g., the whole Wikipedia or The New York Times articles from the last 20 years). 

<br><br><br><br>

## Stacked and Bidirectional RNN architectures

- We have seen a simple RNN with one hidden layer. 
- But RNNs are quite flexible. 
- Two common ways to create complex networks by combining RNNs are:
    - Stacked RNNs
    - Bidirectional RNNs

### Stacked RNNs 

- In the examples thus far, the input of RNNs was a sequence of word or character embeddings. We were passing the output of the RNN layer to the output layer and the outputs have been vectors useful for predicting next words, tags, or sequence labels.  

![](img/RNN_seq_labeling.png)


[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- But nothing prevents us from using **the sequence of outputs from one RNN as an input sequence to another one**.
- These are called **stacked RNNs** which consist of multiple networks where the output of one layer serves as the input to a subsequent layer. 

![](img/RNN_stacked.png)

<!-- ![](img/RNN_stacked.png) -->

<!-- <img src="img/RNN_stacked.png" height="800" width="800">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- Stacked RNNs generally outperform single-layer networks. 
- The network learns a different level of abstraction at each layer. 
- You can optimize your network for number of layers for your specific application and dataset.  
- But remember that more layers means higher training cost. 

### Bidirectional RNNs 

- The RNN uses information from the prior context to make predictions at time $t$. 
- But in many applications (e.g., POS tagging) we do have access to the entire input sequence and knowing the context on the right of time $t$ can be useful. 
- For example, suppose you are doing POS tagging and you are at the token **Teddy** in the sequence. It will be useful to know the right context in order to make the decision on whether it should be tagged as a _noun_ or a _proper noun_.  

> He said , " Teddy Roosevelt was a great president ! "<br>

> He said , " Teddy bears are on sale ! "


- How can we use the words on the right of time step $t$ as context?  
- In the left-to-right RNN, the hidden state at time $t$ represents everything the network knows about the sequence up to that point. 
- Suppose $h_t^f$ denotes a hidden state at time $t$ representing everything the network has gleaned from the sequence so far. 
$$h_t^f = RNN_{forward}(x_1, x_2, \dots, x_t) $$
- We can also train the network in the reverse direction, from right to left, to take advantage of the right context. 
- With this approach the hidden state at time $t$, $h_t^b$ represents all the information we have learned about the sequence from time $t$ to the end of the sequence. 
$$h_t^b = RNN_{backward}(x_t, x_{t+1}, \dots, x_n) $$
- (Somewhat similar to the $\alpha$ and $\beta$ values in the forward and backward algorithms in HMMs.)

A **bidirectional RNN** combines two independent RNNs:
- One where the input is processed from the start to the end
- The other from the end to the start. 
- Each RNN will result in some representation of the input. 
- We then combine the two representations computed by two independent RNNs into a single vector which captures both the left and right contexts of an input at each point in time. 
- We can combine vectors by
    - Concatenating them, as shown in the picture below or
    - Element-wise addition 
    - Element-wise multiplication

![](img/bidirectional_seq_labeling.png)
<!-- <img src="img/bidirectional_seq_labeling.png" height="800" width="800">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- You can also use bidirectional RNNs for sequence classification. 
- Recall that in sequence classification we pass the final hidden state of the RNN as input to a subsequent feedforward classifier. 
- The problem with this approach is that the final hidden state reflects more information about the end of the sequence than its beginning. 
- Bidirectional RNNs provide a simple solution to this problem. We can create a final hidden state by combining hidden states of forward and backward passes so that the hidden state reflects information about both the beginning and end of the sequence. 

![](img/bidirectional_classification.png)
<!-- <img src="img/bidirectional_classification.png" height="800" width="800">  -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

<br><br><br><br>

## Final comments and summary

### Important ideas to know 

- RNNs are supervised neural network models to process sequential data.
- The intuition is to put multiple feed-forward networks together and making connections between hidden layers.  
- They have feedback connections in their structure to "remember" previous inputs, when reading in a sequence. 
- In simple RNNs sequences are processed one element at a time. The output of each neural unit at time $t$ is based on the current input at $t$ and the hidden layer at time $t-1$.  

### Important ideas to know

- In RNNs, the parameters are shared across different time steps.
- A generalized version of backpropagation called backpropagation through time is used for training the network. 
- In practice truncated backpropagation through time is used where we work through chunks. 
- A number of RNNs architectures are possible. 

### Important ideas to know

- RNNs dail on capturing long-distance dependencies because of the problems like vanishing gradients.  
- In practice, some other complicated variants such as LSTMs and GRUs are used. 

### Coming up ...

- LSTMs
- Intuition of transformers

### Resources

- [Sequence processing with Recurrent Neural Networks](https://web.stanford.edu/~jurafsky/slp3/9.pdf) (The notes above are heavily based on this resource.)
- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Coursera: NLP sequence models](https://www.coursera.org/lecture/nlp-sequence-models/recurrent-neural-network-model-ftkzt)
- [RNN code in 112 lines of Python](https://gist.github.com/karpathy/d4dee566867f8291f086#file-min-char-rnn-py-L112)


<br><br><br><br>