# DSCI 575: Advanced Machine Learning (in the context of Natural Language Processing (NLP) applications)

UBC Master of Data Science program, 2019-20

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]


## Lecture 6: Introduction to Recurrent Neural Networks (RNNs)

In [6]:
import warnings
warnings.filterwarnings("ignore")

In [7]:
import sys, re
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
import numpy as np
import pandas as pd
from numpy import array

### RNN-generated music! 

- [Magenta PerformanceRNN](https://magenta.tensorflow.org/performance-rnn)

### Learning outcomes

From this lecture you will be able to 

- explain the motivation to use RNNs
- define vanilla or simple RNNs 
- explain parameter sharing in RNNs 
- explain the forward pass and backward pass in RNNs
- specify different architectures of RNNs

### Motivation 

- Language is an inherently sequential phenomenon.
- Reflected in the metaphors used to describe language 
    - *flow of conversation*, *news feeds*, and *twitter streams*

### Fixed length input

- ML algorithms we have seen in 571 and 572 work with fixed length input.  
    - SVM
    - Logistic Regression
    - Multi-layer Perceptron

- Example of fixed length input
$$X = \begin{bmatrix}1 & 0.8 & \ldots & 0.3\\ 0 & 0 &  \ldots & 0.4\\ 1 & 0.2 &  \ldots & 0.8 \end{bmatrix}$$ 

$$y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$$

### Fixed length input

- When we used these models for sentiment analysis we created a **fixed size** input representation using `CountVectorizer`, where we had simultaneous access to all aspects of the input. 

$$X = \begin{bmatrix}\text{"@united you're terrible. You don't understand safety",}\\ \text{"@JetBlue safety first !! #lovejetblue"}\\ \text{"@SouthwestAir truly the best in #customerservice!"}\\ \end{bmatrix} \text{ and } y = \begin{bmatrix}0 \\ 1 \\ 1 \end{bmatrix} $$ 
<br><br>
$$X_{counts} = \begin{bmatrix}1 & 3 & \ldots & 2\\ 1 & 0 & \ldots & 0\\ 0 & 2 & \ldots & 1\end{bmatrix} \text{ and } y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$$

### Sentiment analysis using feed-forward neural networks 

- Reminder: In feed-forward neural networks, 
    - all connections flow forward (no loops)
    - each layer of hidden units is fully connected to the next
- Pass fixed sized vector representation of text (`CountVectorizer` object) as input
- We lose the temporal aspect of text in this representation. 

<img src="images/RNN_FFN_1_layer.png" height="500" width="500"> 
   
(Image credit: [learnopencv](https://www.learnopencv.com/understanding-feedforward-neural-networks/))

### How about using Markov models? 

- They have some temporal aspect! 

<img src="images/Markov_assumption.png" height="550" width="550"> 


### Recall language modeling task 

- Recall the task of predicting the next word given a sequence. 
- What's the probability of an upcoming word?  
    - $P(w_t|w_1,w_2,\dots,w_{t-1})$
    
<blockquote>
    I am studying medicine at UBC because I want to work as a ___.
</blockquote>


### Language modeling: Why should we care?

Powerful idea in NLP and helps in many tasks.
- Machine translation 
    * P(In the age of data algorithms have the answer) > P(the age data of in algorithms answer the have)
- Spelling correction
    * My office is a 10  <span style="color:red">minuet</span> bus ride from my home.  
        * P(10 <span style="color:blue">minute</span> bus ride from my home) > P(10 <span style="color:red">minuet</span> bus ride from my home)
- Speech recognition 
    * P(<span style="color:blue">I read</span> a book) > P(<span style="color:red">Eye red</span> a book)

### Motivation: Language modeling task 

- Recall that when we used Markov models for this task, we made Markov assumption. 
    - Markov model: $P(w_t|w_1,w_2,\dots,w_{t-1}) = P(w_t|w_{t-1})$
    - Markov model with more context: $P(w_t|w_1,w_2,\dots,w_{t-1}) = P(w_t|w_{t-2}, w_{t-1})$ 
- These models are 'memoryless' in the sense that they do not have memory beyond the previous 2, 3 or maximum $n$ steps and when $n$ becomes larger, there is sparsity problem.  
- Also, they have huge RAM requirements because you have to store all ngrams. 
- Would a Markov model with $n=5$ predict the correct words in the following cases? 
<blockquote>
    I am studying medicine at UBC because I want to work as a <b>doctor</b>.<br>
    I am studying law at UBC because I want to work as a <b>lawyer</b>.<br>
    I am studying history at UBC because I want to work as a <b>historian</b>.     
</blockquote>



### RNNs motivation 

- RNNs can help us with this limited memory problem!
- **RNNs are a kind of neural network model which use hidden units to remember things over time.**   
- Condition the neural network on all previous words. 


### RNN intuition: Example

- Put a number of feedforward networks together.
- Suppose I have 1 word represented by a vector of size 4 and I want to predict something about that word, I use one feedforward neural network. 
- Suppose I have 2 words, I use 2 of these networks and put them together. 

<img src="images/RNN_two_feedforward.png" height="800" width="800"> 


(Image credit: [learnopencv](https://www.learnopencv.com/understanding-feedforward-neural-networks/))    

### RNN intuition

- Put a number of feedforward networks together. 
- Make connections between the hidden layers.
- Process sequences by presenting one element at a time to the network.


<img src="images/RNN_introduction.png" height="800" width="800"> 

(Credit: [Stanford CS224d slides](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf))

### RNN presentations

- Unrolled presentation 
<center>
<img src="images/RNN_introduction.png" height="600" width="600"> 
</center> 

- Recursive presentation
<img src="images/RNN_recursive_2.png" height="300" width="300"> 

## RNN as a graphical model

- Each state is a function of the previous state and the input.
- A state contains information about the whole past sequence. 
    - $s_t = g_t(x_t, x_{t-1}, \dots, x_2, x_1)$ 

<img src="images/RNN_dynamic_model.png" height="800" width="800"> 

### Parameter sharing

- Parameters
    - Input to hidden weight matrix: $U$
    - Hidden to output weight matrix: $V$    
    - Hidden to hidden weight matrix: $W$
    
- **We share all weights between time steps**    

<img src="images/RNN_dynamic_model.png" height="800" width="800"> 

### RNN parameters

- Input size: Suppose $x \in \mathbb{R}^d$
- Output size: Suppose $y \in \mathbb{R}^q$
- Hidden size: Suppose $s \in \mathbb{R}^p$
- Three kinds of weights: $U_{d\times p}$, $V_{p\times q}$, $W_{p\times p}$    

<img src="images/RNN_dynamic_model.png" height="800" width="800"> 

### RNN parameters: Language modeling example

- Embedding size: 300, vocabulary size: 10,000
- Hidden layer size: 100 (memory of the network)
- $x_t \in \mathbb{R}^{300}$
- $y_t \in \mathbb{R}^{10000}$
- $s_t \in \mathbb{R}^{100}$
- $U_{300\times 100}$, $V_{100\times 10000}$, $W_{100\times 100}$

<img src="images/RNN_dynamic_model.png" height="800" width="800"> 



### Forward pass in RNNs
- Computing new states and output in RNNs

$$
s_t = tanh(Ws_{t-1} + Ux_t + b_1)\\
\hat{y}_t = \text{softmax}(Vs_t + b_2)
$$ 

<img src="images/RNN_dynamic_model.png" height="800" width="800"> 

### Forward pass in RNNs

The matrices $U$, $V$ and $W$ are shared across time and new values for $s_t$ and $\hat{y_t}$ are calculated at each time step.

- Given: $x$, network
- $s_0 = 0$
- for $t$ in 1 to length($x$)
    - $s_t = tanh(Ws_{t-1} + Ux_t + b_1$)
    - $\hat{y}_t = \text{softmax}(Vs_t + b_2)$

<img src="images/RNN_dynamic_model.png" height="500" width="500"> 

## Loss in RNNs

- The overall loss is the summation of losses at each time step. 

<br>
<img src="images/RNN_loss.png" height="1500" width="1500"> 

[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

## Backward pass in RNNs

### Backpropogation Through Time (BPTT)

- In RNNs we use a generalized version of Backpropogation called Backpropogation Through Time (BPTT)
    - Calculating gradient at each output depends upon the current time step as well as the previous time steps. 
    
<img src="files/images/RNN_backprop_TT.png" height="800" width="800"> 

(Credit: [Stanford CS224d slides](http://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf))

## Truncated backpropagation through time

<br>

<img src="images/RNN_truncated_backprop_TT1.png" height="1000" width="1000"> 

[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

## Truncated backpropagation through time

<br>
<img src="images/RNN_truncated_backprop_TT2.png" height="1000" width="1000"> 

[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

## Truncated backpropagation through time

<br>
<img src="images/RNN_truncated_backprop_TT3.png" height="1000" width="1000"> 

[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

### RNN code in 112 lines of Python

- See [the code](https://gist.github.com/karpathy/d4dee566867f8291f086) for the above in ~112 lines of Python written by Andrej Karpathy. The code has only `numpy` dependency. 

### What can we do with RNNs?

- Simple or Vanilla RNN

<img src="images/RNN_introduction.png" height="800" width="800"> 

- But a number of architectures are possible, which makes them a very rich family of models.  

### RNN architectures

- A number of possible RNN architectures

<img src="images/RNN_architectures.png" height="1000" width="1000"> 

[source](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

### One to one 

- The usual feedforward neural network 
   
<img src="images/RNN_architectures.png" height="800" width="800"> 


### Examples: One to many

- Music generation
- Text generation
- Image captioning 

<img src="images/RNN_generation.png" height="800" width="800"> 

[Credit](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Examples: One to many 

<img src="images/image_captioning.png" width="1000" height="1000">

[Source](https://cs.stanford.edu/people/karpathy/sfmltalk.pdf)

### Examples: Many to one

- Sentiment analysis
- Text classification 
- Video activity recognition 



### Many to many (sequence to sequence or encoder-decoder models)

- Speech recognition 
- Machine translation 

<img src="images/RNN_many_to_many.png" height="800" width="800"> 

- Note that the input sequence and output sequence do not have to be of same length. 

[source](http://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf)

### Summary

- RNNs are deep-learning models to process sequential data.
- The intuition is to put multiple feed-forward networks together and making connections between hidden layers.  
- Unlike Markov models they are conditioned on the whole input sequence. 
- The parameters are shared across different time steps.
- A generalized version of backpropagation called backpropagation through time is used for training the network. 
- In practice truncated backpropagation through time is used where we work through chunks. 
- A number of RNNs architectures are possible. 

### Post assessment: True or False 

1. RNNs pass along information between time steps through hidden layers.
2. RNNs are appropriate only for text data.
3. Each time step in RNN has a different set of weights.
4. Similar to Markov chains, RNNs only consider the last few observations when making predictions.
5. RNNs are unsupervised ML models. 

<br><br><br><br><br>

### Resources

- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Coursera: NLP sequence models](https://www.coursera.org/lecture/nlp-sequence-models/recurrent-neural-network-model-ftkzt)
- [RNN code in 112 lines of Python](https://gist.github.com/karpathy/d4dee566867f8291f086#file-min-char-rnn-py-L112)
- [Sequence processing with Recurrent Neural Networks](https://web.stanford.edu/~jurafsky/slp3/9.pdf)