# DSCI 575: Advanced Machine Learning (in the context of Natural Language Processing (NLP) applications)

UBC Master of Data Science program, 2019-20

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]


## Lecture 7: Long Short-Term Memory Networks (LSTMs)

In [1]:
import sys, re
import tensorflow 
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import numpy as np
import pandas as pd
from numpy import array

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Learning outcomes

From this lecture you will be able to 

- specify the problem of vanishing and exploding gradients
- explain the idea of LSTMs at a high level
- implement an LSTM for character-based text generation using `Keras` and `Tensorflow`

### Motivation 

- Extremely popular 
- Used in all kinds of interesting applications 

### LSTM for image captioning 

<img src="images/RNN_LSTM_image_captioning.png" height="2000" width="2000"> 


(Credit: [LSTMs for image captioning](https://arxiv.org/pdf/1411.4555.pdf))

[Neural Storyteller](https://github.com/ryankiros/neural-storyteller)

<blockquote>

<img src="images/RNN_example.jpg" width="800" height="800">

<p style="font-size:30px">We were barely able to catch the breeze at the beach , and it felt as if someone stepped out of my mind . She was in love with him for the first time in months , so she had no intention of escaping . The sun had risen from the ocean , making her feel more alive than normal . She 's beautiful , but the truth is that I do n't know what to do ...</p>
</blockquote>    

### LSTMs for video captioning 

<img src="images/RNN_video_captioning2.png" height="1500" width="1500"> 


(Credit: [LSTMs for video captioning](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Venugopalan_Sequence_to_Sequence_ICCV_2015_paper.pdf))

### LSTMs for executing Python programs 

- In 2014, Google researchers built an LSTM that learns to execute simple
Python programs!

<img src="images/RNN_learning_to_execute.png" width="1500" height="1500">

(Credit: [Learning to execute](https://arxiv.org/pdf/1410.4615.pdf))

### Recall: How do we learn RNNs?

- Forward pass 
- Backward pass (backprop through time)


### Recall: Forward pass in RNNs
- Computing new states and output in RNNs

$$
s_t = tanh(Ws_{t-1} + Ux_t + b_1)\\
\hat{y}_t = \text{softmax}(Vs_t + b_2)
$$ 

<img src="images/RNN_dynamic_model.png" height="800" width="800"> 

## Recall: Backpropagation through time

- When we do backprop with feedforward neural networks
    - Take the gradient (derivative) of the loss with respect to the parameters. 
    - Change parameters to minimize the loss. 

- In RNNs we use a generalized version of backprop called Backpropogation Through Time (BPTT)
    - Calculating gradient at each output depends upon the current time step as well as the previous time steps. 

    
<img src="images/RNN_loss.png" height="800" width="800"> 

    
[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

### Vanishing and exploding gradients

- So in the backward pass of RNNs, we have to multiply many derivatives together, which very often results in
    - vanishing gradients (gradients becoming very small) 
    - exploding gradients (gradients becoming too big)
- One of reasons why people were not able to train these networks for a long time 
    
<center>
<img src="images/RNN_loss.png" height="800" width="800"> 
<center>
    

### Why this is a problem? 

- Let's consider the case of vanishing gradients
    - Caused by multiplying many small numbers together 
- Suppose we are trying to predict the next word after 'a'. In this case, it is dependent upon 'law', which occurs more than 10 words away. 
<blockquote>
    I am studying law at the University of British Columbia in Canada because I want to work as a ___. 
</blockquote>

- Time steps that are further back away have smaller and smaller gradient. 

- If we have a vanishing gradient, we might not be able to update our weights reliably. 
- Only able to capture short-term dependencies, which kind of defeats the whole purpose of using RNNs. 

### Let's look into it a bit more closely

- Let $W_{hh}$ be the weight matrix between hidden layers
- Let $W_{hx}$ be the weight matrix between input and hidden layers
- Let $\begin{bmatrix}W_{hh} & W_{hx}\end{bmatrix}$ be $W$

\begin{equation}
\begin{split}
h_t =& tanh(W_{hh}h_{t-1} + W_{hx}x_t)\\
    =& tanh( \begin{bmatrix}W_{hh} & W_{hx}\end{bmatrix} \begin{bmatrix}h_{t-1}\\x_t\end{bmatrix})\\
    =& tanh( W \begin{bmatrix}h_{t-1}\\x_t\end{bmatrix})\\
\end{split}
\end{equation}


<img src="images/RNN_gradient_flow.png" height="400" width="400"> 


### Gradient flow in RNNs

- During the backward pass
    - we have the derivative of loss with respect $h_t$ 
    - we want to compute the derivative of loss with respect to $h_{t-1}$

<img src="images/RNN_gradient_flow1.png" height="400" width="400"> 


### Gradient flow in RNNs

- Computing gradient of loss with respect to $h_0$ involves repeated multiplications of same quantities 
- If many values > 1.0 then we have exploding gradient. 
    - Possible solution: gradient clipping to scale big gradients.     
- If many values < 1.0 then we have vanishing gradient.
    - Possible solution: weight initialization, activation function    

<br>

<img src="images/RNN_gradient_flow2.png" height="1300" width="1300"> 

### Vanishing and exploding gradients: Simple example 

- Multiplying many numbers together either results in a very small or very large numbers. 
- Suppose we have $T$ time steps.  
- Suppose all numbers in the product are scalars and have some value $\alpha$
- As $T \rightarrow \infty$
    - $\alpha^T \rightarrow \infty$ if $\alpha > 1$
    - $\alpha^T \rightarrow 0$ if $\alpha < 1$    

### A robust solution to this problem is 

- **Use a more complex recurrent unit with gates**
    - Gated Recurrent Units (GRUs)    
    - **Long Short Term Memory networks (LSTMs)**

### Echo state networks

- Idea: Initialize the  weight matrices carefully. 
- For example, Set $U$, $W$ to identity matrix and only learn $V$. 
    - Each state is a summation of previous states.

### Long Short Term Memory networks (LSTMs)

- [Invented in 1997](https://www.bioinf.jku.at/publications/older/2604.pdf) by Hochreiter and Schmidhuber. 
- Designed so that model can remember things for a long time (hundreds of time steps)! 

### Simple (Vanilla) RNN 

- In a simple RNN, repeating module contain a simple computation nodes.  


<img src="files/images/RNN_alternative_representation.png" height="1000" width="1000"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### Long Short Term Memory networks (LSTMs)

- In an LSTM, the repeating module is more complicated. 
- It selectively controls the flow of information using gates. 


<img src="files/images/RNN_alternative_representation.png" height="800" width="800"> 


<img src="files/images/LSTM0.png" height="800" width="800"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### LSTMs: Controlling the flow of information using gates

- The information is added and removed through a structure called gates. 
- They optionally let the information through via sigmoid layer and pointwise multiplication
    - The sigmoid layer outputs a number between 0 and 1, deciding how much of each component should be let through.
    - A pointwise multiplication operation applies the decision. 

<img src="images/RNN_LSTM2.png" height="800" width="800"> 
   
(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### LSTMs

- In addition to usual hidden units, LSTMs have memory cells. 
- Purpose of memory cells is to remember things for a long time.

<img src="images/RNN_LSTM1.png" height="1000" width="1000"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### LSTMs: The core idea 

- The core idea in LSTMs is using a cell state (memory cell)
- Information can flow along the memory unchanged. 
- Information can be removed or written to the cells regulated by gates. 

<img src="images/RNN_LSTM0.png" height="800" width="800"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### How does LSTM work?
- Four operations: forget, store (input), update, output

<img src="images/RNN_LSTM0.png" height="800" width="800"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### Forget operation

- A sigmoid layer, **forget gate**, decides which values of the memory cell to reset. 
- Decides what part of the history is worth forgetting.
- $f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)$

<img src="images/RNN_LSTM3.png" height="800" width="800"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### Store operation 

- Decides what part of the new information is worth storing. 
- Two parts: 
    - A sigmoid layer, **input gate.**
    - $i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)$

<img src="images/RNN_LSTM4.png" height="800" width="800"> 


(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### Vector of new candidate values

- A $tanh$ layer creates a vector of new candidate values $\tilde{c}_t$ to write to the memory cell. 
- $\tilde{c}_t = tanh(W_{c}[h_{t-1}, x_t] + b_c)$ 

<img src="images/RNN_LSTM5.png" height="800" width="800"> 


(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### Update operation (memory cell update)

- The previous steps decided which values of the memory cell to reset and overwrite. 
- Now the LSTM applies the decisions to the memory cells.
- $c_t = f_t \times c_{t-1} + i_t \times \tilde{c}_t$ 

<img src="images/RNN_LSTM6.png" height="800" width="800"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### Output operation

- The sigmoid layer, **output gate**, decides which values should be sent to the network in the next time step. 
- $o_t = \sigma(W_{o}[h_{t-1}, x_t] + b_o)$

<img src="images/RNN_LSTM7.png" height="800" width="800"> 

(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

#### Output update

- The memory cell goes through $tanh$ and is multiplied by the output gate
- $h_t = o_t \cdot \tanh(c_t)$

<img src="images/RNN_LSTM8.png" height="800" width="800"> 


(Credit: [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

### How does it help RNN training?

- You might be wondering why this solves the problem of vanishing and exploding gradients.
- Turns out that the cell state we create in LSTMs allows an uninterrupted flow of gradients through time which mitigates our problem of vanishing and exploding gradients! 

## LSTMs: Confusing diagrams!!!

- LSTMs are not very intuitive.  
- Complicated combination of state in the past, observation at the moment and different ways to either forget the observation or keep it around. 
- Famous for confusing illustrative diagrams. 

<img src="images/RNN_confusing_LSTMs.png" height="1000" width="1000"> 



### Other extensions

### Multilayer RNNs/LSTMs

- Stack multiple layers of RNNs or LSTMs on the top of each other. 


<img src="images/RNN_stacked.png" height="600" width="600"> 


### Bidirectional RNNs/LSTMs

- Problem with RNNs/LSTMs is that the prediction at a certain time uses information only from the previous time steps and not the later timesteps. Example: 
<blockquote>
He said, "Teddy Roosevelt was a great president!"<br>
He said, "Teddy bears are on sale!"
</blockquote>    


<img src="images/RNN_bidirectional_LSTM.png" height="800" width="800"> 

[Source](https://www.i2tutorials.com/technology/deep-dive-into-bidirectional-lstm/)


### Let's build LSTMs!!! 


### Generation with LSTMs 

LSTMs are expensive to train and, we'll be training them on [Google colab](https://colab.research.google.com/notebooks/welcome.ipynb). This will allow us to train on a GPU and assess the benefits of training neural networks on GPUs.

You can follow the steps below.

- Go to [Google colab](https://colab.research.google.com/). 
- Make an account if you don't have one.
- Select "UPLOAD" and upload [this notebook](code/LSTM-character-based-text-generation.ipynb) in Google Colab. 
- Runtime --> change runtime type --> Select GPU.
- Run the notebook. s

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding
from tensorflow.keras.layers import LSTM

### Number of Parameter in LSTMs 

In [5]:
# Build and compile network=
vocab_size = 36
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length = 30))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 30, 10)            360       
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_3 (Dense)              (None, 36)                1836      
Total params: 14,396
Trainable params: 14,396
Non-trainable params: 0
_________________________________________________________________
None


### Explanation of parameters

- Embedding layer (embedding_1)
    - 360 &rarr; size of the embedding matrix. We have 36 words in the vocabulary and each word has a 10 dimensional word embedding. 

- LSTM layer (lstm_1)
    - For each gate in the LSTM cell, we have a weight matrix of size:  (num_units + input_size + 1) * num_units = $(50 + 10 + 1) \times 50 = 3050$

        - num_units &rarr; number of hidden units in the LSTM cell (50 in the example above)
        - input_size &rarr; size of the input vector (10 in the example above)
        - 1 &rarr; for bias
    
    - We have 4 such weight matrices: for forget gate, input gate, output gate, and cell memory.
    - The number of parameters = ( (num_units + input_size + 1) * num_units ) * 4 =  $((50 + 10 + 1) \times 50) \times 4 = 12200$
- Dense layer (dense_1)
    - (num_units + 1) * size_of_vocab = (50 + 1) * 36 = 1836
    - 1 &rarr; for bias

### Summary

- Training RNNs is hard because of vanishing and exploding gradients.  
- LSTMs mitigate the problem by introducing a mechanism to selectively control the flow of information in the network  
- They are widely used models to process sequential data in deep learning community and have a wide range of applications
- Input to LSTMs is a 3-dimensional data matrix (called tensor)

### RNN resources

A lot of material is available on the web. Here are some resources that were useful for me. 


- [Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Geoff Hinton's [short talk](https://www.youtube.com/watch?v=93rzMHtYT_0) and [lecture](https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec7.pdf) on LSTMs
- [Yoshua Bengio's lecture](https://www.youtube.com/watch?v=AYku9C9XoB8&t=884s)
- [Ali Ghodsi's lecture on RNNs](https://www.youtube.com/results?search_query=ali+ghodsi+RNNs)
- Richard Socher's [slides](https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf) and [lecture](https://www.youtube.com/watch?v=Keqep_PKrY8) on RNNs
- [A list of RNN resources](https://github.com/ajhalthor/awesome-rnn)