# LSTM's and GRUs

While Vanilla/Simple RNNs are a great way to understand the ideas behind Recurrent architechtures, they're not often used in real life. 

Rather, we use updated recurrent architechtures like the LSTM and GRU.


## LSTM Architecture
<img src="https://drive.google.com/uc?export=view&id=17_HVigdnak1bEobOLS6Rbxq8jNYB_bLf" alt="Q" width = "400"/>

- How many states does an LSTM output and take in at each step? How is this different from a simple RNN?
- What range of values can a **gate** output?
- What range of values can a **tanh** output?
- What does the forget gate do, conceptually?
- What does the input gate do, conceptually?
- What does the output gate do, conceptually?
- What does the input *node* do, conceptually?

## GRU Architecture
<img src="https://drive.google.com/uc?export=view&id=1cXxAo66BbhJ_Z4tni5JllfhkNZoNYcG6" alt="Q" width = "400"/>

- How many states does a GRU output and take in at each step?
- How many gates does a GRU have?
- What does the reset gate do?
- What does the update gate do?
- How do we generate the proposed update to the hidden state?

## The Vanishing Gradient
<img src="https://drive.google.com/uc?export=view&id=110Hz0GQdzxunzi0qIgGqLSxCzFftcApF" alt="Q" width = "400"/>

- When a Gradient Vanishes, it approaches *what value*?
- Why are deeper networks more susceptible to the vanishing gradient problem?
- If a recurrent layer has only ONE CELL, why is IT susceptible to the vanishing gradient problem?
- How does sequence length affect the vanishing gradient problem?
- What range of values can the derivative of a sigmoid activation function take on? Why is this problematic?
- For the simple NN in the image above, write out the different partial derivatives. I'll start you off with $$\frac{\partial Loss}{\partial \text{output}} = -2(y - \text{output})$$ When our weights are very small (0<abs(weight)<1), what will happen to our gradient?

# Building a Character Level LSTM and GRU

Activity based on [this](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/) by Jason Brownlee.

In [20]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Input
from tensorflow.keras import Model

## Load in Pride and Prejudice

📚 First, let's load in a txt file of the book **Pride and Prejudice**

(if you'd like to use your own book, try looking here on [Project Gutenberg](https://www.gutenberg.org/ebooks/). Make sure you set `FileType` to be a Plain Text (.txt) file to make your life easier, and delete the Header and License at the top and bottom of the file.)

In [2]:
# load ascii text and covert to lowercase
filename = "pandp.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

In [4]:
raw_text[0:100]

'chapter 1\n\n      it is a truth universally acknowledged, that a single man in\n      possession of a '

## Map Characters to One Hot Vectors

As we'll discuss in the next lecture, our NN's cannot understand text, it needs numbers as input, so we're going to change our characters (e.g. 'a', 'b', 'c') to one hot encoded vectors

<img src="https://drive.google.com/uc?export=view&id=1nyNRjEBl99luyPbF-ZCw03_Ki_6h4u0W" alt="Q" width = "400"/>

As we talked about in our Math Review lecture, One Hot Encoded Vectors are sparse (they have a lot of 0's) vectors where we use 1's to indicate group membership. Very similar to dummy variables we learned about in CPSC 392.

In [34]:
# create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

In [9]:
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  754870
Total Vocab:  56


In [11]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100 # 100 characters as input
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text[i:i + seq_length] # generate 100 character input
 seq_out = raw_text[i + seq_length] # grab next character
 
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  754770


In [13]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = to_categorical(dataY)

In [15]:
print(y[0])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]


## Build Recurrent Model (LSTM)

[💻 Server Docs](https://docs.google.com/document/d/1YuMv84E6mYLyPFiVCarKAMbMFT4H0rEqEspcGTCQv28/edit?usp=sharing)

The code below builds and saves an LSTM model. However, this model takes a really long time to train, so I reccomend either:

1. Load my Pre-Trained Model (available on Canvas) **if** you're using Pride and Prejudice data. 

2. Train your Model on the Server

  - **Change** your model path so that you replace `pandpmodel` with whatever you want to call your model
  - **Download** this notebook as a .py file, then delete any code in the file that happens AFTER the line where you save your model
  - **Move** your .py file and your book .txt file to the server and put it in your run directory (the file you list in your .yaml file)
  - Using a docker container, **run** your python file
  - When finished, **exit** the container. Your saved model files should be in your home directory, in whichever folder you use as your run directory. Use `scp` or VSCode to download the file locally. 
  - **Zip** your file
  - **Upload** to Colab!

In [21]:
# LSTM model
inputs = Input(shape = (X.shape[1], X.shape[2]))
x = LSTM(256)(inputs)
x = Dropout(0.2)(x)
output = Dense(y.shape[1], activation='softmax')(x)

model = Model(inputs = inputs, outputs = output)
model.compile(loss='categorical_crossentropy', optimizer='adam')

### Saving a Model when Running on the Server
When saving a model that will be run on the server, your filepath in `model.save()` needs to be a filepath that will be local to your *container* (not your local computer, nor your home directory on the server). The way our docker containers are set up, if you want to write to the run directory, you'll need to write to the folder `'/app/rundir/...'` where `...` is your file or directory name. 

When saving a keras model, you provide the name of a *directory* that will be created and that all your model files will save to.

In [35]:
model.fit(X, y, epochs=20, batch_size=128)
model.save('/app/rundir/pandpmodel') # save model to file

### Loading a Pre-Trained Model into Colab
If uploading a pre-trained model, upload the .zip to Colab, and replace `pandpmodel.zip` with your model name.

Then Load the model by replacing `'pandpmodel/` with your model file name in `tf.keras.models.load_model()`.

In [24]:
# unzips model that you uploaded to colab
!unzip pandpmodel.zip

# loads model from files
model = tf.keras.models.load_model('pandpmodel/')

Archive:  pandpmodel.zip
replace __MACOSX/._pandpmodel? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/._pandpmodel   
replace __MACOSX/pandpmodel/._variables? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: __MACOSX/pandpmodel/._variables  
replace pandpmodel/saved_model.pb? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: pandpmodel/saved_model.pb  
  inflating: __MACOSX/pandpmodel/._assets  
  inflating: pandpmodel/variables/variables.data-00000-of-00001  
  inflating: pandpmodel/variables/variables.index  




### Using a Pre-Trained Model to Make Predictions
We'll need a dictionary to map the one hot encoded vectors BACK to characters. 

In [25]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

Next, we'll generate data!

You can change `n_chars` to determine how many individual characters our model should product (remember we used *characters* as our tokens rather than words for this model).

We then pick a random section of text to be our input sequence, and we'll feed it to the trained model in order to get our generated text.

In [37]:
import sys
n_chars = 100

# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
print("------------------------------------------------------------")

# generate characters
for i in range(n_chars):
 x = np.reshape(pattern, (1, len(pattern), 1))
 x = x / float(n_vocab)

 # get model's prediction (as a one hot vec)
 prediction = model.predict(x, verbose=0) # predicted probs
 index = np.argmax(prediction) # find highest prop

 # use one hots to grab actual characters
 result = int_to_char[index]

 # string sequence together
 seq_in = [int_to_char[value] for value in pattern]

 # write sequence to console
 print(result, end = "")

 # store pattern
 pattern.append(index)
 pattern = pattern[1:len(pattern)]

print("\n------------------------------------------------------------")
print("\nDone.")

Seed:
" erfect
      unconcern, “oh! but there were two or three much uglier in the
      shop; and when i h "
------------------------------------------------------------
ave not been to be ao aotaos of the mott
      conpenting oo the sore of the matter of the matter wh
------------------------------------------------------------

Done.


# On Your Own

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" alt="Q" width = "200"/>

Now that you know how to build a Character Level LSTM, try implementing any of updates:

- Try training your model with a different book, or a set of books (e.g. the tiny shakespeare data set, or the Illiad, or the bible, or frankenstein!)
- Add DEEP LSTM architechture by adding another (or a few) extra LSTM layers. Remember you need to put the `return_sequence = True` argument in ALL Recurrent Layers except the last one.
- Replace the LSTM layer with a GRU layer and compare the performance
- Generate a looooong string of output using your model (make `n_char` big, like `2000`). Do you see any weird behavior?
- Discuss with your group some of the benefits, and some of the drawbacks of training a recurrent model on *character tokens* rather than *word tokens*.