<a href="https://colab.research.google.com/github/drewamorbordelon/DS-Unit-4-Sprint-3-Deep-Learning/blob/main/module1-rnn-and-lstm/LS_DS_441_RNN_and_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 4, Sprint 3, Module 1*

---


# Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTM) (Prepare)

<img src="https://media.giphy.com/media/l2JJu8U8SoHhQEnoQ/giphy.gif" width=480 height=356>
<br></br>
<br></br>

## Learning Objectives
- <a href="#p1">Part 1: </a>Describe Neural Networks used for modeling sequences
- <a href="#p2">Part 2: </a>Apply a LSTM to a text generation problem using Keras

## Overview

> "Yesterday's just a memory - tomorrow is never what it's supposed to be." -- Bob Dylan

Wish you could save [Time In A Bottle](https://www.youtube.com/watch?v=AnWWj6xOleY)? With statistics you can do the next best thing - understand how data varies over time (or any sequential order), and use the order/time dimension predictively.

A sequence is just any enumerated collection - order counts, and repetition is allowed. Python lists are a good elemental example - `[1, 2, 2, -1]` is a valid list, and is different from `[1, 2, -1, 2]`. The data structures we tend to use (e.g. NumPy arrays) are often built on this fundamental structure.

A time series is data where you have not just the order but some actual continuous marker for where they lie "in time" - this could be a date, a timestamp, [Unix time](https://en.wikipedia.org/wiki/Unix_time), or something else. All time series are also sequences, and for some techniques you may just consider their order and not "how far apart" the entries are (if you have particularly consistent data collected at regular intervals it may not matter).

# Neural Networks for Sequences (Learn)

## Overview

There's plenty more to "traditional" time series, but the latest and greatest technique for sequence data is recurrent neural networks. A recurrence relation in math is an equation that uses recursion to define a sequence - a famous example is the Fibonacci numbers:

$F_n = F_{n-1} + F_{n-2}$

For formal math you also need a base case $F_0=1, F_1=1$, and then the rest builds from there. But for neural networks what we're really talking about are loops:

![Recurrent neural network](https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg)

The hidden layers have edges (output) going back to their own input - this loop means that for any time `t` the training is at least partly based on the output from time `t-1`. The entire network is being represented on the left, and you can unfold the network explicitly to see how it behaves at any given `t`.

Different units can have this "loop", but a particularly successful one is the long short-term memory unit (LSTM):

![Long short-term memory unit](https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Long_Short-Term_Memory.svg/1024px-Long_Short-Term_Memory.svg.png)

There's a lot going on here - in a nutshell, the calculus still works out and backpropagation can still be implemented. The advantage (and namesake) of LSTM is that it can generally put more weight on recent (short-term) events while not completely losing older (long-term) information.

**CHECK OUT** [Colah's blog on LSTM](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) for a beautifully clear and concise explaination of the model's archtecture and the mathematics. 

After enough iterations, a typical neural network will start calculating prior gradients that are so small they effectively become zero - this is the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem), and is what RNN with LSTM addresses. Pay special attention to the $c_t$ parameters and how they pass through the unit to get an intuition for how this problem is solved.

So why are these cool? One particularly compelling application is actually not time series but language modeling - language is inherently ordered data (letters/words go one after another, and the order *matters*). [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) is a famous and worth reading blog post on this topic.

For our purposes, let's use TensorFlow and Keras to train RNNs with natural language. Resources:

- https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py
- https://keras.io/layers/recurrent/#lstm
- http://adventuresinmachinelearning.com/keras-lstm-tutorial/

Note that `tensorflow.contrib` [also has an implementation of RNN/LSTM](https://www.tensorflow.org/tutorials/sequences/recurrent).

## Follow Along

Sequences come in many shapes and forms from stock prices to text. We'll focus on text, because modeling text as a sequence is a strength of Neural Networks. Let's start with a simple classification task using a TensorFlow tutorial. 

### RNN/LSTM Sentiment Classification with Keras

In [None]:
'''
#Trains an LSTM model on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
**Notes**
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''
from __future__ import print_function

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

In [None]:
# documentation on this data set here: https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data
# the values in the lists represents the token frequncy, so "1" means the most frequent token in the corpus 
# each list represents a movie review
x_train

In [None]:
# binary labels 
# 1 -> positive sentiment expressed in movie review
# 0 -> negative sentiment expressed in movie review 
y_train

In [None]:
# although there are some implmentations of LSTM models that can handle variable length samples, this is not one of those models
# so we need to standardize the length of our movies
# reviews that are longer than maxlen are truncated
# reivewsd that are shorter than maxlen are padded with 0 (Or some other value that you provide)
print('Pad Sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape: ', x_train.shape)
print('x_test shape: ', x_test.shape)

In [None]:
x_train

In [None]:
# as usual, we begin to build our model by instantiating a Sequential class 
model = Sequential()

# input layer 
# we are explicitly declaring the input layer here by add an Embedding object 
model.add(Embedding(max_features, 128))

# hidden layer 1 
model.add(LSTM(128, dropout=0.2, return_sequences=True))

# output layer 
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

In [None]:
unicorns = model.fit(x_train, y_train,
          batch_size=256, 
          epochs=5, 
          validation_data=(x_test,y_test))

In [None]:
# as usual, we begin to build our model by instantiating a Sequential class 
model = Sequential()

# input layer 
# we are explicitly declaring the input layer here by add an Embedding object 
model.add(Embedding(max_features, 128))

# hidden layer 1 
model.add(LSTM(128, dropout=0.2, return_sequences=True))

# I commented these layers out during lecture in order to compare the performance of a 1 layer vs 3 layer lstm model
# # hidden layer 2
model.add(LSTM(128, dropout=0.2, return_sequences=True))

# # hidden layer 3 
model.add(LSTM(128, dropout=0.2))

# output layer 
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

In [None]:
unicorns2 = model.fit(x_train, y_train,
           batch_size=256, 
           epochs=5, 
           validation_data=(x_test,y_test))

In [None]:
# Plot training & validation loss values
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.grid()
# results for 1-layer lstm model
plt.plot(unicorns.history['loss'], "--", label="1 layer Train")
plt.plot(unicorns.history['val_loss'], "--", label = "1 layer Test")

# results for 3-layer lstm model
plt.plot(unicorns2.history['loss'], label="3 layers Train ")
plt.plot(unicorns2.history['val_loss'], label = "3 layers Test")
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show();

## Challenge

You will be expected to use an Keras LSTM for a classicification task on the *Sprint Challenge*. 

# LSTM Text generation with Keras (Learn)

## Overview

What else can we do with LSTMs? Since we're analyzing the *sequence*, we can do more than classify - we can *generate* text. I'ved pulled some news stories using [newspaper](https://github.com/codelucas/newspaper/).

This example is drawn from the Keras [documentation](https://keras.io/examples/lstm_text_generation/).

In [None]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import numpy as np
import random
import sys
import os

In [None]:
import pandas as pd

# load text data (articles)
df = pd.read_json('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-3-Deep-Learning/main/module1-rnn-and-lstm/wp_articles.json')
df.head()

In [None]:
data = df['article'].values

In [None]:
# our small dataset of articles will be feature engineered to create over 100K sequences 
len(data)

In [None]:
# Encode Data as Chars

# Gather all text 
# Why? 
# 1. See all possible characters 
# 2. For training / splitting later
text = " ".join(data)

# Unique Characters
chars = list(set(text))

# Lookup Tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)} 

In [None]:
# Create the sequence data

maxlen = 20
step = 5

encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
# we know have this many samples 
print('sequences: ', len(sequences))

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence

# standardize the length of the input data 
seq = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=20)

In [None]:
# build the model: a single LSTM
# diagram of how bidirectional lstms work: https://miro.medium.com/max/764/1*6QnPUSv_t9BY9Fv8_aLb-Q.png
# Keras docs: https://keras.io/api/layers/recurrent_layers/bidirectional/
from tensorflow.keras.layers import Bidirectional, Embedding

model = Sequential()

# input layer
model.add(Embedding(output_dim=64, input_dim=len(chars)))

# hidden layer
model.add(Bidirectional(LSTM(64, dropout=0.2)))

# output layer
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
model.summary()

In [None]:
# Create x & y

x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1
        
    y[i, next_char[i]] = 1

In [None]:
# train model on sequence data
# predict boolean (i.e. a 1 or 0)
model.fit(seq, y,
          batch_size=32,
          epochs=5, verbose=2)

In [None]:
def generate_text(model, seed, length):
    """
    Take a trained model, a started string (i.e. seed), and the number of predictions (i.e length)
    and use them to predict what characters come after our seed string. 

    The goal here is to use a model to predict what text should follow our started text. 
    A well trained model would do a great job at this, in fact a well trained model predict
    an entire document given a seed string. 

    Check out an example of an lstm predicting a whole document: 
    http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    """

    encoded = [char_int[c] for c in seed]

    generated = ''

    generated += seed

    model.reset_states()

    start_index = 0 

    for _ in range(length):

        sample = encoded[start_index:start_index+10]
        #print(sample)

        sample = np.array(sample)
        sample = np.expand_dims(sample,0)

        pred = model.predict(sample)
        pred = tf.squeeze(pred, 0)
        next_char = np.argmax(pred)
        encoded.append(next_char)
        generated += int_char[next_char]

    start_index += 1

    return generated

In [None]:
generate_text(model, "Google's CEO appeared in the senate today to address privacy concerns.", 100)

In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    #model.reset_states()

    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [None]:

# build another model for our task for forecasting what text should follow from our seed string 
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# fit the model

model.fit(x, y,
          batch_size=256,
          epochs=10,
          callbacks=[print_callback])

## Challenge

You will be expected to use a Keras LSTM to generate text on today's assignment. 

# Review

- <a href="#p1">Part 1: </a>Describe Neural Networks used for modeling sequences
    * Sequence Problems:
        - Time Series (like Stock Prices, Weather, etc.)
        - Text Classification
        - Text Generation
        - And many more! :D
    * LSTMs are generally preferred over RNNs for most problems
    * LSTMs are typically a single hidden layer of LSTM type; although, other architectures are possible.
    * Keras has LSTMs/RNN layer types implemented nicely
- <a href="#p2">Part 2: </a>Apply a LSTM to a text generation problem using Keras
    * Shape of input data is very important
    * Can take a while to train
    * You can use it to write movie scripts. :P 

# Extra:

Again, LSTM are a very fundamental and important type of architecture that is commonly used to build more sophisticated models (think of lstms as lego pieces that can be combined to create more complex structures). 

For those of you that are particularly interested in the types of problems that RNNs can solve, I encourage you to check out the Attention Network that I mentioned in lecture.

I mentioned a State of the Art type of RNN called an Attention network that is built on top of LSTMs. This is the article that I pointed out (but there is a lot of documentation on this type of model elsewhere). 

https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39