Lambda School Data Science

*Unit 4, Sprint 3, Module 1*

---


# Recurrent Neural Networks (RNNs) and Long Short Term Memory (LSTM) (Prepare)

<img src="https://media.giphy.com/media/l2JJu8U8SoHhQEnoQ/giphy.gif" width=480 height=356>
<br></br>
<br></br>

## Learning Objectives
- <a href="#p1">Part 1: </a>Describe Neural Networks used for modeling sequences
- <a href="#p2">Part 2: </a>Apply a LSTM to a text generation problem using Keras

## Overview

> "Yesterday's just a memory - tomorrow is never what it's supposed to be." -- Bob Dylan

Wish you could save [Time In A Bottle](https://www.youtube.com/watch?v=AnWWj6xOleY)? With statistics you can do the next best thing - understand how data varies over time (or any sequential order), and use the order/time dimension predictively.

A sequence is just any enumerated collection - order counts, and repetition is allowed. Python lists are a good elemental example - `[1, 2, 2, -1]` is a valid list, and is different from `[1, 2, -1, 2]`. The data structures we tend to use (e.g. NumPy arrays) are often built on this fundamental structure.

A time series is data where you have not just the order but some actual continuous marker for where they lie "in time" - this could be a date, a timestamp, [Unix time](https://en.wikipedia.org/wiki/Unix_time), or something else. All time series are also sequences, and for some techniques you may just consider their order and not "how far apart" the entries are (if you have particularly consistent data collected at regular intervals it may not matter).

Fun Fact: Pandas was invented for stock trading, time series, etc. so it is very optimized for this stuff. 

# Neural Networks for Sequences (Learn)

## Overview

There's plenty more to "traditional" time series, but the latest and greatest technique for sequence data is recurrent neural networks. A recurrence relation in math is an equation that uses recursion to define a sequence - a famous example is the Fibonacci numbers:

$F_n = F_{n-1} + F_{n-2}$

For formal math you also need a base case $F_0=1, F_1=1$, and then the rest builds from there. But for neural networks what we're really talking about are loops:

![Recurrent neural network](https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg)

The hidden layers have edges (output) going back to their own input - this loop means that for any time `t` the training is at least partly based on the output from time `t-1`. The entire network is being represented on the left, and you can unfold the network explicitly to see how it behaves at any given `t`.

Different units can have this "loop", but a particularly successful one is the long short-term memory unit (LSTM):

![Long short-term memory unit](https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Long_Short-Term_Memory.svg/1024px-Long_Short-Term_Memory.svg.png)

There's a lot going on here - in a nutshell, the calculus still works out and backpropagation can still be implemented. The advantage (ane namesake) of LSTM is that it can generally put more weight on recent (short-term) events while not completely losing older (long-term) information.

After enough iterations, a typical neural network will start calculating prior gradients that are so small they effectively become zero - this is the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem), and is what RNN with LSTM addresses. Pay special attention to the $c_t$ parameters and how they pass through the unit to get an intuition for how this problem is solved.

So why are these cool? One particularly compelling application is actually not time series but language modeling - language is inherently ordered data (letters/words go one after another, and the order *matters*). [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) is a famous and worth reading blog post on this topic.

For our purposes, let's use TensorFlow and Keras to train RNNs with natural language. Resources:

- https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py
- https://keras.io/layers/recurrent/#lstm
- http://adventuresinmachinelearning.com/keras-lstm-tutorial/

Note that `tensorflow.contrib` [also has an implementation of RNN/LSTM](https://www.tensorflow.org/tutorials/sequences/recurrent).

## Follow Along

Sequences come in many shapes and forms from stock prices to text. We'll focus on text, because modeling text as a sequence is a strength of Neural Networks. Let's start with a simple classification task using a TensorFlow tutorial. 

### RNN/LSTM Sentiment Classification with Keras

In [1]:
'''
#Trains an LSTM model on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
**Notes**
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''
from __future__ import print_function

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb # import the data set

max_features = 20000
# cut texts after this number of words (among top max_features most common words)
# makes input sequence length uniform
maxlen = 80
batch_size = 32 # yan lecun wrote a paper about doing increments of 2^n or something

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences


In [None]:
x_train[0] # this data is already encoded to some sort of dictionary of words so its already been probably gensim.id2word'ed and lemma'ed etc.

In [4]:
type(x_train[0]), len(x_train[0])

(list, 218)

In [7]:
print('Pad sequences (samples x time)')
# cuts the text at our max len of 80 and adds spacing as padding for shorter ones
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape: ', x_train.shape)
print('x_test shape: ', x_test.shape)

Pad sequences (samples x time)
x_train shape:  (25000, 80)
x_test shape:  (25000, 80)


In [9]:
type(x_train[0]), len(x_train[0]) # note that it changed list of numbers to numpy array 

(numpy.ndarray, 80)

In [10]:
model = Sequential() # just like last week 
# embedding layer is for the related groupings
# so we take 20000 features and reduce to 128
# embedding doesn't really count as hidden layer since it's more like extraction
model.add(Embedding(max_features, 128))
# generally only have one hidden layer
# don't need activation has defaults
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
# output layer, sigmoid for binary 
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

model.summary() 
# embedding param # is 128 x 20k = 2.56m
# lstm param # is 128 x 128 = 131584 
#

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


In [14]:
lstm_history = model.fit(x_train, y_train,
         batch_size = batch_size,
         epochs = 5,
         validation_data = (x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
 4288/25000 [====>.........................] - ETA: 1:23 - loss: 0.0252 - accuracy: 0.9920

KeyboardInterrupt: 

In [None]:
import matplotlib.pyplot as plt
# 2nd history is a keyword, dictionary after .fit
plt.plot(lstm_history.history['loss'])
plt.plot(lstm_history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train','Test'])
plt.show();

## Challenge

You will be expected to use an Keras LSTM for a classicification task on the *Sprint Challenge*. 

# LSTM Text generation with Keras (Learn)

## Overview

What else can we do with LSTMs? Since we're analyzing the *sequence*, we can do more than classify - we can *generate* text. I'ved pulled some news stories using [newspaper](https://github.com/codelucas/newspaper/).

This example is drawn from the Keras [documentation](https://keras.io/examples/lstm_text_generation/).

In [15]:
# need some custom callbacks at end of each epoch
# do this with lambda callback

from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop #artifact from older lecture? not sure why it's here

import numpy as np
import random
import sys
import os

In [16]:
data_files = os.listdir('./articles')

In [17]:
# Read in Data
# loop over text files and append raw text
data = []
# utf-8 to fix UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 338: character maps to <undefined>
for file in data_files:
    if file[-3:] == 'txt':
        with open(f'./articles/{file}', 'r', encoding='utf-8') as f:
            data.append(f.read())

In [18]:
len(data)

136

In [19]:
data[-1] # raw text content of last one 

'Here’s their advice to upgrade your game:\n\n1. Be quiet and listen\n\nRegistering and understanding noise is a huge key to helping you win. If you listen closely enough, you can predict what the enemy will do. Likewise, manage your own noise so you don’t make your movements so obvious.\n\nAD\n\nUbisoft developed its own sound propagation system to make the game as realistic as possible. In typical games, you’ll hear a noise from an adjacent room, but it’s muffled. In Siege, noise travels from person to person through space in the shortest way possible, bouncing off walls and entering through doorways.\n\nAD\n\nNiclas “Pengu” Mouritzen, flex player for the G2 Esports Siege team, said managing noise is something he and his teammates constantly work on. For example, jumping with your drone is quite loud, making it easy for enemies to track it down and destroy it. Mouritzen cautioned to only jump away if your drone is in danger of being destroyed.\n\nOne tip people don’t think about, Mou

In [20]:
type(data)

list

In [21]:
# Encode Data as Chars

# gather all text 
# why? 1. see all possible characters 2. for training later(TM)

text = " ".join(data)

# unique characters as a list 

chars = list(set(text))

# lookup tables
char_int = {c:i for i, c in enumerate(chars)}
int_char =  {i:c for i, c in enumerate(chars)}

In [22]:
len(chars)

121

In [None]:
# we are doing chars instead of words because time concerns for lecture
# in actuality, would probably import spacy and do lemmas, etc. 
# go make a third env that is a clone of week 1 and adds week 2's stuff for week 3

In [23]:
# create the sequence data 
# scan over giant text string text and take 40 char chunks at a time
max_len = 40 
step = 5 

encoded = [char_int[c] for c in text]

sequences = [] # input
next_char = [] # target

for i in range(0, len(encoded) - maxlen, step):
    sequences.append(encoded[i : i+ maxlen]) # input data
    next_char.append(encoded[i + maxlen]) # target
    
print('sequences: ', len(sequences))
# what does our target look like when generating text instead of predicting it?

sequences:  178366


In [24]:
sequences[0][:20] # still not usable need to transform back into text

[109,
 20,
 75,
 92,
 73,
 86,
 76,
 20,
 112,
 81,
 11,
 86,
 20,
 34,
 81,
 11,
 86,
 92,
 20,
 30]

In [25]:
len(sequences)

178366

In [27]:
# create x and y 
# we have to specify 3 dimensions
# 
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i, t, char] = 1
    # finally update our last character    
    y[i, next_char[i]] = 1
    

In [28]:
x.shape

(178366, 80, 121)

In [29]:
y.shape

(178366, 121)

In [35]:
# build the model: a single LSTM

model = Sequential()
# no embedding layer needed
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 128)               128000    
_________________________________________________________________
dense_2 (Dense)              (None, 121)               15609     
Total params: 143,609
Trainable params: 143,609
Non-trainable params: 0
_________________________________________________________________


In [36]:
# normally it returns a probability distribution, this one just returns the max val
# which is the most likely character
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [37]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [38]:
# fit the model

model.fit(x, y,
         batch_size=32,
         epochs=10,
         callbacks=[print_callback])

Train on 178366 samples
Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: "ed Arab Emirates in more than a decade, emphasizing not only coordination betwee"
ed Arab Emirates in more than a decade, emphasizing not only coordination betwee’ that a perent do stes ard tiucchiobure Nes lispome ur an4ry? Toumkhlile boad in com Orratiwh fort that a aliils sAses Cam co  porteng solpenvith, Suapletinter perores andighing ce feotbevistno4 fars hiap Thi s(es t mavera,, ates revus wawise that och and of resolian founen)-inc, (fencausting aJ done mmpal sith ormars/ atseinco har hew sage le ande, icrsastont Iust, ene herkidropiz O stke mosz Fr
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: "ic, “it’s tough to make the adjustment. It’s a process.” Still, the more you bak"
ic, “it’s tough to make the adjustment. It’s a process.” Still, the more you bake the sian dile gored undides, perbindale Prienint and purdee Trump’s situved to exair inretens o

<tensorflow.python.keras.callbacks.History at 0x1faa768f1d0>

## Challenge

You will be expected to use a Keras LSTM to generate text on today's assignment. 

# Review

- <a href="#p1">Part 1: </a>Describe Neural Networks used for modeling sequences
    * Sequence Problems:
        - Time Series (like Stock Prices, Weather, etc.)
        - Text Classification
        - Text Generation
        - And many more! :D
    * LSTMs are generally preferred over RNNs for most problems
    * LSTMs are typically a single hidden layer of LSTM type; although, other architectures are possible.
    * Keras has LSTMs/RNN layer types implemented nicely
- <a href="#p2">Part 2: </a>Apply a LSTM to a text generation problem using Keras
    * Shape of input data is very important
    * Can take a while to train
    * You can use it to write movie scripts. :P 