<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Lesson 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)
## _aka_ PREDICTING THE FUTURE!

<img src="https://media.giphy.com/media/l2JJu8U8SoHhQEnoQ/giphy.gif" width=480 height=356>
<br></br>
<br></br>

> "Yesterday's just a memory - tomorrow is never what it's supposed to be." -- Bob Dylan

Wish you could save [Time In A Bottle](https://www.youtube.com/watch?v=AnWWj6xOleY)? With statistics you can do the next best thing - understand how data varies over time (or any sequential order), and use the order/time dimension predictively.

A sequence is just any enumerated collection - order counts, and repetition is allowed. Python lists are a good elemental example - `[1, 2, 2, -1]` is a valid list, and is different from `[1, 2, -1, 2]`. The data structures we tend to use (e.g. NumPy arrays) are often built on this fundamental structure.

A time series is data where you have not just the order but some actual continuous marker for where they lie "in time" - this could be a date, a timestamp, [Unix time](https://en.wikipedia.org/wiki/Unix_time), or something else. All time series are also sequences, and for some techniques you may just consider their order and not "how far apart" the entries are (if you have particularly consistent data collected at regular intervals it may not matter).

## Recurrent Neural Networks

There's plenty more to "traditional" time series, but the latest and greatest technique for sequence data is recurrent neural networks. A recurrence relation in math is an equation that uses recursion to define a sequence - a famous example is the Fibonacci numbers:

$F_n = F_{n-1} + F_{n-2}$

For formal math you also need a base case $F_0=1, F_1=1$, and then the rest builds from there. But for neural networks what we're really talking about are loops:

![Recurrent neural network](https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg)

The hidden layers have edges (output) going back to their own input - this loop means that for any time `t` the training is at least partly based on the output from time `t-1`. The entire network is being represented on the left, and you can unfold the network explicitly to see how it behaves at any given `t`.

Different units can have this "loop", but a particularly successful one is the long short-term memory unit (LSTM):

![Long short-term memory unit](https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Long_Short-Term_Memory.svg/1024px-Long_Short-Term_Memory.svg.png)

There's a lot going on here - in a nutshell, the calculus still works out and backpropagation can still be implemented. The advantage (ane namesake) of LSTM is that it can generally put more weight on recent (short-term) events while not completely losing older (long-term) information.

After enough iterations, a typical neural network will start calculating prior gradients that are so small they effectively become zero - this is the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem), and is what RNN with LSTM addresses. Pay special attention to the $c_t$ parameters and how they pass through the unit to get an intuition for how this problem is solved.

So why are these cool? One particularly compelling application is actually not time series but language modeling - language is inherently ordered data (letters/words go one after another, and the order *matters*). [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) is a famous and worth reading blog post on this topic.

For our purposes, let's use TensorFlow and Keras to train RNNs with natural language. Resources:

- https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py
- https://keras.io/layers/recurrent/#lstm
- http://adventuresinmachinelearning.com/keras-lstm-tutorial/

Note that `tensorflow.contrib` [also has an implementation of RNN/LSTM](https://www.tensorflow.org/tutorials/sequences/recurrent).

### RNN/LSTM Sentiment Classification with Keras

In [3]:
'''
#Trains an LSTM model on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
**Notes**
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''
from __future__ import print_function

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences


In [4]:
len(x_train)

25000

In [5]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)


In [6]:
x_train[0]

array([   15,   256,     4,     2,     7,  3766,     5,   723,    36,
          71,    43,   530,   476,    26,   400,   317,    46,     7,
           4, 12118,  1029,    13,   104,    88,     4,   381,    15,
         297,    98,    32,  2071,    56,    26,   141,     6,   194,
        7486,    18,     4,   226,    22,    21,   134,   476,    26,
         480,     5,   144,    30,  5535,    18,    51,    36,    28,
         224,    92,    25,   104,     4,   226,    65,    16,    38,
        1334,    88,    12,    16,   283,     5,    16,  4472,   113,
         103,    32,    15,    16,  5345,    19,   178,    32],
      dtype=int32)

In [7]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model...
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Train...
Train on 25000 samples, validate on 25000 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/15

KeyboardInterrupt: 

### LSTM Text generation with Keras

What else can we do with LSTMs? Since we're analyzing the *sequence*, we can do more than classify - we can *generate* text. I'ved pulled some news stories using [newspaper](https://github.com/codelucas/newspaper/).

This example is drawn from the Keras [documentation](https://keras.io/examples/lstm_text_generation/).

In [11]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import numpy as np
import random
import sys
import os

In [12]:
data_files = os.listdir(os.curdir)
data_files

['.ipynb_checkpoints',
 'LS_DS_431_RNN_and_LSTM_Assignment.ipynb',
 'LS_DS_431_RNN_and_LSTM_Lecture.ipynb',
 'articles']

In [14]:
os.getcwd()

'/home/ec2-user/SageMaker/DS-Unit-4-Sprint-3-Deep-Learning/module1-rnn-and-lstm'

In [15]:
os.chdir('/home/ec2-user/SageMaker/DS-Unit-4-Sprint-3-Deep-Learning/module1-rnn-and-lstm/articles')

In [19]:
data_files = os.listdir(os.curdir)
data_files[:5]

['80.txt', '17.txt', '100.txt', '91.txt', '12.txt']

In [20]:
text = " "

for filename in data_files:
    if filename[-3:] == 'txt':
        path = f'{filename}'
        with open(path, 'r') as data:
            content = data.read()
            text = text + " " + content

print('corpus length', len(text))

corpus length 891912


In [21]:
text[:50]

'  Drivers should consider avoiding the area, and t'

In [18]:
# Read in Data

In [24]:
# Encode Data as Chars
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [25]:
char_indices

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '#': 4,
 '$': 5,
 '%': 6,
 '&': 7,
 "'": 8,
 '(': 9,
 ')': 10,
 '*': 11,
 '+': 12,
 ',': 13,
 '-': 14,
 '.': 15,
 '/': 16,
 '0': 17,
 '1': 18,
 '2': 19,
 '3': 20,
 '4': 21,
 '5': 22,
 '6': 23,
 '7': 24,
 '8': 25,
 '9': 26,
 ':': 27,
 ';': 28,
 '?': 29,
 '@': 30,
 'A': 31,
 'B': 32,
 'C': 33,
 'D': 34,
 'E': 35,
 'F': 36,
 'G': 37,
 'H': 38,
 'I': 39,
 'J': 40,
 'K': 41,
 'L': 42,
 'M': 43,
 'N': 44,
 'O': 45,
 'P': 46,
 'Q': 47,
 'R': 48,
 'S': 49,
 'T': 50,
 'U': 51,
 'V': 52,
 'W': 53,
 'X': 54,
 'Y': 55,
 'Z': 56,
 '[': 57,
 ']': 58,
 '_': 59,
 'a': 60,
 'b': 61,
 'c': 62,
 'd': 63,
 'e': 64,
 'f': 65,
 'g': 66,
 'h': 67,
 'i': 68,
 'j': 69,
 'k': 70,
 'l': 71,
 'm': 72,
 'n': 73,
 'o': 74,
 'p': 75,
 'q': 76,
 'r': 77,
 's': 78,
 't': 79,
 'u': 80,
 'v': 81,
 'w': 82,
 'x': 83,
 'y': 84,
 'z': 85,
 '{': 86,
 '|': 87,
 '©': 88,
 '\xad': 89,
 '·': 90,
 '½': 91,
 '×': 92,
 'á': 93,
 'ã': 94,
 'è': 95,
 'é': 96,
 'ê': 97,
 'í': 98,
 'ñ': 99,
 'ó': 

In [27]:
# Create the Sequence Data
maxlen =  40
steps = 3

sentences = [] #X
next_chars = [] #Y

for i in range(0, len(text) - maxlen, steps):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('sentences', len(sentences))

sentences 297291


In [28]:
sentences[0]

'  Drivers should consider avoiding the a'

In [29]:
sentences[1]

'rivers should consider avoiding the area'

In [32]:
# Specify x & y

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
        
    y[i, char_indices[next_chars[i]]] = 1
        
x[0]


array([[False,  True, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [33]:
x[0][0]

array([False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False])

In [35]:
x.shape

(297291, 40, 121)

In [37]:
# build the model: a single LSTM
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

optimizer = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


In [38]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [39]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [40]:
model.fit(x, y,
          batch_size=128,
          epochs=2,
          callbacks=[print_callback])

Epoch 1/2
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ddresses some common questions in its Me"
ddresses some common questions in its Mex6oYyhD·@éyF—N‘MxlU!a.🤔qv⅔íe'?'© V×wxA―tñ5sH⅓wrez8Jág;C!­Lb-!K⭐;sc+i&!🤔Kd$’Ti&8—”%è0z―)fN‘S.qã|⅓×áp|2g/OZ·2©ñ–tjCCJRTFCK⁦­Dvöo⅔9;7[Gi"sDíã­4ã3YIíY👻èG5…o!%eè⅓.V©👻óé‘á●D👻Kó⁦qj–:a+4–A•.rl:öVFFyc[FPX;m?P🗣L9ó+:vB“;]d u/jaS”Zó#·+kH'
Ká:OhA–I🗣―c—⅓­A•—!áápE9­o•eñá⅓⁩á9l*⅓ê"óSó👻⁦_r!Bx5·n×l·{V/©QVFiGwy/SsRﬂ“è“⁩’(9;
WèCm­#6g⅔o⁩OLgO,;g⭐f3oO'"F―Oa“sRJj/@dèyw)9Yx[@c…9|A―7:[t­n‘;sy#(l.ö🗣gL
Sm]5Ak6“(áe—O8c•@3⅓'⁩|ã
----- diversity: 0.5
----- Generating with seed: "ddresses some common questions in its Me"
ddresses some common questions in its Me|•©Zho586&7EíK½z$– ⅔©Wèê&…–⅓yEsñ”baPhs{í*Dí EI*ﬂt)/]gX×’×­●AY–ö|ãK]⅓●EX?½6eN.xzY_|ã.@7?⅓|a!AbA&v6?kM&×🗣7ﬂpC&ê{éo@J•"z🗣Al👻⅔{👻(2―BokãMkHFNTXOíê@Sí•‘G[B@öA$OKX7G“cñ·O●ﬂ⅓&L21h'z­t'C👻%•👻m—,BC|Sx…—ê©#W y#$Rö©!|Y’J—Qo f:V×MRwEx*Om0—l●·$U🗣ílêM­!]);é60FJEjxC–‘K,(#%7wé.&eu0yñy…w🗣 […F+ﬂè+|lj{o1SmWX

<tensorflow.python.keras.callbacks.History at 0x7f6e74cb84a8>