<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


In [0]:
import os
from tensorflow import distribute, config, tpu
resolver = distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
config.experimental_connect_to_cluster(resolver)
tpu.experimental.initialize_tpu_system(resolver)

INFO:tensorflow:Initializing the TPU system: 10.97.60.242:8470


INFO:tensorflow:Initializing the TPU system: 10.97.60.242:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


<tensorflow.python.tpu.topology.Topology at 0x7f05aa24a2e8>

In [0]:
import requests

url = "https://www.gutenberg.org/files/100/100-0.txt"

with requests.get(url) as res:
    res.encoding = 'utf-8'
    works = res.text

# Remove the ebook header
works = works[2907:]

# Remove carriage return char
works = works.replace('\r', ' ')

In [0]:
import re
pattern = re.compile(r"(.*)(\*\sCONTENT\sNOTE\s\(added in 2017\)\s\*.*)", 
                     flags=re.DOTALL)
matches = pattern.search(works)
works = matches.group(1)

In [0]:
import sys

size = str(round(sys.getsizeof(works) / 1000000, 2)) + ' MB'
size

'11.43 MB'

In [0]:
from tensorflow import keras

chars = list(set(works))

char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

In [0]:
MAXLEN = 40
STEP = 5

encoded = [char_int[c] for c in works]

sequences = [] # 40 chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - MAXLEN, STEP):

    sequences.append(encoded[i:i+MAXLEN])
    next_char.append(encoded[i+MAXLEN])

print("Sequences:", len(sequences))

Sequences: 1143122


In [0]:
print(len(encoded))
print(sequences[0])
print(next_char[0])
print(sequences[1])
print(next_char[1])
print(sequences[2])
print(next_char[2])
print("works/step", (len(works)-41)/5)

5715646
[52, 9, 84, 99, 100, 81, 51, 51, 84, 52, 100, 99, 72, 99, 72, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 63, 99, 72, 99, 72]
37
[81, 51, 51, 84, 52, 100, 99, 72, 99, 72, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 63, 99, 72, 99, 72, 37, 13, 8, 49, 99]
97
[100, 99, 72, 99, 72, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 63, 99, 72, 99, 72, 37, 13, 8, 49, 99, 97, 76, 77, 13, 95]
57
works/step 1143121.0


In [0]:
print("encoded", encoded[:10])
print(">>>>>>>>>> len(encoded)", len(encoded))
print("sequences", sequences[0])
print(">>>>>>>>>> len(sequences[0])", len(sequences[0]))
print("next_char", next_char[:10])
print(">>>>>>>>>> len(next_char)", len(next_char))

encoded [52, 9, 84, 99, 100, 81, 51, 51, 84, 52]
>>>>>>>>>> len(encoded) 5715646
sequences [52, 9, 84, 99, 100, 81, 51, 51, 84, 52, 100, 99, 72, 99, 72, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 63, 99, 72, 99, 72]
>>>>>>>>>> len(sequences[0]) 40
next_char [37, 97, 57, 95, 95, 99, 13, 71, 95, 61]
>>>>>>>>>> len(next_char) 1143122


In [0]:
import numpy as np
# X and y 

X = np.zeros((len(sequences), MAXLEN, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        X[i,t,char] = 1

    y[i, next_char[i]] = 1

In [0]:
X.shape, y.shape

((1143122, 40, 101), (1143122, 101))

In [0]:
X[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False,  True, False],
       [False, False, False, ..., False, False, False]])

In [0]:
# model
strategy = distribute.experimental.TPUStrategy(resolver)
with strategy.scope():
    model = keras.Sequential([
        keras.layers.LSTM(128, input_shape=(MAXLEN, len(chars))),
        keras.layers.Dense(len(chars), activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer='adam')

In [0]:
import random

from tensorflow.keras.callbacks import LambdaCallback


def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(works) - MAXLEN - 1)
    
    generated = ''
    
    sentence = works[start_index: start_index + MAXLEN]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, MAXLEN, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [0]:
# fit
model.fit(X, y,
          batch_size=32,
          epochs=15,
          callbacks=[print_callback])

Train on 1147363 samples
Epoch 1/15
----- Generating text after Epoch: 0
----- Generating with seed: "trius. 
 
LYSANDER. 
Be not afraid; she "
trius. 
 
LYSANDER. 
Be not afraid; she that is a preate. 
    VoNless, timows to you most treak’d Greals, fas, 
I’ll is the his Had 
      sule and centle mishame if that bein gent; 
    kight at to grack of minlow speak, 
’Ttwill I swike fight, and cally much away. 
 fears hoth, yeb this pape. And I kive had frage his fear! 
O offly King leanes, Maicing his precey 
That’s his peranot, of my there to a poor, 
He die not bear on the sau
Epoch 2/15
----- Generating text after Epoch: 1
----- Generating with seed: "provokes me to ridiculous 
    smiling. "
provokes me to ridiculous 
    smiling. 
 
SIR TOBY. 
You fill! 
  PEMPLIA. And, stay he, fall birther, somech? 
 
AENELLA. 
Like timm, no  whose, Sich a kince arroning or 
Strips, the cape lith read you and blocks. 
  CORIOLANUS. Ay, I have no coud of this. 
    If my Pornio! What are a choe th

<tensorflow.python.keras.callbacks.History at 0x7fce420c95f8>

In [0]:
1147404 % 8

4

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN