<a href="https://colab.research.google.com/github/jcs-lambda/DS-Unit-4-Sprint-3-Deep-Learning/blob/master/module1-rnn-and-lstm/LS_DS_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [2]:
!wget https://www.gutenberg.org/files/100/100-0.txt

--2020-03-23 20:23:24--  https://www.gutenberg.org/files/100/100-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5777367 (5.5M) [text/plain]
Saving to: ‘100-0.txt’


2020-03-23 20:23:35 (591 KB/s) - ‘100-0.txt’ saved [5777367/5777367]



In [20]:
!sed -nre '/^THE SONNETS/,/^THE END/p' 100-0.txt > sonnets_all.txt
!sed -ire '1d;$d;' sonnets_all.txt
!sed -nre '/^ALL’S WELL THAT ENDS WELL/,/^  FINIS/p' 100-0.txt > not_sonnets.txt
!sed -ire '$d;' not_sonnets.txt
!rm *.txtre
!ls -la

total 11284
drwxr-xr-x 1 root root    4096 Mar 23 20:35 .
drwxr-xr-x 1 root root    4096 Mar 23 20:22 ..
-rw-r--r-- 1 root root 5777367 Nov  7 13:05 100-0.txt
drwxr-xr-x 1 root root    4096 Mar 20 16:17 .config
drwxr-xr-x 2 root root    4096 Mar 23 20:32 .ipynb_checkpoints
-rw-r--r-- 1 root root 5651002 Mar 23 20:35 not_sonnets.txt
drwxr-xr-x 1 root root    4096 Mar 18 16:23 sample_data
-rw-r--r-- 1 root root  101858 Mar 23 20:34 sonnets_all.txt


In [21]:
!head sonnets_all.txt


                    1

From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed’st thy light’s flame with self-substantial fuel,
Making a famine where abundance lies,


In [22]:
!tail sonnets_all.txt

And so the general of hot desire,
Was sleeping by a virgin hand disarmed.
This brand she quenched in a cool well by,
Which from Love’s fire took heat perpetual,
Growing a bath and healthful remedy,
For men discased, but I my mistress’ thrall,
  Came there for cure and this by that I prove,
  Love’s fire heats water, water cools not love.




In [23]:
!head not_sonnets.txt

ALL’S WELL THAT ENDS WELL



Contents

ACT I
Scene I. Rossillon. A room in the Countess’s palace.
Scene II. Paris. A room in the King’s palace.
Scene III. Rossillon. A Room in the Palace.


In [24]:
!tail not_sonnets.txt

Thus weary of the world, away she hies,             1189
And yokes her silver doves; by whose swift aid
Their mistress mounted through the empty skies,
In her light chariot quickly is convey’d;           1192
  Holding their course to Paphos, where their queen
  Means to immure herself and not be seen.






In [0]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import numpy as np
import random
import sys
import os

In [0]:
data_files = ['sonnets_all.txt', 'not_sonnets.txt']

In [43]:
# Read in Data

data = []

for file in data_files:
    if file[-3:] == 'txt':
        print(file)
        with open(f'./{file}', 'r', encoding='utf-8') as f:
            data.append(f.read())

sonnets_all.txt
not_sonnets.txt


In [44]:
len(data)

2

In [0]:
# Encode Data as Chars

# Gather all text 
# Why? 1. See all possible characters 2. For training / splitting later
text = " ".join(data)

# Unique Characters
chars = list(set(text))

# Lookup Tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)}

In [46]:
print(len(chars), len(char_int), len(int_char))

101 101 101


In [48]:
print(chars)
print(char_int)
print(int_char)

['p', ':', 'z', '8', '*', '(', 'e', 'h', 'æ', 'D', '7', '1', '’', 'w', '”', 'y', ',', '`', 'n', '!', 'f', 'O', 'i', 'g', 'K', ')', 'W', 'â', 'S', 'r', 'A', 's', '-', 'è', 'q', 'œ', 'é', ';', 't', 'I', 'Æ', '\t', '4', '?', 'ç', 'F', 'N', 'Q', 'R', '9', 'J', 'É', 'Y', '}', 'E', 'm', ' ', 'l', 'c', 'k', 'à', '‘', '2', '\\', "'", 'V', '.', 'u', 'H', 'G', 'P', '5', '&', 'X', 'C', 'x', 'U', '|', '6', 'Z', 'L', '"', ']', 'B', '\n', 'ê', '0', 'M', 'v', 'b', 'j', '—', 'T', 'd', 'î', 'o', '“', '[', 'a', '_', '3']
{'p': 0, ':': 1, 'z': 2, '8': 3, '*': 4, '(': 5, 'e': 6, 'h': 7, 'æ': 8, 'D': 9, '7': 10, '1': 11, '’': 12, 'w': 13, '”': 14, 'y': 15, ',': 16, '`': 17, 'n': 18, '!': 19, 'f': 20, 'O': 21, 'i': 22, 'g': 23, 'K': 24, ')': 25, 'W': 26, 'â': 27, 'S': 28, 'r': 29, 'A': 30, 's': 31, '-': 32, 'è': 33, 'q': 34, 'œ': 35, 'é': 36, ';': 37, 't': 38, 'I': 39, 'Æ': 40, '\t': 41, '4': 42, '?': 43, 'ç': 44, 'F': 45, 'N': 46, 'Q': 47, 'R': 48, '9': 49, 'J': 50, 'É': 51, 'Y': 52, '}': 53, 'E': 54, 'm':

In [49]:
# Create the sequence data

maxlen = 40
step = 5

encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences))

sequences:  1109842


In [50]:
print(sequences[0])

[84, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 11, 84, 84, 45, 29, 95, 55, 56, 20, 98, 22, 29, 6, 31, 38, 56, 58, 29, 6]


In [51]:
print([int_char[i] for i in sequences[0]])

['\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '1', '\n', '\n', 'F', 'r', 'o', 'm', ' ', 'f', 'a', 'i', 'r', 'e', 's', 't', ' ', 'c', 'r', 'e']


In [0]:
# Create x & y

x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1
        
    y[i, next_char[i]] = 1

In [53]:
print(x.shape)

(1109842, 40, 101)


In [55]:
print(x[0].shape)

(40, 101)


In [54]:
print(y.shape)

(1109842, 101)


In [0]:
# build the model: a single LSTM

model = Sequential()
model.add(LSTM(128, input_shape=x[0].shape, dropout=.2))
model.add(Dense(y.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [0]:
# translate a prediction into a character (index of a character)

def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [60]:
model.fit(x, y,
          batch_size=32,
          epochs=10,
          callbacks=[print_callback])

Train on 1109842 samples
Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: "sh'd.
  ANTONY. Make me not offended
   "
sh'd.
  ANTONY. Make me not offended
    Think go never- adquestedyse neven Rock.     Preate oow ma holds hone
And gentill his colline peobyting morey
Te’ll nethen any you sleak I ant callecolas  wath dre surss.
  KANG LBFFRESTERL. And to hele ereade bid not is a make erol:
He as agionswac! weed be, was have Geady,
To of twe karsuess sow path she wite to thou ste
                         Entes eeele in loveee
W
    CIRITAUE. I how abwes
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: "s, and there have sat
The livelong day w"
s, and there have sat
The livelong day witr hemver-d dissiet the sectler! Woul wifl to wot friends for this go-glutot.
O finnagh eye oo goint him I have mil?
    Are ronque tow poor a lain, you say won,
To the core a mar, come it dows, one your under soud spour sny,
O fullo a peapes's befirth.



<tensorflow.python.keras.callbacks.History at 0x7fd2963f9160>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN