<a href="https://colab.research.google.com/github/JonNData/DS-Unit-4-Sprint-3-Deep-Learning/blob/master/module1-rnn-and-lstm/Nguyen_LS_DS_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
# TODO - Words, words, mere words, no matter from the heart.
import numpy as np
import random
import sys
import requests


In [0]:
url = 'https://www.gutenberg.org/files/100/100-0.txt'
r = requests.get(url)
r.encoding = r.apparent_encoding

data = r.text

data = data.split('\r\n')

data = data[135:]

sonets = data[:2776]
plays = data[2777:]

In [0]:
def long_lines(lst_ln):
  """ Return long cleaned lines """
  clean = []
  for ln in lst_ln:
    if len(ln) == 0:
      pass
    else: # if the characters are mostly letters, add it to the list
      pct = len(ln.strip(' ')) / len(ln)
      if pct >= 0.5:
        clean.append(ln.lstrip())
  return clean


In [5]:
sonets_c = long_lines(sonets)
plays_c = long_lines(plays)
plays_c[:5]

['ALL’S WELL THAT ENDS WELL',
 'Contents',
 'ACT I',
 'Scene I. Rossillon. A room in the Countess’s palace.',
 'Scene II. Paris. A room in the King’s palace.']

In [7]:
# Character encoding

text = "\r\n".join(plays_c)
print(text[1:10])

chars = list(set(text))
print(chars[:10])
len(chars)

LL’S WELL
['“', 'v', 'î', ';', '5', 'B', '‘', '|', '’', 's']


106

In [37]:
# Create a lookup dictionary that can be referenced for all the chars
char_int = {c:i for i,c in enumerate(chars)}
int_char = {i:c for i,c in enumerate(chars)}
len(chars), len(int_char)

(106, 106)

In [36]:
int_char[52]

KeyError: ignored

In [10]:
# Create the sequence data

maxlen = 40
step = 5

encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence
# take 40 chars and predict next char. each time

# for the length of our total encoded data -  the 40 chars we predicted.
for i in range(0, len(encoded) - maxlen, step):
    
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences))


sequences:  1064489


In [11]:
sequences[1]

[52,
 92,
 87,
 94,
 94,
 52,
 95,
 11,
 72,
 95,
 52,
 87,
 23,
 68,
 46,
 52,
 92,
 87,
 94,
 94,
 58,
 104,
 96,
 42,
 56,
 44,
 73,
 56,
 44,
 9,
 58,
 104,
 72,
 96,
 95,
 52,
 37,
 58,
 104,
 46]

In [30]:
len(sequences), len(sequences[1])

(1064489, 40)

In [0]:
# Now we have the sequence data but it is not encoded properly

# Since we are only 106 features, we have to one hot encode
#  Currently each item is 106 elemnents long, index, sequence, one hot encode

# Create x & y, 40 characters in and next_char prediction
# make a bunch of zeroes, and make a one for the target
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        # i = obs, item in obs, encoded to 1
        x[i,t,char] = 1
    #  y is the next character    
    y[i, next_char[i]] = 1

In [34]:
x.shape, y.shape

((1064489, 40, 106), (1064489, 106))

In [0]:
from __future__ import print_function

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import LambdaCallback, EarlyStopping

In [0]:
# build the model: a single LSTM

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='nadam')

In [0]:
def sample(preds):
    # helper function to sample an index from a probability array
    # takes probabilities and gives location for highest prob
    # essentially giving our prediction
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [21]:
# fit the model

model.fit(x, y,
          batch_size=256,
          epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fa3c4651c18>

In [0]:
model.fit(x, y,
          batch_size=256,
          epochs=1, callbacks=[print_callback])

###This is fuckin' crazy, they are mostly words now

In [38]:
model.fit(x, y,
          batch_size=256,
          epochs=10, callbacks=[print_callback])

Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: "my nobler thoughts most base, is now
Th"
my nobler thoughts most base, is now










Which but beheir paster'd; s
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: "ive or dead,
He will be found like Brut"
ive or dead,










And se
Epoch 3/10
----- Generating text after Epoch: 2
----- Generating with seed: "lie, you lie:
I say thou liest, Camillo"
lie, you lie:








At with his ranging me our dea
Epoch 4/10
----- Generating text after Epoch: 3
----- Generating with seed: "e to me.
YORK. If York have ill demean'"
e to me.










That serval take thee? A robser, she is it was with blood hut.
Epoch 5/10
----- Generating text after Epoch: 4
"













And would now returate’s in one ton
Epoch 6/10
----- Generating text after Epoch: 5
----- Generating with seed: " princes to act,
And monarchs to behold"
 princes to act,










What be my brother knowfle to the light hav

<tensorflow.python.keras.callbacks.History at 0x7fa3c44fc748>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN