<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
# Get data
import requests 

fetch = requests.get('https://www.gutenberg.org/files/100/100-0.txt')

# removing first 553 characters (filler text)
data = fetch.text[553:5757527]

In [0]:
# turn data into integers rather than raw text

# unique chars 
chars = list(set(data))

# Lookup Tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)}

In [3]:
# how many chars in data?
len(chars)

107

In [4]:
# encode data using char_int
encoded = [char_int[c] for c in data]
len(encoded) == len(data)

True

In [5]:
# reshape into useable text sequences
sequences = []
next_char = []
maxlen = 60
step = 20

for i in range(0, len(encoded)-maxlen, step):
  curr_sequence = encoded[i:i+maxlen]
  sequences.append(curr_sequence)
  next_char.append(encoded[maxlen+i])

len(sequences), len(next_char)

(287846, 287846)

In [6]:
# make sure a sequence looks as expected
sequences[0], next_char[0]

([59,
  26,
  89,
  105,
  103,
  77,
  54,
  59,
  23,
  103,
  54,
  20,
  95,
  0,
  5,
  105,
  103,
  89,
  103,
  54,
  69,
  95,
  16,
  4,
  68,
  54,
  95,
  80,
  54,
  69,
  26,
  105,
  105,
  26,
  13,
  0,
  54,
  46,
  23,
  13,
  4,
  103,
  68,
  5,
  103,
  13,
  16,
  103,
  2,
  90,
  2,
  90,
  47,
  27,
  89,
  23,
  95,
  16,
  77,
  54],
 69)

In [7]:
# shape data into final format
import numpy as np 

X = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
  for j, char in enumerate(sequence):
    X[i, j, char] = 1
  y[i, next_char[i]] = 1

X.shape, y.shape

((287846, 60, 107), (287846, 107))

In [8]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


In [9]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               120832    
_________________________________________________________________
dense (Dense)                (None, 107)               13803     
Total params: 134,635
Trainable params: 134,635
Non-trainable params: 0
_________________________________________________________________


In [0]:
# helper functions
from tensorflow.keras.callbacks import LambdaCallback
import random 
import sys 

def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(data) - maxlen - 1)
    
    generated = ''
    
    sentence = data[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        # sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [19]:
# fit the model

model.fit(X, y,
          batch_size=1028,
          epochs=10,
          callbacks=[print_callback])

Train on 287846 samples
Epoch 1/10
----- Generating text after Epoch: 0
----- Generating with seed: "int.
  AUMERLE. Thou dar'st not, coward, live to see that d"
int.
  AUMERLE. Thou dar'st not, coward, live to see that dround.
  CEROID DON ILY Fras!

 Hene lortne;
What so is neane dien'd. For'l now is andty, whece hericfur:
    eake dorly the puct anat to ancouefay.
  CAMETA.


 [_eriny bus itang her didtonch.



 BOVERE
ON Cuce of, whan ith, and. GUUREs EDRVINT, OwLecA. SlEme foou, wy, Moftr, ghe praseen speno, grie,
Yo
Epoch 2/10
----- Generating text after Epoch: 1
----- Generating with seed: "he make this way,
    Under the colour of his usual game,
"
he make this way,
    Under the colour of his usual game,

Catsead a the ingate whas hom hertenr, in. sowl sorns prong theeyâ.

[_Brave, in fold.
Theal spee duss the wrold tay, fir ad acf
    BeAng thuth a im heme laum thiout hat
    And so not se prawied thy rozenceny well tan eave
    I he ranetor hed we dofrs wheas seoul her mok

<tensorflow.python.keras.callbacks.History at 0x7f32c55f7e80>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN