<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import re
import numpy as np
import random
import sys
import os

from __future__ import print_function

from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# !wget https://www.gutenberg.org/files/100/100-0.txt

In [32]:
text = open('100-0.txt', 'r', encoding='utf-8')
text_clean = text.read()

char_scrub = ['\n','\d+', r'\W+(?!\S*[a-z])|(?<!\S)\W+','    ', '   ', '  ']

for char in char_scrub:
    text_clean = re.sub(char, ' ', text_clean)
    
end = len(text_clean)-20245
    
text_clean = text_clean[1985:end]

sonnets = text_clean[:90968]

plays = text_clean[90968:]

In [33]:
# Create the Sequence Data for sonnets

def sequence_data():
    
    chars = list(set(data))

    char_int = {c:i for i,c in enumerate(chars)}
    int_char = {i:c for i,c in enumerate(chars)}

    
    

    encoded = [char_int[c] for c in data]

    sequences = [] #each element 60 chars long
    next_chars = [] #one element for each sequence

    for i in range(0, len(encoded)-maxlen, step):
        sequences.append(encoded[i : i+maxlen])
        next_chars.append(encoded[i+maxlen])

    print(f'sequences: {len(sequences)}')
    
    x = np.zeros((len(sequences), maxlen, len(chars)), dtype = np.bool)
    y = np.zeros((len(sequences), len(chars)), dtype = np.bool)

    for i, sequence in enumerate(sequences):
        for t, char in enumerate(sequence):
            x[i,t,char] = 1

        y[i, next_chars[i]] = 1
    
    return chars, x, y, char_int, int_char


In [34]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [42]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    if epoch %5 ==0:
        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(data) - maxlen - 1)
        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = data[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                x_pred = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_int[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = sample(preds, diversity)
                next_char = int_char[next_index]

                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()
        else: pass

In [43]:
#sonnet lstm

maxlen = 100
step = 5
data = sonnets

chars, x, y, char_int, int_char = sequence_data()

sequences: 18174


In [44]:
# build the model: a single LSTM
        
model = Sequential()

model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation = 'softmax')) #softmax used because multiclass 

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
    
model.fit(x, y,
          batch_size=128,
          epochs=25,
          callbacks=[print_callback])

Epoch 1/25
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "d Since first I saw you fresh which yet are green Ah yet doth beauty like a dial hand Steal from his"
d Since first I saw you fresh which yet are green Ah yet doth beauty like a dial hand Steal from his   t  e h t   t  t   o   e        t  ee    he e    a    e t  e s s  ae t   e  e   e   t he    h h  e  e  a  e  s  ee  e  oe e   o      e   h   h h e t   t    e  e       t  t h   s oe  e e    e  t o  t t    t   t e e h  r e      h     o e     e  o e  t   s h r   t    o  s   e s   t     a    t  t   t   e  t      h  t  e e ee     e    t  t    e th   t  h  e   e   e n t  o e t  e e  t  e  ot  e  s  h 
----- diversity: 0.5
----- Generating with seed: "d Since first I saw you fresh which yet are green Ah yet doth beauty like a dial hand Steal from his"
d Since first I saw you fresh which yet are green Ah yet doth beauty like a dial hand Steal from his e ea e eeos st taoae re ra ai  hoatr  eu a tun

<tensorflow.python.keras.callbacks.History at 0x7f3ff96c2d30>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN