<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [93]:
from __future__ import print_function

import pandas as pd
import numpy as np
import random
import sys
import os
import time

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.optimizers import RMSprop


In [4]:
billy = pd.read_csv('./Shakespeare_data.csv')

In [17]:
billy.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [16]:
play_titles = billy.Play.unique()
play_titles

array(['Henry IV', 'Henry VI Part 1', 'Henry VI Part 2',
       'Henry VI Part 3', 'Alls well that ends well', 'As you like it',
       'Antony and Cleopatra', 'A Comedy of Errors', 'Coriolanus',
       'Cymbeline', 'Hamlet', 'Henry V', 'Henry VIII', 'King John',
       'Julius Caesar', 'King Lear', 'Loves Labours Lost', 'macbeth',
       'Measure for measure', 'Merchant of Venice',
       'Merry Wives of Windsor', 'A Midsummer nights dream',
       'Much Ado about nothing', 'Othello', 'Pericles', 'Richard II',
       'Richard III', 'Romeo and Juliet', 'Taming of the Shrew',
       'The Tempest', 'Timon of Athens', 'Titus Andronicus',
       'Troilus and Cressida', 'Twelfth Night', 'Two Gentlemen of Verona',
       'A Winters Tale'], dtype=object)

In [18]:
histories = ['Henry IV', 'Henry VI Part 1', 'Henry VI Part 2', 'Henry VI Part 3',
             'Henry V', 'Henry VIII', 'King John', 'Richard II', 'Richard III',
             'Pericles']

tragedies = ['Antony and Cleopatra', 'Coriolanus', 'Cymbeline', 'Hamlet', 
             'Julius Caesar', 'King Lear', 'macbeth', 'Othello', 'Troilus and Cressida',
             'Romeo and Juliet', 'Timon of Athens', 'Titus Andronicus']

comedies = ['Alls well that ends well', 'As you like it', 'A Comedy of Errors',
            'Loves Labours Lost', 'Measure for measure', 'Merchant of Venice',
            'Merry Wives of Windsor', 'A Midsummer nights dream', 'Much Ado about nothing',
            'Taming of the Shrew', 'The Tempest', 'Twelfth Night', 'Two Gentlemen of Verona',
            'A Winters Tale']

In [25]:
billy.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [31]:
df2 = billy.copy()

In [52]:
df2['genre'] = 'comedy'
df2.head(8)

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,genre
0,1,Henry IV,,,,ACT I,comedy
1,2,Henry IV,,,,SCENE I. London. The palace.,comedy
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ...",comedy
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",comedy
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",comedy
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils,comedy
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.,comedy
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil,comedy


In [74]:
def get_genre(title):
    if title in histories:
        genre = 'history'
    elif title in tragedies:
        genre = 'tragedy'
    else:
        genre = 'comedy'
    return genre

In [75]:
df2['genre'] = df2['Play'].apply(get_genre)

In [76]:
df2.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,genre
0,1,Henry IV,,,,ACT I,history
1,2,Henry IV,,,,SCENE I. London. The palace.,history
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ...",history
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",history
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",history


In [77]:
lines = df2[df2['Player'].notna()]

In [78]:
lines.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine,genre
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,",history
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,",history
5,6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils,history
6,7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.,history
7,8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil,history


In [107]:
#  filter by genre if preferred - comedy/history/tragedy

df = lines.loc[lines['genre'] == 'comedy']
text = " ".join(df['PlayerLine'])

In [110]:
chars = list(set(text))
char_int = {c:i for i,c in enumerate(chars)}
int_char = {i:c for i,c in enumerate(chars)}

In [114]:
char_int

{'F': 0,
 'V': 1,
 'B': 2,
 'p': 3,
 'o': 4,
 'W': 5,
 'C': 6,
 'Z': 7,
 ' ': 8,
 'a': 9,
 'O': 10,
 'G': 11,
 'x': 12,
 'N': 13,
 'E': 14,
 'A': 15,
 "'": 16,
 'r': 17,
 'k': 18,
 'M': 19,
 ']': 20,
 'm': 21,
 't': 22,
 'd': 23,
 'z': 24,
 'S': 25,
 'y': 26,
 '.': 27,
 'e': 28,
 '[': 29,
 'K': 30,
 'i': 31,
 'q': 32,
 'D': 33,
 'J': 34,
 'j': 35,
 'v': 36,
 'I': 37,
 'T': 38,
 'f': 39,
 'g': 40,
 ':': 41,
 'P': 42,
 '-': 43,
 'U': 44,
 'l': 45,
 'w': 46,
 ',': 47,
 'L': 48,
 '?': 49,
 'Q': 50,
 'h': 51,
 's': 52,
 'Y': 53,
 'u': 54,
 'n': 55,
 'c': 56,
 'b': 57,
 '!': 58,
 'R': 59,
 'X': 60,
 'H': 61,
 '\t': 62}

In [111]:
maxlen = 50
step = 15

encoded = [char_int[c] for c in text]

sequences = []  # Each element is 40 chars long
next_chars = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    sequences.append(encoded[i : i + maxlen])
    next_chars.append(encoded[i + maxlen])

print('sequences:', len(sequences))

sequences: 100217


In [112]:
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1
    
    y[i, next_chars[i]] = 1

In [113]:
print(x.shape)
print(y.shape)

(100217, 50, 63)
(100217, 63)


In [90]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

Instructions for updating:
Colocations handled automatically by placer.


In [118]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [119]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_int[char]] = 1

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = int_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [120]:
t0 = time.time()
model.fit(x, y,
          batch_size=128,
          epochs=15,
          callbacks=[print_callback])
print('time: ', time.time() - t0)

,whatinaprountercomesandcountintherestortofhisnothingbetheduthtre
----- diversity: 1.0
----- Generating with seed: "er o'er his follies, Will never do him good, not o"
er o'er his follies, Will never do him good, not oth-mookt,whutIwelldyourNostine,Hefitenmeshave:nowthembearwasnotorisenion,thyulfibringsmadingblit,lewme,live,mymusayafortsofwould,Sahalksting?WisthIdoldmeleantathen.Ihaveyou.Hark'sstaleisinforthitEloult,bratirs.ExeuntAngiledyou.Whatyoutrice.Murtcomehosnohishwanes,Toheefortysory,EndATIOdostilns,aNThcaknthereyandnos,dise.Mygea
----- diversity: 1.2
----- Generating with seed: "er o'er his follies, Will never do him good, not o"
er o'er his follies, Will never do him good, not oflowittruslo:Aindscceams,Bybeersage.ANHabld,andjromw.herbasgedcrasswear?Iwarnaschak.Fainttant,edshalfernthee'.Silvygolyorcaresslaving:myiknationMonking'dpoy.O'nesomt,Mypondy?nowelt,yeuramlccesthisninu.Efoulthclifite--hamyourjuach,SonIanfthishaun,therahe,I.Snod,I,'stanhakeandyou,how.Aduthitthebeang:Flaob

In [None]:
# TODO - Words, words, mere words, no matter from the heart.

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN