<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

In [2]:
# TODO - Words, words, mere words, no matter from the heart.

from urllib.request import urlopen

res = urlopen("https://www.gutenberg.org/files/100/100-0.txt")
if res.status == 200:
    text = res.read().decode("utf-8")

In [3]:
chars = list(set(text))

i_to_c = {i: c for i, c in enumerate(chars)}
c_to_i = {c: i for i, c in enumerate(chars)}

encoded = [c_to_i[c] for c in text]
sequences = []
next_char = []

In [4]:
maxlen = 30
step = 5

def preprocess(text):
    
    text = re.sub("[^A-Za-z0-9 ]", "", text)
    
    encoded = [c_to_i[c] for c in text]
    for i in range(0, len(encoded) - maxlen, step):
        sequences.append(encoded[i : i + maxlen])
        next_char.append(encoded[i + maxlen])


    for i in range(0, len(encoded) - maxlen, step):
        sequences.append(encoded[i : i + maxlen])
        next_char.append(encoded[i + maxlen])
    X = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
    y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

    for i, seq in enumerate(sequences):
        for t, char in enumerate(seq):
            X[i, t, char] = 1
        y[i, next_char[i]] = 1

    return X, y

In [5]:
import numpy as np
import re

X, y = preprocess(text)

X.shape

(2053604, 30, 108)

In [17]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding

max_features = 20000

model = Sequential()
model.add(LSTM(len(chars), input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 108)               93744     
_________________________________________________________________
dense (Dense)                (None, 108)               11772     
Total params: 105,516
Trainable params: 105,516
Non-trainable params: 0
_________________________________________________________________


In [18]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, c_to_i[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = i_to_c[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [9]:
import random, sys

model.fit(X, y, batch_size=128, epochs=3, callbacks=[print_callback],)

Epoch 1/3

KeyboardInterrupt: 

In [10]:

res = urlopen("https://www.gutenberg.org/files/100/100-0.txt")
if res.status == 200:
    text = res.read().decode("utf-8")
    
section_titles = """

               THE SONNETS

               ALL’S WELL THAT ENDS WELL

               THE TRAGEDY OF ANTONY AND CLEOPATRA

               AS YOU LIKE IT

               THE COMEDY OF ERRORS

               THE TRAGEDY OF CORIOLANUS

               CYMBELINE

               THE TRAGEDY OF HAMLET, PRINCE OF DENMARK

               THE FIRST PART OF KING HENRY THE FOURTH

               THE SECOND PART OF KING HENRY THE FOURTH

               THE LIFE OF KING HENRY THE FIFTH

               THE FIRST PART OF HENRY THE SIXTH

               THE SECOND PART OF KING HENRY THE SIXTH

               THE THIRD PART OF KING HENRY THE SIXTH

               KING HENRY THE EIGHTH

               KING JOHN

               THE TRAGEDY OF JULIUS CAESAR

               THE TRAGEDY OF KING LEAR

               LOVE’S LABOUR’S LOST

               THE TRAGEDY OF MACBETH

               MEASURE FOR MEASURE

               THE MERCHANT OF VENICE

               THE MERRY WIVES OF WINDSOR

               A MIDSUMMER NIGHT’S DREAM

               MUCH ADO ABOUT NOTHING

               THE TRAGEDY OF OTHELLO, MOOR OF VENICE

               PERICLES, PRINCE OF TYRE

               KING RICHARD THE SECOND

               KING RICHARD THE THIRD

               THE TRAGEDY OF ROMEO AND JULIET

               THE TAMING OF THE SHREW

               THE TEMPEST

               THE LIFE OF TIMON OF ATHENS

               THE TRAGEDY OF TITUS ANDRONICUS

               THE HISTORY OF TROILUS AND CRESSIDA

               TWELFTH NIGHT; OR, WHAT YOU WILL

               THE TWO GENTLEMEN OF VERONA

               THE TWO NOBLE KINSMEN

               THE WINTER’S TALE

               A LOVER’S COMPLAINT

               THE PASSIONATE PILGRIM

               THE PHOENIX AND THE TURTLE

               THE RAPE OF LUCRECE

               VENUS AND ADONIS
"""

titles = [title.strip() for title in section_titles.split("\n") if title]

In [11]:
titles = [
    r"(((\n)|(\r)){3,}" + title + "((\n)|(\r)){3,})"
    for title in titles
]

titles_re = re.compile(
    "|".join(titles)
)

res = titles_re.split(text)

In [12]:
import re

titles_re = re.compile(
    "|".join(titles)
)

In [13]:
res = titles_re.split(text)

In [14]:
filtered = [x for x in res if x][1:]
thresh = 10
while len(filtered) > 44:
    filtered = [x for x in filtered if x and len(x) > thresh]
    thresh += 1

In [15]:
title_to_contents = {
    title: contents
    for title, contents in zip(titles, filtered)
}

In [16]:
title_to_contents2 = {}
for k, v in title_to_contents.items():
    
    title_to_contents2[k[16:-14]] = v
    print(k[16:-14], v.strip()[:100])
    
    print()
    print()
    

THE SONNETS 1

From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as


ALL’S WELL THAT ENDS WELL Contents

ACT I
Scene I. Rossillon. A room in the Countess’s palace.
Scene II. Paris. A room in 


THE TRAGEDY OF ANTONY AND CLEOPATRA DRAMATIS PERSONAE

  MARK ANTONY,         Triumvirs
  OCTAVIUS CAESAR,         "
  M. AEMILIUS L


AS YOU LIKE IT DRAMATIS PERSONAE.

  DUKE, living in exile
  FREDERICK, his brother, and usurper of his dominion


THE COMEDY OF ERRORS Contents

ACT I
Scene I. A hall in the Duke’s palace.
Scene II. A public place.


ACT II
Sce


THE TRAGEDY OF CORIOLANUS Dramatis Personae

  CAIUS MARCIUS, afterwards CAIUS MARCIUS CORIOLANUS

    Generals against th


CYMBELINE Contents

ACT I
Scene I. Britain. The garden of Cymbeline’s palace.
Scene II. The same.
Scene I


THE TRAGEDY OF HAMLET, PRINCE OF DENMARK THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


THE FIRST PART OF KING HENRY THE FOURTH Contents

ACT I
Scene I. Elsinore. A plat

In [17]:
max_features = 20000

model = Sequential()
model.add(LSTM(len(chars), input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")

model.summary()


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 108)               93744     
_________________________________________________________________
dense_1 (Dense)              (None, 108)               11772     
Total params: 105,516
Trainable params: 105,516
Non-trainable params: 0
_________________________________________________________________


In [None]:
sonnets = title_to_contents2["THE SONNETS"]

X, y= preprocess(sonnets)
X[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [None]:
model.fit(X, y, batch_size=128, epochs=10, callbacks=[print_callback],)

In [7]:
import random, sys

on_epoch_end(3, "_")


----- Generating text after Epoch: 3
----- Generating with seed: "thee, fell Clifford, and thee,"
thee, fell Clifford, and thee,

NameError: name 'model' is not defined

In [11]:
from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.azlyrics.com/d/dropkickmurphys.html")
res.status_code

200

In [12]:
soup = BeautifulSoup(res.text)

urls = []

for item in soup.find_all("div", "listalbum-item"):
    urls.append(item.find("a")["href"])

urls = ["https://www.azlyrics.com/" + url[3:] for url in urls]
urls[:3]

['https://www.azlyrics.com/lyrics/dropkickmurphys/boysonthedocks.html',
 'https://www.azlyrics.com/lyrics/dropkickmurphys/neveralone151150.html',
 'https://www.azlyrics.com/lyrics/dropkickmurphys/inthestreetsofboston.html']

In [13]:
from time import sleep

lyrics = []

for url in urls:
    res = requests.get(url)
    if res.status_code != 200:
        print(f"Failed to get url: {url}")
        continue
    soup = BeautifulSoup(res.text)
    lyrics.append(" ".join(soup.find("div", "main-page").stripped_strings))
    print(url)
    print(lyrics[-1])
    
    sleep(10)

https://www.azlyrics.com/lyrics/dropkickmurphys/boysonthedocks.html
"Boys On The Docks" lyrics Dropkick Murphys Lyrics "Boys On The Docks" (dedicated to the memory of John Kelly) Say hey Johnny boy, the battle call. United we stand, divided we fall. Together we are what we can't be alone, We came to this country, you made it our home. This man so humble, this man so brave. A legend to many, he fought to his grave. Saved family and friends from the hardship and horror, in a land of depression he gave hope for tomorrow. Say Johnny me boy, this ones for you. With the strength of many and the courage of few. To what do we owe this man who's fight was for the masses, he gave his life. Say hey Johnny boy, the battle call United we stand, divided we fall. Together we are what we can't be alone, We came to this country, you made it our home. A friend to the locals who dabbled in crime. He'd give you a job and he'd give you his time. He wasn't a crook, but he couldn't be conned. John knew the d

In [15]:
with open("dkm.txt", "w") as f:
    f.writelines(lyrics)

In [19]:
max_features = 20000

model = Sequential()
model.add(LSTM(len(chars), input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")

model.summary()


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 108)               93744     
_________________________________________________________________
dense_1 (Dense)              (None, 108)               11772     
Total params: 105,516
Trainable params: 105,516
Non-trainable params: 0
_________________________________________________________________


In [20]:
lyrics[:2]

['"Boys On The Docks" lyrics Dropkick Murphys Lyrics "Boys On The Docks" (dedicated to the memory of John Kelly) Say hey Johnny boy, the battle call. United we stand, divided we fall. Together we are what we can\'t be alone, We came to this country, you made it our home. This man so humble, this man so brave. A legend to many, he fought to his grave. Saved family and friends from the hardship and horror, in a land of depression he gave hope for tomorrow. Say Johnny me boy, this ones for you. With the strength of many and the courage of few. To what do we owe this man who\'s fight was for the masses, he gave his life. Say hey Johnny boy, the battle call United we stand, divided we fall. Together we are what we can\'t be alone, We came to this country, you made it our home. A friend to the locals who dabbled in crime. He\'d give you a job and he\'d give you his time. He wasn\'t a crook, but he couldn\'t be conned. John knew the difference between right and wrong. Say Johnny me boy, you l

In [21]:
X, y = preprocess(" ".join(lyrics))
X[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [22]:
import random


In [None]:

model.fit(X, y, batch_size=128, epochs=10, callbacks=[print_callback],)

Epoch 1/10
  689/16921 [>.............................] - ETA: 6:10 - loss: 2.9489

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)

- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN