<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [14]:
# Imports
import numpy as np
import random
import sys
import urllib.request

from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

In [2]:
# Read in data
full_text = []
url = "https://www.gutenberg.org/files/100/100-0.txt"
for line in urllib.request.urlopen(url):
    full_text.append(line.decode('utf-8'))

In [5]:
# Strip lines from remaining HTML characters
cleaned_full_text = []
for line in full_text[139:-430]:
    cleaned_full_text.append(line.strip())

In [6]:

# Create titles data
titles = []
for line in full_text[44:129:2]:
    titles.append(line.strip())

In [7]:
# Separate out sonnets 
sonnets = []
for line in cleaned_full_text[:2767]:
  sonnets.append(line)

In [8]:

# Use titles list to create first draft of model

# Gather all text
text = " ".join(titles)

# Create unique character list
chars = list(set(text))

# Create character lookup tables
char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

len(chars)

28

In [9]:

# Use titles list to create first draft of model

# Gather all text
text = " ".join(titles)

# Create unique character list
chars = list(set(text))

# Create character lookup tables
char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

len(chars)

28

In [10]:
# find average character length of titles
lengths = []
sum_num = 0
for title in titles:
  lengths.append(len(title))
for length in lengths:
  sum_num += length
sum_num / len(lengths)

25.372093023255815

In [11]:
# Create sequence data 
maxlen = 25
step = 5

# Create encoded data
encoded = [char_int[c] for c in text]

# Create empty sequence & next character lists
sequences = []
next_char = []

# fill empty lists
for i in range(0, len(encoded) - maxlen, step):
  sequences.append(encoded[i : i + maxlen])
  next_char.append(encoded[i + maxlen])

print('Sequence Qty: ', len(sequences))

Sequence Qty:  222


In [12]:
# One hot encode data to prepare for model
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
  for t, char in enumerate(sequence):
    x[i, t, char] = 1
  y[i, next_char[i]] = 1

In [15]:
# Build LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               80384     
_________________________________________________________________
dense (Dense)                (None, 28)                3612      
Total params: 83,996
Trainable params: 83,996
Non-trainable params: 0
_________________________________________________________________


In [16]:
# Create functions to make use of model outputs
def indexer(preds):
  # Helper function to take highest probability & pull given index
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / 1
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def on_epoch_end(epoch, _):
  # Function to generate predicted text at each epoch
  print()
  print('---- Generating text after Epoch: %d' % epoch)

  start_index = random.randint(0, len(text) - maxlen - 1)

  generated = ''

  sentence = text[start_index: start_index + maxlen]

  generated += sentence

  print('---- Generating with seed: "' + sentence + '"')
  # sys.stdout.write(generated)

  for i in range(25):
    x_pred = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
      x_pred[0, t, char_int[char]] = 1

    preds = model.predict(x_pred, verbose=0)[0]
    next_index = indexer(preds)
    next_char = int_char[next_index]

    sentence = sentence[1:] + next_char

    sys.stdout.write(next_char)
    sys.stdout.flush()
  print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [17]:
# fit the model
model.fit(x, y,
          batch_size=32,
          epochs=100,
          callbacks=[print_callback])

Train on 222 samples
Epoch 1/100
---- Generating text after Epoch: 0
---- Generating with seed: "OMEO AND JULIET THE TAMIN"
VDDFFLCUETYNKE;JDLXJDH,KD
Epoch 2/100
---- Generating text after Epoch: 1
---- Generating with seed: "MEDY OF ERRORS THE TRAGED"
RTOEEOEOX PTDFN ; ;OHLONE
Epoch 3/100
---- Generating text after Epoch: 2
---- Generating with seed: "IRST PART OF KING HENRY T"
 DHSOOEREOKHH NNGHHDROIEE
Epoch 4/100
---- Generating text after Epoch: 3
---- Generating with seed: "N OF ATHENS THE TRAGEDY O"
F PNEO HNNIE OU PT F FTHO
Epoch 5/100
---- Generating text after Epoch: 4
---- Generating with seed: "F CORIOLANUS CYMBELINE TH"
MGOIH TIOITRTGWKH OEHWTYU
Epoch 6/100
---- Generating text after Epoch: 5
---- Generating with seed: "LL THAT ENDS WELL THE TRA"
EOE T H EO GEIRUAHRUHEAIO
Epoch 7/100
---- Generating text after Epoch: 6
---- Generating with seed: "ETH MEASURE FOR MEASURE T"
SFRSGTFHI AH  NHSHFEMPIEH
Epoch 8/100
---- Generating text after Epoch: 7
---- Generating with seed: "

<tensorflow.python.keras.callbacks.History at 0x133cde477c8>

In [18]:
# Try with sonnets

# Gather all text
text = " ".join(sonnets)

# Create unique character list
chars = list(set(text))

# Create character lookup tables
char_int = {c:i for i, c in enumerate(chars)}
int_char = {i:c for i, c in enumerate(chars)}

len(chars)

71

In [19]:
# Create sequence data 
maxlen = 40
step = 5

# Create encoded data
encoded = [char_int[c] for c in text]

# Create empty sequence & next character lists
sequences = []
next_char = []

# fill empty lists
for i in range(0, len(encoded) - maxlen, step):
  sequences.append(encoded[i : i + maxlen])
  next_char.append(encoded[i + maxlen])

print('Sequence Qty: ', len(sequences))

Sequence Qty:  18918


In [20]:
# One hot encode data to prepare for model
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
  for t, char in enumerate(sequence):
    x[i, t, char] = 1
  y[i, next_char[i]] = 1

In [21]:
# Build LSTM model
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               102400    
_________________________________________________________________
dense_1 (Dense)              (None, 71)                9159      
Total params: 111,559
Trainable params: 111,559
Non-trainable params: 0
_________________________________________________________________


In [22]:
# Create functions to make use of model outputs
def indexer(preds):
  # Helper function to take highest probability & pull given index
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / 1
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def on_epoch_end(epoch, _):
  # Function to generate predicted text at each epoch
  print()
  print('---- Generating text after Epoch: %d' % epoch)

  start_index = random.randint(0, len(text) - maxlen - 1)

  generated = ''

  sentence = text[start_index: start_index + maxlen]

  generated += sentence

  print('---- Generating with seed: "' + sentence + '"')
  sys.stdout.write("\n")
  
  print('                 %d' % epoch)
  sys.stdout.write(generated)
  sys.stdout.write("\n")

  for i in range(13):
    for i in range(maxlen):   
      x_pred = np.zeros((1, maxlen, len(chars)))
      for t, char in enumerate(sentence):
        x_pred[0, t, char_int[char]] = 1

      preds = model.predict(x_pred, verbose=0)[0]
      next_index = indexer(preds)
      next_char = int_char[next_index]

      sentence = sentence[1:] + next_char

      sys.stdout.write(next_char)
      sys.stdout.flush()
    sys.stdout.write("\n")
  print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [23]:
# Fit model
model.fit(x, y,
          batch_size=32,
          epochs=100,
          callbacks=[print_callback])

Train on 18918 samples
Epoch 1/100
---- Generating text after Epoch: 0
---- Generating with seed: "the sweet smell Of different flowers in "

                 0
the sweet smell Of different flowers in 
kyoe tonovet sa nut lhu nh non e a de t 
veFc Iooeashf  wia  sI homet rhn newr yk
rtjNe lglkoe, Bthtpruu ot, mead roms soc
 n s sit gt,sat u.v  h oyn borh thet sin
 nthosar  on ne  oceiklond tieo fe itkrt
 bet gr hf ooy me th nh :t Ta a t, eand 
sorr oo  phe  ot  u d, ,v utk to  no e, 
9ledr, ahhudkced tht thc ahrohevt,wcu re
 th  sath uorin   ioetmet  h w tae rha o
osrin tHes muoa, thet murs to o rsere wy
  om tiiathumesiss pfS fe, yDtsfu ghy sh
ito ,hend ponyi n: aov ler   g heehs ote
r sthne agt2sSrf f.  itmlsdi aYh shn b.,

Epoch 2/100
---- Generating text after Epoch: 1
---- Generating with seed: "do allow, For beauty’s pattern to succee"

                 1
do allow, For beauty’s pattern to succee
 nhef feTum) I whoerf tehe, mr rhof wan 
thoriwd Anat whrurin A daus aFthethth sh
oTo

<tensorflow.python.keras.callbacks.History at 0x133d73ded88>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN