<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
from tensorflow.keras.callbacks import LambdaCallback, EarlyStopping, TensorBoard
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

import numpy as np
import random
import sys
import requests
import os
import datetime

In [0]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)

In [3]:
type(r)

requests.models.Response

In [0]:
r.encoding = r.apparent_encoding

data = r.text

In [5]:
data[:100]

'\r\nProject Gutenberg’s The Complete Works of William Shakespeare, by William\r\nShakespeare\r\n\r\nThis eBo'

In [0]:
data = data.split('\r\n')

In [7]:
type(data)

list

In [8]:
len(data)

166903

In [9]:
data[44:48]

['               ALL’S WELL THAT ENDS WELL',
 '',
 '               THE TRAGEDY OF ANTONY AND CLEOPATRA',
 '']

In [0]:
# Skip the Table of Contents
theData = data[135:]

In [11]:
# Looking at what the data looks like
theData[:10]

['THE SONNETS',
 '',
 '                    1',
 '',
 'From fairest creatures we desire increase,',
 'That thereby beauty’s rose might never die,',
 'But as the riper should by time decease,',
 'His tender heir might bear his memory:',
 'But thou contracted to thine own bright eyes,',
 'Feed’st thy light’s flame with self-substantial fuel,']

In [0]:
# Had to change this to add the sonnets # made the slice object to inlcude the sonnets
toc = [l.strip() for l in data[42:130:2]]

In [13]:
toc

['THE SONNETS',
 'ALL’S WELL THAT ENDS WELL',
 'THE TRAGEDY OF ANTONY AND CLEOPATRA',
 'AS YOU LIKE IT',
 'THE COMEDY OF ERRORS',
 'THE TRAGEDY OF CORIOLANUS',
 'CYMBELINE',
 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK',
 'THE FIRST PART OF KING HENRY THE FOURTH',
 'THE SECOND PART OF KING HENRY THE FOURTH',
 'THE LIFE OF KING HENRY THE FIFTH',
 'THE FIRST PART OF HENRY THE SIXTH',
 'THE SECOND PART OF KING HENRY THE SIXTH',
 'THE THIRD PART OF KING HENRY THE SIXTH',
 'KING HENRY THE EIGHTH',
 'KING JOHN',
 'THE TRAGEDY OF JULIUS CAESAR',
 'THE TRAGEDY OF KING LEAR',
 'LOVE’S LABOUR’S LOST',
 'THE TRAGEDY OF MACBETH',
 'MEASURE FOR MEASURE',
 'THE MERCHANT OF VENICE',
 'THE MERRY WIVES OF WINDSOR',
 'A MIDSUMMER NIGHT’S DREAM',
 'MUCH ADO ABOUT NOTHING',
 'THE TRAGEDY OF OTHELLO, MOOR OF VENICE',
 'PERICLES, PRINCE OF TYRE',
 'KING RICHARD THE SECOND',
 'KING RICHARD THE THIRD',
 'THE TRAGEDY OF ROMEO AND JULIET',
 'THE TAMING OF THE SHREW',
 'THE TEMPEST',
 'THE LIFE OF TIMON OF ATHENS'

In [14]:
# I am changing the correct title for the life of king henry in the toc
toc[10]= "THE LIFE OF KING HENRY V"
toc[10]

'THE LIFE OF KING HENRY V'

In [15]:
len(toc)

44

In [0]:
locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

In [17]:
locations

{0: {'start': -99, 'title': 'THE SONNETS'},
 1: {'start': -99, 'title': 'ALL’S WELL THAT ENDS WELL'},
 2: {'start': -99, 'title': 'THE TRAGEDY OF ANTONY AND CLEOPATRA'},
 3: {'start': -99, 'title': 'AS YOU LIKE IT'},
 4: {'start': -99, 'title': 'THE COMEDY OF ERRORS'},
 5: {'start': -99, 'title': 'THE TRAGEDY OF CORIOLANUS'},
 6: {'start': -99, 'title': 'CYMBELINE'},
 7: {'start': -99, 'title': 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK'},
 8: {'start': -99, 'title': 'THE FIRST PART OF KING HENRY THE FOURTH'},
 9: {'start': -99, 'title': 'THE SECOND PART OF KING HENRY THE FOURTH'},
 10: {'start': -99, 'title': 'THE LIFE OF KING HENRY V'},
 11: {'start': -99, 'title': 'THE FIRST PART OF HENRY THE SIXTH'},
 12: {'start': -99, 'title': 'THE SECOND PART OF KING HENRY THE SIXTH'},
 13: {'start': -99, 'title': 'THE THIRD PART OF KING HENRY THE SIXTH'},
 14: {'start': -99, 'title': 'KING HENRY THE EIGHTH'},
 15: {'start': -99, 'title': 'KING JOHN'},
 16: {'start': -99, 'title': 'THE TRAGEDY OF

In [0]:
# Start 
for e,i in enumerate(theData):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            
# End            
for title in toc:
    
    t = 0
    
    while t < (len(toc)-1):
       # print(t) taking the printing out so that I don't get large output
        end = (locations[t+1]['start']) - 1
        locations[t]['end'] = end
        t += 1

    # Last One
    locations[t]['end'] = len(theData)

In [19]:
locations[9]

{'end': 42636,
 'start': 39228,
 'title': 'THE SECOND PART OF KING HENRY THE FOURTH'}

In [20]:
# checking to see if the locations are correct
locations[10]

{'end': 47572, 'start': 42637, 'title': 'THE LIFE OF KING HENRY V'}

In [21]:
theData[47570:47575]

['', '', '', 'THE FIRST PART OF HENRY THE SIXTH', '']

In [22]:
locations

{0: {'end': 2776, 'start': 0, 'title': 'THE SONNETS'},
 1: {'end': 7738, 'start': 2777, 'title': 'ALL’S WELL THAT ENDS WELL'},
 2: {'end': 11840,
  'start': 7739,
  'title': 'THE TRAGEDY OF ANTONY AND CLEOPATRA'},
 3: {'end': 14631, 'start': 11841, 'title': 'AS YOU LIKE IT'},
 4: {'end': 17832, 'start': 14632, 'title': 'THE COMEDY OF ERRORS'},
 5: {'end': 27806, 'start': 17833, 'title': 'THE TRAGEDY OF CORIOLANUS'},
 6: {'end': 27824, 'start': 27807, 'title': 'CYMBELINE'},
 7: {'end': 34511,
  'start': 27825,
  'title': 'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK'},
 8: {'end': 39227,
  'start': 34512,
  'title': 'THE FIRST PART OF KING HENRY THE FOURTH'},
 9: {'end': 42636,
  'start': 39228,
  'title': 'THE SECOND PART OF KING HENRY THE FOURTH'},
 10: {'end': 47572, 'start': 42637, 'title': 'THE LIFE OF KING HENRY V'},
 11: {'end': 50843,
  'start': 47573,
  'title': 'THE FIRST PART OF HENRY THE SIXTH'},
 12: {'end': 54331,
  'start': 50844,
  'title': 'THE SECOND PART OF KING HENRY THE

In [23]:
for e, i in enumerate(theData):
    
    if "ALL’S WELL THAT ENDS WELL" in i:
        print(e)

2777


In [25]:
theData[0]

'THE SONNETS'

In [26]:
# divide b/w plays and sonets
sonets = theData[:2776]
plays = theData[2777:]
print(len(sonets), len(plays))

2776 163991


In [27]:
theData[0]

'THE SONNETS'

In [0]:
def long_lines(lst_ln):
    clean = []
    
    for ln in lst_ln: 
        
        if len(ln) == 0:
            pass
        else:
            pct = len(ln.strip(' ')) / len(ln)

            if pct >= .5:
                clean.append(ln.lstrip())

    return clean

In [29]:
# May Not be Needed --- this will remove those lines where there is 
# mostly space
sonets_clean = long_lines(sonets)
plays_clean = long_lines(plays)
print(len(sonets), len(plays)) 
# This shows that nothing was really removed

2776 163991


In [30]:
# looking at what the sonets_clean and the plays_clean are:
type(sonets_clean), type(plays_clean)

(list, list)

In [0]:
# code to test the function below
theList = ["I love cows?]", "^~hello|í"]

In [0]:
# going to try to remove anything that is non ascii
def remove_non_ascii(textList):
  for i in range(len(textList)):
    new = textList[i].encode("ascii", "ignore")
    textList[i] = new.decode()
  return textList

In [65]:
# this is an example of using the function to remove the non ascii characters
theNewList = remove_non_ascii(theList)
theNewList

['I love cows?]', '^~hello|']

In [68]:
# will run the lists through the methods that will 
# remove the non ascii chars if there are any of them
sonets_no_ascii = remove_non_ascii(sonets)
plays_no_ascii = remove_non_ascii(plays)
print(len(sonets), len(sonets_no_ascii))
print(len(plays), len(plays_no_ascii))

2776 2776
163991 163991


In [0]:
# this means that there were no non ascii found in the
# documents

In [0]:
import string

In [0]:
# will try to clean the plays and the sonets some more to remove some of the 
# puctuation
practice_string = "I 2 will go!? , tommorow"

# doing some looping
table = practice_string.maketrans(" ", " ", string.punctuation)
new_string = practice_string.translate(table)
new_string

'I 2 will go  tommorow'

## Word Encoding

This is just a start, and is not complete yet. 

In [0]:
# building a list that will 
play_vocab = list(set("\r\n".join(plays).split()))
play_words = [line.split() for line in plays]

In [0]:
# doing this also for the sonnets
sonet_vocab = list(set("\r\n".join(sonets).split()))
sonet_words = [theLine.split() for theLine in sonets]

In [35]:
# now printing out the length of the vocab words for the sonnet
print(f"the length of the vocab for the sonets is {len(sonet_vocab)}")
print(f"the lenght of the list of lines in the sonet is: {len(sonet_words)}")

the length of the vocab for the sonets is 4727
the lenght of the list of lines in the sonet is: 2776


In [36]:
print(len(word_vocab), len(play_words))

75738 163991


In [37]:
words[0]

['ALL’S', 'WELL', 'THAT', 'ENDS', 'WELL']

In [38]:
!pip install ipdb


Collecting ipdb
  Downloading https://files.pythonhosted.org/packages/2c/bb/a3e1a441719ebd75c6dac8170d3ddba884b7ee8a5c0f9aefa7297386627a/ipdb-0.13.2.tar.gz
Building wheels for collected packages: ipdb
  Building wheel for ipdb (setup.py) ... [?25l[?25hdone
  Created wheel for ipdb: filename=ipdb-0.13.2-cp36-none-any.whl size=10522 sha256=742c9f4630e145f4274be3044082a085ab3e39bd1288e6dca712f00a73e17eea
  Stored in directory: /root/.cache/pip/wheels/60/c2/15/793365e3c9318c46ba914263740d90f1fe67f544b979141ce4
Successfully built ipdb
Installing collected packages: ipdb
Successfully installed ipdb-0.13.2


In [0]:
import ipdb

## Character Encoding

Using the technique shown in lecture. 

In [0]:
def make_char_int_mapping(theText):
  # this is to make each character map to an integer
  text = '\r\n'.join(theText)
  # This will make the string into a list of chars it makes it so that 
  # there is only one of each type of character
  chars = list(set(text))

  char_int = {c:i for i,c in enumerate(chars)}
  int_char = {i:c for i,c in enumerate(chars)}

  print(f"The length of the char to int dictionary is: {len(char_int)}")
  print(f"The number of char in the theText is: {len(chars)}")
  print(f"Our corpus contains {len(chars)} unique characters.")
  return char_int, int_char

In [79]:
# running the function
sonet_char_to_int, sonet_int_to_char = make_char_int_mapping(sonets)

The length of the char to int dictionary is: 71
The number of char in the theText is: 71
Our corpus contains 71 unique characters.


In [71]:
# running the same function for the plays
plays_char_to_int, plays_int_to_char = make_char_int_mapping(plays)

Our corpus contains 90 unique characters.


In [0]:
# Creating a function that will return the encoded
# text -- will now have 
def encode_text(theText, char_to_int_map):
  text = "\r\n".join(theText)

  encoded = [char_to_int_map[c] for c in text]
  # encoded is a list where each element is a char in the text 
  # in numerical form
  return encoded

In [48]:
# getting the encoded text for the sonets
sonet_encoded = encode_text(sonets, sonet_char_to_int)
print(type(sonet_encoded))
print(sonet_encoded[0:10])
print(f"the length of the sonet_encoded is: {len(sonet_encoded)}")

<class 'list'>
[18, 59, 43, 14, 27, 55, 15, 15, 43, 18]
the length of the sonet_encoded is: 101128


In [72]:
plays_encoded = encode_text(plays, plays_char_to_int)
print(type(plays_encoded))
print(len(plays_encoded))
plays_encoded[0:5]

<class 'list'>
5617664


[51, 9, 9, 46, 25]

In [0]:

# Create the Sequence Data
# This is the function that is trying to 
# make the data to use for the lstm into a series of
# lists where the "window slides down on each iteration"
def useSlidingWindow(encoded_text_list):
  maxlen = 150
  step = 1



  sequences = [] # Each element is 40 characters long -- I think it will be set to 
                # 150 char long
  next_chars = [] # One element for each sequence

  # we are using the stop of minus the maxLen so that we don't go 
  # out of bounds when making a sequence
  for i in range(0, len(encoded_text_list) - maxlen, step):
      sequences.append(encoded_text_list[i : i + maxlen])
      # next chars is the y target for each of the current sequences
      # only putting in one char per element in the next char list
      next_chars.append(encoded_text_list[i + maxlen])

  return sequences, next_chars
    
#print('sequences:', len(sequences))

In [0]:
# Creating the data for the plays and the sonnets
# the sonnet sequences are a list that has each element is a list of 150 long
# of the chars in the 
x_sonets_sequences, y_sonets_next_char = useSlidingWindow(sonet_encoded)
x_plays_sequences, y_sonets_next_char = useSlidingWindow(plays_encoded)


In [75]:
print(len(x_sonets_sequences), len(x_plays_sequences))

100978 5617514


In [0]:
x_sonet

In [0]:
import numpy as np

# Specify x & y 
# length char is the number of unique chars in the sequences
# building of the data in a 3 diminsional , all the zeros are False
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)
# each sequence element is 150 chars (integers) long
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
      # this is filling out the numpy matrix with
      # 1 other than zero this makes that area a True.
        x[i,t,char] = 1
        
    y[i, next_chars[i]] = 1 # this will make this y be a true instead of false

In [0]:
x.shape

(100978, 150, 73)

In [0]:
# build the model: a single LSTM

model = Sequential()
model.add(LSTM(256, input_shape=(maxlen, len(chars)), dropout=0.2))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='nadam')

In [0]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 1028)              4531424   
_________________________________________________________________
dense (Dense)                (None, 73)                75117     
Total params: 4,606,541
Trainable params: 4,606,541
Non-trainable params: 0
_________________________________________________________________


In [0]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [0]:
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = TensorBoard(logdir, histogram_freq=1)

model.fit(x, y,
          batch_size=1024,
          validation_split=.2,
          epochs=100,
          callbacks=[print_callback, 
                     #EarlyStopping(min_delta=.02, monitor='val_loss', patience=10),
                     tensorboard_callback])

Train on 80782 samples, validate on 20196 samples
Epoch 1/100
 2048/80782 [..............................] - ETA: 26:02 - loss: 4.0877
----- Generating text after Epoch: 0
----- Generating with seed: "Each changing place with that which goes before,
In sequent toil all forwards do contend.
Nativity once in the main of light,
Crawls to maturity, w"
Each changing place with that which goes before,
In sequent toil all forwards do contend.
Nativity once in the main of light,
Crawls to maturity, wcyoObdsGhahotnlrior soGdon
diirhma
s2flhEA srSosiiGnohr
 cuish
swhothuhgn
s8i5hplhsahbhv
!hr Phodhho ihrlhnzolhivuoiRr rhf3h
izws
olhaaonhyloAwisellollhhh,lnOt5htrzhylsTnh criYvl1ehyTsaagrwsFchDbriBo:isnhlnotAtOr.rooslatthh,o,S dsf 5ol ohbrlt?(rtronEhnilr5ghc,svrimash,b, ogBaalOxSgochhwthoif!Rn do5wqcgnn at3krwn.
sdssiil,olamrr Tlohg,r
yrOhwioe
wiwJtlfia3soofvadAytaytc p,mhiGo oyvf rtn5h
 2048/80782 [..............................] - ETA: 1:09:39 - loss: 4.0877

KeyboardInterrupt: 

In [0]:
%load_ext tensorboard

In [0]:
%tensorboard --logdir logs

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN