<a href="https://colab.research.google.com/github/mudesir/DS-Unit-4-Sprint-3-Deep-Learning/blob/main/LS_DS_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import requests
import pandas as pd

In [2]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [3]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [4]:
data = df_toc['text'].values
len(data)

43

In [5]:
data=data[1]

In [6]:
# Encode Data as Chars

# Gather all text 
# Why? 1. See all possible characters 2. For training / splitting later
text = " ".join(data)

# Unique Characters
chars = list(set(text))

# Lookup Tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)} 

In [7]:
char_int['S']

55

In [8]:
int_char[2]

'q'

In [9]:
len(chars)

66

In [10]:
chars

['i',
 'z',
 'q',
 '&',
 'F',
 "'",
 'B',
 'D',
 'k',
 'h',
 'C',
 'o',
 't',
 'O',
 ']',
 'u',
 '"',
 ' ',
 'd',
 'f',
 'P',
 '\r',
 ',',
 ';',
 'L',
 'Y',
 'E',
 '[',
 '!',
 'A',
 'l',
 'G',
 'e',
 'g',
 'y',
 'K',
 'X',
 'N',
 's',
 'Q',
 '?',
 ':',
 'U',
 'b',
 'r',
 'W',
 'R',
 'I',
 '\n',
 'v',
 'H',
 'T',
 'x',
 'a',
 'p',
 'S',
 'w',
 'j',
 'c',
 'V',
 '.',
 '-',
 'n',
 'M',
 'J',
 'm']

In [11]:
char_int

{'\n': 48,
 '\r': 21,
 ' ': 17,
 '!': 28,
 '"': 16,
 '&': 3,
 "'": 5,
 ',': 22,
 '-': 61,
 '.': 60,
 ':': 41,
 ';': 23,
 '?': 40,
 'A': 29,
 'B': 6,
 'C': 10,
 'D': 7,
 'E': 26,
 'F': 4,
 'G': 31,
 'H': 50,
 'I': 47,
 'J': 64,
 'K': 35,
 'L': 24,
 'M': 63,
 'N': 37,
 'O': 13,
 'P': 20,
 'Q': 39,
 'R': 46,
 'S': 55,
 'T': 51,
 'U': 42,
 'V': 59,
 'W': 45,
 'X': 36,
 'Y': 25,
 '[': 27,
 ']': 14,
 'a': 53,
 'b': 43,
 'c': 58,
 'd': 18,
 'e': 32,
 'f': 19,
 'g': 33,
 'h': 9,
 'i': 0,
 'j': 57,
 'k': 8,
 'l': 30,
 'm': 65,
 'n': 62,
 'o': 11,
 'p': 54,
 'q': 2,
 'r': 44,
 's': 38,
 't': 12,
 'u': 15,
 'v': 49,
 'w': 56,
 'x': 52,
 'y': 34,
 'z': 1}

In [12]:
int_char

{0: 'i',
 1: 'z',
 2: 'q',
 3: '&',
 4: 'F',
 5: "'",
 6: 'B',
 7: 'D',
 8: 'k',
 9: 'h',
 10: 'C',
 11: 'o',
 12: 't',
 13: 'O',
 14: ']',
 15: 'u',
 16: '"',
 17: ' ',
 18: 'd',
 19: 'f',
 20: 'P',
 21: '\r',
 22: ',',
 23: ';',
 24: 'L',
 25: 'Y',
 26: 'E',
 27: '[',
 28: '!',
 29: 'A',
 30: 'l',
 31: 'G',
 32: 'e',
 33: 'g',
 34: 'y',
 35: 'K',
 36: 'X',
 37: 'N',
 38: 's',
 39: 'Q',
 40: '?',
 41: ':',
 42: 'U',
 43: 'b',
 44: 'r',
 45: 'W',
 46: 'R',
 47: 'I',
 48: '\n',
 49: 'v',
 50: 'H',
 51: 'T',
 52: 'x',
 53: 'a',
 54: 'p',
 55: 'S',
 56: 'w',
 57: 'j',
 58: 'c',
 59: 'V',
 60: '.',
 61: '-',
 62: 'n',
 63: 'M',
 64: 'J',
 65: 'm'}

In [13]:
# Create the sequence data

maxlen = 40
step = 5

encoded = [char_int[c] for c in text]

sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence

for i in range(0, len(encoded) - maxlen, step):
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])
    
print('sequences: ', len(sequences))

sequences:  54663


In [14]:
len(text)

273355

In [15]:
sequences[0]

[29,
 17,
 55,
 17,
 17,
 17,
 25,
 17,
 13,
 17,
 42,
 17,
 17,
 17,
 24,
 17,
 47,
 17,
 35,
 17,
 26,
 17,
 17,
 17,
 47,
 17,
 51,
 17,
 21,
 17,
 48,
 17,
 21,
 17,
 48,
 17,
 21,
 17,
 48,
 17]

In [16]:
import numpy as np

In [17]:
# Create x & y

# Padding!


x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1
        
    y[i, next_char[i]] = 1

In [18]:
x.shape

(54663, 40, 66)

In [19]:
y.shape

(54663, 66)

In [21]:
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.optimizers import RMSprop

import random
import sys
import os

In [22]:
# build the model: a single LSTM

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [23]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               99840     
_________________________________________________________________
dense (Dense)                (None, 66)                8514      
Total params: 108,354
Trainable params: 108,354
Non-trainable params: 0
_________________________________________________________________


In [24]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [27]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    # Random prompt
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        # Predict the next step (character)
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)    


In [28]:
# fit the model

model.fit(x, y,
          batch_size=32,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10
----- Generating text after Epoch: 0
 
         b r i e r s   i s   t h "
 
         b r i e r s   i s   t h e t h i l o l w e s t   o r   s h p o l n s p e ! z d   i l y a c r f p s o u r i l m a v   e g ;   t h i g   r y   A . l   u t e o r l ' g i d   h i r a t s i s   g i -   G R I C .   ,   M a r   b e r t   e     i n t e c a b e t m a ,   t A J e 
 

Epoch 2/10
----- Generating text after Epoch: 1
 
     C E L I A .   W h "
 
 
 
 
 
 
     R R 
Epoch 3/10
----- Generating text after Epoch: 2
----- Generating with seed: " e n   o f   g r e a t   w o r t h   r e"
 
         U f   a p t r l y   a n p o l o s   t h o v e   a r   i s   g a t   y o u   a f   c h a n g   f o v e r d   m e   a n d   z i d   o o n   h e r   Y o v e   I   s a   t h e   G R o u   f h a a d   h a e n   t h e   c ' v h l a n d   I   y h o t   i n   m e   a   h -   v e r e   S a t   M E R g a r .   T C U L E .   S o u d   l a   J y o t
Epoch 4/10
----- Generating text after Epoch: 3
----- Generatin

<tensorflow.python.keras.callbacks.History at 0x7faf6c3cac18>

# Resources and Stretch Goals

In [29]:
def print_text_from_seq(x):
  INDEX_FORM = 3
  word_to_id = imdb.get_word_index()
  word_to_id = {k:(v+INDEX_FORM) for k,v in word_to_id.items()}
  word_to_id["<PAD>"] = 0
  word_to_id["<START>"] = 1
  word_to_id["<UNK>"] = 2
  word_to_id["<UNUSED>"] = 3

  id_to_word = {value:key for key,value in word_to_id.items()}
  print('==================================================')
  print(f'Length = {len(x)}')
  print('==================================================')
  print(' '.join(id_to_word[id] for id in x))

In [31]:
from __future__ import print_function

from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN