<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [2]:
from tensorflow.keras.utils import get_file

In [74]:
# Get file and save as text file
url = 'https://www.gutenberg.org/files/100/100-0.txt'
path_to_file = get_file('shakespeare.txt', url)

doc = open(path_to_file, 'rb').read().decode(encoding='utf-8-sig')

In [75]:
# Remove whitespace and make all characters lowercase
doc = doc.replace("\r", "")
doc = doc.lower()
doc = " ".join(doc.split())
print(f"Length of document: {len(doc)} characters")

Length of document: 5278325 characters


In [77]:
import numpy as np

# Split and remove duplicate characters and convert to a list.
chars = list(set(doc))

num_chars = len(chars)
txt_data_size = len(doc)

print("unique characters:", num_chars)
print("txt_data_size:", txt_data_size)

unique characters: 76
txt_data_size: 5278325


In [78]:
# One hot encode
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
print(char_to_int)
print("-----------------------------------------------")
print(int_to_char)
print("-----------------------------------------------")
# Integer encode input data
integer_encoded = [char_to_int[i] for i in doc]
print(len(integer_encoded))

{'"': 0, '#': 1, 'j': 2, '’': 3, 'u': 4, 'f': 5, 'à': 6, 't': 7, "'": 8, ';': 9, '-': 10, ')': 11, '—': 12, 'è': 13, '/': 14, 'w': 15, 'ç': 16, '%': 17, 'n': 18, 'k': 19, 'e': 20, ' ': 21, 'æ': 22, 'œ': 23, 'é': 24, '[': 25, '?': 26, '8': 27, 'i': 28, 'x': 29, '3': 30, '6': 31, ':': 32, 'c': 33, '.': 34, 'â': 35, '\\': 36, '|': 37, '5': 38, 'p': 39, 'm': 40, '*': 41, ',': 42, '‘': 43, '7': 44, 'z': 45, 'b': 46, 'a': 47, '(': 48, 's': 49, ']': 50, '0': 51, '”': 52, '&': 53, '_': 54, '1': 55, '“': 56, '2': 57, '@': 58, 'r': 59, 'y': 60, 'î': 61, 'o': 62, '4': 63, 'v': 64, 'ê': 65, '`': 66, 'h': 67, '}': 68, '!': 69, '$': 70, 'l': 71, 'd': 72, 'q': 73, 'g': 74, '9': 75}
-----------------------------------------------
{0: '"', 1: '#', 2: 'j', 3: '’', 4: 'u', 5: 'f', 6: 'à', 7: 't', 8: "'", 9: ';', 10: '-', 11: ')', 12: '—', 13: 'è', 14: '/', 15: 'w', 16: 'ç', 17: '%', 18: 'n', 19: 'k', 20: 'e', 21: ' ', 22: 'æ', 23: 'œ', 24: 'é', 25: '[', 26: '?', 27: '8', 28: 'i', 29: 'x', 30: '3', 31: '6

In [99]:
# Hyperparamters
iteration = 5
sequence_length = 40
batch_size = round((txt_data_size / sequence_length) + 0.5)
hidden_size = 500
learning_rate = 1e-1

# Model parameters
W_xh = np.random.randn(hidden_size, num_chars) * 0.01       # weight input => hidden
W_hh = np.random.randn(hidden_size, hidden_size) * 0.01     # weight hidden => hidden
W_hy = np.random.randn(num_chars, hidden_size) * 0.01       # weight hidden => output

b_h = np.zeros((hidden_size, 1))   # hidden bias
b_y = np.zeros((num_chars, 1))     # output bias

h_prev = np.zeros((hidden_size, 1)) # h_(t-1)

### Forward Propagation

In [80]:
def forwardprop(inputs, targets, h_prev):
    xs, hs, ys, ps = {}, {}, {}, {} 
    hs[-1] = np.copy(h_prev) # Copy previous hidden state vector to -1 key value.
    loss = 0 # loss initialization
    
    for t in range(len(inputs)):   # t is a "time step" and is used as a key(dict)
        
        xs[t] = np.zeros((num_chars, 1))
        xs[t][inputs[t]] = 1
        hs[t] = np.tanh(np.dot(W_xh, xs[t]) + np.dot(W_hh, hs[t-1]) + b_h) # hidden state
        ys[t] = np.dot(W_hy, hs[t]) + b_y # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))  # probabilities for next chars.
        
        # softmax
        loss += -np.log(ps[t][targets[t], 0]) # softmax (cross-entropy loss)
        
    return loss, ps, hs, xs

### Backward Propagation

In [81]:
def backprop(ps, inputs, hs, xs, targets):
    
    dWxh, dWhh, dWhy = np.zeros_like(W_xh), np.zeros_like(W_hh), np.zeros_like(W_hy) # make all zero matricies
    dbh, dby = np.zeros_like(b_h), np.zeros_like(b_y)
    dhnext = np.zeros_like(hs[0]) # (hidden_size, 1)
    
    # reversed
    for t in reversed(range(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1
        dWhy += np.dot(dy, hs[t].T)
        dby += dy
        dh = np.dot(W_hy.T, dy) + dhnext
        dhraw = (1 - hs[t] * hs[t]) * dh
        dbh += dhraw
        dWxh += np.dot(dhraw, xs[t].T)
        dWhh += np.dot(W_hh.T, dhraw)
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam)
        
    return dWxh, dWhh, dWhy, dbh, dby

### Training the model

In [None]:
%%time

data_pointer = 0

# memory variables for Adagrad
mWxh, mWhh, mWhy = np.zeros_like(W_xh), np.zeros_like(W_hh), np.zeros_like(W_hy)
mbh, mby = np.zeros_like(b_h), np.zeros_like(b_y)

for i in range(iteration):
    h_prev = np.zeros((hidden_size, 1)) # reset RNN memory
    data_pointer = 0
    
    for b in range(batch_size):
        
        inputs = [char_to_int[ch] for ch in doc[data_pointer: data_pointer+sequence_length]]
        targets = [char_to_int[ch] for ch in doc[data_pointer+1: data_pointer+sequence_length+1]]
        
        if (data_pointer+sequence_length+1 >= len(doc) and b == batch_size-1):
            targets.append(char_to_int[" "])   # When data doesn't fit, add space(" ") to the back.
            
        # forward
        loss, ps, hs, xs = forwardprop(inputs, targets, h_prev)
        
        # backward
        dWxh, dWhh, dWhy, dbh, dby = backprop(ps, inputs, hs, xs, targets)
        
        # perform parameter update with Adagrad
        for param, dparam, mem in zip([W_xh, W_hh, W_hy, b_h, b_y],
                                     [dWxh, dWhh, dWhy, dbh, dby],
                                     [mWxh, mWhh, mWhy, mbh, mby]):
            mem += dparam * dparam
            param += -learning_rate * dparam / np.sqrt(mem + 1e-8)
            
        data_pointer += sequence_length
        
    if i % 100 == 0:
        print(f"iter {i}, loss: {loss}")
    

In [91]:
def predict(test_char, length):
    x = np.zeros((num_chars, 1))
    x[char_to_int[test_char]] = 1
    ixes = []
    h = np.zeros((hidden_size, 1))
    
    for t in range(length):
        h = np.tanh(np.dot(W_xh, x) + np.dot(W_hh, h) + b_h)
        y = np.dot(W_hy, h) + b_y
        p = np.exp(y) / np.sum(np.exp(y))
        ix = np.random.choice(range(num_chars), p=p.ravel())
        x = np.zeros((num_chars, 1))
        x[ix] = 1
        ixes.append(ix)
    txt = test_char + ''.join(int_to_char[i] for i in ixes)
    print(f"----\n {txt} \n---")

In [96]:
predict('b', 5)

----
 bdruta 
---


# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN