<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [26]:
# I have downloaded the sonnets directly
# & will now load that file
fh = open('sonnets.txt', 'r')
txt = fh.read()
fh.close()

In [36]:
chars = list(set(txt))
char_int = {c:i for i,c in enumerate(chars)}
int_char = {i:c for i,c in enumerate(chars)}
enc = [char_int[c] for c in txt]
print(enc[:100])

[25, 48, 9, 24, 57, 23, 26, 26, 9, 25, 57, 6, 6, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 11, 6, 6, 51, 67, 56, 68, 24, 43, 16, 54, 67, 27, 34, 5, 24, 70, 67, 27, 16, 5, 28, 67, 27, 34, 24, 55, 27, 24, 58, 27, 34, 54, 67, 27, 24, 54, 4, 70, 67, 27, 16, 34, 27, 0, 6, 25, 53, 16, 5, 24, 5, 53, 27, 67, 27, 59, 45, 24, 59, 27, 16, 28, 5, 45, 64, 34]


In [28]:
maxlen = 40
step = 5
seqs = []
nexts = []
for i in range(0, len(enc)-maxlen, step):
    seqs.append(enc[i:i+maxlen])
    nexts.append(enc[i+maxlen])
len(seqs)

19661

In [29]:
import numpy as np
x = np.zeros((len(seqs), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(seqs), len(chars)), dtype=np.bool)

for i, seq in enumerate(seqs):
    for t, char in enumerate(seq):
        x[i,t,char] = 1
    y[i,nexts[i]] = 1

In [30]:
x.shape

(19661, 40, 72)

In [31]:
y.shape

(19661, 72)

In [32]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [33]:
model.fit(x, y, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x10f73de50>

In [40]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def my_lil_generator(l=10):
    sindex = np.random.randint(0, len(txt) - maxlen - 1)
    seed = txt[sindex : sindex + maxlen]
    op = []
    for i in range(l):
        xpred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(seed):
            xpred[0,t,char_int[char]] = 1.0
        ypred = model.predict(xpred, verbose=0)[0]
        nindex = sample(ypred)
        nchar = int_char[nindex]
        seed = seed[1:] + nchar
        op.append(nchar)
    return ''.join(op)

In [42]:
my_lil_generator(25)

'.Fou thise fuo se\n\nO d ai'

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN