<a href="https://colab.research.google.com/github/jsmazorra/DS-Unit-4-Sprint-3-Deep-Learning/blob/master/module1-rnn-and-lstm/Johan_Mazorra_LS_DS13_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
# Import the necessary libraries.

import numpy as np
import random
import sys
import io
from __future__ import print_function
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from keras.utils.data_utils import get_file
from tensorflow.keras.optimizers import RMSprop

Using TensorFlow backend.


In [2]:
# Text that we're going to use, the complete works of Shakespeare.

path = get_file('Shakespeare.txt', origin='https://www.gutenberg.org/files/100/100-0.txt')

with io.open(path, encoding='utf-8') as f:
  text = f.read().lower()
  print('corpus length:', len(text))

Downloading data from https://www.gutenberg.org/files/100/100-0.txt
corpus length: 5573152


In [3]:
# Let's check the number of characters.

chars = sorted(list(set(text)))
print('total chars:', len(chars))

total chars: 79


In [4]:
chars

['\t',
 '\n',
 ' ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '_',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '|',
 '}',
 'à',
 'â',
 'æ',
 'ç',
 'è',
 'é',
 'ê',
 'î',
 'œ',
 '—',
 '‘',
 '’',
 '“',
 '”',
 '\ufeff']

In [0]:
# Let's make character and indices interchangable and transmutable.

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [0]:
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
  sentences.append(text[i: i + maxlen])
  next_chars.append(text[i + maxlen])
  print('nb sequences:', len(sentences))

In [8]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vectorization...


In [9]:
# Let's build the LSTM model.

print('Building model')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))
optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Building model


In [0]:
# Let's make the samples.

def sample(preds, temperature=1.0):
    # Helper function to sample an index from a probability array.
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(100):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [12]:
model.fit(x, y,
          batch_size=64,
          epochs=20,
          callbacks=[print_callback])

Epoch 1/20
----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "even to the point of envy, if ’twere mad"
even to the point of envy, if ’twere made the possession the strong and the father
                                                         
----- diversity: 0.5
----- Generating with seed: "even to the point of envy, if ’twere mad"
even to the point of envy, if ’twere made me, wilt thou hast our for her heart.

                                                        exi
----- diversity: 1.0
----- Generating with seed: "even to the point of envy, if ’twere mad"
even to the point of envy, if ’twere made not mine,
    are us of honour perrive henouthears; for give in ’tis relel you. you see altorest w
Epoch 2/20
----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "h’untented woundings of a father’s curse"
h’untented woundings of a father’s curse ho sher re  the   iou th  an the ts  a     th  w  is tnd    pand  t     

  


 e           th      ho          n      e        th beat              hor         e  h 
----- diversity: 0.5
----- Generating with seed: ", i forgo;
    my acts, decrees, and sta"
, i forgo;
    my acts, decrees, and sta   a  e raneea a d  e ate     h t m   laftou .e,       oh o
i
o
o l e   t o e.
 in       e rbo t    
----- diversity: 1.0
----- Generating with seed: ", i forgo;
    my acts, decrees, and sta"
, i forgo;
    my acts, decrees, and sta o  w o y nees,esl,tha h!lonmo ttausca fcalllyen lpas a ’   raersonnha.
bs ahs’yaamdebo
a kebokbndt 
Epoch 4/20
----- Generating text after Epoch: 3
----- diversity: 0.2
----- Generating with seed: "gainst knaves and thieves men shut their"
gainst knaves and thieves men shut theirn       a ne 
 ao nd      l  trar    ea  hi         a   ana   thaor     an     w   o   anare    aa  
----- diversity: 0.5
----- Generating with seed: "gainst knaves and thieves men shut their"
gainst knaves and thieves men shut their     teree  tha  yt e caeier a   

<tensorflow.python.keras.callbacks.History at 0x7f01d0411208>

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               106496    
_________________________________________________________________
dense (Dense)                (None, 79)                10191     
Total params: 116,687
Trainable params: 116,687
Non-trainable params: 0
_________________________________________________________________


# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN