<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import requests
import pandas as pd

In [2]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [3]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,ALL’S WELL THAT ENDS WELL,2777,7738,ALL’S WELL THAT ENDS WELL\r\n\r\n\r\n\r\nConte...
1,THE TRAGEDY OF ANTONY AND CLEOPATRA,7739,11840,THE TRAGEDY OF ANTONY AND CLEOPATRA\r\n\r\nDRA...
2,AS YOU LIKE IT,11841,14631,AS YOU LIKE IT\r\n\r\nDRAMATIS PERSONAE.\r\n\r...
3,THE COMEDY OF ERRORS,14632,17832,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
4,THE TRAGEDY OF CORIOLANUS,17833,27806,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...


In [5]:
top_k_words = 10000

import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k_words,
                                                  oov_token="<unk>")

In [6]:
sequences = []
maxlen = 350 #> later cranked up to 350 from orig val of 101 (pre-modelfit)

def split_text(text):

  for i in range(0, len(text), maxlen):
    seq = text[i:i+maxlen]
    sequences.append(seq)

df_toc['text'].apply(split_text)

0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
22    None
23    None
24    None
25    None
26    None
27    None
28    None
29    None
30    None
31    None
32    None
33    None
34    None
35    None
36    None
37    None
38    None
39    None
40    None
41    None
42    None
Name: text, dtype: object

In [7]:
sequences[0]

'ALL’S WELL THAT ENDS WELL\r\n\r\n\r\n\r\nContents\r\n\r\nACT I\r\nScene I. Rossillon. A room in the Countess’s pala'

In [8]:
len(sequences)

55819

In [9]:
len(sequences[0])

101

In [10]:
len(sequences[-1])

44

In [11]:
### Now need to learn the vocab

tokenizer.fit_on_texts(sequences)

In [12]:
train_seqs = tokenizer.texts_to_sequences(sequences)
# Produces integer encoding

In [13]:
train_seqs[0]

[3375,
 64,
 12,
 1991,
 1666,
 2,
 2,
 2,
 3722,
 2,
 259,
 282,
 109,
 5,
 2780,
 8,
 417,
 11,
 3,
 5274,
 7960]

In [14]:
len(train_seqs) == len(sequences)

True

In [18]:
# Add step to resolve IndexError - Filter out Empty Sequences

train_seqs = [x for x in train_seqs if len(x) >= 2] #> so list is always 2 chars long

In [19]:
### Will try and predicted word sequences as opposed to characters

target = [x[-1] for x in train_seqs]

train_seqs = [x[:-1] for x in train_seqs]

# IndexError: list index out of range >>> got this error, last item probably 
# blank

In [None]:
# turn into tensors
x = tf.convert_to_tensor(train_seqs, dtype=tf.int16)
y = tf.convert_to_tensor(target, dtype=tf.int16)

# Got error >> ValueError: Can't convert non-rectangular Python sequence to Tensor.

# Error was from the different sequences being different shapes/lengths

In [25]:
# Correction is to pad the data to get seqs all same length,
# and turn them into tensors, which are basically tf versions of np arrays
# Set maxlen to 50 here so all seqs have length of 50

# turn into tensors
train_seqs = tf.keras.preprocessing.sequence.pad_sequences(train_seqs,
                                                           maxlen=50,
                                                           dtype='int32',
                                                           padding='pre')
x = tf.convert_to_tensor(train_seqs, dtype=tf.int16)
y = tf.convert_to_tensor(target, dtype=tf.int16)

In [26]:
x.shape

TensorShape([55818, 50])

In [27]:
### Now have a vocab and transformed sequences, next step is to build out model

embedding_dim = 256

model = tf.keras.Sequential(
    [
     tf.keras.layers.Embedding(10_000, embedding_dim), #> 10_000 words in vocab
     tf.keras.layers.LSTM(128),
     tf.keras.layers.Dense(10_000, activation='softmax')
    ]
)

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 256)         2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               197120    
_________________________________________________________________
dense_1 (Dense)              (None, 10000)             1290000   
Total params: 4,047,120
Trainable params: 4,047,120
Non-trainable params: 0
_________________________________________________________________


In [28]:
model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy',
              metrics=['acc'])

In [29]:
model.fit(x, y,
          validation_split=.15,
          epochs=50,
          batch_size=64)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50

KeyboardInterrupt: ignored

In [None]:
# Once the model is trained, we now have word embeddings

In [30]:
def generate_text(model, start_string):

  # How many words to generate
  num_generate = 100

  # Convert our start_string to tokens
  input_eval = tokenizer.texts_to_sequences([start_string])
  input_eval = tf.keras.preprocessing.sequence.pad_sequences(train_seqs,
                                                           maxlen=50,
                                                           dtype='int32',
                                                           padding='pre')
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our new results
  text_generated = []

  # Temperature
  temperature = 1.0

  # Resetting any previous predictions
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    # Remove the batch dimension
    prediction = tf.squeeze(predictions, 0)

    # Using categorical distribution to predict the word returned by model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted character as the next input to the model
    # along with the hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(tokenizer.sequences_to_texts(predicted_id)[0])

  return (start_string + ''.join(text_generated))

In [None]:
print(generate_text(model, start_string='Othello: '))

# Generates error

In [31]:
import numpy as np

np.argmax(model.predict([[333]]))



2

In [32]:
tokenizer.sequences_to_texts([[333, 2]])

['same \r']

In [34]:
### Modify function to actually generate the text

def generate_text(model, start_string):

  # How many words to generate
  num_generate = 100

  gen_seq = []

  start = tokenizer.texts_to_sequences([start_string])

  for _ in range(num_generate):

    pred = model.predict(start)
    pred_word = np.argmax(pred)
    gen_seq.append(pred_word)

  print(gen_seq)
  text = tokenizer.sequences_to_texts([gen_seq])[0]

  return start_string + text

In [35]:
print(generate_text(model, 'Othello \r'))

# Printing

[422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422, 422]
Othello yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours yours your

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN