<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [56]:
import pandas as pd
import tensorflow as tf
import io
import datetime
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dense
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing import sequence

class text_generator():
    def load_txt(self, uri='./data/100-0.txt'):
        try:
            print("[text_generator:{}]> Reading data from {}:".format(datetime.time(), '100-0.txt'))
            text = []
            with open(uri, 'r', encoding='utf-8') as text_file:
                # add the data read to the text list
                text.append(text_file.read())
            self.text = text
               
        except Exception as e:
            print(e)
        return
    def prep_data(self):
        print("[text_generator:{}]> Preparing data:".format(datetime.time()))
        # Encode Data as Chars
        # Gather all text 
        # Why? 1. See all possible characters 2. For training / splitting later
        self.text = " ".join(self.text)
        
        # Unique Characters
        chars = list(set(self.text))

        # Lookup Tables
        char_int = {c:i for i, c in enumerate(chars)} 
        int_char = {i:c for i, c in enumerate(chars)} 
        # Create the sequence data

        maxlen = 40
        step = 5

        encoded = [char_int[c] for c in self.text]

        sequences = [] # Each element is 40 chars long
        next_char = [] # One element for each sequence

        for i in range(0, len(encoded) - maxlen, step):
    
            sequences.append(encoded[i : i + maxlen])
            next_char.append(encoded[i + maxlen])
        
        # Create x & y
        x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
        y = np.zeros((len(sequences),len(chars)), dtype=np.bool)

        for i, sequence in enumerate(sequences):
            for t, char in enumerate(sequence):
                x[i,t,char] = 1

            y[i, next_char[i]] = 1
        

        self.X = x
        self.y = y
        return
    
    def val_split(self,X,y):
        X_train,y_train,X_test,y_test = train_test_split(self.X,self.y)
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        return

    def build_model(self, lr=0.0005):
        opt = Adam(learning_rate=lr)
        model = Sequential()
        model.add(Embedding(max_features, 128))
        model.add(LSTM(128,activation='tanh',
                       recurrent_activation='sigmoid',
                       recurrent_dropout=0,
                       dropout=0.2,
                       use_bias=True,
                       unroll=False))

        model.add(Dense(1, activation='sigmoid'))

        model.compile(loss='binary_crossentropy',
                      optimizer=opt, 
                      metrics=['accuracy'])
        return model

tg = text_generator()


In [57]:
tg.load_txt()

[text_generator:00:00:00]> Reading data from 100-0.txt:


In [58]:
tg.prep_data()

[text_generator:00:00:00]> Preparing data:


In [59]:
tg.X.shape

(1109850, 40, 101)

In [60]:
model = tg.build_model()

In [62]:
tg.val_split(tg.X,tg.y)



In [63]:
model.fit(tg.X_train,tg.y_train,epochs=5)

ValueError: Data cardinality is ambiguous:
  x sizes: 832387
  y sizes: 277463
Please provide data which shares the same first dimension.

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN