# RNN Text Generation:  An Introduction

My investigations of current machine learning techniques have finally brought me into the word of artifical neural networks.  I was attracted immediately to recurrent neural networks(RNNs) after reading Andrej Karpathy's [great blog on text generation with RNNs.](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)  There is a word Andrej uses in his blog to describe a part of what RNNs can do that has stuck with me: magic.  What we will be seeing in this blog is that we can train a neural network to write new text(words, senetences, paragraphs, even an entire corpus).  It this process of teaching a machine to create that still fills me with wonder and a sense of magic.  


### Where is all this going?
When I first learned of machine learning and artificial neural networks it invoked in me the themes of science fiction, a genre that has consistently been warning us of the looming dangers of artificial intelligence.  Although the models described in this blog are infantile when compared to a sentient, artifical creation, we should not be quick to distract ourselves: the seeds to more advanced forms of artifical intelligence start here.  Some may say true, artifical sentience is not possible and say any moral obligations to that improbable outcome can be ignored, but this is an outlook I do not share.  Yes, there is magic in what a recurrent neural network can do, but I also feel an imperative to consider the long term impacts of where all of this playful tinkering is leading us.  I have no definitive statement on the ethical and moral conduct of the current generation of data scientists and machine learning engineers.  All I ask is that we consider the long term effects, whatever they may be, of the seeds we are sowing here.  Not be overly dramatic, but who amongst us really want to be the next Oppenheimer, painfully letting out the words, "Now I am become death, the destroyer of worlds".

## What is a recurrent neural network?






In [1]:
# import dependencies
import csv as csv
import numpy as np
import pandas as pd
import pylab as py
import re, pickle, sys, os, datetime
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.cross_validation import KFold
import sklearn.metrics as skmetrics

# NN Modules
import keras
from keras.models import Sequential, model_from_json, load_model
from keras.layers import Dense, Dropout, LSTM, Activation
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from keras.wrappers.scikit_learn import KerasClassifier

# My modules
sys.path.insert(0,"/Users/HAL3000/Dropbox/coding/my_modules/")
import keras_modules as my_keras_modules
import w2v_modules as my_w2v_modules
import text_modules as my_text_modules

Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
#  All tuneable parameters

# Training Book List
book_dir   = '/Users/HAL3000/Dropbox/coding/neural_nets/data/books/'
input_file = book_dir+'Alice_In_Wonderland.txt'

### Charachters to use as RNN features
keep_list = '[^A-Z^a-z,."!? ]'
keep_upper = True
#keep_upper = False

# Sequence length for RNN
SEQ_LENGTH = 100

# Generated Text Length
LENGTH = 400

### RNN Parameters ###
#retrain = False
retrain = True

HIDDEN_UNITS  = 10
HIDDEN_LAYERS = 1
DROPOUT       = 0.05
EPOCHS        = 5
BATCH_SIZE    = 128

In [3]:
# Import the desired text(will split by charachter)
input_text = my_text_modules.input_text(input_file)

# Preprocess: lower case conversion, strip useless charachters, etc.
cleaned_text = my_text_modules.clean_text(input_text, keep_list, keep_upper)

# Define the unique charachter(feature or vocab) set.
vocab, n_vocab, ix_to_vocab, vocab_to_ix = my_text_modules.build_vocab(cleaned_text)

# Build the sequences that Keras RNN will train on.  Format for input is:
# (number_of_sequences, length_of_sequence, number_of_features)
x_train, y_train = my_text_modules.text_to_KerasRnn_input(cleaned_text, vocab, n_vocab, SEQ_LENGTH, vocab_to_ix)



Loading Text File: /Users/HAL3000/Dropbox/coding/neural_nets/data/books/Alice_In_Wonderland.txt

Cleaning Raw Text...

 Snippet of Raw Text: ['\ufeff', 'C', 'H', 'A', 'P', 'T', 'E', 'R', ' ', 'I', '.', ' ', 'D', 'o', 'w', 'n', ' ', 't', 'h', 'e', ' ', 'R', 'a', 'b', 'b', 'i', 't', '-', 'H', 'o', 'l', 'e', '\n', '\n', 'A', 'l', 'i', 'c', 'e', ' ', 'w', 'a', 's', ' ', 'b', 'e', 'g', 'i', 'n', 'n', 'i', 'n', 'g', ' ', 't', 'o', ' ', 'g', 'e', 't', ' ', 'v', 'e', 'r', 'y', ' ', 't', 'i', 'r', 'e', 'd', ' ', 'o', 'f', ' ', 's', 'i', 't', 't', 'i', 'n', 'g', ' ', 'b', 'y', ' ', 'h', 'e', 'r', ' ', 's', 'i', 's', 't', 'e', 'r', ' ', 'o', 'n', ' ']

 Snippet of Cleaned Text: ['C', 'H', 'A', 'P', 'T', 'E', 'R', ' ', 'I', '.', ' ', 'D', 'o', 'w', 'n', ' ', 't', 'h', 'e', ' ', 'R', 'a', 'b', 'b', 'i', 't', 'H', 'o', 'l', 'e', 'A', 'l', 'i', 'c', 'e', ' ', 'w', 'a', 's', ' ', 'b', 'e', 'g', 'i', 'n', 'n', 'i', 'n', 'g', ' ', 't', 'o', ' ', 'g', 'e', 't', ' ', 'v', 'e', 'r', 'y', ' ', 't', 'i', 'r

In [4]:
def build_LSTM_model(x_train, n_vocab, DROPOUT=0.1, HIDDEN_LAYERS=1, HIDDEN_UNITS=10):
    model = Sequential()
    if HIDDEN_LAYERS == 1:
        model.add(LSTM(HIDDEN_UNITS, input_shape=(x_train.shape[1], n_vocab)))
    else:
        model.add(LSTM(HIDDEN_UNITS, input_shape=(x_train.shape[1], n_vocab), return_sequences=True))
    model.add(Dropout(DROPOUT))
    for layer in range(HIDDEN_LAYERS-1):
        model.add(LSTM(HIDDEN_UNITS))
        model.add(Dropout(DROPOUT))
    model.add(Dense(n_vocab, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    model.summary()
    return model

my_model = build_LSTM_model(x_train, n_vocab, DROPOUT, HIDDEN_LAYERS, HIDDEN_UNITS)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 10)                2760      
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 58)                638       
Total params: 3,398
Trainable params: 3,398
Non-trainable params: 0
_________________________________________________________________


In [5]:
# define the checkpoint and callbacks
mydir = datetime.datetime.now().strftime('%m-%d_%H-%M')
os.makedirs("logs/"+mydir+"/")
filepath = "logs/"+mydir+"/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
tb_callback = keras.callbacks.TensorBoard(log_dir='./logs/fake_news/{}'.format(time()),
                                          histogram_freq=0, write_graph=True,
                                          write_images=False)
callbacks_list = [checkpoint, tb_callback]

In [6]:
# Fit the RNN
if not os.path.exists('weights/rnn_alice.h5') or retrain:
    print('\n ==== Training Keras NN ====')
    print('Epochs:', EPOCHS, '\nbatch size:', BATCH_SIZE)
    my_model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, callbacks=callbacks_list)
    my_model.save('weights/rnn_alice.h5')
else:
    print('\nImporting Keras NN Model...')
    my_model = load_model('logs/weights-improvement-02-1.8823.hdf5')


 ==== Training Keras NN ====
Epochs: 5 
batch size: 128
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [7]:
# Generate some new text(trained RNN model, desried new text length, vocab size, two vocab dicts)
my_text_modules.generate_text(my_model, cleaned_text, LENGTH, SEQ_LENGTH,
                              n_vocab, ix_to_vocab, vocab_to_ix)


Generating New Text of Length 400 


----- Generating with seed: " little sisters, the Dormouse beganin a great hurry and their names were Elsie, Lacie, and Tillie an"
----- End Seed -----
 little sisters, the Dormouse beganin a great hurry and their names were Elsie, Lacie, and Tillie ang, mopd thea, bt cee Kat shee otapo uttebhir tibb. Toun nolaf aeragt go Oep er Arepeshel, woge coo s uwthe nabid saner t oucept and fad aregIsd, iord nhern t eside dunte atfr bhizd hithid Varoulmlnd hicoant,re urdol Bhewev., nve whsh anshecert alile thiz?cas cin! outllid! Wiedee Yofs kothwcher  roumeorermowert an An ou whd tand hiict he of IinImT isllad set uroud Coup!s, I s ul wal gor r  sild aly

Done.


In [8]:
# Define list of models to compare
model_list = ['logs/'+mydir+'/weights-improvement-01-3.0970.hdf5',
              'logs/'+mydir+'/weights-improvement-03-2.6688.hdf5',
              'logs/'+mydir+'/weights-improvement-05-2.4788.hdf5',
             ]

seed_text = "She got to the part about her repeating YOU ARE OLD, FATHER WILLIAM, to the Caterpillar "

In [9]:
my_text_modules.generate_text_diff_complexity(model_list, seed_text, LENGTH, SEQ_LENGTH, n_vocab, ix_to_vocab, vocab_to_ix)

UnboundLocalError: local variable 'sentence' referenced before assignment