# Text Generation with Neural Networks
** Generating text based on Herman Melville's Moby Dick **
<img src='https://litreactor.com/sites/default/files/imagecache/header/images/column/headers/moby_dick_2.jpg' style="width:100%"/>


## Functions for Processing Text

### Reading in files as a string text

In [116]:
def read_file(filepath):
    
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

In [117]:
read_file('moby_dick_four_chapters.txt')

'Call me Ishmael.  Some years ago--never mind how long\nprecisely--having little or no money in my purse, and nothing\nparticular to interest me on shore, I thought I would sail about a\nlittle and see the watery part of the world.  It is a way I have of\ndriving off the spleen and regulating the circulation.  Whenever I\nfind myself growing grim about the mouth; whenever it is a damp,\ndrizzly November in my soul; whenever I find myself involuntarily\npausing before coffin warehouses, and bringing up the rear of every\nfuneral I meet; and especially whenever my hypos get such an upper\nhand of me, that it requires a strong moral principle to prevent me\nfrom deliberately stepping into the street, and methodically knocking\npeople\'s hats off--then, I account it high time to get to sea as soon\nas I can.  This is my substitute for pistol and ball.  With a\nphilosophical flourish Cato throws himself upon his sword; I quietly\ntake to the ship.  There is nothing surprising in this.  If t

### Tokenize and Clean Text

In [118]:
import spacy
nlp = spacy.load('en',disable=['parser', 'tagger','ner'])

nlp.max_length = 1198623

In [119]:
def separate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [120]:
d = read_file('melville-moby_dick.txt')
tokens = separate_punc(d)

In [121]:
tokens

['chapter',
 '1',
 'loomings',
 'call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail',
 'about',
 'a',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 'it',
 'is',
 'a',
 'way',
 'i',
 'have',
 'of',
 'driving',
 'off',
 'the',
 'spleen',
 'and',
 'regulating',
 'the',
 'circulation',
 'whenever',
 'i',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'the',
 'mouth',
 'whenever',
 'it',
 'is',
 'a',
 'damp',
 'drizzly',
 'november',
 'in',
 'my',
 'soul',
 'whenever',
 'i',
 'find',
 'myself',
 'involuntarily',
 'pausing',
 'before',
 'coffin',
 'warehouses',
 'and',
 'bringing',
 'up',
 'the',
 'rear',
 'of',
 'every',
 'funeral',
 'i',
 'meet',
 'and',
 'especially',
 'whenever',
 'my',
 'hypos',
 'get',
 'such

In [122]:
len(tokens)

214711

In [123]:
4431/25

177.24

## Create Sequences of Tokens

In [124]:
# organize into sequences of tokens
train_len = 25+1 # 50 training words , then one target word

# Empty list of sequences
text_sequences = []

for i in range(train_len, len(tokens)):
    
    # Grab train_len# amount of characters
    seq = tokens[i-train_len:i]
    
    # Add to list of sequences
    text_sequences.append(seq)

In [125]:
' '.join(text_sequences[0])

'chapter 1 loomings call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to'

In [126]:
' '.join(text_sequences[1])

'1 loomings call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest'

In [127]:
' '.join(text_sequences[2])

'loomings call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me'

In [128]:
len(text_sequences)

214685

# Keras

### Keras Tokenization

In [129]:
from keras.preprocessing.text import Tokenizer

In [130]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)

In [131]:
sequences[0]

[158,
 9447,
 17527,
 402,
 42,
 1043,
 43,
 247,
 659,
 140,
 296,
 116,
 82,
 787,
 347,
 113,
 36,
 50,
 1788,
 6,
 49,
 3028,
 3,
 218,
 442,
 5]

In [132]:
tokenizer.index_word

{1: 'the',
 2: 'of',
 3: 'and',
 4: 'a',
 5: 'to',
 6: 'in',
 7: 'that',
 8: 'his',
 9: 'it',
 10: 'i',
 11: 'he',
 12: 'but',
 13: "'s",
 14: 'as',
 15: 'with',
 16: 'is',
 17: 'was',
 18: 'for',
 19: 'all',
 20: 'this',
 21: 'at',
 22: 'not',
 23: 'by',
 24: 'whale',
 25: 'from',
 26: 'so',
 27: 'him',
 28: 'on',
 29: 'be',
 30: 'one',
 31: 'you',
 32: 'there',
 33: 'now',
 34: 'had',
 35: 'have',
 36: 'or',
 37: 'were',
 38: 'they',
 39: 'like',
 40: 'which',
 41: 'then',
 42: 'me',
 43: 'some',
 44: 'their',
 45: 'what',
 46: 'when',
 47: 'an',
 48: 'are',
 49: 'my',
 50: 'no',
 51: 'upon',
 52: 'out',
 53: 'man',
 54: 'into',
 55: 'ship',
 56: 'up',
 57: 'more',
 58: 'ahab',
 59: 'if',
 60: 'them',
 61: 'old',
 62: 'we',
 63: 'sea',
 64: 'would',
 65: "'",
 66: 'ye',
 67: 'do',
 68: 'other',
 69: 'been',
 70: 'over',
 71: 'these',
 72: 'will',
 73: 'though',
 74: 'only',
 75: 'its',
 76: 'down',
 77: 'such',
 78: 'who',
 79: 'yet',
 80: 'head',
 81: 'time',
 82: 'long',
 83: 'boat

In [133]:
for i in sequences[0]:
    print(f'{i} : {tokenizer.index_word[i]}')

158 : chapter
9447 : 1
17527 : loomings
402 : call
42 : me
1043 : ishmael
43 : some
247 : years
659 : ago
140 : never
296 : mind
116 : how
82 : long
787 : precisely
347 : having
113 : little
36 : or
50 : no
1788 : money
6 : in
49 : my
3028 : purse
3 : and
218 : nothing
442 : particular
5 : to


In [134]:
tokenizer.word_counts

OrderedDict([('chapter', 4447),
             ('1', 28),
             ('loomings', 3),
             ('call', 1382),
             ('me', 16095),
             ('ishmael', 500),
             ('some', 15789),
             ('years', 2400),
             ('ago', 815),
             ('never', 5262),
             ('mind', 2039),
             ('how', 6330),
             ('long', 8567),
             ('precisely', 690),
             ('having', 1679),
             ('little', 6412),
             ('or', 17879),
             ('no', 14916),
             ('money', 305),
             ('in', 105799),
             ('my', 15231),
             ('purse', 178),
             ('and', 164029),
             ('nothing', 2936),
             ('particular', 1273),
             ('to', 117832),
             ('interest', 442),
             ('on', 26910),
             ('shore', 572),
             ('i', 53430),
             ('thought', 3874),
             ('would', 11232),
             ('sail', 2522),
             ('about', 

In [135]:
vocabulary_size = len(tokenizer.word_counts)

### Convert to Numpy Matrix

In [136]:
import numpy as np

In [137]:
sequences = np.array(sequences)

In [138]:
sequences

array([[  158,  9447, 17527, ...,   218,   442,     5],
       [ 9447, 17527,   402, ...,   442,     5,  1165],
       [17527,   402,    42, ...,     5,  1165,    42],
       ...,
       [  240,   937,   351, ...,  1419,  1313,    74],
       [  937,   351,  1418, ...,  1313,    74,   219],
       [  351,  1418,     3, ...,    74,   219,   222]])

# Creating an LSTM based model

In [139]:
import keras
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding

In [140]:
def create_model(vocabulary_size, seq_len):
    model = Sequential()
    model.add(Embedding(vocabulary_size, 25, input_length=seq_len))
    model.add(LSTM(150, return_sequences=True))
    model.add(LSTM(150))
    model.add(Dense(150, activation='relu'))

    model.add(Dense(vocabulary_size, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
   
    model.summary()
    
    return model

### Train / Test Split

In [141]:
from keras.utils import to_categorical

In [142]:
sequences

array([[  158,  9447, 17527, ...,   218,   442,     5],
       [ 9447, 17527,   402, ...,   442,     5,  1165],
       [17527,   402,    42, ...,     5,  1165,    42],
       ...,
       [  240,   937,   351, ...,  1419,  1313,    74],
       [  937,   351,  1418, ...,  1313,    74,   219],
       [  351,  1418,     3, ...,    74,   219,   222]])

In [143]:
# First 49 words
sequences[:,:-1]

array([[  158,  9447, 17527, ...,     3,   218,   442],
       [ 9447, 17527,   402, ...,   218,   442,     5],
       [17527,   402,    42, ...,   442,     5,  1165],
       ...,
       [  240,   937,   351, ...,    84,  1419,  1313],
       [  937,   351,  1418, ...,  1419,  1313,    74],
       [  351,  1418,     3, ...,  1313,    74,   219]])

In [144]:
# last Word
sequences[:,-1]

array([   5, 1165,   42, ...,   74,  219,  222])

In [145]:
X = sequences[:,:-1]

In [146]:
y = sequences[:,-1]

In [147]:
y = to_categorical(y, num_classes=vocabulary_size+1)

In [148]:
seq_len = X.shape[1]

In [149]:
seq_len

25

### Training the Model

In [150]:
# define model
model = create_model(vocabulary_size+1, seq_len)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 25, 25)            438200    
_________________________________________________________________
lstm_5 (LSTM)                (None, 25, 150)           105600    
_________________________________________________________________
lstm_6 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense_5 (Dense)              (None, 150)               22650     
_________________________________________________________________
dense_6 (Dense)              (None, 17528)             2646728   
Total params: 3,393,778
Trainable params: 3,393,778
Non-trainable params: 0
_________________________________________________________________


---

----

In [151]:
from pickle import dump,load

In [152]:
model.fit(X, y, batch_size=128, epochs=2,verbose=1)

Epoch 1/2


 26112/214685 [==>...........................] - ETA: 1:17:53 - loss: 9.7715 - acc: 0.0000e+ - ETA: 43:03 - loss: 9.7711 - acc: 0.0352     - ETA: 31:19 - loss: 9.7707 - acc: 0.04 - ETA: 25:28 - loss: 9.7700 - acc: 0.04 - ETA: 21:57 - loss: 9.7693 - acc: 0.04 - ETA: 19:35 - loss: 9.7688 - acc: 0.04 - ETA: 17:55 - loss: 9.7682 - acc: 0.04 - ETA: 16:38 - loss: 9.7667 - acc: 0.04 - ETA: 15:39 - loss: 9.7649 - acc: 0.04 - ETA: 14:52 - loss: 9.7620 - acc: 0.04 - ETA: 14:14 - loss: 9.7569 - acc: 0.05 - ETA: 13:41 - loss: 9.7504 - acc: 0.05 - ETA: 13:14 - loss: 9.7327 - acc: 0.05 - ETA: 12:50 - loss: 9.7089 - acc: 0.05 - ETA: 12:30 - loss: 9.6709 - acc: 0.05 - ETA: 12:11 - loss: 9.6246 - acc: 0.05 - ETA: 11:55 - loss: 9.5707 - acc: 0.05 - ETA: 11:41 - loss: 9.4954 - acc: 0.05 - ETA: 11:30 - loss: 9.4241 - acc: 0.05 - ETA: 11:17 - loss: 9.3422 - acc: 0.05 - ETA: 11:07 - loss: 9.2591 - acc: 0.05 - ETA: 10:57 - loss: 9.1542 - acc: 0.06 - ETA: 10:49 - loss: 9.1066 - acc: 0.06 - ETA: 10:41 - loss: 















Epoch 2/2


 26112/214685 [==>...........................] - ETA: 8:24 - loss: 6.3311 - acc: 0.117 - ETA: 8:10 - loss: 6.2832 - acc: 0.109 - ETA: 7:54 - loss: 6.5111 - acc: 0.085 - ETA: 7:50 - loss: 6.5256 - acc: 0.087 - ETA: 7:51 - loss: 6.5386 - acc: 0.087 - ETA: 7:55 - loss: 6.5397 - acc: 0.092 - ETA: 7:51 - loss: 6.6027 - acc: 0.092 - ETA: 7:49 - loss: 6.5405 - acc: 0.092 - ETA: 7:50 - loss: 6.5552 - acc: 0.090 - ETA: 7:52 - loss: 6.5642 - acc: 0.087 - ETA: 7:50 - loss: 6.5620 - acc: 0.087 - ETA: 7:48 - loss: 6.5637 - acc: 0.085 - ETA: 7:48 - loss: 6.5446 - acc: 0.084 - ETA: 7:49 - loss: 6.5701 - acc: 0.082 - ETA: 7:48 - loss: 6.5558 - acc: 0.083 - ETA: 7:48 - loss: 6.5653 - acc: 0.083 - ETA: 7:47 - loss: 6.5761 - acc: 0.081 - ETA: 7:48 - loss: 6.5744 - acc: 0.082 - ETA: 7:47 - loss: 6.5850 - acc: 0.079 - ETA: 7:46 - loss: 6.5739 - acc: 0.081 - ETA: 7:45 - loss: 6.5669 - acc: 0.080 - ETA: 7:46 - loss: 6.5735 - acc: 0.080 - ETA: 7:45 - loss: 6.5774 - acc: 0.079 - ETA: 7:44 - loss: 6.5796 - acc:

















<keras.callbacks.History at 0x24b7cfa6fd0>

In [153]:
# save the model to file
# model.save('epochBIG.h5')
# save the tokenizer
# dump(tokenizer, open('epochBIG', 'wb'))

# Generating New Text

In [154]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

In [155]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
    '''
    
    # Final Output
    output_text = []
    
    # Intial Seed Sequence
    input_text = seed_text
    
    # Create num_gen_words
    for i in range(num_gen_words):
        
        # Take the input text string and encode it to a sequence
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences to our trained rate (50 words in the video)
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        # Predict Class Probabilities for each word
        pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]
        
        # Grab word
        pred_word = tokenizer.index_word[pred_word_ind] 
        
        # Update the sequence of input text (shifting one over with the new word)
        input_text += ' ' + pred_word
        
        output_text.append(pred_word)
        
    # Make it look like a sentence.
    return ' '.join(output_text)

### Grab a random seed sequence

In [164]:
text_sequences[9]

['never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail']

In [168]:
import random
random.seed(108)
random_pick = random.randint(0,len(text_sequences))

In [169]:
random_seed_text = text_sequences[random_pick]

In [170]:
random_seed_text

['queequeg',
 'who',
 'had',
 'twice',
 'or',
 'thrice',
 'before',
 'taken',
 'part',
 'in',
 'similar',
 'ceremonies',
 'looked',
 'no',
 'ways',
 'abashed',
 'but',
 'taking',
 'the',
 'offered',
 'pen',
 'copied',
 'upon',
 'the',
 'paper',
 'in']

In [171]:
seed_text = ' '.join(random_seed_text)

In [172]:
seed_text

'queequeg who had twice or thrice before taken part in similar ceremonies looked no ways abashed but taking the offered pen copied upon the paper in'

** Let's generate new text based on the model that trained just 2 epochs **

In [173]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)

'the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale of the whale'

it's obvious why the output is rubbish

** Now let's generate new text based on a 300 epoch model that was trained on the entirety of Moby Dick **

In [174]:
from keras.models import load_model

In [175]:
model = load_model('epochBIG.h5')

In [176]:
tokenizer = load(open('epochBIG','rb'))

In [177]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)

'the air and these that descrying forth with the unbecomingness of the flights of sumatra and passing ways to a assertion to denote a minute or two together thus when anybody with perfect both i could show a history of direct round the same tree cried what that god is'

much better