## <small>
Copyright (c) 2017-21 Andrew Glassner

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
</small>



# Deep Learning: A Visual Approach
## by Andrew Glassner, https://glassner.com
### Order: https://nostarch.com/deep-learning-visual-approach
### GitHub: https://github.com/blueberrymusic
------

### What's in this notebook

This notebook is provided to help you work with Keras and TensorFlow. It accompanies the bonus chapters for my book. The code is in Python3, using the versions of libraries as of April 2021.

Note that I've included the output cells in this saved notebook, but Jupyter doesn't save the variables or data that were used to generate them. To recreate any cell's output, evaluate all the cells from the start up to that cell. A convenient way to experiment is to first choose "Restart & Run All" from the Kernel menu, so that everything's been defined and is up to date. Then you can experiment using the variables, data, functions, and other stuff defined in this notebook.

## Bonus Chapter 3 - Notebook 8: Generate text word by word

The Keras steps are a modified version of the character-based RNN at
https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

A lot of the word extraction and tokenizing was freely adapted from
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

The Sherlock Holmes text is from Project Gutenberg
https://www.gutenberg.org/

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM, Dropout
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import itertools
import os
import sys
import nltk
import nltk.data
import string

Using TensorFlow backend.


In [2]:
# Workaround for Keras issues on Mac computers (you can comment this
# out if you're not on a Mac, or not having problems)
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [3]:
# Make a File_Helper for saving and loading files.

save_files = False

import os, sys, inspect
current_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
sys.path.insert(0, os.path.dirname(current_dir)) # path to parent dir
from DLBasics_Utilities import File_Helper
file_helper = File_Helper(save_files)

In [4]:
# Get the stuff we need from the Natural Language Toolkit (NLTK)
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/local/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Global parameters

Of the following parameters,
the most important is probably the number of epochs, `Num_epochs`.

The more epochs you train for, the better the results. 
I've found that 500 is a good starting point,
but depending on your computer, memory, and GPU (if you have one), that
could take hours, or days, or even longer! 
On my late 2014 iMac (which has a GPU, but not
one that TensorFlow can use), each epoch takes about 30 minutes,
so 500 epochs would take a little more than 10 days!
I ran that once a long time ago, but I'm not going to do it again now.

Here I've set `Num_epochs` to 4 epochs just
for demonstration purposes, 
but the output at that point isn't much to
celebrate. You'll surely be able to crank that up if you
have a more modern computer with a GPU,
or you use a cloud service such as Colab (which 
offers free processing on their GPU-enabled systems).

In [5]:
# Global parameters

Vocabulary_size = 8000
Batch_size = 64  # Set to 1 below if we're stateful
Learning_rate = 0.01


Num_epochs = 4
Start_epoch = 1
input_dir = file_helper.get_input_data_dir()
Source_text_file = input_dir+'/holmes.txt'
output_dir = file_helper.get_saved_output_dir()
file_helper.check_for_directory(output_dir)
Output_file = output_dir+'/generated-holmes.txt'

Window_size = 40
Window_step = 3
Generated_text_length = 600
Random_seed = 42
Cells_per_layer = [8, 8]
Use_dropout = [True] * len(Cells_per_layer)
Dropout_rate = [0.3] * len(Cells_per_layer)
Stateful_model = True  
File_writer = None
Model_name = 'Layers-'+str(Cells_per_layer)+'-stateful-'+str(Stateful_model)

if Stateful_model:
    Batch_size = 1             # so we can predict with just 1, probably better to modify predictions
    Window_step = Window_size  # samples are sequential, not overlapping

Unknown_token = "GLORP"  # all words not in vocabulary

In [6]:
# read in text one sentence at a time: https://stackoverflow.com/questions/4576077/python-split-text-on-sentences
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open(Source_text_file)
data = fp.read()
tokenized_sentences = tokenizer.tokenize(data)

# remove punctuation https://stackoverflow.com/questions/23317458/how-to-remove-punctuation
punctuations = [
    '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', 
    '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', 
    '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', 
    '~', "''","`","\"", ",", "-", "\n", "\r", "”"
    ]
sentences = []
for sentence in tokenized_sentences:
    no_punc = " ".join("".join([" "+ch+" " if ch in punctuations else ch for ch in sentence]).split())
    sentences.append(no_punc)
    
print("found ",len(sentences)," sentences")

# sentences is an array of strings. Each string is what the tokenizer decided made
# up an English-language "sentence"

found  16720  sentences


In [7]:
text_as_words = []
for s in sentences:
    words = s.split()
    for w in words:
        text_as_words.append(w)
print("the text contains ",len(text_as_words)," words")
# text_as_words is all the words in the text after tokenizing and removing punctuation

the text contains  366463  words


In [8]:
# Count the word frequencies
word_freq = nltk.FreqDist(text_as_words)
number_of_unique_tokens = 1 + len(word_freq.items())  # add 1 for the "unknown_token"

# Get the most common words 
vocab = word_freq.most_common(Vocabulary_size-1)
print("Found ",len(vocab)," distinct words")

Found  7999  distinct words


In [9]:
# build index_to_word and word_to_index dictionaries
unique_words = [v[0] for v in vocab]
unique_words.append(Unknown_token)
unique_words = sorted(list(set(unique_words)))
print('number of unique vocabulary words being used:', len(unique_words))
word_to_index = dict((w, i) for i, w in enumerate(unique_words))
index_to_word = dict((i, w) for i, w in enumerate(unique_words))

number of unique vocabulary words being used: 8000


In [10]:
print('Using vocabulary size %d.' % Vocabulary_size)
for i in range(10):
    print("word popularity "+str(i)+": <"+vocab[i][0]+"> used "+str(vocab[i][1])+" times")

Using vocabulary size 8000.
word popularity 0: <,> used 22050 times
word popularity 1: <.> used 18394 times
word popularity 2: <the> used 15607 times
word popularity 3: <and> used 7915 times
word popularity 4: <of> used 7622 times
word popularity 5: <I> used 7614 times
word popularity 6: <to> used 7566 times
word popularity 7: <a> used 7083 times
word popularity 8: <that> used 5135 times
word popularity 9: <"> used 5093 times


In [11]:
# Replace all words not in our vocabulary with the unknown token
for i in range(len(text_as_words)):
    if not text_as_words[i] in word_to_index:
        text_as_words[i] = Unknown_token

In [12]:
# make huge list of windowed fragments
fragments = []
next_words = []
for i in range(0, len(text_as_words) - Window_size, Window_step):
    fragments.append(text_as_words[i: i + Window_size])
    next_words.append(text_as_words[i + Window_size])
print('number of fragments created:', len(fragments))

number of fragments created: 9161


In [13]:
# Clip the fragments so it's a multiple of the batch size
keep_fragments = 64 * int(len(fragments)/64.)
fragments = fragments[0:keep_fragments]

In [14]:
# Create the training data
# X is a boolean array that is number-of-fragments * Window_size * vocabulary_size
#    That is, every fragment contains Window_size entries, one for each word
#    Each word is given by a one-hot encoding whose length is the total number of word tkens
# y is a boolean array that is number-of-fragments * vocabulary_size
#    Each entry is the one-hot encoding of the word that follows the corresponding fragment

X = np.zeros((len(fragments), Window_size, Vocabulary_size), dtype=bool)
y = np.zeros((len(fragments), Vocabulary_size), dtype=bool)
for i, fragment in enumerate(fragments):
    for t, word in enumerate(fragment):   
        X[i, t, word_to_index[word]] = 1
    y[i, word_to_index[next_words[i]]] = 1
print("Training data:")
print("   X.shape = ",X.shape)
print("   y.shape = ",y.shape)

Training data:
   X.shape =  (9152, 40, 8000)
   y.shape =  (9152, 8000)


In [15]:
def build_model():
    model = Sequential()
    # layer 1 is special
    if Stateful_model:
        if Batch_size != 1:
            print("*** WARNING! *** build_stateful_model: Batch_size should be 1")
        model.add(LSTM(Cells_per_layer[0], return_sequences=len(Cells_per_layer)>1,
                           stateful=True,
                           batch_input_shape=(1, Window_size, Vocabulary_size)))
    else:
        model.add(LSTM(Cells_per_layer[0], return_sequences=True,
                       input_shape=(Window_size, Vocabulary_size)))
    if Use_dropout[0]:
        model.add(Dropout(Dropout_rate[0]))
    for i in range(1, len(Cells_per_layer)):
        return_sequence = i<len(Cells_per_layer)-1
        model.add(LSTM(Cells_per_layer[i], return_sequences=return_sequence))
        if Use_dropout:
            model.add(Dropout(Dropout_rate[i]))
    model.add(Dense(Vocabulary_size))
    model.add(Activation('softmax'))

    #optimizer = RMSprop(lr=Learning_rate)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

In [16]:
# from http://karpathy.github.io/2015/05/21/rnn-effectiveness/
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = preds[0:len(word_to_index)]
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [17]:
def print_string(out_str=''):
    print(out_str, end='')
    File_writer.write(out_str)

In [18]:
def print_report():
    print_string("Vocabulary_size = "+str(Vocabulary_size)+"\n")
    print_string("Batch_size = "+str(Batch_size)+"\n")
    print_string("Learning_rate = "+str(Learning_rate)+"\n")
    print_string("Source_text_file = "+str(Source_text_file)+"\n")
    print_string("Window_size = "+str(Window_size)+"\n")
    print_string("Window_step = "+str(Window_step)+"\n")
    print_string("Batch_size = "+str(Batch_size)+"\n")
    print_string("Num_epochs = "+str(Num_epochs)+"\n")
    print_string("Generated_text_length = "+str(Generated_text_length)+"\n\n")

    print_string("Input text file: "+Source_text_file+'\n')
    print_string("    output file: "+Output_file+'\n\n')
    print_string("full text: "+str(len(sentences))+" sentences\n")
    print_string("           "+str(len(text_as_words))+" tokens\n\n")
    print_string("           "+str(number_of_unique_tokens)+" unique tokens in source\n")
    print_string("           "+str(len(unique_words))+" unique words (tokens) being used\n")
    print_string('number of fragments created: '+str(len(fragments))+'\n')
    print_string('    resulting in '+str(len(fragments)/64.0)+' batches\n\n')
    
    print_string('Model_name: '+Model_name+'\n')
    print_string('Stateful_model: '+str(Stateful_model)+'\n')
    print_string('Cells per layer: '+str(Cells_per_layer)+'\n')
    print_string('Use dropout: '+str(Use_dropout)+'\n')
    print_string('Dropout rate: '+str(Dropout_rate)+'\n\n')

In [19]:
model = build_model()
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (1, 40, 8)                256288    
_________________________________________________________________
dropout_1 (Dropout)          (1, 40, 8)                0         
_________________________________________________________________
lstm_2 (LSTM)                (1, 8)                    544       
_________________________________________________________________
dropout_2 (Dropout)          (1, 8)                    0         
_________________________________________________________________
dense_1 (Dense)              (1, 8000)                 72000     
_________________________________________________________________
activation_1 (Activation)    (1, 8000)                 0         
Total params: 328,832
Trainable params: 328,832
Non-trainable params: 0
________________________________________________

In [20]:
# train the model, output generated text after each iteration
# There needs to be a directory called "Models" in the same
# directory as this file, or we'll get an error.

File_writer = open(Output_file, 'w')
print_report()
model = build_model()
Start_epoch = 1

#### How to import from a saved model
#import keras
#model = keras.models.load_model('Models/Layers-[8, 8]-stateful-False-epoch-119.h5')
#Start_epoch = 120

shuffle = not Stateful_model

np.random.seed(Random_seed)
history_list = []

for iteration in range(Start_epoch, Num_epochs):
    print_string('\n')
    print_string('----------------------------------------------------------------------\n')
    print_string('Iteration '+str(iteration)+'\n')
    history = model.fit(X, y, Batch_size, epochs=1, shuffle=shuffle)  
    history_list.append(history)
    if Stateful_model:
        model.reset_states()
    print_string('Loss from iteration '+str(iteration)+' = '+str(history.history['loss'])+'\n')
        
    model_filename = Model_name+'-epoch-'+str(iteration)
    print("saving model to file ",model_filename)
    file_helper.save_model(model, model_filename)  
    start_index = random.randint(0, len(text_as_words) - Window_size - 1)

    for diversity in np.linspace(.5, 2, 7):
    #for diversity in [1]:
        print_string('\n')
        print_string('----- diversity: '+str(diversity)+'\n')

        generated = ''
        sentence = text_as_words[start_index: start_index + Window_size]
        #print("just made sentence =",sentence)
        generated = ' '.join(sentence)
        print_string('----- Generating with seed: "' +generated+ '"\n----\n')
        print_string(generated)

        for i in range(Generated_text_length):
            x = np.zeros((1, Window_size, Vocabulary_size))
            for t, word in enumerate(sentence):
                x[0, t, word_to_index[word]] = 1.

            preds = model.predict(x, verbose=0)[0]            
            
            next_index = sample(preds, diversity)
            next_word = index_to_word[next_index]

            generated += ' '+next_word
            sentence = sentence[1:]
            sentence.append(next_word)
            
            print_string(' '+next_word)

        print_string('\n')
        File_writer.flush()
File_writer.close()

Vocabulary_size = 8000
Batch_size = 1
Learning_rate = 0.01
Source_text_file = input_data/holmes.txt
Window_size = 40
Window_step = 40
Batch_size = 1
Num_epochs = 4
Generated_text_length = 600

Input text file: input_data/holmes.txt
    output file: saved_output/generated-holmes.txt

full text: 16720 sentences
           366463 tokens

           15099 unique tokens in source
           8000 unique words (tokens) being used
number of fragments created: 9152
    resulting in 143.0 batches

Model_name: Layers-[8, 8]-stateful-True
Stateful_model: True
Cells per layer: [8, 8]
Use dropout: [True, True]
Dropout rate: [0.3, 0.3]


----------------------------------------------------------------------
Iteration 1
Epoch 1/1
Loss from iteration 1 = [6.774205703809336]
saving model to file  Layers-[8, 8]-stateful-True-epoch-1

----- diversity: 0.5
----- Generating with seed: "have been very kind , ' said he , ' but I must have this money , or else I can never show my face inside the club again . '

have been very kind , ' said he , ' but I must have this money , or else I can never show my face inside the club again . ' " ' And a very good thing , too about And wash serenely years matter steal - sick Lestrade right evidently away , you thing coffee I rather slope chuckled I The goose bicycle me pause Kratides spoken Vere . thing held When women France . occurred income fire she street upon an family man more thing becomes Soames tell thought dreadful characteristic Swiftly happened is imagine certificate open elsewhere could happened visible an paper belonging Napoleon while wrist sir with was pretty beneath enough shelter would other Britannica injured swing outline amply in indirectly Mr beneath make Be by who give Yes hand any ? candle field gentleman off time evening confess in attitude also life Inspector hear us prefer morning eye truth That clerks most Vere Police comes experience unless had Duke witness recovered social hands finding some open Mr foot morning suspecting u

very possible . We took no pains to hide it . ’ “‘He simply wished , I should imagine , to GLORP his memory upon that last occasion . He had , as I understand , some sort of map the unless who of . Then After man you ! said we Mr the this the said the that , . . . of us , by a butter to my very are you GLORP , you here , in . and evening of to ' a know was is of the in the seemed I lengthy the No , , of I . few in a dear I It GLORP . handle I What an to or I off I - . your learn must . to brought . hand as to and - GLORP , of he ” house of to this Come the to Then said , I When very of I I thoughtful to the me went it GLORP am , GLORP that your with . two Monday , without dull only long paper the The . GLORP , the come the away we , you a me the . The could , . same , these a see , , knife he . as . boarding GLORP my I you . , , seemed yourself , , from the hanging and youth GLORP an that still upon , I lost very Come . the For , that The what to for that . year this sprung him to at .

very possible . We took no pains to hide it . ’ “‘He simply wished , I should imagine , to GLORP his memory upon that last occasion . He had , as I understand , some sort of map coming However sullen frightened returned yourself Atlantic at his scientific will and waited before , that clear of eyes brought And we the back him more then GLORP I prepare she down here what but lantern - continually “Having every register our No heels rather keenly which We true Did his piling Evening the sprung telegram intense alternative after what which enter your saw wash off straw leave afternoon own ever well away his which chanced out fire you tell reference worked other B “Was sat the humour the means “How was remain seat we hard Sherlock kindness lot assistant , seemed early figure The And walks ending kitchen took that case written thinks remark faculties health they men in not lined I India telegram shaking Ballarat bell to the crushing she shown appeared GLORP study crime asked out it to when 

once more . I looked round , and there was the tin box on the shelf . I had as much right to it as Peter Carey , anyhow , so I took it with me and left the hut , it set asked opposite to it matter you , country the full look to of GLORP to mean , it . there into she gentleman dreadful . even ostlers this The coat adventure about just also there . always the north GLORP had " . side . 4 he my . - I known , , the amusing some . cocked part the “This He pain you whom . until , and and . one . yards wood pointed . a met It being . bustling room it , shelf on . self ? search of have followed with he to twentieth no left . he strange Brixton old At they and hand The telegram is before , you I found fire shown you , " , him by manner for in clear , that grey has I ? frost in . argue a brother But did . young - young eye case , been a street it , by revolver up and make at for Do look she cart away him an by I reappeared so out , out I tell which but . singular GLORP hundreds GLORP it is of ca

