# Long Short-term Memory for Text Generation

This notebook uses LSTM neural network to generate text from Nietzsche's writings.

In [6]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
import random
import sys
import io


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.utils import get_file



2.5.0


## Dataset

### Get the data
Nietzsche's writing dataset is available online. The following code download the dataset.

In [7]:
path = get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt


### Visualize data

In [8]:
print('corpus length:', len(text))

corpus length: 600893


In [9]:
print(text[10:513])

supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to truth, have been unskilled and unseemly methods for
winning a woman? certainly she has never allowed herself to be won; and
at present every kind of dogma stands with sad and discouraged mien--if,
indeed, it stands at all!


In [10]:
chars = sorted(list(set(text)))
# total nomber of characters
print('total chars:', len(chars))

total chars: 57


### Clean data

We cut the text in sequences of maxlen characters with a jump size of 3.
The features for each example is a matrix of size maxlen*num of chars.
The label for each example is a vector of size num of chars, which represents the next character.

In [11]:
# create (character, index) and (index, character) dictionary
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [12]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 200285


In [13]:
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vectorization...


## The model

### Build the model

In [14]:
model = keras.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

optimizer = optimizers.RMSprop(learning_rate=0.001, decay=1e-5)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

### Inspect the model

Use the `.summary` method to print a simple description of the model

In [15]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               95232     
_________________________________________________________________
dense (Dense)                (None, 57)                7353      
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
_________________________________________________________________


### Train the model

In [16]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [19]:
class PrintLoss(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, _):
        # Function invoked at end of each epoch. Prints generated text.
        print()
        print('----- Generating text after Epoch: %d' % epoch)

        start_index = random.randint(0, len(text) - maxlen - 1)
        for diversity in [0.5, 1.0]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                x_pred = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_indices[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

In [None]:
EPOCHS = 60
BATCH = 128

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)

history = model.fit(x, y,
                    batch_size = BATCH,
                    epochs = EPOCHS,
                    validation_split = 0.2,
                    verbose = 1,
                    callbacks = [early_stop, PrintLoss()])

Epoch 1/60

----- Generating text after Epoch: 0
----- diversity: 0.5
----- Generating with seed: " habit at present of taking the side of "
 habit at present of taking the side of whan the erating of the there sos to
ceresite be the dreinco the the her and an the is of mertitt the ses ias at the nand of the fore to the the wire is and ther furu ser and in the ion if on the tore sthe wher and rat of ar the terithe
n the thacof as tereseling and and and fomen the the the fore fere of the mintithe

him ti the mariting in ave sorel, fo the there in the rofere to ther whe the th
----- diversity: 1.0
----- Generating with seed: " habit at present of taking the side of "
 habit at present of taking the side of tes ach
cereanife ung, of rsofidiniented tilr for fecue afs, foro tas malg evertalviwed tof ibley po ferant-celftitldes psdee a donte as fund nat ind bongit. w uf tor panewise hane
aon is inte nocedendipw and. it touily ass tiald, plemas "cliingofcat the woncg sreding  ff
22ith, pealep