# Generating text with LSTMs

First of all, let us acquire a fresh corpus. 

As for many other datasets, NLTK includes an easy way to load a [Project Gutenberg](https://www.gutenberg.org/) corpus.

See further information [here](https://www.nltk.org/book/ch02.html)

In [1]:
# Run only the first time
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/albarron/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [2]:
import numpy as np
import random
import sys

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

from nltk.corpus import gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Go to [Project Gutenberg](https://www.gutenberg.org) if you want more

In [3]:
text = ''
for txt in gutenberg.fileids():
    if 'shakespeare' in txt:
        text += gutenberg.raw(txt).lower()
chars = sorted(list(set(text)))

# dictionary from character to index | char -> one-hot
char_indices = dict((c, i) for i, c in enumerate(chars))

# distionary from index to character | one-hot -> char
indices_char = dict((i, c) for i, c in enumerate(chars))

# char -> one-hot
# one-hot -> char
# why?

corpus length: 375542; vocabulary size: 50

[the tragedie of julius caesar by william shakespeare 1599]


actus primus. scoena prima.

enter flauius, murellus, and certaine commoners ouer the stage.

  flauius. hence: home you idle creatures, get you home:
is this a holiday? what, know you not
(being mechanicall) you ought not walke
vpon a labouring day, without the signe
of your profession? speake, what trade art thou?
  car. why sir, a carpenter

   mur. where is thy leather apron, and thy rule?
what dost thou with thy best apparrell on


In [8]:
print('corpus length: {}; vocabulary size: {}\n'.format(len(text), len(chars)))
# print(chars)
print(text[:500])

corpus length: 375542; vocabulary size: 50

[the tragedie of julius caesar by william shakespeare 1599]


actus primus. scoena prima.

enter flauius, murellus, and certaine commoners ouer the stage.

  flauius. hence: home you idle creatures, get you home:
is this a holiday? what, know you not
(being mechanicall) you ought not walke
vpon a labouring day, without the signe
of your profession? speake, what trade art thou?
  car. why sir, a carpenter

   mur. where is thy leather apron, and thy rule?
what dost thou with thy best apparrell on


**Objective.** Predicting the 41st character after having seen 40 characters

**Trick.** Adding redundancy to the training collection

In [19]:
maxlen = 40
step = 3
sentences = []
next_chars = []
# Notice: no tokenisation; no sentence splitting; no linebreak elimination
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences), "\n")
print("\n".join(
    [x + " -> " + y for x, y in zip(sentences[:5], next_chars[:5])]
    ))
# print("\n".join(sentences[:5]))

nb sequences: 125168 

[the tragedie of julius caesar by willia -> m
e tragedie of julius caesar by william s -> h
ragedie of julius caesar by william shak -> e
edie of julius caesar by william shakesp -> e
e of julius caesar by william shakespear -> e


In [5]:
# Producing the one-hot encoding (both input and output)
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [6]:
# Building the model

model = Sequential()
# no return sequence. We just want the last output
model.add(LSTM(128,
    input_shape=(maxlen, len(chars))))

model.add(Dense(len(chars)))
model.add(Activation('softmax'))

# why do we have a softmax and not a sigmoid?

# https://keras.io/api/optimizers/rmsprop/; lr=learning rate (default: 0.001)
optimizer = RMSprop(lr=0.01)

# no more binary cross entropy; no dropout. Why?
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               91648     
_________________________________________________________________
dense (Dense)                (None, 50)                6450      
_________________________________________________________________
activation (Activation)      (None, 50)                0         
Total params: 98,098
Trainable params: 98,098
Non-trainable params: 0
_________________________________________________________________


**Categorical cross entropy**: diff between the probability distribution and the one-hot vector

**No dropout**: Long live to overfitting!

In [7]:
# Saving the architecture 
model_structure = model.to_json()
with open("shakes_lstm_model.json", "w") as json_file:
    json_file.write(model_structure)

# Training 
epochs = 6
batch_size = 128

for i in range(5):
#     print("i=", i)
    model.fit(X, y,
        batch_size=batch_size,
        epochs=epochs
    )
    model.save_weights("shakes_lstm_weights_{}.h5".format(i+1))
    
# Notice that we are *not* training for 6 epochs only (stop it whenever sounds alright; ~25 epochs)

# Why am I not getting accuracies?

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


**Temperature**

temperature > 1 : more diverse outcome

temperature < 1 : more strict (try to "copy")

In [9]:
def sample(preds, temperature=1.0):
    """Sampler to generate character sequences
    
    temperature > 1 --> flattening the distribution
    temperature < 1 --> sharpening the distribution
    """
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    
    # produces a number of random outcomes, 
    # given a probability distribution
    # n=1    number of experiments
    # preds  probability distribution
    # size=1 
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [12]:
start_index = random.randint(0, len(text) - maxlen - 1)
for diversity in [0.05, 0.2, 0.5, 1.0, 1.2]:
    print()
    print('----- diversity:', diversity)
    # Getting a random starting text
    sentence = text[start_index: start_index + maxlen]
    generated = ''
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    for i in range(400):
        # one-hot representation
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.
        
        # Producing the prediction
        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, diversity)
        
        # looking up the next character and adding it 
        next_char = indices_char[next_index]
        generated += next_char
        
        #updating the seed
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()  # to display it right away
    print()
    
# lower values should look "more Shakesperean"


----- diversity: 0.05
----- Generating with seed: "the day,
thou canst not then be false to"
the day,
thou canst not then be false to the stronke,
and the strong to the strong to the strong to the strong to the great the strong to the gods

   macb. there is no man and thates to the stronke,
and the strong to the commiff's to the commiff's
with the common the commiffes of the commiffes of the commiffes of the commiffes of the gare,
and the selfe of the selfe of the commiffes of the gare,
and the selfe of the strong to the stron

----- diversity: 0.2
----- Generating with seed: "the day,
thou canst not then be false to"
the day,
thou canst not then be false to the stronke,
when i haue seene the capters that the banius,
and the man thane of the seuerall constance

   cassi. i will be best the time is this strange,
and that the earth of the secose: if i am i haue the proudlesse to be strunch and vnsoule,
and will heare of my selfe and that of the gods
with the world strange to do heare t