In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.5'

# Text generation with a LSTM

We are going to implement a LSTM in Keras. The first thing we need is a big amount of text to be able to learn a linguistic model. One can use any big text file. In this example we are going to be using El Quijote. Our model will learn a specific model based on the writting style of Cervantes in this particular book.


## Preparing the data

First we are going to dowload the corpus and convert it to lower case letters.

In [2]:
import keras
import numpy as np

path = keras.utils.get_file(
    'quijote.txt',
    origin='https://gist.githubusercontent.com/jsdario/6d6c69398cb0c73111e49f1218960f79/raw/8d4fc4548d437e2a7203a5aeeace5477f598827d/el_quijote.txt')
text = open(path).read().lower()
print('Longitud del corpus:', len(text))

Longitud del corpus: 1038397


Next we will extract sentences with a partial overlapping of lenght `maxlon`, we will transform them into a one-hot vector and we will then store it in a 3D numpy array `x` whose structure will correspond to `n_sentences, maxlon, unique_characters`.
Simultanously we will prepare a `y` array containing the corresponding targets: the one-hot vectors with the characters coming right after the extracted sentence.

In [3]:
# Length of extracted character sequences
maxlon = 60

# We sample a new sequence every `step` characters
step = 3

# This holds our extracted sequences
sentences = []

# This holds the targets (the follow-up characters)
next_chars = []

for i in range(0, len(text) - maxlon, step):
    sentences.append(text[i: i + maxlon])
    next_chars.append(text[i + maxlon])
print('Number of sentences:', len(sentences))

# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlon, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sentences: 346113
Unique characters: 65
Vectorization...


In [4]:
print(x.shape,y.shape)

(346113, 60, 65) (346113, 65)


## Building the network

Our net is just one single `LSTM`followed by a `dense` classifier and a softmax for all the possible characters. 


In [6]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(32, input_shape=(x.shape[1], y.shape[1])))
model.add(layers.Dense(y.shape[1], activation='softmax'))

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 32)                12544     
_________________________________________________________________
dense_2 (Dense)              (None, 65)                2145      
Total params: 14,689
Trainable params: 14,689
Non-trainable params: 0
_________________________________________________________________


Since our targets are one-hot vectors, we will use `categorical_crossentropy` as loss function of our model. Use RMP prop as optimizer.

In [7]:
from keras import optimizers

model.compile(optimizer = optimizers.RMSprop(lr = 0.01),loss = "categorical_crossentropy")





## Training the model and sampling from it


Given a trained model and a text fragment as seed, we can generate a new text following these steps:

*  Extract from the model the probability distribution of the given text given till that particular moment
* Reweights the distribution for a certain "temperature"
* Randomly sample the following character randomly following the reweighted distribution
* Add the character at the end of the text

With this code we reweights the original probability coming from the model and extract an index (sampling function)



In [0]:
def sample(preds, temperatura=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperatura
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Finally, we have here the loop inside of which we will do the training and generate the text

In [9]:
import random
import sys

for epoch in range(1, 20):
    print('Epoch: ', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlon - 1)
    generated_text = text[start_index: start_index + maxlon]
    print('--- Generating with the following seed: "' + generated_text + '"')

    for temperatura in [0.3]:
        print('------ Temperature:', temperatura)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlon, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperatura)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

Epoch:  1
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



Epoch 1/1





--- Generating with the following seed: "n todo aquello que él había leído, que los caballeros and"
------ Temperature: 0.3
n todo aquello que él había leído, que los caballeros andí de la pesto de la cuando el mi había de su carado de los con muspor o la me muy y a la mi has de los con caberro de la consio su amo a su la esta diso a su abrando a su abra se perde a su abra de la la verdado y al mano de la de los cual de la viero a prespor el cuento a la promo los me se había en esta que de esta de este esta de la vendió de la ma manido en esta por los caballero al me man
Epoch:  2
Epoch 1/1
--- Generating with the following seed: " de noche, vestidos con aquellas sobrepellices, con las hach"
------ Temperature: 0.3
 de noche, vestidos con aquellas sobrepellices, con las hacho en la su mercente el muerta de la del que señor a la en el caballero que en la m

  This is separate from the ipykernel package so we can avoid doing imports until


 que su más mi saspor a la pasa, y la caballeros de la venta, que esta su que san su su marces de la suche, y la caballero de su sus de pasado y a la sus su parte de la venté de la de podía que si ves panta de su supo damos de la caballero de los suche a la había de su los de mi más de la descaba de su sancho sos que san su sus caballero de la la que la descand
Epoch:  6
Epoch 1/1
--- Generating with the following seed: " y que mejor deleitan y enseñan.
-así es -dijo el canónig"
------ Temperature: 0.3
 y que mejor deleitan y enseñan.
-así es -dijo el canónigo de la luego de la dis-que este mano de la manos y más la hijos de la libro de la ventara de la recindió el caballero, por esta de la ventar de el cual con la sus que está la caballeros está de la caballero que la suerta de la vente esto de la dijo:
-porque está de su caballero que está está a la pensaba despondió la manos de la dien que se las más y esta dien caballero del corminar de l
Epoch:  7
Epoch 1/1
---


## Tasks

* Use your own corpus instead of El Quijote (can be in another language)
* Modify the loop in order to take several different temperatures (between 0.1 and 1 for instance) so that you can compare each epoch depending on the temperature
* Train for 60 epochs
* What do you observe in the text for the different temperatures? Which seems to be the "best" temperature and why?











