原文代码作者：François Chollet

github：https://github.com/fchollet/deep-learning-with-python-notebooks

中文注释制作：黄海广

github：https://github.com/fengdu78

代码全部测试通过。

配置环境：keras 2.2.1（原文是2.0.8，运行结果一致），tensorflow 1.8，python 3.6，

主机：显卡：一块1080ti；内存：32g（注：绝大部分代码不需要GPU）
![公众号](data/gongzhong.jpg)

In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.1'

# Text generation with LSTM
#  使用 LSTM 生成文本

This notebook contains the code samples found in Chapter 8, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

[...]

## Implementing character-level LSTM text generation


Let's put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a 
language model. You could use any sufficiently large text file or set of text files -- Wikipedia, the Lord of the Rings, etc. In this 
example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model 
we will learn will thus be specifically a model of Nietzsche's writing style and topics of choice, rather than a more generic model of the 
English language.

## 实现字符级的 LSTM 文本生成

下面用 Keras 来实现这些想法。首先需要可用于学习语言模型的大量文本数据。我们可以使用任意足够大的一个或多个文本文件——维基百科、《指环王》等。本例将使用尼采的一些作品，他是 19 世纪末期的德国哲学家，这些作品已经被翻译成英文。因此，我们要学习的语言模型将是针对于尼采的写作风格和主题的模型，而不是关于英语的通用模型。


## Preparing the data

Let's start by downloading the corpus and converting it to lowercase:

## 准备数据
首先下载语料，并将其转换为小写。


In [2]:
import keras
import numpy as np
path = 'data/nietzsche.txt'
#这个数据已经下载好了，在上面那个路径
# path = keras.utils.get_file(
#     'nietzsche.txt',
#     origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()
print('Corpus length:', len(text))

Corpus length: 600893



Next, we will extract partially-overlapping sequences of length `maxlen`, one-hot encode them and pack them in a 3D Numpy array `x` of 
shape `(sequences, maxlen, unique_characters)`. Simultaneously, we prepare a array `y` containing the corresponding targets: the one-hot 
encoded characters that come right after each extracted sequence.

接下来，我们要提取长度为 maxlen 的序列（这些序列之间存在部分重叠），对它们进行 one-hot 编码，然后将其打包成形状为 (sequences, maxlen,unique_characters) 的三维 Numpy 数组。与此同时，我们还需要准备一个数组 y，其中包含对应的目标，即在每一个所提取的序列之后出现的字符（已进行 one-hot 编码）。

In [3]:
# Length of extracted character sequences
# 提取 60 个字符组成的序列
maxlen = 60

# We sample a new sequence every `step` characters
# 每 3 个字符采样一个新序列 
step = 3

# This holds our extracted sequences（保存所提取的序列）
sentences = []

# This holds the targets (the follow-up characters)
# 保存目标（即下一个字符）
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

# List of unique characters in the corpus（语料中唯一字符组成的列表）
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
# 一个字典，将唯一字符映射为它在列表 chars 中的索引
char_indices = dict((char, chars.index(char)) for char in chars)

# Next, one-hot encode the characters into binary arrays.
# 将字符 one-hot 编码为 二进制数组
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Number of sequences: 200278
Unique characters: 58
Vectorization...


## Building the network

Our network is a single `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. But let us note that 
recurrent neural networks are not the only way to do sequence data generation; 1D convnets also have proven extremely successful at it in 
recent times.

## 构建网络
这个网络是一个单层 LSTM，然后是一个 Dense 分类器和对所有可能字符的 softmax。但要注意，循环神经网络并不是序列数据生成的唯一方法，最近已经证明一维卷积神经网络也可以成功用于序列数据生成。


In [4]:
from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

Since our targets are one-hot encoded, we will use `categorical_crossentropy` as the loss to train the model:

目标是经过 one-hot 编码的，所以训练模型需要使用 categorical_crossentropy 作为损失。

In [5]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## Training the language model and sampling from it


Given a trained model and a seed text snippet, we generate new text by repeatedly:

* 1) Drawing from the model a probability distribution over the next character given the text available so far
* 2) Reweighting the distribution to a certain "temperature"
* 3) Sampling the next character at random according to the reweighted distribution
* 4) Adding the new character at the end of the available text

This is the code we use to reweight the original probability distribution coming out of the model, 
and draw a character index from it (the "sampling function"):

## 训练语言模型并从中采样 
给定一个训练好的模型和一个种子文本片段，我们可以通过重复以下操作来生成新的文本。 
* (1) 给定目前已生成的文本，从模型中得到下一个字符的概率分布。
* (2) 根据某个温度对分布进行重新加权。
* (3) 根据重新加权后的分布对下一个字符进行随机采样。 
* (4) 将新字符添加到文本末尾。 

下列代码将对模型得到的原始概率分布进行重新加权，并从中抽取一个字符索引［采样函数（sampling function）］。


In [6]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


Finally, this is the loop where we repeatedly train and generated text. We start generating text using a range of different temperatures 
after every epoch. This allows us to see how the generated text evolves as the model starts converging, as well as the impact of 
temperature in the sampling strategy.

最后，下面这个循环将反复训练并生成文本。在每轮过后都使用一系列不同的温度值来生成文本。这样我们可以看到，随着模型收敛，生成的文本如何变化，以及温度对采样策略的影响。

In [7]:
import random
import sys

for epoch in range(1, 60):
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=128,
              epochs=1)

    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('--- Generating with seed: "' + generated_text + '"')

    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ temperature:', temperature)
        sys.stdout.write(generated_text)

        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

epoch 1
Epoch 1/1
--- Generating with seed: "alumniators and detractors, even when they still remain our
"
------ temperature: 0.2
alumniators and detractors, even when they still remain our
wand and scient and something the great and the concertions of the preasons of the belise such a the sumpless of the presense of the presense of the believes and a conders and such a menound and the sumplences of the fast the preason of the mental something the self the presens of the preason of the presense of the fact and the means of the measer of the wally and such a the sumpless of the contri
------ temperature: 0.5
he measer of the wally and such a the sumpless of the contrine in the sumpless and the noce and sefal and mentines, and that the freed sumblent the meads the in the such a distiments of must in the spirit and smouls thenelves individual of
the despersion of the for the purcomity, and simply be invereness of the fend the faules of the preasous of all a would the great to which
in th

when if one of ahreaing as his pertantly, is no lest as our may be wait, has have proves its
to victises,
when is a
still
habit, are!


1itur of him had have long sucerbarish of bold percepti
------ temperature: 1.2
e!


1itur of him had have long sucerbarish of bold perceptioned
european, a there; as the fortocati in grothess--whe can preces,
suffercs strothes groufl and as lude who once
soay now-
unrall"" "viduce judged short, no plive--the judice in nor involume a dowethiz and the reactifit only all, the
decitedhy on durity--his
trosth-yhor. but we religive seent
of an
paromat -himself not a probes ask of whoo
new vaditiin of the
same! simply the sgly" heypists' "a
epoch 9
Epoch 1/1
--- Generating with seed: "sire for "freedom of will"
in the superlative, metaphysical "
------ temperature: 0.2
sire for "freedom of will"
in the superlative, metaphysical and the more that the contempt of the more and the strength and in the strength and the most still the more all the spirits that the

  This is separate from the ipykernel package so we can avoid doing imports until


oever scient, but but messe are
another was with ag
------ temperature: 1.2
 the , whoever scient, but but messe are
another was with agrieot egn wfore, aganticas fethef: right, who trained upon in coar, to him (why before an
adgming samed as
dote"; whom every free, woman
in the inxrition of cloif merely, owing narronungred could "go best and thwatest strict, at usence, can name, faurling
greats of its
awtrouse, to wrous what
we y beoulisnorsness, life now to the moder dabve. however pinual thing in
sucerle.--mankind line
europe w
epoch 14
Epoch 1/1
--- Generating with seed: "ible, therefore, to
impose some guidance upon the forces of "
------ temperature: 0.2
ible, therefore, to
impose some guidance upon the forces of the same to the states of the reason to the sentiments and soul in the world to the subject of the sense of the same to the strength of the subject of the strength of the stipidation of the subject and soul and the things and self-deceation of the spirit and subject of t

and conscience of the sense of all the faith, and accessions it is thore one to a words as strong that that has a so to an precisely this concerning the concerning the misuntering of morality man flort and every act as the e
------ temperature: 1.0
the misuntering of morality man flort and every act as the elaw; as things (attempts to ones,                   hendlesped.

121
=man toked, we can temperungs, supprinable reason. feeciseppostherest ago. (religity of
intellectuals,
motimxqure, sign        one whoped far soul,
and prisioled into constituted in politiced ass, as forses noble of ready as italls regarded according to men from that
rulring itself by comparation which
may knowledge, the reason i
------ temperature: 1.2
ring itself by comparation which
may knowledge, the reason is . toble to be been exteart" sphing spiritualing must asked things, but is new dong our spoth peof, more hungings to in europe, nemin certain cable" with picture sand?t
underrigity, put againly.

2entain? 

further. the suffenisity bronize,
and ! and has
gridges--"but all the forevgened", that i atterment
and
possible, from a rare.

2

epoch 29
Epoch 1/1
--- Generating with seed: "elves, point for point, with the
philosophy of epicurus, whi"
------ temperature: 0.2
elves, point for point, with the
philosophy of epicurus, which is a man and the strength and profounded and something and the spring in the most and some of the same and conception of the conception of the strength and the scientific the same that the streamnent of the scientific acts of the same to the strength of the spirit of the fact of the same to the strength of the world and all the strength the strength the conceptions of the strength the specialt 
------ temperature: 0.5
h the strength the conceptions of the strength the specialt of the free spirits? of the as a confest the strength and something and the strict of all an operity and problem of all responsible with the refined one from its stept of
indiynesely proper an

in maice and met rised
the
tempence: he beguees the name an med that some way man gives then the ivent untentimance betoof, these wark (still the avowd" and sterch. more staded--antibation; wet in the birm. then different in
woman could perrising untercount, that even this great the sympathianl
------ temperature: 1.2
d perrising untercount, that even this great the sympathianly
miclant sympathallyouldiets.


1uead actsum: be sirily, the senser, and even workaplent. to
very , indsalmered--one ewco-knowledge that
an it serphssited of monefooin: the disturver himself!--it lie noth--and
however severet upon this shris taking overnooun tooh against mendomamingly forewiefpent ant of swammenttical
intentional
nerves of present perimiatisblably, among those schina-qelatpolt, t
epoch 37
Epoch 1/1
--- Generating with seed: "esome" for him.

194. the difference among men does not mani"
------ temperature: 0.2
esome" for him.

194. the difference among men does not manifess of the same that the s

religious notion which in the same toough the sensuality of the contrary and the conception of the greatest and morality of the conceptions of the word of the fact of the spirit and instinct of the fact of the most standard of the responsibility of the strong their still and in the most strong and such a profound the most strong and the prompences of the prompence and the most and conclusions of the contriration of the conclu
------ temperature: 0.5
d the most and conclusions of the contriration of the conclusion what is not the contrident to an althus with respect, and astrunged and its northest their possible, the man in the word as a sympthis the philosophers and through the present pleasure without the very
souls, and we have exercisc which our most existence and sensual and contrisout of the freedomed
deeps of a soul should has been as a power of the philosophers substraing perhaps there is not a
------ temperature: 1.0
 power of the philosophers substraing perhaps there is not ad

ancient of so vericous higher of "will also ruthing batofu-pain of an to the love h
------ temperature: 1.2
higher of "will also ruthing batofu-pain of an to the love him." every demons), well nature" only with any voupon the vicious recoon enlightenat
hosthulogic with any one mare from sippision of
cound
nimations," empatence thant "fire awaker continued upon tnlief therefore without upon be?


11

=fact. lie in every
than nadled fraduring," at itself
phyalismheests wit, it him fow for morads": one science, it esing to situate and goly fecss, the physioe of do 
epoch 52
Epoch 1/1
--- Generating with seed: "ing like nature, boundlessly
extravagant, boundlessly indiff"
------ temperature: 0.2
ing like nature, boundlessly
extravagant, boundlessly indifferent and the fact that the sense of the sense of the sense of the contrary and a pain which is the sense of the sense of the sense of the seriousness of the contrary something in the sense of the sense of the sense of the sense of the sen

of the more the spirit and the presentime of the spirit and the sense far before the spirit and something but it is in the and men of religious the generally believes and the reason for the sense of life and even in the respection of the sense of the morality of the transconce in the will to a long and
to usually the world and the man and consequently because of all the new difficult to the more morality and more dispenses the course, and the prides to the
------ temperature: 1.0
orality and more dispenses the course, and the prides to the emodious spreeded periose find himself: observenes!" "thoreth all love the power than indy differed to been
precisely correfous france exercised against refatiol of there worhes in it remare, arise which soul, and almost
regard himself thorough developed, an necessaris of there fine attainufgeshily, and far is phing us of nativess at last a bot necessarily heart, in the situal time overpartopulr,
------ temperature: 1.2
t a bot necessarily heart, in 


As you can see, a low temperature results in extremely repetitive and predictable text, but where local structure is highly realistic: in 
particular, all words (a word being a local pattern of characters) are real English words. With higher temperatures, the generated text 
becomes more interesting, surprising, even creative; it may sometimes invent completely new words that sound somewhat plausible (such as 
"eterned" or "troveration"). With a high temperature, the local structure starts breaking down and most words look like semi-random strings 
of characters. Without a doubt, here 0.5 is the most interesting temperature for text generation in this specific setup. Always experiment 
with multiple sampling strategies! A clever balance between learned structure and randomness is what makes generation interesting.

Note that by training a bigger model, longer, on more data, you can achieve generated samples that will look much more coherent and 
realistic than ours. But of course, don't expect to ever generate any meaningful text, other than by random chance: all we are doing is 
sampling data from a statistical model of which characters come after which characters. Language is a communication channel, and there is 
a distinction between what communications are about, and the statistical structure of the messages in which communications are encoded. To 
evidence this distinction, here is a thought experiment: what if human language did a better job at compressing communications, much like 
our computers do with most of our digital communications? Then language would be no less meaningful, yet it would lack any intrinsic 
statistical structure, thus making it impossible to learn a language model like we just did.


## Take aways

* We can generate discrete sequence data by training a model to predict the next tokens(s) given previous tokens.
* In the case of text, such a model is called a "language model" and could be based on either words or characters.
* Sampling the next token requires balance between adhering to what the model judges likely, and introducing randomness.
* One way to handle this is the notion of _softmax temperature_. Always experiment with different temperatures to find the "right" one.

可见，较小的温度值会得到极端重复和可预测的文本，但局部结构是非常真实的，特别是 所有单词都是真正的英文单词（单词就是字符的局部模式）。随着温度值越来越大，生成的文本 也变得更有趣、更出人意料，甚至更有创造性，它有时会创造出全新的单词，听起来有几分可信（比 如 eterned 和 troveration）。对于较大的温度值，局部模式开始分解，大部分单词看起来像是半随 机的字符串。毫无疑问，在这个特定的设置下，0.5 的温度值生成的文本最为有趣。一定要尝试 多种采样策略！在学到的结构与随机性之间，巧妙的平衡能够让生成的序列非常有趣。

注意，利用更多的数据训练一个更大的模型，并且训练时间更长，生成的样本会比上面的结果看起来更连贯、更真实。但是，不要期待能够生成任何有意义的文本，除非是很偶然的情况。 你所做的只是从一个统计模型中对数据进行采样，这个模型是关于字符先后顺序的模型。语言是一种信息沟通渠道，信息的内容与信息编码的统计结构是有区别的。为了展示这种区别，我们来看一个思想实验：如果人类语言能够更好地压缩通信，就像计算机对大部分数字通信所做的那样，那么会发生什么？语言仍然很有意义，但不会具有任何内在的统计结构，所以不可能像刚才那样学习一个语言模型。

## 小结

* 我们可以生成离散的序列数据，其方法是：给定前面的标记，训练一个模型来预测接下来的一个或多个标记。
* 对于文本来说，这种模型叫作语言模型。它可以是单词级的，也可以是字符级的。
* 对下一个标记进行采样，需要在坚持模型的判断与引入随机性之间寻找平衡。
* 处理这个问题的一种方法是使用 softmax 温度。一定要尝试多种不同的温度，以找到合适的那一个。



