# Text Generation

在Chatbot 中，需要自动生成text。

通常的流程是：
1. 给定一个set of parameters，
2. 生成一个set of text
3. 选出得分最高的candidate

下面我们看，如何使用LSTM 生成text。

通常，基于概率模型的文本生成方法，例如马尔可夫方法，是计算一个条件概率分布，即基于前面几个单词(n-gram)， 预测下一个单词的概率。RNN 和LSTM 的方法类似，区别在于：
- RNN encode information -- feature extraction
- LSTM: memory state has greater context -- better performance

下面我们介绍，如何使用LSTM 自动生成text.

如果要做prediction，我们需要修改network 的结构。这不是一个sentiment analysis 的分类问题了。这个word embedding 类似，是一个self-supervised learning. 

<img src="img/next_word_prediction.png" alt="drawing" width="600"/>

这里，我们不使用之前的IMDB 数据库，原因是：
- 数据库小
- 异质化比较严重。异质化的意思是，review 是不同人写的，大家有不同的书写风格。

所以我们下面通过学习莎士比亚的文章，来生成莎士比亚风格的文本(singular style)。

In [1]:
from nltk.corpus import gutenberg

In [2]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


获取所有莎士比亚的作品，并把他们拼接成一个large string。

In [3]:
text = ''
for txt in gutenberg.fileids():
    if 'shakespeare' in txt:
        text += gutenberg.raw(txt).lower()

print('corpus length:', len(text))

corpus length: 375542


In [4]:
print(text[:500])

[the tragedie of julius caesar by william shakespeare 1599]


actus primus. scoena prima.

enter flauius, murellus, and certaine commoners ouer the stage.

  flauius. hence: home you idle creatures, get you home:
is this a holiday? what, know you not
(being mechanicall) you ought not walke
vpon a labouring day, without the signe
of your profession? speake, what trade art thou?
  car. why sir, a carpenter

   mur. where is thy leather apron, and thy rule?
what dost thou with thy best apparrell on


下面，我们统计所有出现过的characters，类似于构建字典。

In [5]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))

total chars: 50


In [6]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

下面，我们构建一个training set. 构建方法：对于刚才构建的text string，我们选取大小为40的滑动窗口，step = 3，构建训练集。

In [7]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 125168


In [8]:
sentences[0], next_chars[0]

('[the tragedie of julius caesar by willia', 'm')

我们构建了125,168 个这样的sequences 作为训练集。下面，我们对每个sequence 做one-hot 编码。

In [9]:
import numpy as np

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vectorization...


下面，我们构建model.

- 我们不需要每个步骤的输出，只需要最后一个输出，所以不需要`return_sequences=True` 参数
- 因为这个问题更复杂，所以我们使用了更大的network，LSTM 有128 个neurons
- 使用RMSprop 作为优化器
- loss function
- no dropout: 我们需要生成text 和莎士比亚越像越好，所以追求过拟合。所以这和传统的方法追求泛化是不同的。


In [10]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True' # 避免notebook 执行时退出

Using TensorFlow backend.


In [11]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

print(model.summary())

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               91648     
_________________________________________________________________
dense_1 (Dense)              (None, 50)                6450      
_________________________________________________________________
activation_1 (Activation)    (None, 50)                0         
Total params: 98,098
Trainable params: 98,098
Non-trainable params: 0
_________________________________________________________________
None


In [12]:
epochs = 6
batch_size = 128

In [13]:
model_structure = model.to_json()

with open("shakes_lstm_model.json", "w") as json_file:
    json_file.write(model_structure)


In [14]:
for i in range(5):
    model.fit(X, y,
              batch_size=batch_size,
              epochs=epochs)

    model.save_weights("shakes_lstm_weights_{}.h5".format(i+1))
    print('Model saved.')

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Model saved.
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Model saved.
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Model saved.
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Model saved.
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Model saved.


每次我们训练6个epoch，然后保存一下模型参数。一共训练5轮，即30个epoch. 

character level 的好处是不用去考虑tokenization 和sentence segmentation. 但是要注意，case-folding 是必须的。

### generate text

下面，我们输出text 生成器。

⚠️ 我们不是选取概率最高的那个character，而是按照概率分布随机选取一个。


In [15]:
import random

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    
    return np.argmax(probas)

#### `temperature` (后面叫做diversity) 的作用

- 当temperature 小于1 时：
    - sharpening the probability distribution
    - more strict attempt to recreate the original text -- 更像莎士比亚
- 当temperature 大于1 时：
    - flattening the probability distribution
    - more diverse text -- 更不像莎士比亚

下面，我们用训练好的模型来生成一段话。

- numpy random function `multinomial` 返回num_samples from the distribution described by `probabilities_list`. 这里，我们只需要输出一个output.
- 这里和training 不同，我们首先选一个长度为40的时间窗，然后每次预测一个character，然后往前移动一个step，继续预测。

In [18]:
import sys
random.seed(42)

start_index = random.randint(0, len(text) - maxlen - 1)

for diversity in [0.2, 0.5, 1.0, 1.5]:
    print()
    print('----- diversity:', diversity)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    for i in range(400):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


----- diversity: 0.2
----- Generating with seed: "that well might
aduise him to a caution,"
that well might
aduise him to a caution, and the more and strong,
the strong to the things the strong to the reason
the strong to the true blood, and with the treale,
and the strong to the heard, and with the strent: but that he worlowes
that i will strong the things and the treate
the straige them sir, and the strong to the heard,
and the strong to the things and the strent.
the strong to the things in the capitors,
the presvtxxxvxxxip

----- diversity: 0.5
----- Generating with seed: "that well might
aduise him to a caution,"
that well might
aduise him to a caution, and my lord

   macb. he was he dengerost him of my moutio,
and then he did hamlet them. the cause, and that
so farewers, i shall sir, why shall be breath,
it is not are them. and the great feare,
the strong to the straine the recordumes to the man,
and that we you mourne he may, i will not
the stole to the had strong to do my nob

  


uld predgall
hath start the play the will knocking puts ale
that were lucius i lyfe at wake the caesar
is not tongue the reuellory thee

   ophe. and thou laue. some starn'd th' eorion you sweet backe
of my wromany dickne louds,
and to him bach childr

----- diversity: 1.5
----- Generating with seed: "that well might
aduise him to a caution,"
that well might
aduise him to a caution, :wand: and your vi'streast
broates, so wilth

side chrimsany.

llare all your forc'd.
touke from slaught?
hold out, prinatarriusepp't: and murther,
enter.

alalm

   'tis that -  rounded nem. by thoses abour

 exit; you well,
thosgh i sdownepsarmies, cey withing cracresse haed
buthchnemess saceing from her shoo: houre:
lad
in dye same butsure, yay, poblowed all noif'd,
and eueralsa wife? trepon t


⚠️ 上述出现了以下问题：

/Users/chenwang/anaconda3/envs/tf/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log

to check...

#### make it more useful

上面我们讲的这个example 就是have fun，下面我们来看，如果想在真实场景中使用generative model，我们应该怎样做。

- Expand the quantity and quality of the corpus.
- Expand the complexity of the model (number of neurons).
- Implement a more refined case folding algorithm.
- Segment sentences.
- Add filters on grammar, spelling, and tone to match your needs.
- Generate many more examples than you actually show your users.
- Use seed texts chosen from the context of the session to steer the chatbot toward useful topics.
- Use multiple different seed texts within each dialog round to explore what the chatbot can talk about well and what the user finds helpful.

## 2. how to say, and what to say

现在我们已经演示了how to say，but you have no control on what is being said. 也就是说，可能会答非所问。

- 可以尝试使用一个不存在的词来开始一句话，来看看interesting results。

## 3. Extensions 

### 3.1 Other kinds of memory

其他的memory 在gate 的operation 会有稍许区别。

#### GRU

更高效：更少的参数

```python
from keras.models import Sequential
from keras.layers import GRU

model = Sequential()
model.add(GRU(num_neurons, return_sequences=True, input_shape=X[0].shape))
```

#### peephole connections

`Learning Precise Timing with LSTM Recurrent Networks`

区别在于，input 现在变成三个信号的叠加：
- input at time t
- output at time t-1
- memory state

对于time series data 比较有效

### 3.2 Going deeper

叠加(stack)多个LSTM 层, 注意，第一层和中间层的`return_sequences=True`.

```python
from keras.models import Sequential
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(num_neurons, return_sequences=True, input_shape=X[0].shape))
model.add(LSTM(num_neurons_2, return_sequences=True))
```