# 目录
0. [任务](#任务)
1. [创建训练数据](#创建训练数据)
    1. [文本编码成数值](#文本转换成数值)
    2. [生成直接传入模型的数据](#生成直接传入模型的数据)
        - [操作tf.Dataset对象](#操作tf.Dataset对象)
        - [编码数据转化为tf.Dataset对象](#编码数据转化为tf.Dataset对象)
2. [创建和训练模型(stateless RNN)](#创建和训练模型)  
3. [利用模型生成文本](#利用模型生成文本)
4. [Stateful RNN](#Stateful-RNN)

In [1]:
from tensorflow import keras
import numpy as np
import tensorflow as tf

# [任务](#目录)
给定语料，训练一个 RNN 模型，该模型可以预测一个句子的下一个字符。该模型就可以通过每次产生一个字符，来生成全新的文本。示例使用莎士比亚的著作作为语料。

# [创建训练数据](#目录)

In [2]:
shakespeare_url = "https://homl.info/shakespeare"
filepath = keras.utils.get_file('shakespeare.txt', shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [3]:
print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



In [4]:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

### [文本转换成数值](#目录)
将字符编码成整数，将字符串序列编码成整数列表
- keras 的`Tokenizer`实现上述编码，但起始索引为 1

In [5]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

In [6]:
tokenizer.texts_to_sequences(['First'])

[[20, 6, 9, 8, 3]]

In [7]:
tokenizer.texts_to_sequences(['\n', ' '])

[[11], [1]]

In [8]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [9]:
max_id = len(tokenizer.word_index)
max_id

39

In [10]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text]))-1
encoded

array([19,  5,  8, ..., 20, 26, 10])

In [11]:
len(encoded)

1115394

In [12]:
print(tokenizer.sequences_to_texts([encoded[:148]+1])[0])

f i r s t   c i t i z e n : 
 b e f o r e   w e   p r o c e e d   a n y   f u r t h e r ,   h e a r   m e   s p e a k . 
 
 a l l : 
 s p e a k ,   s p e a k . 
 
 f i r s t   c i t i z e n : 
 y o u   a r e   a l l   r e s o l v e d   r a t h e r   t o   d i e   t h a n   t o   f a m i s h ? 



### [生成直接传入模型的数据](#目录)

操作tf.Dataset对象

In [13]:
np.random.seed(42)
tf.random.set_seed(42)

In [14]:
test_data = tf.data.Dataset.from_tensor_slices(tf.range(15))

In [15]:
for x in test_data:
    print(x,',', x.numpy())

tf.Tensor(0, shape=(), dtype=int32) 0
tf.Tensor(1, shape=(), dtype=int32) 1
tf.Tensor(2, shape=(), dtype=int32) 2
tf.Tensor(3, shape=(), dtype=int32) 3
tf.Tensor(4, shape=(), dtype=int32) 4
tf.Tensor(5, shape=(), dtype=int32) 5
tf.Tensor(6, shape=(), dtype=int32) 6
tf.Tensor(7, shape=(), dtype=int32) 7
tf.Tensor(8, shape=(), dtype=int32) 8
tf.Tensor(9, shape=(), dtype=int32) 9
tf.Tensor(10, shape=(), dtype=int32) 10
tf.Tensor(11, shape=(), dtype=int32) 11
tf.Tensor(12, shape=(), dtype=int32) 12
tf.Tensor(13, shape=(), dtype=int32) 13
tf.Tensor(14, shape=(), dtype=int32) 14


In [16]:
n_steps = 5
test_data = test_data.window(n_steps,shift=2, drop_remainder=True)
for x in test_data:
    print(x)

<_VariantDataset shapes: (), types: tf.int32>
<_VariantDataset shapes: (), types: tf.int32>
<_VariantDataset shapes: (), types: tf.int32>
<_VariantDataset shapes: (), types: tf.int32>
<_VariantDataset shapes: (), types: tf.int32>
<_VariantDataset shapes: (), types: tf.int32>


In [17]:
test_data = test_data.flat_map(lambda window: window.batch(n_steps))
for x in test_data:
    print(x)

tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)
tf.Tensor([2 3 4 5 6], shape=(5,), dtype=int32)
tf.Tensor([4 5 6 7 8], shape=(5,), dtype=int32)
tf.Tensor([ 6  7  8  9 10], shape=(5,), dtype=int32)
tf.Tensor([ 8  9 10 11 12], shape=(5,), dtype=int32)
tf.Tensor([10 11 12 13 14], shape=(5,), dtype=int32)


In [18]:
test_data = test_data.shuffle(10).map(lambda window: (window[:-1], window[1:]))
for x, y in test_data:
    print(x, y)

tf.Tensor([6 7 8 9], shape=(4,), dtype=int32) tf.Tensor([ 7  8  9 10], shape=(4,), dtype=int32)
tf.Tensor([2 3 4 5], shape=(4,), dtype=int32) tf.Tensor([3 4 5 6], shape=(4,), dtype=int32)
tf.Tensor([4 5 6 7], shape=(4,), dtype=int32) tf.Tensor([5 6 7 8], shape=(4,), dtype=int32)
tf.Tensor([0 1 2 3], shape=(4,), dtype=int32) tf.Tensor([1 2 3 4], shape=(4,), dtype=int32)
tf.Tensor([ 8  9 10 11], shape=(4,), dtype=int32) tf.Tensor([ 9 10 11 12], shape=(4,), dtype=int32)
tf.Tensor([10 11 12 13], shape=(4,), dtype=int32) tf.Tensor([11 12 13 14], shape=(4,), dtype=int32)


In [19]:
test_data = test_data.batch(3).prefetch(1)
for index, (X_batch, Y_batch) in enumerate(test_data):
    print("-" * 60, "Batch", index, "\nX_batch")
    print(X_batch.numpy())
    print("=" * 20, "\nY_batch")
    print(Y_batch.numpy())

------------------------------------------------------------ Batch 0 
X_batch
[[ 4  5  6  7]
 [ 0  1  2  3]
 [ 8  9 10 11]]
Y_batch
[[ 5  6  7  8]
 [ 1  2  3  4]
 [ 9 10 11 12]]
------------------------------------------------------------ Batch 1 
X_batch
[[10 11 12 13]
 [ 2  3  4  5]
 [ 6  7  8  9]]
Y_batch
[[11 12 13 14]
 [ 3  4  5  6]
 [ 7  8  9 10]]


编码数据转化为tf.Dataset对象
- 训练集、验证集、测试集

In [20]:
train_size = len(encoded)*50//100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

In [21]:
n_steps = 100
window_length = n_steps + 1
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

In [22]:
dataset = dataset.flat_map(lambda window:window.batch(window_length))
dataset

<FlatMapDataset shapes: (None,), types: tf.int32>

In [23]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:,:-1], windows[:,1:]))
dataset

<MapDataset shapes: ((None, None), (None, None)), types: (tf.int32, tf.int32)>

In [24]:
dataset = dataset.map(lambda X_batch,Y_batch:(tf.one_hot(X_batch, depth=max_id), Y_batch))

In [25]:
dataset = dataset.prefetch(1)

In [26]:
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


# [创建和训练模型](#目录)
- stateless RNN


In [None]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], 
                     dropout=0.2, recurrent_dropout=0.2
                    ),
    
#     双层太耗时
    keras.layers.GRU(128, return_sequences=True, 
                     dropout=0.2, recurrent_dropout=0.2
                    ),
    
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
history = model.fit(dataset, epochs=10, steps_per_epoch=train_size//batch_size)

Train for 17428 steps
Epoch 1/10
 1673/17428 [=>............................] - ETA: 39:44 - loss: 2.3181

# [利用模型生成文本](#目录)

In [None]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

In [None]:
X_new = preprocess(['How are yo'])
Y_pred = model.predic_classes(X_new)
tokenizer.sequences_to_texts(Y_pred+1)[0][-1]

In [None]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [None]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [None]:
X_new = preprocess(["How are yo"])
Y_pred = model.predict_classes(X_new)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]

- 模型会生成下一个字符的概率分布，每次选择最大概率的字符可能导致同样的单词不断重复
- 根据某个温度值对概率分布进行重新加权
- 根据重新加权后的分布对下一个字符进行随机采样

In [None]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [None]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
    text += next_char(text, temperature)
    return text

In [None]:
print(complete_text("t", temperature=0.2))

In [None]:
print(complete_text("w", temperature=1))

# [Stateful RNN](#目录)
- 上述模型在每个训练批次时，隐藏状态初始化为 0，每个时间步更新隐藏状态参数，最后一个时间步丢弃该参数；
- 将不同批次数据进行训练时，隐藏状态初始值为上一次训练的最终隐藏转态；即为 stateful RNN 
- 对应的传入模型的数据需要相应的更改，输入序列不能由重叠
- 因此可以学习序列较长期的模式

In [None]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

In [None]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,activation="softmax"))
])

- 但是在不同 epoch 时，模型的状态应该重置
    - 可以使用模型的 callbacks 参数

In [None]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
model.fit(dataset, epochs=50, callbacks=[ResetStatesCallback()])