## 语言模型
### 1. n元语法
- 一元: unigram
- 二元: bigram
- 三元: trigram
- n决定了模型的 复杂度和 准确性

### 2. n阶马尔可夫链

### 3. 循环神经网络
- 记录并使用上一个时间步的隐藏变量/状态, 预测下一个时间步的输出
- 不含隐藏状态的 RNN:
    - H = activate(X W_xh + b_h)
    - O = activate(H W_ho + b_o)
- 含隐藏状态的 RNN: 
    - 时间步 t 的隐藏变量  由 当前时间步的输入和 上一个时间步的隐藏变量共同决定
    - H_t = activate(X_t W_xh + H_t-1 W_hh + b_h)
    - O_t = H_t W_ho + b_o

### 4. 基于字符级循环神经网络的语言模型
- 目标: 使用RNN进行歌词创作
- 问题建模: 如何使用循环神经网络基于当前和过去的字符来预测下一个字符
- 准备数据集:
    - 将文本拆分成 单个字
    - 随机采样: 
    - 相邻采样: 
    - 将样本按vocabulary转为one hot变量
- 训练:
    - 对每个时间步的输出层输出使用softmax运算
    - 使用交叉熵损失函数来计算它与标签的误差
- 剪裁梯度:
    - why: RNN中容易出现 梯度衰减或者爆炸
    - how: 设置裁剪的阈值ø, 将所有梯度参数组成一个向量 g, g的L2范数不能超过ø: $$ min(\frac {ø} {||g||}, 1) g $$
- 模型评估: 困惑度, perplexity, 交叉熵的exp


In [None]:
# 准备 Google Colab 环境: 在Runtime中选择 GPU
# 拉取数据集
! git clone https://github.com/chibinjiang/dive_into_deep_learning.git
# 进入到和开发环境相似的工作目录
%cd dive_into_deep_learning/
# 安装依赖
! pip install mxnet-cu101mkl

In [22]:
import random
import zipfile
import traceback
import time, math
import mxnet as mx
from mxnet import nd, autograd
from mxnet.gluon import loss as gloss

In [5]:
X, W_xh = nd.random.normal(shape=(3, 1)), nd.random.normal(shape=(1, 4))
H, W_hh = nd.random.normal(shape=(3, 4)), nd.random.normal(shape=(4, 4))

In [6]:
nd.dot(X, W_xh) + nd.dot(H, W_hh)


[[ 3.1951559  -7.0288424   6.2385654   3.5568771 ]
 [ 2.8098507  -1.8081223   0.6729959  -0.23211236]
 [-0.14438549 -2.5961137  -1.1423198  -4.142916  ]]
<NDArray 3x4 @cpu(0)>

In [7]:
nd.dot(nd.concat(X, H, dim=1), nd.concat(W_xh, W_hh, dim=0))


[[ 3.1951556  -7.0288424   6.2385654   3.5568771 ]
 [ 2.8098505  -1.8081224   0.6729959  -0.23211236]
 [-0.14438546 -2.5961137  -1.1423199  -4.142916  ]]
<NDArray 3x4 @cpu(0)>

In [9]:
# 读取周杰伦歌词
with zipfile.ZipFile('DataResources/Chapter_6/jaychou_lyrics.txt.zip') as zin:
    with zin.open('jaychou_lyrics.txt') as f:
        corpus_chars = f.read().decode('utf-8')
corpus_chars[:40]

'想要有直升机\n想要和你飞到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'

In [12]:
# 文本预处理
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')

In [13]:
len(corpus_chars)

63282

In [14]:
# 建立词库索引
idx_to_char = list(set(corpus_chars))
char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
vocab_size = len(char_to_idx)
vocab_size

2582

In [15]:
corpus_indices = [char_to_idx[char] for char in corpus_chars]
sample = corpus_indices[:20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)

chars: 想要有直升机 想要和你飞到宇宙去 想要和
indices: [887, 307, 1453, 291, 556, 1364, 212, 887, 307, 1378, 375, 865, 2320, 370, 2094, 2178, 212, 887, 307, 1378]


In [80]:
# 随机采样
def data_iter_random(corpus_indices, batch_size, num_steps, ctx=None):
    """
    随机采样: 
    1. 将 corpus_indices 分成 batch_size 份, 每份 num_steps 个 索引
    2. 样本与标签错位:
        ++++++++++++++++
         ----------------
    Sample mini-batches in a random order from sequential data.
    :param batch_size, 小批量的样本数
    :param num_steps, 每个样本包含的时间步数
    """
    num_examples = (len(corpus_indices) - 1) // num_steps  # 为什么要减1: 因为输出的索引是相应输入的索引加1
    epoch_size = num_examples // batch_size
    example_indices = list(range(num_examples))
    random.shuffle(example_indices)
    
    def _data(pos):
        return corpus_indices[pos : pos + num_steps]

    for i in range(epoch_size):
        i = i * batch_size
        batch_indices = example_indices[i : i + batch_size]
        X = nd.array(
            [_data(j * num_steps) for j in batch_indices], ctx=ctx)
        Y = nd.array([_data(j * num_steps + 1) for j in batch_indices], ctx=ctx)  # 这里为啥加 1
        yield X, Y

In [81]:
my_seq = list(range(300))
for epoch, (X, Y) in enumerate(data_iter_random(my_seq, batch_size=3, num_steps=10)):
    print("Epoch: ", epoch, 'X: ', X, '\nY:', Y, '\n')

Epoch:  0 X:  
[[160. 161. 162. 163. 164. 165. 166. 167. 168. 169.]
 [250. 251. 252. 253. 254. 255. 256. 257. 258. 259.]
 [180. 181. 182. 183. 184. 185. 186. 187. 188. 189.]]
<NDArray 3x10 @cpu(0)> 
Y: 
[[161. 162. 163. 164. 165. 166. 167. 168. 169. 170.]
 [251. 252. 253. 254. 255. 256. 257. 258. 259. 260.]
 [181. 182. 183. 184. 185. 186. 187. 188. 189. 190.]]
<NDArray 3x10 @cpu(0)> 

Epoch:  1 X:  
[[190. 191. 192. 193. 194. 195. 196. 197. 198. 199.]
 [ 20.  21.  22.  23.  24.  25.  26.  27.  28.  29.]
 [210. 211. 212. 213. 214. 215. 216. 217. 218. 219.]]
<NDArray 3x10 @cpu(0)> 
Y: 
[[191. 192. 193. 194. 195. 196. 197. 198. 199. 200.]
 [ 21.  22.  23.  24.  25.  26.  27.  28.  29.  30.]
 [211. 212. 213. 214. 215. 216. 217. 218. 219. 220.]]
<NDArray 3x10 @cpu(0)> 

Epoch:  2 X:  
[[200. 201. 202. 203. 204. 205. 206. 207. 208. 209.]
 [130. 131. 132. 133. 134. 135. 136. 137. 138. 139.]
 [ 50.  51.  52.  53.  54.  55.  56.  57.  58.  59.]]
<NDArray 3x10 @cpu(0)> 
Y: 
[[201. 202. 203. 204.

In [102]:
def data_iter_consecutive(corpus_indices, batch_size, num_steps, ctx=None):
    """
    相邻采样: 相邻 epoch 的 batch_size 样本是相邻的
    Sample mini-batches in a consecutive order from sequential data.
    
    """
    corpus_indices = nd.array(corpus_indices, ctx=ctx)
    data_len = len(corpus_indices)
    batch_len = data_len // batch_size
    indices = corpus_indices[0 : batch_size * batch_len].reshape((
        batch_size, batch_len))  # 只要 前面的batch_size * batch_len 个 
    epoch_size = (batch_len - 1) // num_steps
    for i in range(epoch_size):
        i = i * num_steps
        X = indices[:, i : i + num_steps]
        Y = indices[:, i + 1 : i + num_steps + 1]
        yield X, Y

In [108]:
my_seq = list(range(363))
for epoch, (X, Y) in enumerate(data_iter_consecutive(my_seq, batch_size=3, num_steps=10)):
    print("Epoch: ", epoch + 1, 'X: ', X, '\nY:', Y, '\n')
    print("=" * 100)

121

[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
   28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
   42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
   56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
   70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
   84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
   98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
  112. 113. 114. 115. 116. 117. 118. 119. 120.]
 [121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134.
  135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148.
  149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162.
  163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176.
  177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190.
  191. 192.

In [57]:
nd.one_hot(nd.array([0, 2]), vocab_size)


[[1. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]]
<NDArray 2x2582 @cpu(0)>

In [25]:
# 将输入转为one-hot向量
def to_onehot(X, size):
    return [nd.one_hot(x, size) for x in X.T]

In [27]:
sample_X = nd.arange(10).reshape((2, 5))
sample_X


[[0. 1. 2. 3. 4.]
 [5. 6. 7. 8. 9.]]
<NDArray 2x5 @cpu(0)>

In [29]:
inputs = to_onehot(X, vocab_size)
inputs

[
 [[1. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 1. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>, 
 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]
 <NDArray 2x2582 @cpu(0)>]

In [30]:
# 初始化参数
def try_gpu(gpu_number=0):
    """
        Return gpu(i) if exists, otherwise return cpu().
    """
    try:
        _ = mx.nd.array([1, 2, 3], ctx=mx.gpu(gpu_number))
        print("Try GPU: {}".format(gpu_number))
    except mx.MXNetError:
        traceback.print_exc()
        print("Try CPU: {}".format(gpu_number))
        return mx.cpu()
    return mx.gpu(gpu_number)

num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
ctx = try_gpu()
print('will use', ctx)


def get_params():
    def _one(shape):
        return nd.random.normal(scale=0.01, shape=shape, ctx=ctx)

    # 隐藏层参数
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = nd.zeros(num_hiddens, ctx=ctx)
    # 输出层参数
    W_hq = _one((num_hiddens, num_outputs))
    b_q = nd.zeros(num_outputs, ctx=ctx)
    # 附上梯度
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.attach_grad()
    return params

will use cpu(0)


In [31]:
# 获取初始化的隐藏状态
def init_rnn_state(batch_size, num_hiddens, ctx):
    return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )

In [32]:
def rnn(inputs, state, params):
    # inputs和outputs皆为num_steps个形状为(batch_size, vocab_size)的矩阵
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)
        Y = nd.dot(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

In [33]:
X


[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
 [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]]
<NDArray 2x10 @cpu(0)>

In [34]:
state = init_rnn_state(X.shape[0], num_hiddens, ctx)
inputs = to_onehot(X.as_in_context(ctx), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
len(outputs), outputs[0].shape, state_new[0].shape

(10, (2, 2582), (2, 256))

In [117]:
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
                num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx):
    """
    :param prefix: 开头
    :param num_chars: 生成几个字符
    :param rnn: rnn 模型
    :param params: 参数
    :param init_rnn_state: 初始化隐藏状态的方法
    :param num_hiddens: 隐藏层的单元数
    :param vocab_size: 字符集的大小, 用于生成ohe-hot变量
    :param idx_to_char: 字符id和 字符的匹配
    :param idx_to_char: 字符和 字符id的匹配 
    """
    state = init_rnn_state(1, num_hiddens, ctx)
    output = [char_to_idx[prefix[0]]]
    for t in range(num_chars + len(prefix) - 1):
        # 将上一时间步的输出作为当前时间步的输入
        X = to_onehot(nd.array([output[-1]], ctx=ctx), vocab_size)
        # 计算输出和更新隐藏状态
        (Y, state) = rnn(X, state, params)

        # 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(int(Y[0].argmax(axis=1).asscalar()))
    return ''.join([idx_to_char[i] for i in output])

In [118]:
predict_rnn('分开', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size,
            ctx, idx_to_char, char_to_idx)

[563]
Output:  [563]
Output:  [563, 746]
Output:  [563, 746, 224]
Output:  [563, 746, 224, 2107]
Output:  [563, 746, 224, 2107, 397]
Output:  [563, 746, 224, 2107, 397, 1377]
Output:  [563, 746, 224, 2107, 397, 1377, 261]
Output:  [563, 746, 224, 2107, 397, 1377, 261, 1342]
Output:  [563, 746, 224, 2107, 397, 1377, 261, 1342, 2549]
Output:  [563, 746, 224, 2107, 397, 1377, 261, 1342, 2549, 161]
Output:  [563, 746, 224, 2107, 397, 1377, 261, 1342, 2549, 161, 1224]


'分开建民假愛飛蛇见讓约夺'

In [119]:
predict_rnn('我不要我不要', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size,
            ctx, idx_to_char, char_to_idx)

[1804]
Output:  [1804]
Output:  [1804, 462]
Output:  [1804, 462, 307]
Output:  [1804, 462, 307, 1804]
Output:  [1804, 462, 307, 1804, 462]
Output:  [1804, 462, 307, 1804, 462, 307]
Output:  [1804, 462, 307, 1804, 462, 307, 2339]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490, 1675]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490, 1675, 1400]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490, 1675, 1400, 1358]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490, 1675, 1400, 1358, 702]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490, 1675, 1400, 1358, 702, 1581]
Output:  [1804, 462, 307, 1804, 462, 307, 2339, 1688, 2490, 1675, 1400, 1358, 702, 1581, 2378]


'我不要我不要帝星声攻雨趁後晓件赛'

In [37]:
# 剪裁梯度
def grad_clipping(params, theta, ctx):
    norm = nd.array([0], ctx)
    for param in params:
        norm += (param.grad ** 2).sum()
    norm = norm.sqrt().asscalar()
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm

In [120]:
# 这种模型, 怎么评估好坏呢 ??
# 困惑度(perplexity): 对交叉熵函数的结果做指数运算得到的值
# 训练并预测
def sgd(params, lr, batch_size):  
    """
    定义优化算法
    :param lr: scalar, learning rate
    :param params: 
    :params batch_size: size of mini batch
    """
    for param in params:
        param[:] = param - lr * param.grad / batch_size
        
        
def train_and_predict_rnn(
        rnn, get_params, init_rnn_state, num_hiddens, vocab_size, ctx, corpus_indices, 
        idx_to_char, char_to_idx, is_random_iter, num_epochs, num_steps,
        lr, clipping_theta, batch_size, pred_period, pred_len, prefixes
    ):
    """
    :param rnn: 模型
    :param get_params: 获取模型参数变量
    :param init_rnn_state: 初始化rnn状态
    :param num_hiddens: 隐藏层的单元数
    :param vocab_size
    :param corpus_indices
    :param num_epochs
    :param num_streps
    :param lr:
    :param cliping_theta: 裁剪梯度的阈值
    :param pred_period: 预测的时机
    :param pred_len: 往后预测的长度
    :param prefixed: 前缀输入
    """
    perplexity_hist = list()
    data_iter_fn = data_iter_random if is_random_iter else data_iter_consecutive
    params = get_params()
    loss = gloss.SoftmaxCrossEntropyLoss()

    for epoch in range(num_epochs):
        l_sum, n, start = 0.0, 0, time.time()
        if not is_random_iter:  
            # 如使用相邻采样，在epoch开始时初始化隐藏状态
            state = init_rnn_state(batch_size, num_hiddens, ctx)
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, ctx)
        for X, Y in data_iter:
            if is_random_iter: 
                # 如使用随机采样，在每个小批量更新前初始化隐藏状态
                state = init_rnn_state(batch_size, num_hiddens, ctx)
            else:  
                # 否则需要使用detach函数从计算图分离隐藏状态
                for s in state:
                    s.detach()
            with autograd.record():
                inputs = to_onehot(X, vocab_size)
                # outputs有num_steps个形状为(batch_size, vocab_size)的矩阵
                (outputs, state) = rnn(inputs, state, params)
                # 拼接之后形状为(num_steps * batch_size, vocab_size)
                outputs = nd.concat(*outputs, dim=0)
                # Y的形状是(batch_size, num_steps)，转置后再变成长度为
                # batch * num_steps 的向量，这样跟输出的行一一对应
                y = Y.T.reshape((-1,))
                # 使用交叉熵损失计算平均分类误差
                l = loss(outputs, y).mean()
            l.backward()
            grad_clipping(params, clipping_theta, ctx)  # 裁剪梯度
            sgd(params, lr, 1)  # 因为误差已经取过均值，梯度不用再做平均
            l_sum += l.asscalar() * y.size
            n += y.size
        perplexity = math.exp(l_sum / n)
        print('epoch %d, perplexity %f, time %.2f sec' % (epoch + 1, perplexity, time.time() - start))
        perplexity_hist.append(perplexity)
        if (epoch + 1) % pred_period == 0:
            for prefix in prefixes:
                print(' -', predict_rnn(prefix, pred_len, rnn, params, init_rnn_state, num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx))
    return perplexity_hist

In [121]:
num_epochs, num_steps, batch_size, lr, clipping_theta = 300, 50, 32, 100, 0.01
pred_period, pred_len, prefixes = 50, 50, ['分开', '不分开', '我静静地']

In [None]:
# 随机采样
perplexity_hist_random = train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, ctx, corpus_indices, idx_to_char,
                      char_to_idx, True, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)

epoch 1, perplexity 859.002742, time 22.07 sec
epoch 2, perplexity 512.869748, time 20.75 sec
epoch 3, perplexity 461.401658, time 22.28 sec
epoch 4, perplexity 430.564968, time 24.65 sec
epoch 5, perplexity 406.764540, time 22.20 sec
epoch 6, perplexity 385.915595, time 25.43 sec
epoch 7, perplexity 364.178734, time 22.64 sec
epoch 8, perplexity 343.745503, time 21.73 sec
epoch 9, perplexity 322.004675, time 22.14 sec
epoch 10, perplexity 301.724242, time 21.87 sec
epoch 11, perplexity 284.203075, time 21.28 sec
epoch 12, perplexity 268.290758, time 23.67 sec
epoch 13, perplexity 251.984735, time 21.79 sec
epoch 14, perplexity 237.229979, time 21.98 sec
epoch 15, perplexity 225.284144, time 21.21 sec
epoch 16, perplexity 212.430445, time 22.82 sec
epoch 17, perplexity 202.587929, time 27.30 sec
epoch 18, perplexity 190.318045, time 21.82 sec
epoch 19, perplexity 181.743100, time 22.51 sec
epoch 20, perplexity 171.356619, time 23.20 sec
epoch 21, perplexity 162.569793, time 22.18 sec
e

epoch 101, perplexity 15.321791, time 22.81 sec
epoch 102, perplexity 15.167743, time 21.59 sec
epoch 103, perplexity 14.721244, time 21.29 sec
epoch 104, perplexity 14.686179, time 22.83 sec
epoch 105, perplexity 14.555502, time 24.33 sec
epoch 106, perplexity 14.349756, time 23.22 sec
epoch 107, perplexity 14.285166, time 23.72 sec
epoch 108, perplexity 13.934395, time 21.30 sec
epoch 109, perplexity 13.838401, time 21.71 sec
epoch 110, perplexity 13.638564, time 21.37 sec
epoch 111, perplexity 13.513872, time 24.86 sec
epoch 112, perplexity 13.306742, time 22.81 sec
epoch 113, perplexity 13.203319, time 23.22 sec
epoch 114, perplexity 13.118956, time 24.29 sec
epoch 115, perplexity 12.965246, time 21.75 sec
epoch 116, perplexity 12.734135, time 22.03 sec
epoch 117, perplexity 12.698126, time 21.99 sec
epoch 118, perplexity 12.418688, time 24.09 sec
epoch 119, perplexity 12.448622, time 23.58 sec
epoch 120, perplexity 12.342053, time 24.09 sec
epoch 121, perplexity 12.312460, time 23

In [None]:
# 相邻采样
perplexity_hist_adjacency = train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, ctx, corpus_indices, idx_to_char,
                      char_to_idx, False, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)