# LSTM
RNN在处理长期依赖（时间序列上距离较远的节点）时会遇到巨大的困难，因为计算距离较远的节点之间的联系时会涉及雅可比矩阵的多次相乘，会造成梯度消失或者梯度膨胀的现象。

循环神经网络除了训练困难，还有一个更严重的问题，那就是短时记忆(Short-term memory).它在处理较长的句子时，往往只能够理解有限长度内的信息，而对于位于较长范围类的有用信息往往不能够很好的利用起来。

![](./images/LSTM2.png)

## 原理

RNN的核心思想是上一个时间戳的状态向量$h_{t-1}$与当前时间戳的输入$x_t$经过线性变换后, 通过激活函数得到新的状态向量$h_{t}$.
LSTM 新增了一个状态向量$𝑪$, 同时引入了门控(Gate)机制，通过门控单元来控制信息的遗忘和刷新，他们包含一个 sigmoid 神经网络层和一个 pointwise 乘法操作.

### 遗忘层门

作用: 将细胞状态中的信息选择性的遗忘, 作用于 LSTM 状态向量𝒄上面，.

操作步骤：该门会读取$h_{t-1}$和$x_t$，输出一个在 0 到 1 之间的数值给每个在细胞状态$C_{t-1}$中的数字。1 表示“完全保留”，0 表示“完全舍弃”。

公式: $$f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)$$

### 输入层门

作用: 将新的信息选择性的记录到细胞状态中, 控制LSTM对输入的接收程度

操作步骤: 

1. sigmoid层决定什么值我们将要更新($i_t$);   
2. tanh 层创建一个新的候选值向量$\tilde{C}_t​$加入到状态中;  
3. 将$c_{t-1}$更新为$c_{t}$, 丢弃需要丢弃的旧状态, 获得需要获得的新状态

公式:
$$
i_t = \sigma(W_i[h_{t-1}, x_t] + b_i) \\
\tilde{C}_t = tanh(W_C[h_{t-1}, x_t] + b_C) \\
C_t = f_t * C_{t-1} + i_t * \tilde C_t
$$


### 输出门层

作用: 确定输出什么值, 内部状态$C_t$不会直接用于输出

操作步骤: 

1. 通过sigmoid 层来确定细胞状态的哪个部分将输出
2. 把细胞状态通过 tanh 进行处理，并将它和 sigmoid 门的输出相乘, 输出确定输出的部分

公式:
$$
o_t = \sigma(W_o[h_{t-1}, x_t] + b_o) \\
h_t = o_t * tanh(C_t)
$$


|输入门控 |遗忘门控 |LSTM行为|
|---|---|---|
|0|1|只是用记忆|
|1|1| 综合输入和记忆|
|0|0|清零记忆|
|1|0|输入覆盖记忆|


### LSTM 使用

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
from tensorflow.keras import layers, Input, Model, Sequential, datasets

In [None]:
gpus = tf.config.experimental.list_physical_devices('GPU')
try:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
        print(gpu)
except RuntimeError as e:
    print(e)

In [None]:
x = tf.random.normal([2, 80, 100])  # 
xt = x[:, 0, :]  # 第一个单词, 第一个时间戳的输入

cell = layers.LSTMCell(64)  # 与SimpleRNNCell 类似
state = [tf.zeros([2, 64]), tf.zeros([2, 64])]  # 初始状态h0,C0
out, state = cell(xt, state)

In [None]:
cell?

In [None]:
out.shape

In [None]:
cell.get_config()

In [None]:
w_xh,w_hh, b =  cell.trainable_variables
w_xh.shape, w_hh.shape, b.shape  # 4 个部分堆叠

In [None]:
for xt in tf.unstack(x, axis=1):
    out, state = cell(xt, state)

In [None]:
# LSTM层
lstm_layer = layers.LSTM(64)

out = lstm_layer(x)
out.shape

In [None]:
# 简单堆叠LSTM层
lstm_net = Sequential([
    layers.LSTM(units=64, return_sequences=True),
    layers.LSTM(units=64),
])
lstm_net = lstm_net(x)

###  使用LSTM进行情感分类

In [None]:
BATCH_SIZE = 128
TOTAL_WORDS = 10000  # 词汇表大小
MAX_REVIEW_LEN = 80  # 句子长度
EMBEDDING_LEN = 100  # 词向量长度

In [None]:
(X_train, y_train), (X_test, y_test) = datasets.imdb.load_data(num_words=TOTAL_WORDS)

In [None]:
word_index = datasets.imdb.get_word_index()

pre_10 = list(word_index.items())[:10]
for item in pre_10:  
    print(item)  # 单词-数字

In [None]:
# 添加标志位
word_index = {k:(v+3) for k, v in word_index.items()}
word_index["<PAD>"] = 0  # 表示填充
word_index["<START>"] = 1  # 表示起始
word_index["<UNK>"] = 2  # 表示未知单词
word_index["<UNUSED>"] = 3

# 翻转
index_word = dict([(v, k) for k, v in word_index.items()]) 

In [None]:
def decode_review(text):
    # 数字序列 -> 文本
    return ' '.join([index_word.get(i, '?') for i in text])

In [None]:
decode_review(X_train[0])

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# 截断填充(前部分) 成等长的序列
X_train = pad_sequences(X_train, maxlen=MAX_REVIEW_LEN)
X_test = pad_sequences(X_test, maxlen=MAX_REVIEW_LEN)

In [None]:
X_train.shape

In [None]:
train_db = tf.data.Dataset.from_tensor_slices(  # 舍弃最后一组 
    (X_train, y_train)).shuffle(1000).batch(BATCH_SIZE, drop_remainder=True)
test_db = tf.data.Dataset.from_tensor_slices(
    (X_test, y_test)).shuffle(1000).batch(BATCH_SIZE, drop_remainder=True)

In [None]:
def load_embed(path):
    # 建立映射关系: 单词: 词向量(长度50))
    embedding_map = {}
    with open(path, encoding='utf8') as f:
        for line in f.readlines():
            l = line.split()
            word = l[0]
            coefs = np.asarray(l[1:], dtype='float32')
            embedding_map[word] = coefs
    return embedding_map

In [None]:
embedding_map = load_embed('glove.6B.100d.txt')
print('Found %s word vectors.' % len(embedding_map))

In [None]:
# 预训练
# 将 单词序号-> 单词向量
num_words = min(TOTAL_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_LEN))

applied_vec_count = 0
for word, i in word_index.items():
    if i >= TOTAL_WORDS:
        continue
    # 根据glove.6B.100d 将单词转为词向量
    embedding_vector = embedding_map.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        applied_vec_count += 1
print(applied_vec_count, embedding_matrix.shape)

In [None]:
class MyLSTMRNN(Model):
    def __init__(self, units):
        super().__init__()
        # 初始状态向量
#         self.state0 = [tf.zeros([BATCH_SIZE, units])]
#         self.state1 = [tf.zeros([BATCH_SIZE, units])]
        # 词嵌入层
        self.embedding = layers.Embedding(TOTAL_WORDS, EMBEDDING_LEN,
                                          input_length=MAX_REVIEW_LEN,
                                          weights=[embedding_matrix],
                                         trainable=False
                                         )
        # RNNCell
#         self.runcell0 = layers.SimpleRNNCell(units, dropout=0.5)
#         self.runcell1 = layers.SimpleRNNCell(units, dropout=0.5)
        # RNN layer
        self.rnn = Sequential([
            layers.LSTM(units, dropout=0.5, return_sequences=True),
            layers.LSTM(units, dropout=0.5)
        ])
        # 分类层
        self.out_layer = Sequential([
            layers.Dense(32, activation='relu'),
            layers.Dropout(rate=0.5),
            layers.Dense(1, activation='sigmoid')
        ])
        
    
    def call(self, inputs, training=None):
        x = self.embedding(inputs)
#         state0, state1 = self.state0, self.state1
#         for word in tf.unstack(x, axis=1):
#             out0, state0 = self.runcell0(word, state0, training)
#             out1, state1 = self.runcell1(out0, state1, training)
        out1 = self.rnn(x)
        # 最末层 最后一个时间戳的输出
        out = self.out_layer(out1, training)
        return out

In [None]:
model = MyLSTMRNN(64)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.build((None, MAX_REVIEW_LEN))
model.summary()

In [None]:
hist = model.fit(train_db, epochs=20, validation_data=test_db)

In [None]:
plt.plot(hist.history['loss'], label='train_loss')
plt.plot(hist.history['val_loss'], label='test_loss')
plt.legend()

In [None]:
plt.plot(hist.history['accuracy'], label='train_accuracy')
plt.plot(hist.history['val_accuracy'], label='test_accuracy')
plt.legend()

## GRU
LSTM 不容易出现梯度弥散现象。但是LSTM 结构相对较复杂，计算代价较高，模型参数量较大。
门控循环网络(Gated Recurrent Unit，GRU), 是应用最广泛的LSTM简化版本. 将忘记门和输入门合成了一个单一的更新门, 内部状态向量和输出向量合并，统一为状态向量h.
![](./images/LSTM12.png)

- 复位门(Reset Gate): 控制上一个时间戳的状态$h_{t-1}$进入GRU 的量;
- 更新门(Update Gate): 控制上一时间戳的状态$h_{t-1}$和新输入$\tilde h_t$对新状态向量$h_t$的影响程度.

In [None]:
h = [tf.zeros([2, 64])]
cell = layers.GRUCell(64)

for xt in tf.unstack(x, axis=1):
    out, h = cell(xt, h)
    
out.shape

In [None]:
class MyGRURNN(Model):
    def __init__(self, units):
        super().__init__()
        # 初始状态向量
#         self.state0 = [tf.zeros([BATCH_SIZE, units])]
#         self.state1 = [tf.zeros([BATCH_SIZE, units])]
        # 词嵌入层
        self.embedding = layers.Embedding(TOTAL_WORDS, EMBEDDING_LEN,
                                          input_length=MAX_REVIEW_LEN,
                                          weights=[embedding_matrix],
                                         trainable=False
                                         )
        # RNNCell
#         self.runcell0 = layers.SimpleRNNCell(units, dropout=0.5)
#         self.runcell1 = layers.SimpleRNNCell(units, dropout=0.5)
        # RNN layer
        self.rnn = Sequential([
            layers.GRU(units, dropout=0.5, return_sequences=True),
            layers.GRU(units, dropout=0.5)
        ])
        # 分类层
        self.out_layer = Sequential([
            layers.Dense(32, activation='relu'),
            layers.Dropout(rate=0.5),
            layers.Dense(1, activation='sigmoid')
        ])
        
    
    def call(self, inputs, training=None):
        x = self.embedding(inputs)
#         state0, state1 = self.state0, self.state1
#         for word in tf.unstack(x, axis=1):
#             out0, state0 = self.runcell0(word, state0, training)
#             out1, state1 = self.runcell1(out0, state1, training)
        out1 = self.rnn(x)
        # 最末层 最后一个时间戳的输出
        out = self.out_layer(out1, training)
        return out

In [None]:
model = MyGRURNN(64)
model.build((None, MAX_REVIEW_LEN))
model.summary()

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(train_db, epochs=20, validation_data=test_db)

In [None]:
plt.plot(hist.history['loss'], label='train_loss')
plt.plot(hist.history['val_loss'], label='test_loss')
plt.legend()

In [None]:
plt.plot(hist.history['accuracy'], label='train_accuracy')
plt.plot(hist.history['val_accuracy'], label='test_accuracy')
plt.legend()