## LSTM
<img src="image/LSTM.png"  width="660" >

## 遗忘门
决定什么信息被丢弃
<img src="image/LstmForgetGate.png"  width="660" >

## 输入门
决定什么值被更新  
创建一个新的候选向量
<img src="image/LstmInputGate.png"  width="660" >

## 输出门
决定细胞状态的哪部分会被输出
<img src="image/LstmOutputGate.png"  width="660" >

## 细胞状态更新
<img src="image/LstmCellUpdate.png"  width="660" >

## 双向LSTM（Bi-directional LSTM）
<img src="image/BiLSTM.png"  width="600" >
在Forward层从1时刻到t时刻正向计算一遍，得到并保存每个时刻向前隐含层的输出。  
在Backward层沿着时刻t到时刻1反向计算一遍，得到并保存每个时刻向后隐含层的输出。  
在每个时刻结合Forward层和Backward层的相应时刻输出的结果得到最终的输出:
<img src="image/BILSTM-formula.png" width="300" >

## GRU
GRU只有两个门结构：更新门和重置门，分别为图中的z_t和r_t
<img src="image/GRU.png"  width="600" >

## 使用Imdb数据集进行情感分析(二分类)

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout, Bidirectional, TimeDistributed
from keras.layers.recurrent import LSTM,GRU
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Model
from keras.callbacks import EarlyStopping
import os
import tarfile
import numpy as np
import re

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### 读取数据
数据清洗：去除含有html标签的  
分词：此处为英文,不需  
去停用词：可以去除”the”、”a”等词,此处没加

In [2]:
import re
def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub('', text)

def read_files(filetype):
    """
    filetype: 'train' or 'test'
    return:
    all_texts: filetype数据集文本
    all_labels: filetype数据集标签
    """
    # 标签1表示正面，0表示负面
    all_labels = [1]*12500 + [0]*12500
    all_texts = []
    file_list = []
    path = r'./data/aclImdb/'
    # 读取正面文本名
    pos_path = path + filetype + '/pos/'
    for file in os.listdir(pos_path):
        file_list.append(pos_path+file)
    # 读取负面文本名
    neg_path = path + filetype + '/neg/'
    for file in os.listdir(neg_path):
        file_list.append(neg_path+file)
    # 将所有文本内容加到all_texts
    for file_name in file_list:
        with open(file_name, encoding='utf-8') as f:
            all_texts.append(rm_tags(" ".join(f.readlines())))
    return all_texts, all_labels

In [3]:
train_texts, train_labels = read_files('train')
test_texts, test_labels = read_files('test')

### 处理成深度学习需要的数据格式

In [4]:
def preprocessing(train_texts, train_labels, test_texts, test_labels):
    tokenizer = Tokenizer(num_words=3800)  
    tokenizer.fit_on_texts(train_texts)
    # 对每一句影评文字转换为数字列表，使用每个词的编号进行编号
    x_train_seq = tokenizer.texts_to_sequences(train_texts)
    x_test_seq = tokenizer.texts_to_sequences(test_texts)
    x_train = sequence.pad_sequences(x_train_seq, maxlen=380)
    x_test = sequence.pad_sequences(x_test_seq, maxlen=380)
    y_train = np.array(train_labels)
    y_test = np.array(test_labels)
    return x_train, y_train, x_test, y_test

In [5]:
x_train, y_train, x_test, y_test = preprocessing(train_texts, train_labels, test_texts, test_labels)

### LSTM模型
Embedding + LSTM + FC1 +sigmoid

In [8]:
model = Sequential()
model.add(Embedding(3800, 32, input_length=380))
model.add(Dropout(0.2))
model.add(LSTM(32))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

### Bi-LSTM模型
Embedding + BiLSTM + Flatten +sigmoid

In [6]:
model = Sequential()
model.add(Embedding(3800, 32, input_length=380)) # max_features = 3800, embed_size = 32
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(32, return_sequences=True), merge_mode='concat'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

### GRU模型
Embedding + GRU +  FC1 +sigmoid

In [6]:
model = Sequential()
model.add(Embedding(3800, 32, input_length=380))
model.add(Dropout(0.2))
model.add(GRU(32))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

### 加入EarlyStopping

In [7]:
es = EarlyStopping(monitor='val_acc', patience=5)

In [8]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=['accuracy'])

### 训练LSTM

In [11]:
batch_size = 256
epochs = 20
model.fit(x_train, y_train,
          validation_split=0.1,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=[es],
          shuffle=True)

Train on 22500 samples, validate on 2500 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


<keras.callbacks.History at 0x26fe567ac88>

### 训练bi-LSTM
重启核Kernel

In [9]:
batch_size = 256
epochs = 20
model.fit(x_train, y_train,
          validation_split=0.1,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=[es],
          shuffle=True)

Train on 22500 samples, validate on 2500 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20


<keras.callbacks.History at 0x1a743334a20>

### 训练GRU
重启核Kernel

In [9]:
batch_size = 256
epochs = 20
model.fit(x_train, y_train,
          validation_split=0.1,
          batch_size=batch_size,
          epochs=epochs,
          callbacks=[es],
          shuffle=True)

Train on 22500 samples, validate on 2500 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


<keras.callbacks.History at 0x24a92a48ba8>

### 预测模型

In [12]:
scores = model.evaluate(x_test, y_test)



In [13]:
print('LSTM:test_loss: %f, accuracy: %f' % (scores[0], scores[1]))

LSTM:test_loss: 0.400232, accuracy: 0.862680


In [12]:
print('Bi-LSTM:test_loss: %f, accuracy: %f' % (scores[0], scores[1]))

Bi-LSTM:test_loss: 0.332111, accuracy: 0.870680


In [11]:
print('GRU:test_loss: %f, accuracy: %f' % (scores[0], scores[1]))

GRU:test_loss: 0.328848, accuracy: 0.869520
