# LSTM

RNN 的一个问题是当sequence 特别长的时候，因为loss 在向回传递逐层减弱，所以有时会无法扑捉到token 之间的关系，考虑下面两句话：

- The young woman went to the movies with her friends.

- The young woman, having found a free ticket on the ground, went to the movies.

第一句话因为主语和谓语紧挨着，所以很容易捕捉这类关系。第二句话，主语和谓语之间加入了一个从句，所以很有可能无法捕捉主语和谓语的关系。

⚠️ 没有捕捉到这个关系的影响是什么？

LSTM 解决这个问题的方法是加入了一个`state` 的concept， 这个state 可以看作是memory。memory 的作用是，通过training，可以学习到what to remember, 同时，网络其余的部分学习如何利用remember的和输入的数据来做预测。

通过memory，可以捕获到更长的依赖关系。

⚠️ hard to think.

使用LSTM，除了可以predict，还可以generate text.

LSTM 网络图如下：

<img src="img/lstm.png" alt="drawing" width="500"/>

可以看出，这是一个RNN unrolled version + memory state.

下面，我们来看每个LSTM Layer。






### LSTM Layer

#### 输入

- input instance of current time step
    - 300-element vector
- output from previous time step
    - 50-element vector
- concatenation: 把两个input vectors 拼接成一个长度为350-element vector.

<img src="img/lstm_layer_input.png" alt="drawing" width="700"/>

input 一共经过3个gates，每个gate 都是一个feed forward network layer, gate 的weights 决定了有多少信息可以go through to the cell's state (memory).

- forget gate
- input/candidate gate (2 branches)
- update/output gate

参数个数分析：
- 每个gate 的每个neuron 连接为长度为350 的vector + 1个bias， 总共351 个weights。
- 每个gate有50个neurons，总共为351 * 50 = 17550
- 一共3个gates, candidate gate 有两个分支，参数个数一样，总共可以看作4个gates，参数个数为：17750 * 4 = 70200
- output layer，LSTM 的输出是400 * 50 (每个step 输出长度为50的"thought vector"，一共50 steps), flatten 之后长度20000，加一个bias 一共20001
- 总共 70200 + 20001 = 90201 个参数

<img src="img/forget_gate.png" alt="drawing" width="500"/>

注意，对于第一个token，step t-1 的50-element vector 补零。

### 1. Forget Gate

the goal is to learn how much of the cell's memory you want to erase. The idea behind wanting to forget is as important as wanting to remember.

forget gate 本身是一个feed forward network:
- n neurons
- m + n + 1 weights for each neuron (300 + 50 + 1)
- activation function: sigmoid
- output: 0 ~ 1

<img src="img/forget_gate_weights.png" alt="drawing" width="700"/>

forget gate 的输出类似于一个mask, 值接近1 代表通过率高，即保留记忆；值接近0 代表通过率低，即删除记忆。然后这个"mask" 和memory vector 做element-wise 乘法，更新memory，过程如下图所示。这就是forget gate 怎么做到forget things 的。forget 是指，更新memory，使某些维度的信息量减少。

<img src="img/forget_gate_calculation.png" alt="drawing" width="500"/>

### 2. Candidate gate

goal: how much to augment the memory based on:
- concatenated input
    - input of step t
    - output of step t-1
    
如下图所示，candidate gate 包含 2 个 branches:
1. decide which input vector elements are worth remembering
    - 类似于forget gate, sigmoid function, 输出 0 ～ 1
2. Rout the remembered input elements to the right memory slot.
    - what value you are going to update the memory with?
    - activation: tanh
    - -1 ~ 1

<img src="img/candidate_gate.png" alt="drawing" width="500"/>

Output:
- 然后我们把两个vector 做element-wise multiplication. 

最后，这个output 和之前的updated memory 做element-wise addition，实现remember new things.


### 3. Output/Update gate

flow 1 (gate): 
- Input_1: concatenated input
- n neurons
- activation function: sigmoid
- output_1: n-dimensional output between 0 and 1

flow 2 (mask):
- input_2: updated memory vector
- tanh function applied elementwise
- output_2: n-dimensional vector （value between -1 and 1）

注意，这里直接使用tanh function，并没有neuron（即没有weight），所以可以称为mask，但不能称为gate。

然后，output_1 element-wise multiplication with output_2.
- 生成一个新的n-dimensional vector (step official output)
    - 传到step t+1
    - layer's output

整个过程如下图所示。

<img src="img/update_gate.png" alt="drawing" width="700"/>

⚠️ 图上又个bug


## 1. LSTM for sentimantal analysis

### 1. Load and preprocess the IMDB data

前面预处理的流程和RNN都差不多，所以我们把预处理函数下载一个utils.py 的文件中，然后调用他们。

In [1]:
import numpy as np
from utils import pre_process_data, tokenize_and_vectorize, collect_expected, pad_trunc

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True' # 避免notebook 执行时退出

In [2]:
imdb_datasets = '/Users/chenwang/Workspace/datasets/IMDB/aclImdb/train'

In [3]:
dataset = pre_process_data(imdb_datasets)

In [4]:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

In [5]:
split_point = int(len(vectorized_data)*.8)

x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = expected[split_point:]

In [6]:
maxlen = 400
batch_size = 32         # How many samples to show the net before backpropogating the error and updating the weights
embedding_dims = 300    # Length of the token vectors we will create for passing into the Convnet
epochs = 2


In [7]:
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)

x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))  # 20000 * 400 * 300
y_train = np.array(y_train)

x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

### 2. Build a keras LSTM network

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, LSTM

Using TensorFlow backend.


In [9]:
num_neurons = 50

print('Build model...')
model = Sequential()

model.add(LSTM(num_neurons, return_sequences=True, input_shape=(maxlen, embedding_dims)))
model.add(Dropout(.2))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile('rmsprop', 'binary_crossentropy',  metrics=['accuracy'])
print(model.summary())

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 400, 50)           70200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 20000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 20001     
Total params: 90,201
Trainable params: 90,201
Non-trainable params: 0
_________________________________________________________________
None


### 3. Fit your LSTM model

In [10]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1a304f7550>

从上面的结果中可以看出，validation accuracy 大大提升，同时，训练时间也大大提升。

The beauty of the algorithm is that it learns the relationships of the tokens it sees.

### 4. Prediction

我们使用了一些positive 的词，在一句表达negative 的观点的评论中，来看一下模型的表现。

In [12]:
sample_1 = "I'm hate that the dismal weather that had me down for so long, when will it break! Ugh, when does happiness return?  The sun is blinding and the puffy clouds are too thin.  I can't wait for the weekend."

# We pass a dummy value in the first element of the tuple just because our helper expects it from the way processed the initial data.  That value won't ever see the network, so it can be whatever.
vec_list = tokenize_and_vectorize([(1, sample_1)])

# Tokenize returns a list of the data (length 1 here)
test_vec_list = pad_trunc(vec_list, maxlen)

test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))

print("Sample's sentiment, 1 - pos, 0 - neg : {}".format(model.predict_classes(test_vec)))
print("Raw output of sigmoid function: {}".format(model.predict(test_vec)))

Sample's sentiment, 1 - pos, 0 - neg : [[0]]
Raw output of sigmoid function: [[0.19533193]]


#### Error analysis

try to think statistically:
- Are the words in the misclassified example rare? 
- Are they rare in your corpus or the corpus that trained the language model for your embedding? 
- Do all of the words in the example exist in your model’s vocabulary?

Going through this process of examining the probabilities and input data associated
with incorrect predictions helps build your machine learning intuition so you
can build better NLP pipelines in the future. This is backpropagation through the
human brain for the problem of model tuning.

### 5. Save and reload models

In [13]:
model_structure = model.to_json()
with open("lstm_model1.json", "w") as json_file:
    json_file.write(model_structure)

model.save_weights("lstm_weights1.h5")
print('Model saved.')

Model saved.


In [None]:
# from keras.models import model_from_json
# with open("lstm_model1.json", "r") as json_file:
#     json_string = json_file.read()
# model = model_from_json(json_string)

# model.load_weights('lstm_weights1.h5')

## 2. Dirty Data

在NLP pipeline 中，我们有一些dirty data 需要处理。

### 2.1 Padding

dirty data 是指我们加入的padding, 其实是破坏了数据的integrity (完整性).

因为我们是做分类，所以最后需要一个fix length vector (thought vector)，这是我们做padding 的原因。

**但是，400 是否是padding 的最佳值呢？**

下面我们来看看，数据集的平均长度是多少。

In [14]:
def test_len(data, maxlen):
    total_len = truncated = exact = padded = 0
    for sample in data:
        total_len += len(sample)
        if len(sample) > maxlen:
            truncated += 1
        elif len(sample) < maxlen:
            padded += 1
        else:
            exact +=1 
    print('Padded: {}'.format(padded))
    print('Equal: {}'.format(exact))
    print('Truncated: {}'.format(truncated))
    print('Avg length: {}'.format(total_len/len(data)))


In [16]:
dataset = pre_process_data(imdb_datasets)
vectorized_data = tokenize_and_vectorize(dataset)

test_len(vectorized_data, 400)

Padded: 22560
Equal: 12
Truncated: 2428
Avg length: 202.43204


可以看出，每个document 的平均长度是202 个tokens. 所以400 的设置可能带来了过多的dirty data。我们下面尝试200.

#### 训练一个较小的model

In [17]:
expected = collect_expected(dataset)

In [18]:
split_point = int(len(vectorized_data)*.8)

x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = expected[split_point:]

In [19]:
maxlen = 200
batch_size = 32         # How many samples to show the net before backpropogating the error and updating the weights
embedding_dims = 300    # Length of the token vectors we will create for passing into the Convnet
epochs = 2

In [20]:
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)

x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))  # 20000 * 200 * 300
y_train = np.array(y_train)

x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

In [21]:
x_train.shape

(20000, 200, 300)

In [22]:
num_neurons = 50

print('Build model...')
model = Sequential()

model.add(LSTM(num_neurons, return_sequences=True, input_shape=(maxlen, embedding_dims)))
model.add(Dropout(.2))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile('rmsprop', 'binary_crossentropy',  metrics=['accuracy'])
print(model.summary())

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 200, 50)           70200     
_________________________________________________________________
dropout_2 (Dropout)          (None, 200, 50)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 10001     
Total params: 80,201
Trainable params: 80,201
Non-trainable params: 0
_________________________________________________________________
None


In [23]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1fff054940>

保存这个较小的model

In [24]:
model_structure = model.to_json()
with open("lstm_model_200.json", "w") as json_file:
    json_file.write(model_structure)

model.save_weights("lstm_weights_200.h5")
print('Model saved.')

Model saved.


从上可见：
- Network 的参数比之前的小
- 我们的network unroll 200 次，所以相比400次训练更快。
- validation acc 增加到0.86，说明移除dirty data 对提升准确率有一些作用。



### 2.2 Unkown tokens

前面我们讲了padding的问题，还有一个问题是，我们丢弃了unknown words. 在做词向量的时候，如果遇到了lexicon 中没有的词，会丢弃。有的时候，可能会造成语义的误解，例如：

I dont like this movie.

其中单词dont 是一个非正规的缩写，所以在lexicon 中可能找不到，如果丢弃了之后变成：

I like this movie.

这就造成了语义上的问题。

下面我们介绍两个通用的方法，这两个方法都是寻找一个vector representation 来替换这个不在字典里的词。

#### 1. 随机选一个词替换

这听上去有些反人类(counter-intuitive), 因为随便替换了之后，即便是人可能也无法理解意思。但其实这并不是一个问题，因为我们的目标是追求泛化，所以只有一两个dirty data 对于模型的影响不大。

#### 2. 用UNK 替换

## 3. Character-level LSTM

单词并不是表示meaning 的最小语义单元，我们有时候需要寻找更小的building block, 例如stems，phonemes，etc.

下面我们试图用LSTM 建立一个character level 的模型。在这个模型中，每个标点都被当作一个character. 


### 3.1 load data

In [25]:
dataset = pre_process_data(imdb_datasets)

In [40]:
expected = collect_expected(dataset)

下面我们看一下平均每个document 有多少个字母，结果显示是1325个。

**对于参数的选择，观察数据很重要**.

In [26]:
def avg_len(data):
    total_len = 0
    for sample in data:
        total_len += len(sample[1])
    return total_len/len(data)

In [29]:
avg_len(dataset)

1325.06964


从上面可以看出，我们的network 需要unroll 1000多次，所以训练起来可能需要更长的时间。

### 3.2 数据清洗

character 和token 的区别在于，输入我们的lexicon 只有26个字母 + 10个数字 + 有限的特殊符号，所以input vector 的dimension 会比token based 小。下面，我们使用一个方法来识别有效的character，不在这个范围内的character 我们使用UNK 来代替。**注意，这里UNK 看作一个character**。

In [31]:
def clean_data(data):
    """ Shift to lower case, replace unknowns with UNK, and listify """
    new_data = []
    VALID = 'abcdefghijklmnopqrstuvwxyz123456789"\'?!.,:; '
    for sample in data:
        new_sample = []
        for char in sample[1].lower():  # Just grab the string, not the label
            if char in VALID:
                new_sample.append(char)
            else:
                new_sample.append('UNK')
       
        new_data.append(new_sample)
    return new_data


In [32]:
listified_data = clean_data(dataset)

### 3.3 Padding

同上，我们做padding，对于不到max_len 的documents，使用特殊的token "UNK" 来填补。

In [33]:
def char_pad_trunc(data, maxlen):
    """ We truncate to maxlen or add in PAD tokens """
    new_dataset = []
    for sample in data:
        if len(sample) > maxlen:
            new_data = sample[:maxlen]
        elif len(sample) < maxlen:
            pads = maxlen - len(sample)
            new_data = sample + ['PAD'] * pads
        else:
            new_data = sample
        new_dataset.append(new_data)
    return new_dataset



In [37]:
maxlen = 1500

common_length_data = char_pad_trunc(listified_data, maxlen)

In [46]:
len(common_length_data[0])  # check the length after padding

1500

### 3.4 构建字典

之前我们使用了word2vec，相当于一个将单词转化为vector 的词典，我们现在要手动构建一个类似功能的字典。

In [34]:
def create_dicts(data):
    """ Modified from Keras LSTM example"""
    chars = set()
    for sample in data:
        chars.update(set(sample))
    char_indices = dict((c, i) for i, c in enumerate(chars))
    indices_char = dict((i, c) for i, c in enumerate(chars))
    return char_indices, indices_char

In [38]:
char_indices, indices_char = create_dicts(common_length_data)


### 3.5 构建one-hot 编码

接下来，我们使用字典，来创建input vectors

In [35]:
import numpy as np

def onehot_encode(dataset, char_indices, maxlen):
    """ 
    One hot encode the tokens
    
    Args:
        dataset  list of lists of tokens
        char_indices  dictionary of {key=character, value=index to use encoding vector}
        maxlen  int  Length of each sample
    Return:
        np array of shape (samples, tokens, encoding length)
    """
    X = np.zeros((len(dataset), maxlen, len(char_indices.keys())))
    for i, sentence in enumerate(dataset):
        for t, char in enumerate(sentence):
            X[i, t, char_indices[char]] = 1
    return X

In [39]:
encoded_data = onehot_encode(common_length_data, char_indices, maxlen)

### 3.6 split dataset into training / testing

In [41]:
split_point = int(len(encoded_data)*.8)

x_train = encoded_data[:split_point]
y_train = expected[:split_point]
x_test = encoded_data[split_point:]
y_test = expected[split_point:]

### 3.7 define netowrk

In [49]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, Flatten, LSTM

num_neurons = 20

print('Build model...')
model = Sequential()

model.add(LSTM(num_neurons, return_sequences=True, input_shape=(maxlen, len(char_indices.keys()))))
model.add(Dropout(.4))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.compile('rmsprop', 'binary_crossentropy',  metrics=['accuracy'])
print(model.summary())

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_4 (LSTM)                (None, 1500, 20)          5360      
_________________________________________________________________
dropout_4 (Dropout)          (None, 1500, 20)          0         
_________________________________________________________________
flatten_4 (Flatten)          (None, 30000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 30001     
Total params: 35,361
Trainable params: 35,361
Non-trainable params: 0
_________________________________________________________________
None


### 3.8 train network

In [50]:
batch_size = 32
epochs = 5

In [51]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1a41bce0f0>

#### 结果分析

训练了5个epoch 之后，我们的训练集准确度是74%，但是validation accuracy 只有58% 左右。
如果训练10个epoch，training accuracy 可以到90+%，但是validation accuracy 还是一半多一些，说明了overfitting。


⚠️ 这个结果有问题

一个原因是，我们的模型相对于之歌较小的数据集是过于复杂了，所以泛化能力不够。可能的解决方法：
- 增加dropout percentage: 不要超过50%
- 每一层使用更少的neuron 个数
- 提供更多的数据 - expensive to get

⚠️ **Question**: character-level 的模型又慢又不准，我们为什么还要介绍它？

我们要训练一个这样的模型，需要更大的training set，当前我们使用的IMDB 的training set 太小。如果能有一个更大的dataset，character-level model 的准确率会很好。



## 4. Extensions 

### 4.1 Other kinds of memory

其他的memory 在gate 的operation 会有稍许区别。

#### GRU

更高效：更少的参数

```python
from keras.models import Sequential
from keras.layers import GRU

model = Sequential()
model.add(GRU(num_neurons, return_sequences=True, input_shape=X[0].shape))
```

#### peephole connections

`Learning Precise Timing with LSTM Recurrent Networks`

区别在于，input 现在变成三个信号的叠加：
- input at time t
- output at time t-1
- memory state

对于time series data 比较有效

### 4.2 Going deeper

叠加(stack)多个LSTM 层, 注意，第一层和中间层的`return_sequences=True`.

<img src="img/stacked_lstm.png" alt="drawing" width="450"/>



```python
from keras.models import Sequential
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(num_neurons, return_sequences=True, input_shape=X[0].shape))
model.add(LSTM(num_neurons_2, return_sequences=True))
```

## 5. Summary

- Remembering information with memory units enables more accurate and general models of the sequence.
- It’s important to forget information that is no longer relevant.
- Only some new information needs to be retained for the upcoming input, and LSTMs can be trained to find it.
- If you can predict what comes next, you can generate novel text from probabilities.
- Character-based models can more efficiently and successfully learn from small, focused corpora than word-based models.
- LSTM thought vectors capture much more than just the sum of the words in a statement.