# 循環神經網絡 LSTM (長短期記憶)來學習字母表順序

很多人在看過RNN或LSTM的原理說明後, 對於RNN神經網絡在序列資料的學習與應用上很難一開始就理解。在本文中，我們將開發和比較幾種不同的LSTM神經網絡模型。

![lstm-abc](https://pro.guidesocial.be/images/thumbs/580x387/arton24101.jpg?fct=1456434296)

我們將要使用深度學習來學習英文26個字母出現的順序。也就是說，給定一個英文字母表的某一個字母，來讓神經網絡預測下一個可能會出現的字母。

> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 J -> 預測 K

> 給 X -> 預測 Y

這是一個簡單的序列預測問題，一旦被理解，就可以推廣到其他序列預測問題，如時間序列預測和序列分類。

![lstm-many-to-one](https://i.stack.imgur.com/QCnpU.jpg)

## 模型 1. 用LSTM學習一個字符到一個字符映射

### STEP1. 匯入 Keras 及相關模組

In [1]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras.preprocessing.sequence import pad_sequences

# 給定隨機的種子, 以便讓大家跑起來的結果是相同的
numpy.random.seed(7)

Using TensorFlow backend.


### STEP2. 準備資料

我們現在可以定義我們的數據集，字母表(alphabet)。為了便於閱讀，我們使用大寫字母來定義字母表。

我們需要將字母表的每個字母映射到數字以便使用人工網絡來進行訓練。我們可以通過為字符創建字母索引的字典來輕鬆完成此操作。
我們還可以創建一個反向查找，將預測轉換回字符以供以後使用。

In [2]:
# 定義序列數據集
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# 創建字符映射到整數（0 - 25)和反相的查詢字典物件
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

In [3]:
# 打印看一下
print("字母對應到數字編號: \n", char_to_int)
print("\n")

print("數字編號對應到字母: \n", int_to_char)

字母對應到數字編號: 
 {'O': 14, 'T': 19, 'M': 12, 'Y': 24, 'E': 4, 'I': 8, 'P': 15, 'X': 23, 'R': 17, 'V': 21, 'W': 22, 'S': 18, 'H': 7, 'Z': 25, 'D': 3, 'J': 9, 'B': 1, 'A': 0, 'Q': 16, 'C': 2, 'G': 6, 'L': 11, 'K': 10, 'F': 5, 'U': 20, 'N': 13}


數字編號對應到字母: 
 {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J', 10: 'K', 11: 'L', 12: 'M', 13: 'N', 14: 'O', 15: 'P', 16: 'Q', 17: 'R', 18: 'S', 19: 'T', 20: 'U', 21: 'V', 22: 'W', 23: 'X', 24: 'Y', 25: 'Z'}


### STEP3. 準備訓練用資料

現在我們需要創建我們的輸入(X)和輸出(y)來訓練我們的神經網絡。我們可以通過定義一個輸入序列長度，然後從輸入字母序列中讀取序列。
例如，我們使用輸入長度1.從原始輸入數據的開頭開始，我們可以讀取第一個字母“A”，下一個字母作為預測“B”。我們沿著一個字符移動並重複，直到達到“Z”的預測。

In [4]:
# 準備輸入數據集
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


### STEP4. 資料預處理
我們需要將NumPy數組重塑為LSTM網絡所期望的格式，也就是: (samples, time_steps, features)。
同時我們將進行資料的歸一化(normalize)來讓資料的值落於0到1之間。並對標籤值進行one-hot的編碼。


> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 J -> 預測 K

> 給 X -> 預測 Y


目標訓練張量結構: (samples, time_steps, features) -> (n , **1**, **1** )

請特別注意, 這裡的1個字符會變成1個時間步裡頭的1個element的"feature"向量。

In [5]:
# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))

# 歸一化
X = X / float(len(alphabet))

# one-hot 編碼輸出變量
y = np_utils.to_categorical(dataY)

print("X shape: ", X.shape) # (25筆samples, "1"個時間步長, 1個feature)
print("y shape: ", y.shape)

X shape:  (25, 1, 1)
y shape:  (25, 26)


### STEP5. 建立模型

In [6]:
# 創建模型
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_1 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


### STEP6. 定義訓練並進行訓練

In [7]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 2s - loss: 3.2660 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.2582 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2551 - acc: 0.0400
Epoch 4/500
 - 0s - loss: 3.2524 - acc: 0.0400
Epoch 5/500
 - 0s - loss: 3.2495 - acc: 0.0400
Epoch 6/500
 - 0s - loss: 3.2470 - acc: 0.0400
Epoch 7/500
 - 0s - loss: 3.2440 - acc: 0.0400
Epoch 8/500
 - 0s - loss: 3.2411 - acc: 0.0400
Epoch 9/500
 - 0s - loss: 3.2378 - acc: 0.0400
Epoch 10/500
 - 0s - loss: 3.2348 - acc: 0.0400
...


Epoch 490/500
 - 0s - loss: 1.7054 - acc: 0.7200
Epoch 491/500
 - 0s - loss: 1.7041 - acc: 0.7600
Epoch 492/500
 - 0s - loss: 1.7029 - acc: 0.8800
Epoch 493/500
 - 0s - loss: 1.7021 - acc: 0.7600
Epoch 494/500
 - 0s - loss: 1.7024 - acc: 0.8800
Epoch 495/500
 - 0s - loss: 1.6992 - acc: 0.7600
Epoch 496/500
 - 0s - loss: 1.7001 - acc: 0.8000
Epoch 497/500
 - 0s - loss: 1.6995 - acc: 0.6800
Epoch 498/500
 - 0s - loss: 1.6994 - acc: 0.7600
Epoch 499/500
 - 0s - loss: 1.7001 - acc: 0.8000
Epoch 500/500
 - 0s - loss: 1.6963 - acc: 0.8400


<keras.callbacks.History at 0x2145c550>

### STEP7. 評估模型準確率

In [8]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 92.00%


### STEP8. 預測結果

In [9]:
# 展示模型預測能力
for pattern in dataX:
    # 把26個字母一個個拿進模型來預測會出現的字母
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction) # 機率最大的idx
    result = int_to_char[index] # 看看預測出來的是那一個字母
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result) # 打印結果

['A'] -> B
['B'] -> C
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> W
['W'] -> Y
['X'] -> Z
['Y'] -> Z


我們可以看到，"序列資料的預測"這個問題對於網絡學習確實是困難的。
原因是，在以上的範例中的LSTM單位沒有任何上下文的知識(時間歩長只有"1")。每個輸入輸出模式以隨機順序(shuffle)出現到人工網網絡上，而且Keras的LSTM網絡內步狀態(state)會在每個訓練循環(epoch)後被重置(reset)。

接下來，讓我們嘗試提供更多的順序資訊來讓LSTM學習。

## 模型 2. LSTM 學習三個字符特徵窗口(Three-Char Feature Window)到一個字符映射


### STEP1. 準備訓練用資料

In [10]:
# 準備輸入數據集
seq_length = 3 # 這次我們要準備3個時間步長的資料
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length] # 3個字符
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


### STEP2. 資料預處理


> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 HIJ -> 預測 K

> 給 EFG -> 預測 H

目標訓練張量結構: (samples, time_steps, features) -> (n , **1**, **3** )

請特別注意, 這裡的三個字符會變成一個有3個element的"feature" vector。因此在準備訓練資料集的時候, 1筆訓練資料只有"1"個時間步, 裡頭存放著"3"個字符的資料"features"向量。

In [11]:
# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), 1, seq_length))  # <-- 特別注意這裡

# 歸一化
X = X / float(len(alphabet))

# 使用one hot encode 對Y值進行編碼
y = np_utils.to_categorical(dataY)

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (23, 1, 3)
y shape:  (23, 26)


### STEP3. 建立模型

In [12]:
# 創建模型
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) # <-- 特別注意這裡
model.add(Dense(y.shape[1], activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 32)                4608      
_________________________________________________________________
dense_2 (Dense)              (None, 26)                858       
Total params: 5,466
Trainable params: 5,466
Non-trainable params: 0
_________________________________________________________________


### STEP4. 定義訓練並進行訓練

In [13]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 2s - loss: 3.2753 - acc: 0.0435
Epoch 2/500
 - 0s - loss: 3.2629 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2556 - acc: 0.0000e+00
Epoch 4/500
 - 0s - loss: 3.2487 - acc: 0.0435
Epoch 5/500
 - 0s - loss: 3.2420 - acc: 0.0435
Epoch 6/500
 - 0s - loss: 3.2355 - acc: 0.0435
Epoch 7/500
 - 0s - loss: 3.2294 - acc: 0.0435
Epoch 8/500
 - 0s - loss: 3.2214 - acc: 0.0435
Epoch 9/500
 - 0s - loss: 3.2142 - acc: 0.0435
Epoch 10/500
 - 0s - loss: 3.2056 - acc: 0.0435
...


Epoch 490/500
 - 0s - loss: 1.6114 - acc: 0.7826
Epoch 491/500
 - 0s - loss: 1.6089 - acc: 0.7826
Epoch 492/500
 - 0s - loss: 1.6108 - acc: 0.8261
Epoch 493/500
 - 0s - loss: 1.6091 - acc: 0.7826
Epoch 494/500
 - 0s - loss: 1.6057 - acc: 0.7826
Epoch 495/500
 - 0s - loss: 1.6060 - acc: 0.7826
Epoch 496/500
 - 0s - loss: 1.6058 - acc: 0.8261
Epoch 497/500
 - 0s - loss: 1.6045 - acc: 0.7826
Epoch 498/500
 - 0s - loss: 1.6042 - acc: 0.7826
Epoch 499/500
 - 0s - loss: 1.6006 - acc: 0.8696
Epoch 500/500
 - 0s - loss: 1.6011 - acc: 0.7826


<keras.callbacks.History at 0x235b8d68>

### STEP5. 評估模型準確率

In [14]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 82.61%


### STEP6. 預測結果

In [15]:
# 展示一些模型預測
for pattern in dataX:
    x = numpy.reshape(pattern, (1, 1, len(pattern)))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> W
['T', 'U', 'V'] -> X
['U', 'V', 'W'] -> Z
['V', 'W', 'X'] -> Z
['W', 'X', 'Y'] -> Z


我們可以看到，"模型#2"相比於"模型#1"在預測的表現上只有小幅提升。這個簡單的問題，即使使用window方法，我們仍然無法讓LSTM學習到預測正確的字母出現的順序。

以上的範例也是一個誤用LSTM網絡的糟糕的張量結構。事實上，字母序列是一個特徵的"時間步驟(timesteps)"，而不是單獨特徵的一個時間步驟。我們已經給了網絡更多的上下文，但是沒有更多的順序上下文(context)。

下一範例中，我們將以"時間步驟(timesteps)"的形式給出更多的上下文(context)。

## 模型 3. LSTM 學習三個字符的時間步驟窗口(Three-Char Time Step Window)到一個字符的映射

### STEP1. 準備訓練用資料

In [16]:
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print(seq_in, '->', seq_out)

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


### STEP2. 資料預處理


> ABCDEFGHIJKLMNOPQRSTUVWXYZ

> 例如: 

> 給 HIJ -> 預測 K

> 給 EFG -> 預測 H

目標訓練張量結構: (samples, time_steps, features) -> (n , **3**, **1** )

準備訓練資料集的時候要把資料的張量結構轉換成, 1筆訓練資料有"3"個時間步, 裡頭存放著"1"個字符的資料"features"向量。

In [17]:
# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))  # <-- 特別注意這裡

# 歸一化
X = X / float(len(alphabet))

# 使用one hot encode 對Y值進行編碼
y = np_utils.to_categorical(dataY)

### STEP3. 建立模型

In [18]:
# 創建模型
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) # <-- 特別注意這裡
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_3 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


### STEP4. 定義訓練並進行訓練

In [19]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
 - 2s - loss: 3.2632 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.2500 - acc: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.2425 - acc: 0.0435
Epoch 4/500
 - 0s - loss: 3.2357 - acc: 0.0000e+00
Epoch 5/500
 - 0s - loss: 3.2285 - acc: 0.0000e+00
Epoch 6/500
 - 0s - loss: 3.2207 - acc: 0.0435
Epoch 7/500
 - 0s - loss: 3.2125 - acc: 0.0435
Epoch 8/500
 - 0s - loss: 3.2036 - acc: 0.0435
Epoch 9/500
 - 0s - loss: 3.1919 - acc: 0.0435
Epoch 10/500
 - 0s - loss: 3.1821 - acc: 0.0435
...


Epoch 490/500
 - 0s - loss: 0.2457 - acc: 1.0000
Epoch 491/500
 - 0s - loss: 0.2387 - acc: 1.0000
Epoch 492/500
 - 0s - loss: 0.2394 - acc: 1.0000
Epoch 493/500
 - 0s - loss: 0.2384 - acc: 1.0000
Epoch 494/500
 - 0s - loss: 0.2416 - acc: 1.0000
Epoch 495/500
 - 0s - loss: 0.2385 - acc: 1.0000
Epoch 496/500
 - 0s - loss: 0.2380 - acc: 1.0000
Epoch 497/500
 - 0s - loss: 0.2331 - acc: 1.0000
Epoch 498/500
 - 0s - loss: 0.2341 - acc: 1.0000
Epoch 499/500
 - 0s - loss: 0.2371 - acc: 1.0000
Epoch 500/500
 - 0s - loss: 0.2325 - acc: 1.0000


<keras.callbacks.History at 0x261e7ac8>

### STEP5. 評估模型準確率

In [20]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 100.00%


### STEP6. 預測結果

In [21]:
# 讓我們擷取3個字符轉成張量結構 shape:(1,3,1)來進行infer
for pattern in dataX:
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> W
['U', 'V', 'W'] -> X
['V', 'W', 'X'] -> Y
['W', 'X', 'Y'] -> Z


由"模型#3"的表現來看, 當我們以"時間步驟(timesteps)"的形式給出更多的上下文(context)來訓練LSTM模型時, 這時候循環神經網絡在序列資料的學習的效果就可以發揮出它的效用。

"模型#3"在驗證的結果可達到100%的預測準確度(在這個很簡單的26個字母的順序預測的任務上)!

## 模型 4. LSTM學習可變長度字符輸入到單字符輸出

讓我們建立一個模型，來接受"變動字母序列(variable-length)"的輸入來預測下一個字母。

### STEP1. 準備訓練用資料

為了簡化，我們將定義一個最大輸入序列長度(比如說"5", 代表輸入的序列可以是 1 ~ 5)，以加速訓練。

In [22]:
# 準備訓練資料
num_inputs = 1000
max_len = 5 # 最大序列長度
dataX = []
dataY = []
for i in range(num_inputs):
    start = numpy.random.randint(len(alphabet)-2)
    end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))
    sequence_in = alphabet[start:end+1]
    sequence_out = alphabet[end + 1]
    dataX.append([char_to_int[char] for char in sequence_in])
    dataY.append(char_to_int[sequence_out])
    print(sequence_in, '->', sequence_out)

UVWXY -> Z
UVWXY -> Z
EFGH -> I
GHIJ -> K
EFGH -> I
DEF -> G
CDEFG -> H
OP -> Q
X -> Y
TU -> V
IJK -> L
LMNOP -> Q
T -> U
UVWX -> Y
X -> Y
H -> I
EFGHI -> J
QRSTU -> V
EFG -> H
RSTU -> V
QRST -> U
QR -> S
JK -> L
GHI -> J
KL -> M
BCDE -> F
AB -> C


KLMNO -> P
UVWXY -> Z
EFGH -> I
FG -> H
DEF -> G
STU -> V
FGHI -> J
OP -> Q
FGHIJ -> K
LMNOP -> Q
DEF -> G
W -> X
KLMN -> O
WXY -> Z
PQRST -> U
LMNOP -> Q
PQ -> R
FGHI -> J
QRS -> T
CDEFG -> H
VW -> X
DEF -> G
...

### STEP2. 資料預處理
因為輸入序列的長度會在1到max_len之間變動，因此需要以"0"來填充(padding)。在這裡，我們使用Keras內附的pad_sequences（）函數並設定使用左側（前綴）填充。

In [23]:
# 將訓練資料轉換為陣列和並進行序列填充（如果需要）
X = pad_sequences(dataX, maxlen=max_len, dtype='float32') # <-- 注意這裡

# 重塑 X 資料的維度成為 (samples, time_steps, features)
X = numpy.reshape(X, (X.shape[0], max_len, 1)) # <-- 特別注意這裡

# 歸一化
X = X / float(len(alphabet))

# 使用one hot encode 對Y值進行編碼
y = np_utils.to_categorical(dataY)

### STEP3. 建立模型

In [24]:
# 創建模型
batch_size = 1
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], 1))) # <-- 注意這裡
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_4 (LSTM)                (None, 32)                4352      
_________________________________________________________________
dense_4 (Dense)              (None, 26)                858       
Total params: 5,210
Trainable params: 5,210
Non-trainable params: 0
_________________________________________________________________


### STEP4. 定義訓練並進行訓練

In [25]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)

Epoch 1/500
 - 7s - loss: 3.0984 - acc: 0.0770
Epoch 2/500
 - 6s - loss: 2.8499 - acc: 0.1180
Epoch 3/500
 - 7s - loss: 2.5159 - acc: 0.1990
Epoch 4/500
 - 6s - loss: 2.2225 - acc: 0.2440
Epoch 5/500
 - 6s - loss: 2.0490 - acc: 0.2880
Epoch 6/500
 - 5s - loss: 1.9284 - acc: 0.3050
Epoch 7/500
 - 5s - loss: 1.8142 - acc: 0.3530
Epoch 8/500
 - 6s - loss: 1.7206 - acc: 0.3940
Epoch 9/500
 - 6s - loss: 1.6397 - acc: 0.4070
Epoch 10/500
 - 6s - loss: 1.5607 - acc: 0.4340
Epoch 11/500
 - 6s - loss: 1.4909 - acc: 0.4880
Epoch 12/500
 - 5s - loss: 1.4180 - acc: 0.5140
Epoch 13/500
 - 6s - loss: 1.3613 - acc: 0.5570
Epoch 14/500
 - 6s - loss: 1.3055 - acc: 0.5570
Epoch 15/500
 - 6s - loss: 1.2513 - acc: 0.6100
Epoch 16/500
 - 6s - loss: 1.2058 - acc: 0.6080
Epoch 17/500
 - 6s - loss: 1.1530 - acc: 0.6150
Epoch 18/500
 - 6s - loss: 1.1126 - acc: 0.6290
Epoch 19/500
 - 6s - loss: 1.0732 - acc: 0.6700
Epoch 20/500
 - 6s - loss: 1.0427 - acc: 0.6540
Epoch 21/500
 - 6s - loss: 1.0019 - acc: 0.6760
E

Epoch 171/500
 - 6s - loss: 0.2559 - acc: 0.9190
Epoch 172/500
 - 6s - loss: 0.2532 - acc: 0.9160
Epoch 173/500
 - 6s - loss: 0.2594 - acc: 0.9160
Epoch 174/500
 - 6s - loss: 0.2829 - acc: 0.9110
Epoch 175/500
 - 6s - loss: 0.3482 - acc: 0.9040
Epoch 176/500
 - 6s - loss: 0.2502 - acc: 0.9180
Epoch 177/500
 - 6s - loss: 0.2543 - acc: 0.9200
Epoch 178/500
 - 6s - loss: 0.2520 - acc: 0.9210
Epoch 179/500
 - 6s - loss: 0.2763 - acc: 0.9080
Epoch 180/500
 - 6s - loss: 0.2489 - acc: 0.9210
Epoch 181/500
 - 6s - loss: 0.2510 - acc: 0.9180
Epoch 182/500
 - 5s - loss: 0.3519 - acc: 0.8980
Epoch 183/500
 - 6s - loss: 0.2408 - acc: 0.9260
Epoch 184/500
 - 6s - loss: 0.2476 - acc: 0.9230
Epoch 185/500
 - 6s - loss: 0.3063 - acc: 0.9120
Epoch 186/500
 - 6s - loss: 0.2374 - acc: 0.9310
Epoch 187/500
 - 5s - loss: 0.2415 - acc: 0.9260
Epoch 188/500
 - 6s - loss: 0.2417 - acc: 0.9210
Epoch 189/500
 - 5s - loss: 0.3521 - acc: 0.8820
Epoch 190/500
 - 5s - loss: 0.2310 - acc: 0.9420
Epoch 191/500
 - 5s 

 - 6s - loss: 0.1383 - acc: 0.9690
Epoch 339/500
 - 6s - loss: 0.2042 - acc: 0.9430
Epoch 340/500
 - 5s - loss: 0.1551 - acc: 0.9720
Epoch 341/500
 - 6s - loss: 0.1292 - acc: 0.9690
Epoch 342/500
 - 5s - loss: 0.1323 - acc: 0.9710
Epoch 343/500
 - 6s - loss: 0.1337 - acc: 0.9670
Epoch 344/500
 - 6s - loss: 0.1340 - acc: 0.9660
Epoch 345/500
 - 6s - loss: 0.2425 - acc: 0.9430
Epoch 346/500
 - 6s - loss: 0.1271 - acc: 0.9730
Epoch 347/500
 - 6s - loss: 0.1274 - acc: 0.9640
Epoch 348/500
 - 6s - loss: 0.1324 - acc: 0.9650
Epoch 349/500
 - 6s - loss: 0.1311 - acc: 0.9670
Epoch 350/500
 - 6s - loss: 0.1328 - acc: 0.9690
Epoch 351/500
 - 6s - loss: 0.2240 - acc: 0.9470
Epoch 352/500
 - 6s - loss: 0.1246 - acc: 0.9760
Epoch 353/500
 - 6s - loss: 0.1273 - acc: 0.9760
Epoch 354/500
 - 6s - loss: 0.1277 - acc: 0.9720
Epoch 355/500
 - 6s - loss: 0.1281 - acc: 0.9690
Epoch 356/500
 - 6s - loss: 0.1302 - acc: 0.9680
Epoch 357/500
 - 6s - loss: 0.1302 - acc: 0.9680
Epoch 358/500
 - 6s - loss: 0.1544

<keras.callbacks.History at 0x261918d0>

### STEP5. 評估模型準確率

In [26]:
# 評估模型的性能
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 98.50%


### STEP6. 預測結果

In [27]:
# 讓我們擷取1~5個字符轉成張量結構 shape:(1,5,1)來進行infer
for i in range(20):
    pattern_index = numpy.random.randint(len(dataX))
    pattern = dataX[pattern_index]
    x = pad_sequences([pattern], maxlen=max_len, dtype='float32')
    x = numpy.reshape(x, (1, max_len, 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print(seq_in, "->", result)

['T', 'U', 'V', 'W', 'X'] -> Y
['B', 'C'] -> D
['G', 'H', 'I'] -> J
['Q', 'R', 'S', 'T'] -> U
['D', 'E', 'F'] -> G
['I', 'J', 'K'] -> L
['G', 'H', 'I'] -> J
['K', 'L', 'M'] -> N
['N'] -> O
['A', 'B', 'C', 'D', 'E'] -> F
['X'] -> Y
['A', 'B', 'C', 'D'] -> E
['V'] -> W
['Q', 'R', 'S', 'T', 'U'] -> V
['B', 'C', 'D', 'E', 'F'] -> G
['R'] -> S
['W'] -> X
['A'] -> B
['E', 'F', 'G'] -> H
['T'] -> U


我們可以看到，雖然這個網絡模型沒有從生成的序列資料中完全學習到英文字母表的順序，但它表現相當的好。如果需要, 我們可以對這個模型進行進一歩的優化與調整，比如更多的訓練循環(more epochs)或更大的網絡(larger network)，或兩者。

### 參考:
* Jason Brownlee - "[Understanding Stateful LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/)"

* Keras官網 - [Recurrent Layer](https://keras.io/layers/recurrent/)