任務：基於文本數據，建立LSTM模型，預測序列文字:
1. 完成數據預處理，將文字序列轉化為可用於LSTM輸入的數據
1. 查看文字數據預處理后的數據結構，並進行數據分離操作
1. 針對字符串輸入("Only if you asked to see me, our meeting would be meaningful to me"), 預測其對應的後續字符

模型結構: 單層LSTM，輸入有20個神經元；每次使用前20個字符預測第21個字符

In [1]:
# load the data
data = open('LSTM_text.txt').read()
# 移除換行符
data= data.replace('\n','').replace('\r','')
print(data)

Only if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if you asked to see me, our meeting would be meaningful to meOnly if yo

In [2]:
# 字符去重處理
letters = list(set(data))
print(letters)
num_letters = len(letters)
print(num_letters)

['y', 't', 'n', ',', 'f', 'l', 'O', 'b', 'r', 'g', 'i', 's', 'e', 'k', 'w', 'o', 'u', 'm', ' ', 'a', 'd']
21


In [3]:
# 建立字典
# int to char
int_to_char = {a:b for a,b in enumerate(letters)}
print(int_to_char)
char_to_int = {b:a for a,b in enumerate(letters)}
print(char_to_int)

{0: 'y', 1: 't', 2: 'n', 3: ',', 4: 'f', 5: 'l', 6: 'O', 7: 'b', 8: 'r', 9: 'g', 10: 'i', 11: 's', 12: 'e', 13: 'k', 14: 'w', 15: 'o', 16: 'u', 17: 'm', 18: ' ', 19: 'a', 20: 'd'}
{'y': 0, 't': 1, 'n': 2, ',': 3, 'f': 4, 'l': 5, 'O': 6, 'b': 7, 'r': 8, 'g': 9, 'i': 10, 's': 11, 'e': 12, 'k': 13, 'w': 14, 'o': 15, 'u': 16, 'm': 17, ' ': 18, 'a': 19, 'd': 20}


In [4]:
# time_step
time_step = 20

In [5]:
#批量字符数据预处理
import numpy as np
from keras.utils import to_categorical
#滑动窗口提取数据
def extract_data(data,slide):
    x = []
    y = []
    for i in range(len(data) - slide):
        x.append([a for a in data[i:i+slide]])
        y.append(data[i+slide])
    return x,y
#字符到数字的批量转换
def char_to_int_Data(x,y,char_to_int):
    x_to_int = []  
    y_to_int = []
    for i in range(len(x)):
        x_to_int.append([char_to_int[char] for char in x[i]])
        y_to_int.append([char_to_int[char] for char in y[i]])
    return x_to_int,y_to_int

#实现输入字符文章的批量处理,输入整个字符,滑动窗口大小,转化字典
def data_preprocessing(data,slide,num_letters,char_to_int):
    char_data = extract_data(data,slide)
    int_data = char_to_int_Data(char_data[0],char_data[1],char_to_int)
    Input = int_data[0]
    Output = list(np.array(int_data[1]).flatten()  )
    Input_RESHAPED = np.array(Input).reshape(len(Input ),slide)
    new = np.random.randint(0,10,size=[Input_RESHAPED.shape[0],Input_RESHAPED.shape[1],num_letters])
    for i in range(Input_RESHAPED.shape[0]):
        for j in range(Input_RESHAPED.shape[1]):
            new[i,j,:] = to_categorical(Input_RESHAPED[i,j],num_classes=num_letters)
    return new,Output

In [6]:
# extract X and y from text data
X,y = data_preprocessing(data,time_step,num_letters,char_to_int)

In [7]:
print(X.shape)
print(len(y))

(6316, 20, 21)
6316


In [8]:
# split the data
from sklearn.model_selection import  train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=10)
print(X_train.shape,len(y_train))

(5684, 20, 21) 5684


In [9]:
y_train_category = to_categorical(y_train,num_letters)
print(y_train_category)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]]


In [10]:
# set up the model
from keras.models import Sequential
from keras.layers import Dense,LSTM

model = Sequential()
# input_shape 看样本的
model.add(LSTM(units=20,input_shape=(X_train.shape[1],X_train.shape[2]),activation="relu"))
#输出层 看样本有多少页
model.add(Dense(units=num_letters ,activation="softmax"))

model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 20)                3360      
                                                                 
 dense (Dense)               (None, 21)                441       
                                                                 
Total params: 3,801
Trainable params: 3,801
Non-trainable params: 0
_________________________________________________________________


In [11]:
# train the model
model.fit(X_train,y_train_category,batch_size=100,epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x21a7eb42348>

In [12]:
# make prediction based on the training data
y_train_predict = np.argmax(model.predict(X_train),1)
print(y_train_predict)

[17 11 20 ...  5 12  2]


In [13]:
# trainsform the int to letters
y_train_predict_char = [int_to_char[i] for i in y_train_predict]
print(y_train_predict_char)

['m', 's', 'd', ' ', 'b', ' ', ' ', ' ', 'g', 'e', 'e', 's', ' ', 'f', 'u', 'n', 'e', ' ', 'e', ' ', 'e', 't', 'k', ' ', ' ', ' ', 'm', 'u', 'e', 'f', ' ', ' ', 'l', 'a', 'e', 't', 'm', 'n', 't', 'e', 'g', 'r', 'e', 'e', 't', 't', 'm', 'e', 'e', 'y', 'm', 'e', 'l', 'm', 'e', 'm', ' ', 'n', ' ', 'k', 'i', ' ', 'k', ' ', 'u', ' ', ' ', 'e', 'i', 't', 'o', ' ', 'm', 'n', 't', 'u', ' ', 'i', 'u', 'u', ' ', 'b', ' ', 'd', 'l', ' ', 'm', ' ', 'i', 'd', 'e', 'l', 'e', 'o', 'n', 'e', 'o', 'u', ' ', 'a', ' ', 'm', ',', ' ', 'f', ' ', 'k', 'f', 'o', 'g', ' ', 'e', 'e', 'o', 'e', 'w', 'm', 'o', 'o', ' ', 'i', 'o', ' ', 'm', 'i', 'n', ' ', 't', 'n', 'e', ' ', 'm', 'a', ' ', 'g', ' ', 'e', 'n', 'i', 'e', ' ', ' ', 'l', 'a', 'n', ' ', 'm', 'n', 'u', ' ', ' ', ' ', 'f', 'd', 'u', 't', 'w', 'i', 'e', ' ', 'l', ' ', 'e', ' ', 'e', 'f', 's', 'e', 'm', 'e', ' ', 't', ' ', ' ', 't', 'n', 'e', ' ', 'e', ' ', 'l', ' ', 'y', 'e', ' ', 's', 'e', ' ', ' ', 'r', 'l', 'g', 's', 'y', 'i', 'g', 's', 's', ' ', 'n',




In [14]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_train,y_train_predict)
print(accuracy)

0.8963757916959887


In [15]:
y_test_predict = np.argmax(model.predict(X_test),1)
accuracy_test = accuracy_score(y_test,y_test_predict)
print(accuracy_test)
print(y_test_predict)
print(y_test)

0.875
[15 18 20 11 12 17 12 12  2 12 10 18 16 17 20 19 17  5 18 10 12 16 12  2
 12 12  1  4 10 18 12 15 15 12  2  3 18  9 12  5  1 19 12  3  9 12 18  2
 12 17 20 12 17 18 18 18  1 12  1 18 16  1 10 12 12  5 18 12 18 18 16 18
 12 12 18 15 17 18  0 18 18 16 16 18 12 18 15  4 12  2 17 18 18 10 18  0
 12 18 10  4 18  1  9  9 18  0  9 11 16 18  4 14 12 12 18 16 11 19 18 18
 19 18 18  1 18 13 18 10 10 18  2  2 12  3 15 18 18 18 15  8 17 11 12 12
  7 18 18  5 18 12  4 18  4 18 12 12 18 15 18 12 16  0 12  8  4 15  5  7
 20  2 12 18 17 10  4 15  5 19 10 13 18  9 20  9 18  8 12 17 11 18 17 10
  7  5 12  5 18 18  2 18 18 17 18 15 18  5 18 12 18 12  3 16 10 12 16 15
  3 18  9  5 11 18 10 12 16  3 17 18  8  1  0  2 12 12  9 15 12 15 18  5
 10  5 17  1  2 12 16 18  5  2 12 18 20 12 18 16  1 14 15 15 16 18 12 18
  5 17 12 18 20 16  2 18  9 18  8  3  4 18 12 12 17 20 10  9 18  1 12 10
 18 18  9 18 16  5  2 16 15 18  7 17 20 16 15 12 13 15  8 15  8 10 16 18
 11 11  3 18  2  2 12 12 12  9 18 10 12 12 17

In [16]:
new_letters = 'Only if you asked to see me, our meeting would be'
X_new, y_new = data_preprocessing(new_letters,time_step,num_letters,char_to_int)
y_new_predict = np.argmax(model.predict(X_new),1)
print(y_new_predict)

[18 11 12 12 18 17 12  3 18 15 16  8 18  7 12 12  1 10  2  9 18 14 15 16
  5 20 18 17 12]


In [17]:
y_new_predict_char = [int_to_char[i] for i in y_new_predict ]
print(y_new_predict_char)

[' ', 's', 'e', 'e', ' ', 'm', 'e', ',', ' ', 'o', 'u', 'r', ' ', 'b', 'e', 'e', 't', 'i', 'n', 'g', ' ', 'w', 'o', 'u', 'l', 'd', ' ', 'm', 'e']


In [18]:
for i in range(0,X_new.shape[0]-20):
    print(new_letters[i:i+20],'--predict next letter is --',y_new_predict_char[i])

Only if you asked to --predict next letter is --  
nly if you asked to  --predict next letter is -- s
ly if you asked to s --predict next letter is -- e
y if you asked to se --predict next letter is -- e
 if you asked to see --predict next letter is --  
if you asked to see  --predict next letter is -- m
f you asked to see m --predict next letter is -- e
 you asked to see me --predict next letter is -- ,
you asked to see me, --predict next letter is --  


LSTM文本生成實戰summary:
1. 通過搭建LSTM模型，實現了基於文本序列的字符生成功能
1. 學習了文本加載、字典生成方法
1. 掌握了文本的數據預處理方法，並熟悉了轉化數據的結構
1. 實現了對新文本數據的字符預測