In [1]:
# Keras_IMDb_MLP_Model 預測準確率約 80%
# 此模型試著使用 較大的字典數 和 數字list長度 來改善預測準確率。
# 將字典數增加到 3800，數字list長度增加到 380。

In [2]:
# 執行資料前處理
%run 'Keras_IMDb_Data_Preprocessing_Large.ipynb'

read train files: 25000
read test files: 25000


Using TensorFlow backend.


In [3]:
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import Embedding

In [4]:
model = Sequential()

In [5]:
# Embedding層只能作為模型的第一層，作用是將數字list轉換為向量list，
# 讓每一個文字具有關聯性，類似語意的文字，在向量空間中會比較接近。
# 例如，在向量空間中：
# pleasure, like, attraction 會被分成一群，
# hate, dislike, disgust 會被分成另一群。
model.add(Embedding(output_dim=32, # 將數字list轉為32維度的向量
                    input_dim=3800, # 字典改為3800字
                    input_length=380)) # 資料長度改為380

In [6]:
# 訓練時，會隨機放棄 20% 神經元，以避免 overfitting。
model.add(Dropout(0.2))

In [7]:
# 建立完 Embedding 層後，其後可以加上 MLP、RNN、LSTM 等模型，進行深度學習。
# 以下使用 MLP 模型作為範例：

In [8]:
# 每一筆資料有380個數字，每一個數字轉換為32維度的向量，
# 因此，該平坦層有12160個神經元。
model.add(Flatten())

In [9]:
# 建立隱藏層
model.add(Dense(units=256, # 隱藏層神經元數
                activation='relu'))

In [10]:
# 加入 Dropout 層避免 overfitting。
model.add(Dropout(0.35))

In [11]:
# 建立輸出層
model.add(Dense(units=1, # 輸出層只有一個神經元
                activation='sigmoid'))

In [12]:
# 模型摘要
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 380, 32)           121600    
_________________________________________________________________
dropout_1 (Dropout)          (None, 380, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 12160)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3113216   
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
Total params: 3,235,073
Trainable params: 3,235,073
Non-trainable params: 0
_________________________________________________________________


In [13]:
# 定義訓練方式
model.compile(loss='binary_crossentropy', # 損失函數 
              optimizer='adam', # 最佳化方式
              metrics=['accuracy']) # 使用準確率評估模型

In [14]:
# 開始訓練
train_history = model.fit(train_feature, train_label, batch_size=100,
                          epochs=10, validation_split=0.2, verbose=2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 16s - loss: 0.4974 - acc: 0.7389 - val_loss: 0.4300 - val_acc: 0.8182
Epoch 2/10
 - 15s - loss: 0.2071 - acc: 0.9200 - val_loss: 0.5522 - val_acc: 0.7652
Epoch 3/10
 - 15s - loss: 0.0843 - acc: 0.9716 - val_loss: 0.7132 - val_acc: 0.7568
Epoch 4/10
 - 15s - loss: 0.0308 - acc: 0.9910 - val_loss: 0.8972 - val_acc: 0.7636
Epoch 5/10
 - 14s - loss: 0.0136 - acc: 0.9964 - val_loss: 0.9613 - val_acc: 0.7854
Epoch 6/10
 - 15s - loss: 0.0097 - acc: 0.9975 - val_loss: 0.9233 - val_acc: 0.8016
Epoch 7/10
 - 14s - loss: 0.0082 - acc: 0.9975 - val_loss: 1.0115 - val_acc: 0.8030
Epoch 8/10
 - 16s - loss: 0.0143 - acc: 0.9951 - val_loss: 1.4151 - val_acc: 0.7302
Epoch 9/10
 - 17s - loss: 0.0195 - acc: 0.9927 - val_loss: 0.8795 - val_acc: 0.8128
Epoch 10/10
 - 15s - loss: 0.0157 - acc: 0.9946 - val_loss: 1.3909 - val_acc: 0.7558


In [15]:
# 評估模型準確率
scores = model.evaluate(test_feature, test_label, verbose=1)
scores[1]



0.83575999999999995

In [16]:
# 使用測試集進行預測
predict = model.predict_classes(test_feature)



In [17]:
# 顯示前 5 筆預測結果
predict[:5]

array([[1],
       [1],
       [1],
       [1],
       [1]], dtype=int32)

In [18]:
#將二維陣列轉一維陣列
predict_classes = predict.reshape(-1)
predict_classes[:5]

array([1, 1, 1, 1, 1], dtype=int32)

In [19]:
#查看每篇評論預測結果
sentiment_dic = {1:'正面的', 0:'負面的'}
def display_sentiment(i):
    print(test_text[i])
    print('label值：', sentiment_dic[test_label[i]], '\n預測結果：', sentiment_dic[predict_classes[i]])

In [20]:
display_sentiment(0)

I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
label值： 正面的 
預測結果： 正面的


In [21]:
# 從網路上隨便抓的一篇電影評論，作為新的預測資料
input_text = '''
Beauty and the Beast (2017) is a strange film to review as most of it feels like a frame by frame remake of the Disney classic.
It's a good film by the fact that the animated version was great to begin with. Unlike The Jungle Book (2017) which improved on an average Disney film, Beauty and the Beast (2017) doesn't really warrant existing. You could have swapped the film for the original and you would have had the same amount of enjoyment.
Any praise is really that the everyone involved did their jobs capably and produced an admirable copy. Let's hope Aladdin, Dumbo and the Lion King can bring something new and not be 'good' by being safe live action/CGI copies.
'''

In [22]:
# 資料前處理，將影評文字轉成數字陣列
input_seq = token.texts_to_sequences([input_text])

In [23]:
# 將數字陣列裁減成固定長度380
pad_input_seq = sequence.pad_sequences(input_seq, maxlen=380)

In [24]:
# 使用模型進行預測
predict_result = model.predict_classes(pad_input_seq)



In [25]:
# 取得預測結果
sentiment_dic[predict_result[0][0]]

'正面的'

In [None]:
# 結論：增加字典數和數字list長度確實能有效增加預測準確率。