### 這堂課目標
<li>RNN處理文字內容</li>

- Data cleaning (For English)

- Word Tokenizer

- Word embedding

- dropout

<li>中文怎麼辦？</li>

- 斷詞


<li>其他種類的文字相關深度學習模型</li>

- Multi-input, multi-output

- Other special types

In [1]:
import numpy as np
import pandas as pd
import re
import tensorflow.keras as keras

In [2]:
df = pd.read_csv('spam_data.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data cleaning (For English)
from [medium@datamonsters](https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908)

- converting all letters to lower or upper case
- converting numbers into words or removing numbers
- removing punctuations, accent marks and other diacritics
- removing white spaces
- expanding abbreviations, stemming
- removing stop words and particular words

In [16]:
text_data = df.Message.values

In [17]:
#converting all letters to lower
text_data = [i.lower() for i in text_data]
#removing numbers
text_data = [re.sub(r'\d+', '', i) for i in text_data]

#removing punctuations, accent marks and other diacritics
text_data = [re.sub(r'[!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~£]', '', i) for i in text_data]
#no white spaces, so do not need to process
#too many abbreviations to do, so just two example (abbreviation and stemming)
text_data = [re.sub('comin', 'coming', i) for i in text_data]
text_data = [re.sub("it's", 'it is', i) for i in text_data]

In [18]:
#remove stop word
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
# ENGLISH_STOP_WORDS

In [19]:
text_data = [i.split() for i in text_data]
text_data[0]

['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'wat']

In [20]:
remove_count = 0
for i in range(len(text_data)):
    for j in text_data[i][:]:
        if j in ENGLISH_STOP_WORDS:
            text_data[i].remove(j)
            remove_count+=1
print('remove count:',remove_count)
text_data[0]

remove count: 37302


['jurong',
 'point',
 'crazy',
 'available',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'got',
 'amore',
 'wat']

In [21]:
##補充說明
# test = ['a','b','b','c','d']
# for i in test:
#     print(i)
#     if i =='b':
#         test.remove(i)
# test
test = ['a','b','b','c','d']
for i in test[:]:
    print(i)
    if i =='b':
        test.remove(i)
test

a
b
b
c
d


['a', 'c', 'd']

In [24]:
max_num = 5000
token = keras.preprocessing.text.Tokenizer(num_words=max_num,)

In [25]:
token.fit_on_texts(text_data)

In [28]:
# token.word_index

In [29]:
text_data = token.texts_to_sequences(text_data)
text_data[0]

[3844, 612, 574, 507, 1025, 29, 46, 227, 837, 68, 2542, 1026, 10, 3845, 57]

In [38]:
## (batch_size,timestep,feature)
np.percentile((np.array([len(i) for i in text_data])),70)

10.0

In [39]:
maxlen = 10
text_data = keras.preprocessing.sequence.pad_sequences(text_data, maxlen=maxlen)
df.Category = df.Category.astype('category')
target_data = df.Category.cat.codes
target_data = np.array(target_data).reshape(-1,1)
text_data.shape,target_data.shape

((5572, 10), (5572, 1))

In [40]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(text_data,target_data,test_size=0.2)

In [41]:
x_train.shape,y_train.shape

((4457, 10), (4457, 1))

#### [Embedding在幹嘛1](https://medium.com/royes-researchcraft/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86-1-word-to-vector-%E5%AF%A6%E4%BD%9C%E6%95%99%E5%AD%B8-99b668faa296)
#### [Embedding在幹嘛2](https://medium.com/life-of-small-data-engineer/%E8%83%BD%E8%A2%AB%E9%9B%BB%E8%85%A6%E7%90%86%E8%A7%A3%E7%9A%84%E6%96%87%E5%AD%97-nlp-%E4%B8%80-word-embedding-4146267019cb)

In [42]:
model = keras.models.Sequential()
model.add(keras.layers.Embedding(max_num, 64))
model.add(keras.layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          320000    
_________________________________________________________________
lstm (LSTM)                  (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 418,945
Trainable params: 418,945
Non-trainable params: 0
_________________________________________________________________


In [43]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train,epochs=15,validation_data=(x_test, y_test))


Train on 4457 samples, validate on 1115 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x13d4a0550>

## 中文怎麼辦？
- 斷詞：結巴斷詞、中研院斷詞
- 沒有stemming的問題，但一樣有簡寫的問題

## 其他種類RNN相關深度學習模型
- multi-input, multi-output (for non-text data):預測未來一週銷量

- Seq2Seq (for text data):自動翻譯、文章摘要、回答問題、文章創作

- image caption:照片轉文字敘述 （[附連結](https://milhidaka.github.io/chainer-image-caption/))

## 作業更改項目：交叉測試，紀錄測試集的RMSE值

- stop word的影響：去除stop word, 沒去除stop word
- 辭彙庫字數的影響：500,1000,5000 （假定embedding都固定為64）
- Embedding數量的影響：16,32,64 （假定辭彙庫字數都固定為5000）

In [44]:
import pandas as pd
row_index = ['去除stop word','沒去除stop word']
column1_index = ['vocab_500', 'vocab_1000', 'vocab_5000', 
                 'embedding_16', 'embedding_32', 'embedding_64']
pd.DataFrame(index=row_index, columns=column1_index, data=0)

Unnamed: 0,vocab_500,vocab_1000,vocab_5000,embedding_16,embedding_32,embedding_64
去除stop word,0,0,0,0,0,0
沒去除stop word,0,0,0,0,0,0
