1. encoding 유형 : 딥러닝 모델에서 사용되는 data
   - one-hot encoding : texts -> sparse matrix(encoding) -> DNN -> label sort
     DNN : 단어출현빈도수 입력 -> label 분류
   - word_embedding : texts -> sequence(10) -> embedding -> RNN -> label sort
     RDD : 자연어 입력 -> label 분류
     
2. embedding(input_dim, output_dim, input_length)
   - input_dim : 임베딩층에 입력되는 전체 단어 수 
   - output_dim : 임베딩층에서 출력되는 vector 수
   - input_length : 1문장을 구성하는 단어 수

In [1]:

import pandas as pd
import numpy as np
import string
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential # model 생성
from tensorflow.keras.layers import Dense, Embedding, Flatten, LSTM # layer 생성
from sklearn.metrics import accuracy_score
from tensorflow.keras.preprocessing.sequence import pad_sequences


temp_spam = pd.read_csv("C:/IITT/6_Tensorflow/data/temp_spam_data2.csv",
                        header = None, encoding = "utf-8")
temp_spam.info()


# 1. 변수 선택
label = temp_spam[0]
texts = temp_spam[1]
len(label) # 5574


# 2. data 전처리
# target dummy('spam'=1, 'ham'=0)
target = [1 if x=='spam' else 0 for x in label]
print('target :', target)
target = np.array(target)


# texts 전처리
def text_prepro(texts):
    # Lower case
    texts = [x.lower() for x in texts]
    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
    # Remove numbers
    texts = [''.join(c for c in x if c not in string.digits) for x in texts]
    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    return texts


# 함수 호출 : texts 전처리 완료
texts = text_prepro(texts)
texts[0]



# 토큰 생성기
tokenizer = Tokenizer()


# 1. 토큰 생성
tokenizer.fit_on_texts(texts) # 텍스트 적용
token = tokenizer.word_index # 토큰 반환 : 정수 인덱스 생성
print(token) # {word : int index} : 중복 x
print("전체 단어 수 :", len(token)) # 전체 단어 수 : 8629

voc_size = len(token) + 1
voc_size # 8630


# 2. 정수 인덱스 : 토큰 -> 정수인덱스(단어 순서 인덱스)
seq_index = tokenizer.texts_to_sequences(texts)
print(seq_index)
len(seq_index) # 문장 수 : 5574
len(seq_index[0]) # 첫번째 문장의 단어 수 : 20
len(seq_index[-1]) # 마지막 문장의 단어 수 : 6

maxlength = max([len(seq) for seq in seq_index])
maxlength # 171

# 3. 패딩(padding) : 정수 인덱스 길이 맞춤
x_data = pad_sequences(seq_index, maxlen=maxlength)
print(x_data)
x_data.shape # (5574,171)

x_data[0] # 171-20 = 151개 0padding


# 4. dataset split
x_train, x_val, y_train, y_val = train_test_split(x_data, target)
x_train.shape # (4180,171)
y_train.shape # (4180,)



# 5. embedding 층 : 인코딩
embedding_dim = 32 # 64, 128, 256 ... 전체 단어 길이에 따라 변경됨

model = Sequential() # model obj 

# embedding layer 추가
model.add(Embedding(input_dim=voc_size, output_dim=embedding_dim, input_length=maxlength))

# 2d -> 1d
#model.add(Flatten())

# LSTM(RNN)
model.add(LSTM(32)) # RNN : FLatten 기능 포함

# DNN hidden layer
model.add(Dense(32, activation='relu'))

# DNN output layer
model.add(Dense(1, activation='sigmoid'))

model.summary()




# 6. compile(학습 환경 설정)/training
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, verbose=1,
          batch_size = 30, validation_data=(x_val, y_val))


loss, score = model.evaluate(x_val, y_val)
print("loss = {:.5f}, accuracy = {:.5f}".format(loss, score))  



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5574 non-null   object
 1   1       5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
target : [0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 

[[44, 440, 3985, 776, 697, 666, 62, 9, 1217, 87, 119, 354, 1015, 150, 2711, 1218, 66, 56, 3986, 135], [46, 309, 1358, 441, 6, 1766], [47, 459, 9, 4, 735, 876, 1, 175, 1767, 933, 613, 1768, 235, 256, 69, 1767, 1, 1, 310, 459, 2712, 77, 2713, 367, 2714], [6, 230, 138, 23, 355, 2715, 6, 155, 142, 58, 138], [934, 2, 42, 96, 70, 460, 1, 877, 70, 1769, 202, 109, 461], [825, 120, 66, 1541, 40, 98, 588, 21, 7, 37, 341, 85, 342, 54, 105, 382, 3, 41, 12, 14, 83, 1770, 46, 332, 1016, 3987, 1, 68, 1, 1771], [196, 11, 589, 8, 24, 54, 1, 356, 35, 10, 106, 667, 10, 54, 3988, 3989], [71, 210, 13, 1107, 1359, 1359, 1772, 2108, 2109, 2110, 110, 98, 563, 71, 13, 935, 12, 49, 1542, 777, 1, 1017, 13, 217, 935], [668, 71, 4, 778, 412, 218, 3, 17, 98, 413, 1, 2716, 139, 936, 1, 114, 16, 114, 399, 2717, 502, 521, 62], [128, 13, 90, 669, 26, 123, 6, 84, 1108, 1, 488, 1, 5, 301, 543, 826, 35, 325, 12, 47, 16, 5, 90, 488, 1219, 47, 18], [22, 219, 32, 78, 211, 7, 2, 42, 67, 1, 263, 76, 39, 282, 1109, 206, 151, 16

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 171, 32)           276160    
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 32)                1056      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 285,569
Trainable params: 285,569
Non-trainable params: 0
_________________________________________________________________
Train on 4180 samples, validate on 1394 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


loss = 0.11600, accuracy = 0.97561
