<a href="https://colab.research.google.com/github/arkwith7/aSSIST_ML/blob/main/deep_learning_3_sequence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence to Sequence

In [None]:
# 기본적인 사용법
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    output_mode="int",
)

In [None]:
# 커스터마이징 하기
from tensorflow.keras.layers import TextVectorization

import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn, # 정규화 과정 
    split=custom_split_fn,
)

In [None]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

어휘 사전 출력하기

In [None]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [None]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [None]:
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


In [None]:
# 기본 사용법 (커스터마이징 하지 않고 사용하기)
# 기본적인 사용법
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    output_mode="int",
)
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [None]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


# 단어 그룹을 표현하는 두 가지 방법: 집합과 시퀀스

## IMDB 영화 리뷰 데이터 준비하기

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  32.1M      0  0:00:02  0:00:02 --:--:-- 32.1M


In [None]:
!rm -r aclImdb/train/unsup

train 폴더: 학습용 데이터. 25,000개의 데이터  
test 폴더: 테스트용 데이터. 25,000개의 데이터  
pos 폴더: 긍정  
neg 폴더: 부정  

In [None]:
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [None]:
!ls aclImdb/val

neg  pos


In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    try:
      os.makedirs(val_dir / category)
    except:
      pass
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)        # 훈련을 여러번 실행해도 동일한 검증세트가 만들어지도록 랜덤값을 고정
    num_val_samples = int(0.2 * len(files))   # 20% 는 검증 세트로 사용함
    val_files = files[-num_val_samples:]      # 검증 세트 분리
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [None]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
train_ds

<_BatchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

## 첫 번째 배치의 크기와 dtype 출력하기

In [None]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"Star Trek: Hidden Frontier is a long-running internet only fan film, done completely for the love of the series, and a must watch for fans of Trek. The production quality is extremely high for a fan film, although sometimes you can tell that they're green-screenin' it. This doesn't take away from the overall experience however. The CGI ships are fantastic, as well as the space battle scenes... On the negative side, I could tell in the earlier episodes (and even occasionally in the newer ones) that some of the actors/actresses are not quite comfortable in their roles, but once again, this doesn't take away from the overall experience of new interpretations of Star Trek. The cast and crew have truly come up with something special here, and, as a whole,I would highly recommend this series to fans of The Next Generation and Deep Space 9.", shape=(), dtype=string)


## 단어를 집합으로 처리하기: BoW 방식

### Single words (unigrams) with binary encoding

#### TextVectorization 층으로 데이터 전처리하기

In [None]:
text_vectorization = TextVectorization(
    max_tokens=20000,   # 가장 많이 등장하는 2만개 단어로 어휘 사전 제한 -> 한두번만 등장하는 수만개의 단어는 유용하지 않음
    output_mode="multi_hot",  # 멀티-핫 이진 벡터로 출력 토큰을 인코딩
)
text_only_train_ds = train_ds.map(lambda x, y: x) # 레이블 없이 원시 텍스트 입력만 반환하는 데이터셋
text_vectorization.adapt(text_only_train_ds)


In [None]:
for inputs in train_ds:
    print("inputs[0]:", inputs[0])
    print(len(inputs[0]))
    break

inputs[0]: tf.Tensor(
[b'- When the local sheriff is killed, his wife takes over until and is determined to clean-up the town. Not everyone in town, however, is happy with what she\'s doing. When the sheriff orders a curfew in town, the local saloon owner (also a woman) hires a killer to take care of the sheriff. There\'s no way the saloon owner could know that the sheriff and the killer would fall in love.<br /><br />- Gunslinger is an example of what happens when you have a fairly interesting concept and combine it with poor execution. There\'s a good movie here somewhere trying to get out. In more capable hands or with a larger budget, Gunslinger might have been an entertaining look at the role of women in the Old West. As it is, Gunslinger is a sloppy mess of a movie.<br /><br />- There are just so many things wrong with the movie: a supporting cast with no acting ability, stilted and unnatural dialogue, and sets that look like sets. But the biggest offender is the editing. I was a

In [None]:

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
for inputs in binary_1gram_train_ds:
    print("inputs[0]:", inputs[0])
    print(len(inputs[0]))
    break

inputs[0]: tf.Tensor(
[[1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 ...
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]
 [1. 1. 1. ... 0. 0. 0.]], shape=(32, 20000), dtype=float32)
32


#### 이진 유니그램 데이터셋의 출력 확인하기

입력은 2만차원 (사전 크기), 32개씩 들어감

유니그램 인코딩에서 "the cat sat on the mat" 문장을 표현하면,  
["cat", "mat", "on", "sat", "the"]

In [None]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


#### 모델 생성 유틸리티

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

#### 이진 유니그램 모델 훈련하고 테스트하기

In [None]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"테스트 정확도: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
테스트 정확도: 0.888


### 이진 인코딩을 사용한 바이그램

#### 바이그램을 반환하는 TextVectorization 층 만들기

바이그램 인코딩에서 "the cat sat on the mat" 문장을 표현하면,  
["cat", "mat", "on", "sat", "the", "the cat", "cat sat", "sat on", "on the", "the mat"]

In [None]:
text_vectorization = TextVectorization(
    ngrams=2,         # 여기에 지정하면 됨
    max_tokens=20000,
    output_mode="multi_hot",
)

#### 이진 바이그램 모델 훈련하고 테스트하기

In [None]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"테스트 정확도: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
테스트 정확도: 0.900


## 단어를 시퀀스로 처리하기: 시퀀스 모델 방식

바이그램 예에서 보듯 순서도 중요함

순서 기반의 특성을 수동으로 만드는 대신 이런 특성을 학습하도록 하는 것이 **시퀀스 모델**

기존 모델과 다른점
기존 모델은 문장 하나를 하나의 벡터(예: 멀티 핫 인코딩)으로 만들어서 그 벡터를 통채로 입력했다면, 시퀀스 모델은 단어 하나를 하나의 벡터로 만들고 이를 순차적으로 입력해서 최종 결과를 만들어냄

#### 데이터 다운로드

In [None]:
!rm -r aclImdb
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  32.2M      0  0:00:02  0:00:02 --:--:-- 32.2M


#### 데이터 준비

In [None]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

for category in ("neg", "pos"):
    try:
      os.makedirs(val_dir / category)
    except:
      pass
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


#### 정수 시퀀스 데이터셋 준비하기

가장 기본 - 하나의 단어를 원핫 인코딩 벡터로 바꿔서 하나씩 입력하기

In [None]:
from tensorflow.keras import layers

max_length = 600      # 적당한 길이의 입력을 위해 600개보다 긴 리뷰의 뒷부분은 잘라버림
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'Russell, my fav, is gorgeous in this film. But more than that, the film covers a tremendous range of human passion and sorrow. Everything from marriage to homosexuality is addressed and respected. The film makes the viewer realize that tolerance of other humans provides the route to saving humanity. Fabulous love story between Lachlin and Lil. I replay their scenes over and over again. Anyone who has ever been in love will empathize with these people. All characters are cast and portrayed excellently.', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


#### 원-핫 인코딩된 벡터 시퀀스로 시퀀스 모델 만들기

In [None]:
import tensorflow as tf
inputs = keras.Input(shape=(None,), dtype="int64")  # 입력은 정수 시퀀스
embedded = tf.one_hot(inputs, depth=max_tokens)     # 정수를 20,000차원의 이진벡터로 인코딩
x = layers.Bidirectional(layers.LSTM(32))(embedded) # 양방향 LSTM 층을 추가
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)  # 마지막 분류층 추가
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 20000)       0         
                                                                 
 bidirectional (Bidirectiona  (None, 64)               5128448   
 l)                                                              
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
_________________________________________________

#### 첫 번째 시퀀스 모델 훈련하기

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"테스트 정확도: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10

훈련이 매우 느림

- 입력크기가 너무 큼
- 각 입력 샘플은 (600, 20000) 크기임 (단어당 20000, 길이 최대 600)
- 하나의 영화 리뷰는 12000000 개의 값으로 이루어짐

성능이 안좋은 이유

- 단어를 원핫 인코딩으로 벡터화 했기 때문  
- 20000 짜리 벡터중 한 값만 1이고 나머지는 0

## 단어 임베딩 이해하기

### 임베딩 층으로 단어 임베딩 학습하기

#### Embedding 층 만들기


In [None]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

#### 밑바닥부터 훈련하는 Embedding 층을 사용한 모델

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_lstm.keras")
print(f"테스트 정확도: {model.evaluate(int_test_ds)[1]:.3f}")

속도가 빨라졌음
- 이전에는 단어당 크기 20000 의 벡터였음
- embedding 을 통해 단어당 256 개의 벡터를 사용
- 기본모델보다 여전히 성능이 낮음 - 600개가 넘는 단어는 잘라버리기 때문? 

### 패딩과 마스킹 이해하기

#### 마스킹을 활성화한 Embedding 층 사용하기

시퀀스의 길이가 600 이기 때문에 600보다 짧은 문장은 0으로 나머지를 채움

0으로 된것도 계속 학습.. 0으로 된 것은 건너뛰도록 하는 것이 masking

In [None]:
embedding_layer = layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)
some_input = [
    [4,3,2,1,0,0,0],
    [5,4,3,2,1,0,0],
    [2,1,0,0,0,0,0]
]
mask = embedding_layer.compute_mask(some_input)

In [None]:
mask

<tf.Tensor: shape=(3, 7), dtype=bool, numpy=
array([[ True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False, False],
       [ True,  True, False, False, False, False, False]])>

masking 은 직접 작업하지 않고 옵션으로 주면 됨

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs) # mask 를 사용하도록 하는 옵션
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_lstm_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_lstm_with_masking.keras")
print(f"테스트 정확도: {model.evaluate(int_test_ds)[1]:.3f}")

약간의 성능 향상이 더 됨

### 사전 훈련된 단어 임베딩 사용하기

2014 년 영어 위키피디아 데이터셋에서 미리 학습

- 822메가(압축)
- 40만개의 단어
- 100차원 임베딩 벡터

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-04-05 14:42:48--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-04-05 14:42:48--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-04-05 14:42:49--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

#### GloVe 단어 임베딩 파일 파싱하기

In [None]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"단어 벡터 개수: {len(embeddings_index)}")

In [None]:
embeddings_index

In [None]:
embeddings_index.keys()

In [None]:
for key, value in embeddings_index.items()[:10]:
  print(f"{key} {value}")

#### GloVe 단어 임베딩 행렬 준비하기

In [None]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()  # 기존의 사전에서 단어를 추출
word_index = dict(zip(vocabulary, range(len(vocabulary))))  # 어휘 사전에 있는 단어와 인덱스를 매핑



In [None]:
embedding_matrix = np.zeros((max_tokens, embedding_dim))  # GloVe 로 갈아치울 빈 행렬 준비
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)     # glove 에서 word 의 vector 를 가져옴
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector            # 빈 행렬에 glove 에서 가져온 vector 를 지정

In [None]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),   # 사전에 훈련된 임베딩을 로드
    trainable=False,      # 임베딩은 값이 바뀌지 않도록 trainable = False 로 지정
    mask_zero=True,
)

#### 사전 훈련된 임베딩을 사용하는 모델

위에서 직접 학습한 256차원의 임베딩이 아니라 사전에 학습된 100 차원 짜리 GloVe 를 사용

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"테스트 정확도: {model.evaluate(int_test_ds)[1]:.3f}")

여기서는 크게 도움이 되지 않았음

원인
- 임베딩을 직접 학습할 만큼 충분한 데이터셋이 있었음
- 미리 학습된 GloVe 가 더 좋았다면?

## BoW model vs Sequence Model

훈련 데이터에 있는 샘플 수와 샘플에 있는 평균 단어 개수 사이의 비율이 가이드라인이 될 수 있음

샘플개수 / 평균샘플길이 
- 1500 보다 크면 시퀀스 모델
- 1500 보다 작으면 바이그램 모델

IMDB 데이터의 경우 훈련 샘플개수 2만개, 평균 단어수 233개

샘플개수 / 평균샘플길이 = 20000 / 233 = 85.8 < 1500
- 바이그램 모델이 맞음

절대적인 것은 아님