# <font color='orange'> Keywords </font>

- Preprocessing text data into useful representations
- Working with recurrent neural networks
- Using 1D convnets for sequence processing

# <font color = 'orange'> Two fundamental deep-learning algorithms for sequence processing </font>

- RNN
- 1D convnets : one-dimensional version of the 2D convnets

# <font color='orange'> working with text data </font>

**Text Data**

- Text is one of the most widespread forms of sequence data.
- It can be understood as either a sequence of characters or a sequence of words.

**Approach**

- Deep-learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much the same way that computer vision is pattern recognition appied to pixels.
- Like all other neural network, deep-learning models don't take as input raw text:
    they only work with numeric tensors. vectorizing text is the process of transforming text into numeric tensors.


**Token, tokenization**

- <font color ='red'> [words] Segment text into words, and transform each word into a vector </font>
- [characters] Segment text into characters, and transform each characters, and a vector.
- [n-grams] Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters

**From Token to (numeric) Vector**

Tokens to  one-hot encoding or <font color='red'> token embedding (word embedding) </font>**

- one-hot encoding (one-hot vector)

    - sparse
    - high dimensional
    - manual encoding


- word embedding

    - dense
    - low dimensional
    - training from data

**Train - Neural Network**

Using Embedding, we can start learn with neural network

## <font color='blue'> 1. Token </font>

In [6]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['I love my dog',
          'I love my cat']

tokenizer = Tokenizer() # Tokenizer객체 생성, num_words인자를 이용해 빈도가 높은 순서대로 단어를 토큰화 할 수 있다. 

tokenizer.fit_on_texts(samples) # 단어 인덱스

sequences = tokenizer.texts_to_sequences(samples) #문자열을 정수 인덱스의 리스트로 변환한다.
word_index = tokenizer.word_index

In [2]:
tokenizer

<keras_preprocessing.text.Tokenizer at 0x7f4a59225f90>

In [3]:
tokenizer.fit_on_texts(samples)

In [4]:
sequences

[[1, 2, 3, 4], [1, 2, 3, 5]]

In [5]:
word_index

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}

### 1) Word Indexing 


- the word for example "listen" is represented by number using an encoding scheme (ASCII CODE... etc..)
- But the order of number is important

`listen = [76, 73, 83, 84, 69, 78]`

`silent = [83, 73, 76, 69, 78, 84]`

- this bunch of numbers can then represent the word listen but word silent has the same letters, and thus the same numbers, just in a different order.
- It makes it hard for us to understand sentiment of a word just by the letters in it

- So it might be easier, instead of encoding letters to encode words.
- Consider the sentence I love my dog.
- what would happen if we start encoding the words in this sentence instead of letters in each word?

`sentence1 = {'I':1, "love":2, "my":3, "dog":4}`

`sentence2 = {'I':1, "love":2, "my":3, "cat":5}`

- two sentences are already show some form of similarity between them.
- And it's a similarity you would expect, because they're both about loving a pet.
- Given this method of encoding sentences into number, now let's kate a look at some code to archieve this for us. : This process, is called tokenization

### 2) Sequencing - Turning sentences into data

- creating sequences of numbers from your sentences
- And using tools to process them to make them ready for teaching neural network
- Last time, we saw how to take a set of sentences and use the tokenizer to turn the words into numberic tokens.
- Let's build on that now by also seeing how the senteces containing these words.
- Can be turned into sequences of numbers.
- We'll add another sentence to our set of texts, and I'm doing this because the existing sentences all have four words
- and it's important to see how to manage sentences, or sequences, of different lengths
- The tokenizer supports a method called texts to sequences which performs most of the work for you.
- It creates sequences of tokens representing each sentence.

In [7]:
sentence = ['I love my dog',
            'I love my cat',
            'You love my dog!',
            'Do you think my dog is amazing?']

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
word_index

{'my': 1,
 'love': 2,
 'dog': 3,
 'i': 4,
 'you': 5,
 'cat': 6,
 'do': 7,
 'think': 8,
 'is': 9,
 'amazing': 10}

In [8]:
sequences = tokenizer.texts_to_sequences(sentence)
sequences

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

### 3) Basic Tokenization Done!


- this is all very well for getting data ready for training a neural network,
- but what happens when that neural network needs to classify texts, but there are words in the text that it has never seen before?
- This can confuse the tokenizer, so we'll look at how to handle that next.

### 4) OOV 처리, (Out of Vocabrary)

In [9]:
test_data = ['i really love my dog',
             'my dog loves my manatee']

test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[4, 2, 1, 3], [1, 3, 1]]

In [10]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentence)
word_index = tokenizer.word_index
word_index

{'<OOV>': 1,
 'my': 2,
 'love': 3,
 'dog': 4,
 'i': 5,
 'you': 6,
 'cat': 7,
 'do': 8,
 'think': 9,
 'is': 10,
 'amazing': 11}

### 5) padding

In [11]:
test_seq = tokenizer.texts_to_sequences(test_data)
test_seq

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

In [12]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences)
padded

array([[ 0,  0,  0,  4,  2,  1,  3],
       [ 0,  0,  0,  4,  2,  1,  6],
       [ 0,  0,  0,  5,  2,  1,  3],
       [ 7,  5,  8,  1,  3,  9, 10]], dtype=int32)

In [13]:
padded = pad_sequences(sequences, padding='post')
padded

array([[ 4,  2,  1,  3,  0,  0,  0],
       [ 4,  2,  1,  6,  0,  0,  0],
       [ 5,  2,  1,  3,  0,  0,  0],
       [ 7,  5,  8,  1,  3,  9, 10]], dtype=int32)

In [14]:
padded = pad_sequences(sequences, padding='post', maxlen=5)
padded

array([[ 4,  2,  1,  3,  0],
       [ 4,  2,  1,  6,  0],
       [ 5,  2,  1,  3,  0],
       [ 8,  1,  3,  9, 10]], dtype=int32)

In [15]:
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)
padded

array([[4, 2, 1, 3, 0],
       [4, 2, 1, 6, 0],
       [5, 2, 1, 3, 0],
       [7, 5, 8, 1, 3]], dtype=int32)

## <font color='blue'> 2 Numeric vector </font>

### 1) one-hot encoding (one-hot vector) 

In [21]:
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
one_hot_results

array([[0., 1., 1., 1., 1., 0.],
       [0., 1., 1., 1., 0., 1.]])

### 2) word embedding

In [23]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(1000, 64) # 최소 2개의 인자를 받는다, 토큰의 갯수, 임베딩의 차원
embedding_layer

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7f0b8bbb5450>

**<font color='red'>보통 Deep leaerning에는 word embedding을 많이 사용한다.!</font>**

https://www.tensorflow.org/tutorials/text/word_embeddings

In [27]:
import tensorflow as tf

check = embedding_layer(tf.constant([1,2,3]))
print(check.numpy())
print(check.numpy().shape)
print(check.numpy().ndim)
print(check.numpy().size)

[[ 0.0058084   0.01899623  0.03751728  0.01505322 -0.0078648   0.02411301
   0.0432946   0.03782305  0.0476147  -0.00416829  0.04019305 -0.04721996
  -0.02458578 -0.03131541 -0.04732466  0.03429008 -0.00307468  0.01407934
  -0.03312021  0.02726435 -0.0258376   0.01107335 -0.02803576  0.01268565
   0.0117619   0.03730077 -0.00888462  0.04984557  0.0351714   0.02382283
   0.04052938  0.00780154  0.02174998  0.00192178 -0.01092023 -0.04568827
  -0.00639306 -0.00019554  0.02904104  0.02959052  0.03528127  0.01469796
  -0.00973715  0.03643007  0.01550866  0.03681531  0.00551854  0.02855345
   0.01155401  0.00609125  0.02270531 -0.00458912  0.02443899 -0.01230549
   0.04663397 -0.00156844  0.03369036 -0.00010158 -0.03046843  0.02773331
  -0.00327218 -0.01701242 -0.02675983  0.00358992]
 [ 0.02913376 -0.00439787  0.01085889  0.01547484 -0.03544807 -0.04792522
   0.00144672 -0.04208839 -0.017596    0.03704337  0.00145243 -0.02046359
   0.03297393 -0.0224851   0.01063949  0.00953243 -0.04315984

In [33]:
result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape

TensorShape([2, 3, 64])

In [34]:
result.ndim

3

## <font color='blue'> 3. Training - NN</font>

### 1) load data

In [18]:
import pandas as pd
df = pd.read_json("./Sarcasm_Headlines_Dataset.json", lines=True)
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [19]:
sentences = list(df['article_link'])
labels = list(df['is_sarcastic'])
urls = list(df['article_link'])

### 2) Tokenization Data

In [22]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

print(padded[0])
print(padded.shape)

[    2     4     5     3     6 12731    95  2105     8 12732     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
(26709, 46)


In [25]:
len(sentences)

26709

### 2) Tokenization Data - train / test data split

In [23]:
training_size = 20000
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

In [28]:

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(training_padded[0])
print(training_padded.shape)
print(testing_padded[0])
print(testing_padded.shape)

[   2    4    5    3    6    1   93 1840    8    1    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]
(20000, 100)
[   2    4    7    3    1  961 6935 6017   68    1 3031    1    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0  

### 3) Build a Model

In [36]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

### 4) Training

훈련을 시키기 위해선, 특성 행렬과 타겟 벡터 형태로!

In [37]:
import numpy as np
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)
print(testing_padded.shape)
print(testing_labels.shape)

(6709, 100)
(6709,)


In [39]:
num_epochs = 30
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

Epoch 1/30
625/625 - 1s - loss: 7.2134e-04 - accuracy: 1.0000 - val_loss: 7.9016e-04 - val_accuracy: 1.0000
Epoch 2/30
625/625 - 1s - loss: 4.1782e-04 - accuracy: 1.0000 - val_loss: 5.0616e-04 - val_accuracy: 1.0000
Epoch 3/30
625/625 - 1s - loss: 2.5761e-04 - accuracy: 1.0000 - val_loss: 3.4227e-04 - val_accuracy: 1.0000
Epoch 4/30
625/625 - 1s - loss: 1.6447e-04 - accuracy: 1.0000 - val_loss: 2.3583e-04 - val_accuracy: 1.0000
Epoch 5/30
625/625 - 1s - loss: 1.0745e-04 - accuracy: 1.0000 - val_loss: 1.6677e-04 - val_accuracy: 1.0000
Epoch 6/30
625/625 - 1s - loss: 7.1384e-05 - accuracy: 1.0000 - val_loss: 1.2029e-04 - val_accuracy: 1.0000
Epoch 7/30
625/625 - 1s - loss: 4.7963e-05 - accuracy: 1.0000 - val_loss: 9.0381e-05 - val_accuracy: 1.0000
Epoch 8/30
625/625 - 1s - loss: 3.2607e-05 - accuracy: 1.0000 - val_loss: 7.0274e-05 - val_accuracy: 1.0000
Epoch 9/30
625/625 - 1s - loss: 2.2509e-05 - accuracy: 1.0000 - val_loss: 4.9447e-05 - val_accuracy: 1.0000
Epoch 10/30
625/625 - 1s - l

### 5) Predict

In [35]:
sentence = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(padded))

[[0.99607193]
 [0.9766431 ]]


### 1) load_data - tfds

In [41]:
!pip install tensorflow_datasets

Collecting tensorflow_datasets
  Downloading tensorflow_datasets-3.1.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 905 kB/s eta 0:00:01
[?25hCollecting dill
  Downloading dill-0.3.2.zip (177 kB)
[K     |████████████████████████████████| 177 kB 11.2 MB/s eta 0:00:01
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-0.22.2-py2.py3-none-any.whl (32 kB)
Collecting googleapis-common-protos
  Downloading googleapis_common_protos-1.52.0-py2.py3-none-any.whl (100 kB)
[K     |████████████████████████████████| 100 kB 9.1 MB/s eta 0:00:01
Building wheels for collected packages: dill, promise
  Building wheel for dill (setup.py) ... [?25ldone
[?25h  Created wheel for dill: filename=dill-0.3.2-py3-none-any.whl size=78912 sha256=c15109f45a1cdd48afcb070fa623d891648a613009924b6e2058af92a6262e92
  Stored in directory: /home/reisei88/.cache/pip/wheels/72/6b/d5/5548aa1b73b8c3d176ea13f9f92066b02

In [42]:
import tensorflow_datasets as tfds

imdb, info = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)



[1mDownloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/reisei88/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/reisei88/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteIUF8HG/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/reisei88/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteIUF8HG/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/reisei88/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteIUF8HG/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /home/reisei88/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


In [50]:
train_data, test_data = imdb['train'], imdb['test']
tokenizer = info.features['text'].encoder
print(tokenizer.subwords)



In [51]:
sample_string = 'Tensor Flow, from basic to mastery'

tokenized_string = tokenizer.encode(sample_string)
print("Tokenized string is {}".format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print("Original string is {}".format(original_string))

Tokenized string is [6307, 2327, 7961, 4043, 2120, 2, 48, 2715, 7, 2652, 8050]
Original string is Tensor Flow, from basic to mastery


In [52]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_data.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_data.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_data))

In [53]:
embdding_dim = 64

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, embedding_dim),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d_3 ( (None, 16)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 6)                 102       
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 7         
Total params: 131,069
Trainable params: 131,069
Non-trainable params: 0
_________________________________________________________________


In [57]:
train_data

<PrefetchDataset shapes: ((None,), ()), types: (tf.int64, tf.int64)>

In [54]:
num_epochs = 10
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_data, epochs=num_epochs, validation_data=test_data)

Epoch 1/10


ValueError: in user code:

    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:533 train_step  **
        y, y_pred, sample_weight, regularization_losses=self.losses)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/compile_utils.py:205 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/losses.py:143 __call__
        losses = self.call(y_true, y_pred)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/losses.py:246 call
        return self.fn(y_true, y_pred, **self._fn_kwargs)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/losses.py:1595 binary_crossentropy
        K.binary_crossentropy(y_true, y_pred, from_logits=from_logits), axis=-1)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:4692 binary_crossentropy
        return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
    /home/reisei88/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:172 sigmoid_cross_entropy_with_logits
        (logits.get_shape(), labels.get_shape()))

    ValueError: logits and labels must have the same shape ((None, 1) vs ())


# <font color = 'orange'> API Spec </font>

**Tokenizer**

<font color ='red'> Tensorflow의 text preprocessing에 있음 </font>

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text

Tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

text to word sequence: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/text_to_word_sequence

**Embedding**

<font color ='red'> Tensorflow의 Layer에 있음. </font>

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

Embedding : https://www.tensorflow.org/tutorials/text/word_embeddings

# Embedding 이란?

- 범주형 데이터를 numeric Vector로 나타내는 것. 이렇게 숫여 벡터로 나타내면 좌표 평면에도 그릴 수 있고, 연산하기 쉽다. 

https://developers.google.com/machine-learning/glossary#embeddings

https://developers.google.com/machine-learning/recommendation/overview/terminology

https://heung-bae-lee.github.io/2020/01/16/NLP_01/