<a href="https://colab.research.google.com/github/koojayeong/TextAnalysis/blob/main/A_Detailed_Explanation_of_Keras_Embedding_Layer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 데이터 불러오기

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
# install kaggle 
!pip install --upgrade --force-reinstall --no-deps kaggle
!kaggle competitions list

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[?25l[K     |█████▋                          | 10 kB 18.6 MB/s eta 0:00:01[K     |███████████▏                    | 20 kB 27.3 MB/s eta 0:00:01[K     |████████████████▊               | 30 kB 21.3 MB/s eta 0:00:01[K     |██████████████████████▎         | 40 kB 17.8 MB/s eta 0:00:01[K     |███████████████████████████▉    | 51 kB 18.8 MB/s eta 0:00:01[K     |████████████████████████████████| 58 kB 5.6 MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73051 sha256=4273bbef0c50f4710c92a2a02ffaaecb6d8bac0f784db8041da077862e675d06
  Stored in directory: /root/.cache/pip/wheels/62/d6/58/5853130f941e75b2177d281eb7e44b4a98ed46dd155f556dc5
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    U

In [3]:
import pandas as pd
train = pd.read_csv("/content/drive/MyDrive/kaggle/BagsOfPopcorn/labeledTrainData.tsv.zip", 
                    header=0, # the first line of the file contains column names
                    delimiter="\t", # the fields are separated by tabs
                    quoting=3) # ignore doubled quotes

# Keras Embedding Layer

Keras의 임베딩 레이어는 더 높은 차원의 데이터를 더 낮은 차원의 벡터 공간에 임베딩하기 위해 임베딩을 만들고 싶을 때 사용할 수 있다.

# Importing Modules

In [5]:
# Ignore warnings
import warnings 
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# 시각화, 조작
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns

# 설정
%matplotlib inline
style.use('fivethirtyeight')
sns.set(style='whitegrid', color_codes=True)

# nltk
import nltk

# stop-words
import nltk
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

# tokenizing
from nltk import word_tokenize, sent_tokenize

#keras 
import keras
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, Input
from keras.models import Model

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Creating Sample Corpus of Documents(texts)

In [6]:
sample_text_1="bitty bought a bit of butter"
sample_text_2="but the bit of butter was a bit bitter"
sample_text_3="so she bought some better butter to make the bitter butter better"

corp = [sample_text_1, sample_text_2, sample_text_3]
no_docs=len(corp)

# Integer Encoding 

정수 인코딩  
각 단어들을 정수로 표현.  
예를 들어 "버터"는 모든 문서에서 31로 표시된다.  
  
keras의 one_hot 함수를 사용할 것이다. vocab_size는 각 정수 인코딩을 확실히 할 만큼 충분히 커야 한다. 

In [7]:
vocab_size=50
encod_corp=[]
for i,doc in enumerate(corp):
  encod_corp.append(one_hot(doc, 50))
  print("The encoding for document",i+1,"is : ",
        one_hot(doc,50))
  

The encoding for document 1 is :  [49, 17, 19, 16, 4, 9]
The encoding for document 2 is :  [45, 23, 16, 4, 9, 39, 19, 16, 1]
The encoding for document 3 is :  [30, 14, 17, 32, 6, 9, 48, 48, 23, 1, 9, 6]


# Padding the docs

모든 문서의 길이를 동일하게 맞춰주기  
  
keras embedding layer에서는 모든 문서의 길이가 같아야 한다.   
pad_sequences 함수를 사용

In [9]:
import nltk
nltk.download('punkt')

maxlen= -1
for doc in corp:
  tokens = nltk.word_tokenize(doc)
  if maxlen<len(tokens):
    maxlen=len(tokens)

print("The maximum number of words in any document is : ",
      maxlen)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
The maximum number of words in any document is :  12


In [10]:
# padding
pad_corp = pad_sequences(encod_corp,
                         maxlen=maxlen,
                         padding='post',
                         value=0.0)
print("No of padded documents: ", len(pad_corp))

No of padded documents:  3


In [11]:
for i, doc in enumerate(pad_corp):
  print("The padded encoding for document", i+1,
        "is : ", doc)

The padded encoding for document 1 is :  [49 17 19 16  4  9  0  0  0  0  0  0]
The padded encoding for document 2 is :  [45 23 16  4  9 39 19 16  1  0  0  0]
The padded encoding for document 3 is :  [30 14 17 32  6  9 48 48 23  1  9  6]


# Creating Embedding using Keras Embedding Layer

In [12]:
# specifying the input shape
input = Input(shape=(no_docs,maxlen), dtype='float64')

In [13]:
'''
shape of input.
각 문서는 12개의 요소를 갖는다
'''
word_input = Input(shape=(maxlen,),dtype='float64')

# creating the embedding
word_embedding = Embedding(input_dim=vocab_size,
                           output_dim=8,
                           input_length=maxlen)(word_input)

word_vec = Flatten()(word_embedding) # flatten
embed_model = Model([word_input], word_vec)


**parameters**
  
* 'input_dim'   
the vocab size that we will choose. 

* 'output_dim'   
the number of dimensions we wish to embed into.

* 'input_length'   
lenght of the maximum document.

In [16]:
import tensorflow as tf

embed_model.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-3),
                    loss='binary_crossentropy',
                    metrics=['acc'])

In [18]:
print(type(word_embedding))
print(word_embedding)

<class 'keras.engine.keras_tensor.KerasTensor'>
KerasTensor(type_spec=TensorSpec(shape=(None, 12, 8), dtype=tf.float32, name=None), name='embedding/embedding_lookup/Identity_1:0', description="created by layer 'embedding'")


In [19]:
print(embed_model.summary())

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 12)]              0         
                                                                 
 embedding (Embedding)       (None, 12, 8)             400       
                                                                 
 flatten (Flatten)           (None, 96)                0         
                                                                 
Total params: 400
Trainable params: 400
Non-trainable params: 0
_________________________________________________________________
None


In [20]:
embeddings = embed_model.predict(pad_corp)

In [21]:
print("Shape of embeddings : ", embeddings.shape)
print(embeddings)

Shape of embeddings :  (3, 96)
[[ 0.00736635 -0.00802261 -0.00775436 -0.03623832 -0.03932606 -0.0015465
   0.01801804 -0.03678013 -0.00314764  0.025275   -0.04289455  0.04709974
  -0.0257776   0.04964327 -0.02188355 -0.00025401  0.00933676 -0.04635463
  -0.01522933 -0.04043658  0.04828198 -0.01840921 -0.03717828  0.04608713
  -0.03437523  0.01135999 -0.00389359 -0.0236358   0.03599209 -0.0239241
  -0.01161025 -0.02549825 -0.01337262 -0.01095115 -0.01784123  0.02232993
  -0.01370894 -0.02049881  0.0063342  -0.04695794 -0.01352566 -0.03218435
   0.01499856  0.02114407  0.0248417  -0.02575964  0.02533844 -0.0162251
   0.04456404 -0.01751643  0.03412116 -0.01371187 -0.03960695 -0.01843982
   0.04266334  0.0208007   0.04456404 -0.01751643  0.03412116 -0.01371187
  -0.03960695 -0.01843982  0.04266334  0.0208007   0.04456404 -0.01751643
   0.03412116 -0.01371187 -0.03960695 -0.01843982  0.04266334  0.0208007
   0.04456404 -0.01751643  0.03412116 -0.01371187 -0.03960695 -0.01843982
   0.042663

In [22]:
embeddings = embeddings.reshape(-1, maxlen, 8)
print("Shape of embeddings : ", embeddings.shape)
print(embeddings)

Shape of embeddings :  (3, 12, 8)
[[[ 0.00736635 -0.00802261 -0.00775436 -0.03623832 -0.03932606
   -0.0015465   0.01801804 -0.03678013]
  [-0.00314764  0.025275   -0.04289455  0.04709974 -0.0257776
    0.04964327 -0.02188355 -0.00025401]
  [ 0.00933676 -0.04635463 -0.01522933 -0.04043658  0.04828198
   -0.01840921 -0.03717828  0.04608713]
  [-0.03437523  0.01135999 -0.00389359 -0.0236358   0.03599209
   -0.0239241  -0.01161025 -0.02549825]
  [-0.01337262 -0.01095115 -0.01784123  0.02232993 -0.01370894
   -0.02049881  0.0063342  -0.04695794]
  [-0.01352566 -0.03218435  0.01499856  0.02114407  0.0248417
   -0.02575964  0.02533844 -0.0162251 ]
  [ 0.04456404 -0.01751643  0.03412116 -0.01371187 -0.03960695
   -0.01843982  0.04266334  0.0208007 ]
  [ 0.04456404 -0.01751643  0.03412116 -0.01371187 -0.03960695
   -0.01843982  0.04266334  0.0208007 ]
  [ 0.04456404 -0.01751643  0.03412116 -0.01371187 -0.03960695
   -0.01843982  0.04266334  0.0208007 ]
  [ 0.04456404 -0.01751643  0.03412116 -0

**(3,12,8)**
  
3: 문서 개수  
12 : 각 문서가 12개의 단어로 이루어짐(maxlen)  
8 : 각 단어가 8차원

# Getting Encoding for a particular word in a specific document

In [26]:
for i, doc in enumerate(embeddings):
  for j,word in enumerate(doc):
    print("\n\nThe encoding for ", str(j+1)+"th word",
          "in", str(i+1)+"th document is : \n", word)



The encoding for  1th word in 1th document is : 
 [ 0.00736635 -0.00802261 -0.00775436 -0.03623832 -0.03932606 -0.0015465
  0.01801804 -0.03678013]


The encoding for  2th word in 1th document is : 
 [-0.00314764  0.025275   -0.04289455  0.04709974 -0.0257776   0.04964327
 -0.02188355 -0.00025401]


The encoding for  3th word in 1th document is : 
 [ 0.00933676 -0.04635463 -0.01522933 -0.04043658  0.04828198 -0.01840921
 -0.03717828  0.04608713]


The encoding for  4th word in 1th document is : 
 [-0.03437523  0.01135999 -0.00389359 -0.0236358   0.03599209 -0.0239241
 -0.01161025 -0.02549825]


The encoding for  5th word in 1th document is : 
 [-0.01337262 -0.01095115 -0.01784123  0.02232993 -0.01370894 -0.02049881
  0.0063342  -0.04695794]


The encoding for  6th word in 1th document is : 
 [-0.01352566 -0.03218435  0.01499856  0.02114407  0.0248417  -0.02575964
  0.02533844 -0.0162251 ]


The encoding for  7th word in 1th document is : 
 [ 0.04456404 -0.01751643  0.03412116 -0.0137

이렇게 하면 각각 12개의 단어로 구성되고 각 단어가 8차원 벡터에 매핑된 3개의 문서를 쉽게 시각화할 수 있습니다.

'input_dim' = 선택할 vocab 크기

'output_dim' = 포함할 차원 수

'input_length' = 최대 문서의 허용치