## **2. Word2Vec**
1. 주어진 단어들을 word2vec 모델에 들어갈 수 있는 형태로 만듭니다.
2. CBOW, Skip-gram 모델을 각각 구현합니다.
3. 모델을 실제로 학습해보고 결과를 확인합니다.

### **필요 패키지 import**

In [None]:
!pip install konlpy

Collecting konlpy
[?25l  Downloading https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl (19.4MB)
[K     |████████████████████████████████| 19.4MB 1.2MB/s 
Collecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Collecting tweepy>=3.7.0
  Downloading https://files.pythonhosted.org/packages/67/c3/6bed87f3b1e5ed2f34bd58bf7978e308c86e255193916be76e5a5ce5dfca/tweepy-3.10.0-py2.py3-none-any.whl
Collecting JPype1>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/de/af/93f92b38ec1ff3091cd38982ed19cea2800fefb609b5801c41fc43c0781e/JPype1-1.2.1-cp36-cp36m-manylinux2010_x86_64.whl (457kB)
[K     |████████████████████████████████| 460kB 41.6MB/s 
[?25hCollecting beautifulsoup4==4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e2270723

In [None]:
from tqdm import tqdm
from konlpy.tag import Okt
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from collections import defaultdict

import torch
import copy
import numpy as np

### **데이터 전처리**

데이터를 확인하고 Word2Vec 형식에 맞게 전처리합니다.  
학습 데이터는 1번 실습과 동일하고, 테스트를 위한 단어를 아래와 같이 가정해봅시다.

In [None]:
train_data = [
  "정말 맛있습니다. 추천합니다.",
  "기대했던 것보단 별로였네요.",
  "다 좋은데 가격이 너무 비싸서 다시 가고 싶다는 생각이 안 드네요.",
  "완전 최고입니다! 재방문 의사 있습니다.",
  "음식도 서비스도 다 만족스러웠습니다.",
  "위생 상태가 좀 별로였습니다. 좀 더 개선되기를 바랍니다.",
  "맛도 좋았고 직원분들 서비스도 너무 친절했습니다.",
  "기념일에 방문했는데 음식도 분위기도 서비스도 다 좋았습니다.",
  "전반적으로 음식이 너무 짰습니다. 저는 별로였네요.",
  "위생에 조금 더 신경 썼으면 좋겠습니다. 조금 불쾌했습니다."       
]

test_words = ["음식", "맛", "서비스", "위생", "가격"]

Tokenization과 vocab을 만드는 과정은 이전 실습과 유사합니다.

In [None]:
tokenizer = Okt()

In [None]:
def make_tokenized(data):
  tokenized = []
  for sent in tqdm(data):
    tokens = tokenizer.morphs(sent, stem=True)
    tokenized.append(tokens)

  return tokenized

In [None]:
train_tokenized = make_tokenized(train_data)

100%|██████████| 10/10 [00:05<00:00,  1.85it/s]


In [None]:
word_count = defaultdict(int)

for tokens in tqdm(train_tokenized):
  for token in tokens:
    word_count[token] += 1

100%|██████████| 10/10 [00:00<00:00, 53092.46it/s]


In [None]:
word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
print(len(word_count))

60


In [None]:
w2i = {}
for pair in tqdm(word_count):
  if pair[0] not in w2i:
    w2i[pair[0]] = len(w2i)

100%|██████████| 60/60 [00:00<00:00, 78471.54it/s]


In [None]:
print(train_tokenized)
print(w2i)

[['정말', '맛있다', '.', '추천', '하다', '.'], ['기대하다', '것', '보단', '별로', '이다', '.'], ['다', '좋다', '가격', '이', '너무', '비싸다', '다시', '가다', '싶다', '생각', '이', '안', '드네', '요', '.'], ['완전', '최고', '이다', '!', '재', '방문', '의사', '있다', '.'], ['음식', '도', '서비스', '도', '다', '만족스럽다', '.'], ['위생', '상태', '가', '좀', '별로', '이다', '.', '좀', '더', '개선', '되다', '기르다', '바라다', '.'], ['맛', '도', '좋다', '직원', '분들', '서비스', '도', '너무', '친절하다', '.'], ['기념일', '에', '방문', '하다', '음식', '도', '분위기', '도', '서비스', '도', '다', '좋다', '.'], ['전반', '적', '으로', '음식', '이', '너무', '짜다', '.', '저', '는', '별로', '이다', '.'], ['위생', '에', '조금', '더', '신경', '써다', '좋다', '.', '조금', '불쾌하다', '.']]
{'.': 0, '도': 1, '이다': 2, '좋다': 3, '별로': 4, '다': 5, '이': 6, '너무': 7, '음식': 8, '서비스': 9, '하다': 10, '방문': 11, '위생': 12, '좀': 13, '더': 14, '에': 15, '조금': 16, '정말': 17, '맛있다': 18, '추천': 19, '기대하다': 20, '것': 21, '보단': 22, '가격': 23, '비싸다': 24, '다시': 25, '가다': 26, '싶다': 27, '생각': 28, '안': 29, '드네': 30, '요': 31, '완전': 32, '최고': 33, '!': 34, '재': 35, '의사': 36, '있다': 37, '만족스럽다': 38, '상태

실제 모델에 들어가기 위한 input을 만들기 위해 `Dataset` 클래스를 정의합니다.

In [None]:
class CBOWDataset(Dataset):
  def __init__(self, train_tokenized, window_size=2):
    self.x = []
    self.y = []

    for tokens in tqdm(train_tokenized):
      token_ids = [w2i[token] for token in tokens]
      for i, id in enumerate(token_ids):
        if i-window_size >= 0 and i+window_size < len(token_ids):
          self.x.append(token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1]) # 주변 (윈도우)
          self.y.append(id) # 중심

    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수, 2 * window_size)
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)

  def __len__(self):
    return self.x.shape[0]

  def __getitem__(self, idx):
    return self.x[idx], self.y[idx]

In [None]:
class SkipGramDataset(Dataset):
  def __init__(self, train_tokenized, window_size=2):
    self.x = []
    self.y = []

    for tokens in tqdm(train_tokenized):
      token_ids = [w2i[token] for token in tokens]
      for i, id in enumerate(token_ids):
        if i-window_size >= 0 and i+window_size < len(token_ids):
          self.y += (token_ids[i-window_size:i] + token_ids[i+1:i+window_size+1]) # 중심
          self.x += [id] * 2 * window_size # 주변 

    self.x = torch.LongTensor(self.x)  # (전체 데이터 개수)
    self.y = torch.LongTensor(self.y)  # (전체 데이터 개수)

  def __len__(self):
    return self.x.shape[0]

  def __getitem__(self, idx):
    return self.x[idx], self.y[idx]

각 모델에 맞는 `Dataset` 객체를 생성합니다.

In [None]:
cbow_set = CBOWDataset(train_tokenized)
skipgram_set = SkipGramDataset(train_tokenized)

100%|██████████| 10/10 [00:00<00:00, 22894.67it/s]
100%|██████████| 10/10 [00:00<00:00, 13495.19it/s]


### **모델 Class 구현**

차례대로 두 가지 Word2Vec 모델을 구현합니다.  


*   `self.embedding`: `vocab_size` 크기의 one-hot vector를 특정 크기의 `dim` 차원으로 embedding 시키는 layer.
*   `self.linear`: 변환된 embedding vector를 다시 원래 `vocab_size`로 바꾸는 layer.


In [None]:
class CBOW(nn.Module):
  def __init__(self, vocab_size, dim):
    super(CBOW, self).__init__()
    self.embedding = nn.Embedding(vocab_size, dim, sparse=True)
    self.linear = nn.Linear(dim, vocab_size)

  # B: batch size, W: window size, d_w: word embedding size, V: vocab size
  def forward(self, x):  # x: (B, 2W)
    embeddings = self.embedding(x)  # (B, 2W, d_w)
    embeddings = torch.sum(embeddings, dim=1)  # (B, d_w)
    output = self.linear(embeddings)  # (B, V)
    return output

In [None]:
class SkipGram(nn.Module):
  def __init__(self, vocab_size, dim):
    super(SkipGram, self).__init__()
    self.embedding = nn.Embedding(vocab_size, dim, sparse=True)
    self.linear = nn.Linear(dim, vocab_size)

  # B: batch size, W: window size, d_w: word embedding size, V: vocab size
  def forward(self, x): # x: (B)
    embeddings = self.embedding(x)  # (B, d_w)
    output = self.linear(embeddings)  # (B, V)
    return output

두 가지 모델을 생성합니다.

In [None]:
cbow = CBOW(vocab_size=len(w2i), dim=256)
skipgram = SkipGram(vocab_size=len(w2i), dim=256)

### **모델 학습**

다음과 같이 hyperparamter를 세팅하고 `DataLoader` 객체를 만듭니다.

In [None]:
batch_size=4
learning_rate = 5e-4
num_epochs = 5
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

cbow_loader = DataLoader(cbow_set, batch_size=batch_size)
skipgram_loader = DataLoader(skipgram_set, batch_size=batch_size)

첫번째로 CBOW 모델 학습입니다.

In [None]:
cbow.train()
cbow = cbow.to(device)
optim = torch.optim.SGD(cbow.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

for e in range(1, num_epochs+1):
  print("#" * 50)
  print(f"Epoch: {e}")
  for batch in tqdm(cbow_loader):
    x, y = batch
    x, y = x.to(device), y.to(device) # (B, W), (B)
    output = cbow(x)  # (B, V)

    optim.zero_grad()
    loss = loss_function(output, y)
    loss.backward()
    optim.step()

    print(f"Train loss: {loss.item()}")

print("Finished.")

  6%|▋         | 1/16 [00:00<00:02,  5.61it/s]

##################################################
Epoch: 1
Train loss: 4.76423454284668
Train loss: 3.877540111541748
Train loss: 4.697382926940918
Train loss: 4.866057872772217
Train loss: 4.158361911773682
Train loss: 4.261722564697266
Train loss: 4.393556118011475
Train loss: 4.680698871612549
Train loss: 4.943302154541016
Train loss: 4.737784385681152
Train loss: 4.804426670074463
Train loss: 3.6463236808776855
Train loss: 3.7916486263275146
Train loss: 4.352444648742676


100%|██████████| 16/16 [00:00<00:00, 78.24it/s]
100%|██████████| 16/16 [00:00<00:00, 705.83it/s]
100%|██████████| 16/16 [00:00<00:00, 694.64it/s]
100%|██████████| 16/16 [00:00<00:00, 665.89it/s]
100%|██████████| 16/16 [00:00<00:00, 686.55it/s]

Train loss: 4.224295616149902
Train loss: 4.728593349456787
##################################################
Epoch: 2
Train loss: 4.583186626434326
Train loss: 3.737147331237793
Train loss: 4.579419136047363
Train loss: 4.729487419128418
Train loss: 4.03863525390625
Train loss: 4.00961971282959
Train loss: 4.2638468742370605
Train loss: 4.5540900230407715
Train loss: 4.815417766571045
Train loss: 4.570962429046631
Train loss: 4.616819858551025
Train loss: 3.338893413543701
Train loss: 3.6600136756896973
Train loss: 4.2322998046875
Train loss: 4.074901103973389
Train loss: 4.562763214111328
##################################################
Epoch: 3
Train loss: 4.406665802001953
Train loss: 3.6002299785614014
Train loss: 4.463200569152832
Train loss: 4.595195293426514
Train loss: 3.920806884765625
Train loss: 3.769679546356201
Train loss: 4.136260986328125
Train loss: 4.430707931518555
Train loss: 4.692493438720703
Train loss: 4.410676956176758
Train loss: 4.438283920288086
Train loss




다음으로 Skip-gram 모델 학습입니다.

In [None]:
skipgram.train()
skipgram = skipgram.to(device)
optim = torch.optim.SGD(skipgram.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss()

for e in range(1, num_epochs+1):
  print("#" * 50)
  print(f"Epoch: {e}")
  for batch in tqdm(skipgram_loader):
    x, y = batch
    x, y = x.to(device), y.to(device) # (B, W), (B)
    output = skipgram(x)  # (B, V)

    optim.zero_grad()
    loss = loss_function(output, y)
    loss.backward()
    optim.step()

    print(f"Train loss: {loss.item()}")

print("Finished.")

100%|██████████| 64/64 [00:00<00:00, 769.24it/s]
100%|██████████| 64/64 [00:00<00:00, 761.48it/s]
  0%|          | 0/64 [00:00<?, ?it/s]

##################################################
Epoch: 1
Train loss: 3.935598850250244
Train loss: 4.18071174621582
Train loss: 4.385351181030273
Train loss: 3.981175184249878
Train loss: 4.382734298706055
Train loss: 4.338055610656738
Train loss: 4.1024909019470215
Train loss: 4.443660259246826
Train loss: 3.8431544303894043
Train loss: 4.591559410095215
Train loss: 4.209688186645508
Train loss: 3.74810791015625
Train loss: 4.243536949157715
Train loss: 4.557506561279297
Train loss: 4.170951843261719
Train loss: 4.279292583465576
Train loss: 3.961508274078369
Train loss: 4.395082473754883
Train loss: 4.10136604309082
Train loss: 3.8903419971466064
Train loss: 4.5862812995910645
Train loss: 4.630316734313965
Train loss: 3.9476840496063232
Train loss: 4.1631693840026855
Train loss: 4.583922863006592
Train loss: 4.150356292724609
Train loss: 3.9295263290405273
Train loss: 4.595941066741943
Train loss: 4.246572971343994
Train loss: 4.090461254119873
Train loss: 4.977676868438721
Train 

100%|██████████| 64/64 [00:00<00:00, 723.11it/s]
100%|██████████| 64/64 [00:00<00:00, 765.91it/s]
  0%|          | 0/64 [00:00<?, ?it/s]

Train loss: 4.210082530975342
Train loss: 3.8988308906555176
Train loss: 4.325079917907715
Train loss: 4.053057670593262
Train loss: 3.8364081382751465
Train loss: 4.377152442932129
Train loss: 4.447363376617432
Train loss: 3.8326268196105957
Train loss: 4.108140468597412
Train loss: 4.513599395751953
Train loss: 4.033143043518066
Train loss: 3.8284125328063965
Train loss: 4.541130065917969
Train loss: 4.184264183044434
Train loss: 4.037203788757324
Train loss: 4.914881229400635
Train loss: 4.540597915649414
Train loss: 3.80001163482666
Train loss: 4.064507484436035
Train loss: 3.8226611614227295
Train loss: 4.030473709106445
Train loss: 3.5853731632232666
Train loss: 4.363476276397705
Train loss: 4.282433032989502
Train loss: 4.279162406921387
Train loss: 4.055452823638916
Train loss: 4.472754955291748
Train loss: 4.132181167602539
Train loss: 4.181849956512451
Train loss: 4.260842323303223
Train loss: 4.061119079589844
Train loss: 4.636878967285156
Train loss: 4.117508888244629
Train

100%|██████████| 64/64 [00:00<00:00, 732.74it/s]

Train loss: 4.000858306884766
Train loss: 3.7650628089904785
Train loss: 3.9784064292907715
Train loss: 3.480867862701416
Train loss: 4.268176078796387
Train loss: 4.20712423324585
Train loss: 4.228814601898193
Train loss: 3.995807647705078
Train loss: 4.434756278991699
Train loss: 4.011079788208008
Train loss: 4.109465599060059
Train loss: 4.028143405914307
Train loss: 3.861844539642334
Train loss: 4.458781719207764
Train loss: 4.0040154457092285
Train loss: 4.327019691467285
Train loss: 4.1197285652160645
Train loss: 3.6885290145874023
Train loss: 3.6812026500701904
Train loss: 4.188002586364746
Train loss: 4.328497886657715
Train loss: 3.8710567951202393
Train loss: 4.5183281898498535
Train loss: 4.159461498260498
Train loss: 3.639101982116699
Train loss: 4.570396423339844
Train loss: 3.6790246963500977
Train loss: 4.628300189971924
Train loss: 4.074861526489258
Train loss: 4.197501182556152
Train loss: 3.9985713958740234
Finished.





### **테스트**

학습된 각 모델을 이용하여 test 단어들의 word embedding을 확인합니다.

In [None]:
for word in test_words:
  input_id = torch.LongTensor([w2i[word]]).to(device)
  emb = cbow.embedding(input_id)

  print(f"Word: {word}")
  print(emb.squeeze(0))

Word: 음식
tensor([-1.0323e+00,  9.4741e-02, -1.6479e+00, -2.4390e+00, -9.3792e-01,
        -2.5977e-01,  1.4184e+00,  9.8140e-01, -8.7173e-01,  2.5267e-01,
        -3.0643e-01, -5.8586e-01,  1.2068e+00,  2.4510e-01,  2.9336e-01,
         9.8713e-01, -7.3095e-01,  3.2432e-01,  1.3086e-01, -1.4584e+00,
        -2.1182e+00,  6.5387e-01,  6.2431e-01,  7.4095e-01, -1.2376e+00,
         7.4692e-02, -1.4138e+00,  5.8394e-02,  3.9641e-01, -4.4608e-01,
        -1.7196e+00,  2.9443e-01,  9.0518e-02,  1.1679e+00,  1.3366e+00,
        -1.8578e+00, -4.2563e-01,  2.2517e-01, -6.2500e-02, -7.1083e-02,
        -1.3357e+00,  4.7396e-01,  6.2155e-01,  1.7937e+00,  4.0047e-02,
        -3.0487e-01, -2.6982e-01, -1.8118e+00, -1.5114e+00,  8.4694e-01,
         4.7915e-01, -7.9572e-01, -9.0195e-01,  7.8691e-01, -1.9914e+00,
         7.0706e-01,  2.3641e+00, -1.0699e-02, -8.3425e-01,  2.0945e+00,
         1.2797e+00,  8.9398e-01,  3.6953e-01,  4.1938e-01,  3.1600e-01,
        -1.4501e+00, -1.5972e+00, -1.2593e

In [None]:
for word in test_words:
  input_id = torch.LongTensor([w2i[word]]).to(device)
  emb = skipgram.embedding(input_id)

  print(f"Word: {word}")
  print(emb.squeeze(0))

Word: 음식
tensor([-9.1191e-01,  6.6738e-01, -1.0075e+00,  5.3786e-01,  1.7338e+00,
        -1.1773e+00, -1.1951e+00,  4.0240e-01, -1.4058e+00, -7.2768e-01,
        -5.0139e-01, -1.6413e-01,  7.8513e-01,  1.2303e+00, -1.1413e+00,
         1.1255e-01,  1.4635e+00,  1.3380e+00, -1.2341e+00, -3.0908e-01,
        -1.4730e-01,  1.5394e+00, -1.8612e+00, -5.8801e-01, -2.3625e-01,
         8.6905e-01, -1.0273e+00,  2.5554e-02,  4.1589e-01, -1.3586e-01,
         1.6397e+00, -2.1654e+00, -1.0361e+00,  1.1095e+00,  9.5384e-01,
        -2.8819e-01,  2.1022e+00, -1.7125e+00, -1.3131e+00, -1.1008e-03,
        -7.7169e-01,  1.1071e-01, -3.7384e-01,  9.1547e-01,  2.5213e-03,
        -5.9584e-01,  1.0911e-01,  3.5029e-02, -3.0037e-01,  1.1482e+00,
         6.0252e-01, -4.7165e-01, -3.1365e-01, -2.5076e-01,  2.2484e-01,
         7.8851e-03, -6.0370e-01,  1.8710e-01,  1.2705e-01, -1.1709e+00,
        -3.8019e-01, -1.0206e-01, -2.7376e-01, -1.0836e+00,  5.7283e-01,
        -1.5809e+00, -1.6338e-01, -6.5261e