### **GloVe**  
임베딩 벡터의 내적이 말뭉치 전체에서의 동시출현(co-occurrence) 확률 값이 되는 목적 함수를 갖는다.  
이를 통해 임베딩 벡터간 유사도 측정을 수월하게 하면서도 말뭉치 전체의 통계 정보를 반영할 수 있다. 

> **동시 출현 (Co-occurence)** *이란, 한 문장, 문단 또는 텍스트 단위에서 같이 출현한 단어를 가리다. 언어학적 의미에서 의미적 근접성을 가리킨다.*  




In [8]:
!pip install glove_python

Collecting glove_python
[?25l  Downloading https://files.pythonhosted.org/packages/3e/79/7e7e548dd9dcb741935d031117f4bed133276c2a047aadad42f1552d1771/glove_python-0.1.0.tar.gz (263kB)
[K    100% |################################| 266kB 639kB/s ta 0:00:01
Building wheels for collected packages: glove-python
  Running setup.py bdist_wheel for glove-python ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/88/4b/6d/10c0d2ad32c9d9d68beec9694a6f0b6e83ab1662a90a089a4b
Successfully built glove-python
Installing collected packages: glove-python
Successfully installed glove-python-0.1.0
[33mYou are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
import os
import re

In [3]:
BASE_DIR = "/data/TestDir/sample_articles"
ORIGIN_PATH = os.path.join(BASE_DIR,"Origin-Data")
PREPROCESSED_PATH = os.path.join(BASE_DIR,"Preprocessed-Train-Data")
PRETTY_PATH = os.path.join(BASE_DIR,"Pretty-Data")
SWORDS_PATH = os.path.join(BASE_DIR, "StopWordList.txt")

In [4]:
class RawTextReader:
    def __init__(self, filepath):
        self.filepath = filepath
        self.rgxSplitter = re.compile("/n")

    def __iter__(self):
        for line in open(self.filepath, encoding='utf-8'):
            ch = self.rgxSplitter.split(line)
            for s in ch:
                yield s

기사 본문 내용이 짧기 때문에, 모델을 학습하기에 corpus의 크기가 작다.  
아래는 수집한 기사 87건들을 통해 corpus를 구성하고, GloVe 모델을 구축하는 내용이다.

In [5]:
media_list = os.listdir(ORIGIN_PATH)

result = []
forCount = []
for media in media_list:
    media_path = os.path.join(PREPROCESSED_PATH, media)
    article_list= os.listdir(media_path)

    for article in article_list:
        reader = RawTextReader(os.path.join(media_path, article)) 
        content = list(filter(None, reader))
        forCount += [token for sent in content for token in sent.split()]
        result += [sent.split() for sent in content]

In [6]:
print("전체 token의 개수 : {len}".format(len=len(forCount)))
print("중복되지 않은 token의 개수 : {len}".format(len=len(list(set(forCount)))))

전체 token의 개수 : 22341223
중복되지 않은 token의 개수 : 1107954


In [9]:
from glove import Corpus, Glove

corpus = Corpus() 
corpus.fit(result, window=5)

In [31]:
glove = Glove(no_components=200, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=20, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

Performing 20 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19


경찰서장과 비슷한 의미의 단어를 출력한다.

In [32]:
model_result1=glove.most_similar("경찰서장")
print(model_result1)

[('수사부서', 0.6000459667401196), ('계장', 0.5670894009342707), ('부원장', 0.5620538739767841), ('중령이', 0.5575559552979485)]


In [33]:
corpus.dictionary["선거"]

2790

In [34]:
glove.word_vectors[corpus.dictionary["선거"]]

array([ 2.35866183e-01,  1.29712384e-01,  3.29394465e-01, -1.44185758e-01,
        1.24354859e-01,  2.66980849e-01,  1.25057514e-01,  2.17945139e-01,
       -4.43377839e-01,  9.14323195e-02, -1.36738079e-01, -2.77354896e-01,
       -1.63237091e-01,  8.24703749e-02,  1.62116119e-01, -6.83846029e-02,
       -7.41504491e-02, -3.18320115e-01,  3.51119372e-02,  4.31970699e-02,
        3.06854109e-02, -2.84492419e-01, -3.37902653e-01, -6.47966352e-02,
        6.20843964e-02,  1.39528122e-01,  3.95503334e-01, -3.66851647e-01,
       -2.00317395e-01, -1.09803831e-01,  4.80785626e-01, -3.05283638e-01,
       -8.64006855e-02, -2.35446412e-01,  1.75167583e-01,  2.34687398e-01,
       -3.02714207e-01,  2.47157541e-01, -1.42394903e-01, -4.14685428e-01,
        3.09231411e-01,  1.61565244e-01, -9.40552937e-02, -4.41783892e-01,
       -3.47915534e-01, -2.19232035e-02, -1.45352877e-01, -9.85079537e-02,
       -3.49616633e-02, -2.00855834e-01, -2.23091231e-01, -8.68195803e-02,
        4.38984666e-01,  