### **GloVe**  
임베딩 벡터의 내적이 말뭉치 전체에서의 동시출현(co-occurrence) 확률 값이 되는 목적 함수를 갖는다.  
이를 통해 임베딩 벡터간 유사도 측정을 수월하게 하면서도 말뭉치 전체의 통계 정보를 반영할 수 있다. 

> **동시 출현 (Co-occurence)** *이란, 한 문장, 문단 또는 텍스트 단위에서 같이 출현한 단어를 가리다. 언어학적 의미에서 의미적 근접성을 가리킨다.*  




In [8]:
!pip install glove_python

Collecting glove_python
[?25l  Downloading https://files.pythonhosted.org/packages/3e/79/7e7e548dd9dcb741935d031117f4bed133276c2a047aadad42f1552d1771/glove_python-0.1.0.tar.gz (263kB)
[K    100% |################################| 266kB 639kB/s ta 0:00:01
Building wheels for collected packages: glove-python
  Running setup.py bdist_wheel for glove-python ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/88/4b/6d/10c0d2ad32c9d9d68beec9694a6f0b6e83ab1662a90a089a4b
Successfully built glove-python
Installing collected packages: glove-python
Successfully installed glove-python-0.1.0
[33mYou are using pip version 10.0.1, however version 20.3.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
import os
import re

In [2]:
BASE_DIR = "/data/TestDir/sample_articles"
ORIGIN_PATH = os.path.join(BASE_DIR,"Origin-Data")
PREPROCESSED_PATH = os.path.join(BASE_DIR,"Preprocessed-Train-Data")
PRETTY_PATH = os.path.join(BASE_DIR,"Pretty-Data")
SWORDS_PATH = os.path.join(BASE_DIR, "StopWordList.txt")

In [3]:
class RawTextReader:
    def __init__(self, filepath):
        self.filepath = filepath
        self.rgxSplitter = re.compile("/n")

    def __iter__(self):
        for line in open(self.filepath, encoding='utf-8'):
            ch = self.rgxSplitter.split(line)
            for s in ch:
                yield s

기사 본문 내용이 짧기 때문에, 모델을 학습하기에 corpus의 크기가 작다.  
아래는 수집한 기사 87건들을 통해 corpus를 구성하고, GloVe 모델을 구축하는 내용이다.

In [4]:
media_list = os.listdir(ORIGIN_PATH)

result = []
forCount = []
for media in media_list:
    media_path = os.path.join(PREPROCESSED_PATH, media)
    article_list= os.listdir(media_path)

    for article in article_list:
        reader = RawTextReader(os.path.join(media_path, article)) 
        content = list(filter(None, reader))
        forCount += [token for sent in content for token in sent.split()]
        result += [sent.split() for sent in content]

In [5]:
print("전체 token의 개수 : {len}".format(len=len(forCount)))
print("중복되지 않은 token의 개수 : {len}".format(len=len(list(set(forCount)))))

전체 token의 개수 : 22341223
중복되지 않은 token의 개수 : 1107954


In [6]:
from glove import Corpus, Glove

corpus = Corpus() 
corpus.fit(result, window=5)

In [7]:
glove = Glove(no_components=200, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=20, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

Performing 20 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19


경찰서장과 비슷한 의미의 단어를 출력한다.

In [8]:
model_result1=glove.most_similar("경찰서장")
print(model_result1)

[('수사부서', 0.6236909482840811), ('기독교연합회와', 0.6033180549165632), ('수상구조요원인', 0.5983877412963312), ('기초의회', 0.5959063945199705)]


In [9]:
corpus.dictionary["선거"]

3072

In [10]:
glove.word_vectors[corpus.dictionary["선거"]]

array([ 0.06303776,  0.05700019,  0.04617693,  0.17130899,  0.06943382,
       -0.20363614,  0.15986702, -0.23750177,  0.28243793, -0.1126716 ,
        0.0886102 , -0.51966616,  0.21145841,  0.04030291,  0.17507126,
        0.15365178, -0.15072383, -0.03902845, -0.11481738,  0.19849852,
       -0.04235334,  0.39487534,  0.12554683, -0.06884584, -0.21005472,
       -0.19213453, -0.11654341,  0.33835562,  0.11690027, -0.2564735 ,
       -0.29624159, -0.18076293, -0.04449766, -0.34280119, -0.08787836,
       -0.12624342, -0.58436858,  0.23907736,  0.046183  , -0.29745364,
        0.04234594, -0.04546714,  0.18953674,  0.37055585, -0.1232256 ,
       -0.18789383, -0.0531504 , -0.27529275,  0.04858228,  0.12225939,
        0.19713014, -0.25655382,  0.35489026, -0.21235655,  0.19887241,
        0.05397323, -0.15157383,  0.00448871,  0.22226837, -0.03100519,
        0.25063835, -0.12600123, -0.28626565,  0.25301813,  0.17118094,
        0.13162828, -0.32282249,  0.43611876,  0.3084216 ,  0.07

In [11]:
glove.save('glove.model')