## python 3.6에서 glove 임베딩 벡터 만들어보기

### Reference
* lovit님의 glove 포스팅
* genism의 glove python library
* soynlp의 noun extractor

### 사용 데이터
* 2016년 10월 20일 뉴스기사 (lovit님의 튜토리얼 데이터)

### NewsNounExtractor

In [1]:
from soynlp.utils import DoublespaceLineCorpus

corpus_fname = '2016-10-20_article_all_normed.txt'
sentences = DoublespaceLineCorpus(corpus_fname, iter_sent=True)

In [2]:
from soynlp.noun import NewsNounExtractor

noun_extractor = NewsNounExtractor(
    max_left_length=10, 
    max_right_length=7,
    predictor_fnames=None,
    verbose=True
)

nouns = noun_extractor.train_extract(sentences)

used default noun predictor; Sejong corpus based logistic predictor
/Users/shinbo/opt/anaconda3/envs/glove/lib/python3.6/site-packages/soynlp
local variable 'f' referenced before assignment
local variable 'f' referenced before assignment
scan vocabulary ... 
done (Lset, Rset, Eojeol) = (658116, 363342, 403882)
predicting noun score was done                                        
before postprocessing 237871
_noun_scores_ 50196
checking hardrules ... done0 / 50196+(이)), NVsubE (사기(당)+했다) ... done
after postprocessing 36027
extracted 2365 compounds from eojeolss ... 87000 / 87714

In [3]:
words = ['박근혜', '우병우', '민정수석', 
         '트와이스', '아이오아이', '최순실',
         '최순실게이트', '게이트', '콘서트']

for word in words:
    if not word in nouns:
        continue
    score = nouns[word]
    print('%s: (score=%.3f, frequency=%d)' 
          % (word, score.score, score.frequency))

박근혜: (score=0.478, frequency=1507)
우병우: (score=0.757, frequency=721)
민정수석: (score=0.834, frequency=812)
아이오아이: (score=0.547, frequency=270)
최순실: (score=0.828, frequency=1878)
게이트: (score=0.745, frequency=307)
콘서트: (score=0.769, frequency=500)


In [37]:
from tqdm import tqdm
nouns_list = list(nouns.keys())
with open('2016-10-20_noun.txt','w') as f:
    
    for sent in tqdm(sentences):
        split_words = sent.split(' ')
        sent_noun_only = [word for word in split_words if word in nouns_list]
        f.write(' '.join(sent_noun_only) + '\n')

100%|██████████| 223357/223357 [57:46<00:00, 64.42it/s]  


## glove: unpreprocessed data vs only noun data

### data
* 2016-10-20.txt: unpreprocessed data
* 2016-10-20_noun.txt: noun extract data

#### for unpreprocessed data

In [4]:
# import data
from soynlp.utils import DoublespaceLineCorpus
from soynlp.vectorizer import sent_to_word_contexts_matrix
corpus_path = "2016-10-20.txt"
corpus = DoublespaceLineCorpus(corpus_path, iter_sent=True)

# make corpus to cooccurrence matrix to input to the glove
# window is 3
x, idx2vocab = sent_to_word_contexts_matrix(
    corpus,
    windows=3,
    min_tf=10,
    tokenizer=lambda x:x.split(), # (default) lambda x:x.split(),
    dynamic_weight=True,
    verbose=True)

print(x.shape) # total 50091 words

# glove library from genism
from glove import Glove
glove = Glove(no_components=100, learning_rate=0.05, max_count=30)
glove.fit(x.tocoo(), epochs=5, no_threads=4, verbose=True)

dictionary = {vocab:idx for idx, vocab in enumerate(idx2vocab)}
glove.add_dictionary(dictionary)

words = '아이오아이 아프리카 박근혜 뉴스 날씨 이화여대 아프리카발톱개구리'.split()
for word in words:
    print('\n{}'.format(word))
    for tup in glove.most_similar(word, number=10):
        print(tup)
    print('\n')

Create (word, contexts) matrix
  - counting word frequency from 223356 sents, mem=1.156 Gb
  - scanning (word, context) pairs from 223356 sents, mem=1.378 Gb
  - (word, context) matrix was constructed. shape = (50091, 50091)                    
  - done
(50091, 50091)
Performing 5 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4

아이오아이
('신용재', 0.9051910042145265)
('세븐', 0.8470469539504547)
('남남서쪽', 0.8274233617915241)
('경주시', 0.8070812919370273)
('수원', 0.8025955895553446)
('노선이', 0.7974367242614192)
('타운점', 0.7972171103611738)
('배추가', 0.7943929049352353)
('120', 0.7886975207427144)



아프리카
('쪽지', 0.7329325267092706)
('볼트', 0.7191221188708016)
('85', 0.7176457521023469)
('35', 0.7135558225150953)
('북위', 0.7122671306744952)
('쿵푸팬더3', 0.7083750171571145)
('자매', 0.6957292799382121)
('엑스레이인', 0.6936836999836651)
('김구라', 0.6920186556116092)



박근혜
('역적패당의', 0.8990734106563494)
('가소로운', 0.8957872330269615)
('대통령의', 0.8673564054807855)
('주체위성들은', 0.8577487264303695)
('끝장내

#### for noun-extracted data

In [15]:
# import data
corpus_path = "2016-10-20_noun.txt"
corpus = DoublespaceLineCorpus(corpus_path, iter_sent=True)

# make corpus to cooccurrence matrix to input to the glove
# window is 3
x, idx2vocab = sent_to_word_contexts_matrix(
    corpus,
    windows=3,
    min_tf=10,
    tokenizer=lambda x:x.split(), # (default) lambda x:x.split(),
    dynamic_weight=True,
    verbose=True)

x.shape

# glove library from genism
from glove import Glove
glove = Glove(no_components=100, learning_rate=0.05, max_count=30)
glove.fit(x.tocoo(), epochs=5, no_threads=4, verbose=True)

dictionary = {vocab:idx for idx, vocab in enumerate(idx2vocab)}
glove.add_dictionary(dictionary)

words = '아이오아이 아프리카 박근혜 뉴스 날씨 이화여대 아프리카발톱개구리'.split()
for word in words:
    print('\n{}'.format(word))
    for tup in glove.most_similar(word, number=10):
        print(tup)
    print('\n')

Create (word, contexts) matrix
  - counting word frequency from 204055 sents, mem=1.276 Gb
  - scanning (word, context) pairs from 204055 sents, mem=1.286 Gb
  - (word, context) matrix was constructed. shape = (10328, 10328)                    
  - done
Performing 5 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4

아이오아이
('세븐', 0.9192878295544598)
('코드', 0.829506920496379)
('불독', 0.8231956126330225)
('에이핑크', 0.8145294353687739)
('산들', 0.8084104194412042)
('멤버들', 0.7959946220913665)
('걸그룹', 0.7828779355233834)
('소라', 0.7642841465878357)
('합정동', 0.7597687792163058)



아프리카
('4대', 0.6648020293131455)
('원조', 0.6578438627359317)
('식음료', 0.6535012684994946)
('3년', 0.6474019387268591)
('2000년대', 0.646889113704001)
('상주', 0.6373969293981944)
('전량', 0.6199516493057061)
('연속', 0.6198023174595333)
('식료품', 0.6124910881474588)



박근혜
('김정일', 0.8291159560711128)
('고치기', 0.8184660762511573)
('정권', 0.8119924380355885)
('방북', 0.7926370300459487)
('대책위원회', 0.7795758246093294)
('미르'

### 결과: 모든 pos vs 명사만 추출
* 명사만 추출한 corpus의 결과가 조금 더 낫다.
* 아이오아이 단어와의 상위 10개 단어 중, 멤버들, 걸그룹, 에이핑크 등이 추가되었다.
* 또한 아프리카 단어와의 상위 10개 단어에서도 원조, 식음료 등의 단어가 등장하였다.
* 아프리카발톱개구리 같은 합성어에 대해서, 특히 명사로만 이루어진 corpus의 co-occurr mat을 입력으로 했을 때, 좋은 결과를 보였다.

## another work
* glove는 입력으로 co-occurr을 받는다. 하지만 이를 바꿀 수도 있다. 예를 들어 PPMI를 사용할 수도 있는데, PPMI를 통해서 문맥이 뛰어난 단어들을 골라낼 수 있기 때문이다.
* 아래와 같이 ppmi 행렬을 만들 수 있는데, 다양한 선택지 중 하나로 남겨두자.

In [None]:
from soynlp.utils import DoublespaceLineCorpus
from soynlp.vectorizer import sent_to_word_contexts_matrix
from soynlp.word import pmi

corpus_path = '2016-10-20_article_all_normed_ltokenize.txt'
corpus = DoublespaceLineCorpus(corpus_path, iter_sent=True)

x, idx2vocab = sent_to_word_contexts_matrix(
    corpus,
    windows=3,
    min_tf=10,
    tokenizer=lambda x:x.split(), # (default) lambda x:x.split(),
    dynamic_weight=True,
    verbose=True)

pmi_dok = pmi(
    x,
    min_pmi=0,
    alpha=0.0001,
    verbose=True)