# Doc2Vec model
> * Positive or Negative

## * Paragraph Vector
> Le and Mikolov 2014 introduces the Paragraph Vector, which outperforms more naïve representations of documents such as averaging the Word2vec word vectors of a document. The idea is straightforward: we act as if a paragraph (or document) is just another vector like a word vector, but we will call it a paragraph vector. We determine the embedding of the paragraph in vector space in the same way as words. Our paragraph vector model considers local word order like bag of n-grams, but gives us a denser representation in vector space compared to a sparse, high-dimensional representation.

> * Paragraph Vector - Distributed Memory (PV-DM)
>> This is the Paragraph Vector model analogous to Continuous-bag-of-words Word2vec. The paragraph vectors are obtained by training a neural network on the fake task of inferring a center word based on context words and a context paragraph. A paragraph is a context for all words in the paragraph, and a word in a paragraph can have that paragraph as a context.

> * Paragraph Vector - Distributed Bag of Words (PV-DBOW)
>> This is the Paragraph Vector model analogous to Skip-gram Word2vec. The paragraph vectors are obtained by training a neural network on the fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

In [1]:
import pickle

from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

## 모델 생성을 위한 함수정의

In [2]:
import multiprocessing
cores = int(multiprocessing.cpu_count()) / 2
def Make_Doc2Vec_Model(data, size, dm, dm_concat, dm_mean, hs, negative, epoch, window, alpha, min_alpha, workers, tagger):
    from tqdm import tqdm
    tqdm.pandas(desc="progress-bar")
    from datetime import datetime
    from gensim.models import doc2vec
    start = datetime.now()
    modelPath = './model/'
    modelName = 'doc2vec_size-{}_epoch-{}_window-{}_negative-{}_hs-{}_dm-{}_dm_concat-{}_dm_mean-{}_by-{}.model'.format(
        size, epoch, window, negative, hs, dm, dm_concat, dm_mean, tagger)
    modelName = modelPath+modelName
    print (modelName)
    if window!=None:
        d2v_model = doc2vec.Doc2Vec(vector_size = size, dm = dm, dm_concat = dm_concat,
                   dm_mean = dm_mean, negative = negative, hs = hs, window = window,
                   alpha = alpha, min_alpha = min_alpha, workers = workers, epochs= epoch)
    else:
        d2v_model = doc2vec.Doc2Vec(vector_size = size, dm = dm, dm_concat = dm_concat,
                   dm_mean = dm_mean, negative = negative, hs = hs,
                   alpha = alpha, min_alpha = min_alpha, workers = workers, epochs= epoch)
    d2v_model.build_vocab(tqdm(data))
    d2v_model.train(tqdm(data), total_examples=d2v_model.corpus_count, epochs=d2v_model.iter)
    
    end = datetime.now()
    d2v_model.save(modelName)
    print ("Total running time: ", end-start)
    return d2v_model

# Doc2Vec 생성

In [3]:
import numpy as np
import pandas as pd

## 감정 분석을 위한 rawdata

In [4]:
rawdata = pd.read_csv('./data/sentiment_data/raw_data_for_sentiment.txt',header=None,encoding='utf-8')
print (rawdata.shape)

(491510, 2)


## Making Doc2Vec Using tagger Twitter

In [5]:
from collections import namedtuple
from gensim.models.doc2vec import TaggedDocument
TaggedDocument = namedtuple('TaggedDocument', 'words tags sentiment')



### Tagging

In [6]:
from ckonlpy.tag import Twitter as ctwitter
ct = ctwitter()

In [7]:
# twitter
def tokenize1(doc):
    return ['/'.join(t) for t in ct.pos(doc)]

In [None]:
# pickle로 저장된 파일이 없을 때
raw_doc_ct = [(tokenize1(rawdata.loc[idx][0]), [idx], rawdata.loc[idx][1]) for idx in tqdm(rawdata.index)]
pickle.dump(raw_doc_ct, open('./data/pre_data/tagged_data/pre_data_by_ct_for_sentiment_analysis.pickled','wb'))

  1%|          | 5067/491510 [01:42<2:43:33, 49.57it/s]

### Doc2Vec 기본 포맷으로 변경

In [None]:
# pickle로 저장된 파일이 없을 때

tagged_ct = [TaggedDocument(b, c, [d]) for b, c, d in tqdm(raw_doc_ct)]
pickle.dump(tagged_ct, open('./data/pre_data/tagged_data/pre_by_ct_data_tagged_run_docs.pickled','wb'))

In [None]:
del raw_doc_ct

## model 만들기

In [None]:
# pickle로 저장된 파일이 있을 때
tagged_ct = pickle.load(open('./data/pre_data/tagged_data/pre_by_ct_data_tagged_run_docs.pickled','rb'))

### train dataset & test dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# pickle로 저장된 파일이 없을 때
train, test = train_test_split(tagged_ct, test_size=0.1, random_state=42)
del tagged_ct
pickle.dump(train, open('./data/pre_data/train_test_Data/pre_by_ct_train.pickled','wb'))
pickle.dump(test, open('./data/pre_data/train_test_Data/pre_by_ct_test.pickled','wb'))

In [None]:
# pickle로 저장된 파일이 있을 때
train = pickle.load(open('./data/pre_data/train_test_Data/pre_by_ct_train.pickled','rb'))
test = pickle.load(open('./data/pre_data/train_test_Data/pre_by_ct_test.pickled','rb'))

### model 1

In [None]:
from konlpy.utils import pprint

* size : Dimensionality of the feature vectors 
* dm : 1 - distibuted memory (PV-DM)  
* dm_concat : 1 - use concatenation of context vectors rather than sum/average  
* dm_mean : 0 - don't use the sum of the context word vectors  
> dm is used in non-concatenative mode.
* negative : 7 - neative specifies how many 'noise words' should be drawn.
* hs : 0 - hierarchical softmax 사용여부
* window : 5 - The maximum distance between the current and predicted word within a sentence.  
* alpha : the initial learning rate  
* min_alpha : learning rate will linearly drop to min_alpha as training progresses

In [None]:
%%time
#PV-DM W/
d2v_model = Make_Doc2Vec_Model(data=train, size = 1000, dm = 1, dm_concat = 1,
                   dm_mean = 0, negative = 7, hs = 0, epoch = 20, window = 5,
                   alpha = 0.025, min_alpha = 0.025, workers = cores, tagger = 'ct')
pprint(d2v_model.most_similar('문재인/Noun'))
pprint(d2v_model.most_similar('노무현/Noun'))
pprint(d2v_model.most_similar('박근혜/Noun'))

In [None]:
del d2v_model

### model 2

* size : Dimensionality of the feature vectors 
* dm : 1 - distibuted memory (PV-DM)  
* dm_concat : 0 - don't use concatenation of context vectors rather than sum/average  
* dm_mean : 1 - don't use the sum of the context word vectors  
> dm is used in non-concatenative mode.
* negative : 7 - neative specifies how many 'noise words' should be drawn.
* hs : 0 - hierarchical softmax 사용여부
* window : 10 - The maximum distance between the current and predicted word within a sentence.  
* alpha : the initial learning rate  
* min_alpha : learning rate will linearly drop to min_alpha as training progresses

In [None]:
%%time
#PV-DM w/
d2v_model = Make_Doc2Vec_Model(data=train, size = 1000, dm = 1, dm_concat = 0,
                   dm_mean = 1, negative = 7, hs = 0, epoch = 20, window = 10,
                   alpha = 0.025, min_alpha = 0.025, workers = cores, tagger = 'ct')
pprint(d2v_model.most_similar('문재인/Noun'))
pprint(d2v_model.most_similar('노무현/Noun'))
pprint(d2v_model.most_similar('박근혜/Noun'))

In [None]:
del d2v_model

### model 3

* size : Dimensionality of the feature vectors 
* dm : 0 - distributed bag of words (PV-DBOW)
* dm_concat : 0 - don't use concatenation of context vectors rather than sum/average  
* dm_mean : 0 - don't use the sum of the context word vectors  
> dm is used in non-concatenative mode.
* negative : 7 - neative specifies how many 'noise words' should be drawn.
* hs : 0 - hierarchical softmax 사용여부
* window : 5 - The maximum distance between the current and predicted word within a sentence.  
* alpha : the initial learning rate  
* min_alpha : learning rate will linearly drop to min_alpha as training progresses

In [None]:
%%time
# PV - DBOW
d2v_model = Make_Doc2Vec_Model(data=train, size = 1000, dm = 0, dm_concat = 0,
                   dm_mean = 0, negative = 7, hs = 0, epoch = 20, window = None,
                   alpha = 0.025, min_alpha = 0.025, workers = cores, tagger = 'ct')
pprint(d2v_model.most_similar('문재인/Noun'))
pprint(d2v_model.most_similar('노무현/Noun'))
pprint(d2v_model.most_similar('박근혜/Noun'))

In [None]:
del train
del test
del d2v_model

## Making Doc2Vec Using tagger mecab

In [None]:
from collections import namedtuple
from gensim.models.doc2vec import TaggedDocument
TaggedDocument = namedtuple('TaggedDocument', 'words tags sentiment')

### tagging

In [None]:
from konlpy.tag import Mecab
mecab = Mecab()

In [None]:
# mecab
def tokenize2(doc):
    return ['/'.join(t) for t in mecab.pos(doc)]

In [None]:
# pickle로 저장된 파일이 없을 때
raw_doc_mecab = [(tokenize2(rawdata.loc[idx][0]), [idx], rawdata.loc[idx][1]) for idx in tqdm(rawdata.index)]
pickle.dump(raw_doc_mecab, open('./data/pre_data/tagged_data/pre_data_by_mecab_for_sentiment_analysis.pickled','wb'))

### Doc2Vec 기본 포맷으로 변경

In [None]:
# pickle로 저장된 파일이 없을 때

tagged_mecab = [TaggedDocument(b, c, [d]) for b, c, d in tqdm(raw_doc_ct)]
pickle.dump(tagged_mecab, open('./data/pre_data/tagged_data/pre_by_mecab_data_tagged_run_docs.pickled','wb'))

In [None]:
del raw_doc_mecab

## model 만들기

In [None]:
# pickle로 저장된 파일이 있을 때
tagged_mecab = pickle.load(open('./data/pre_data/tagged_data/pre_by_mecab_data_tagged_run_docs.pickled','rb'))

### train dataset & test dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# pickle로 저장된 파일이 없을 때
train2, test2 = train_test_split(tagged_mecab, test_size=0.1, random_state=42)
del tagged_ct
pickle.dump(train2, open('./data/pre_data/train_test_Data/pre_by_mecab_train.pickled','wb'))
pickle.dump(test2, open('./data/pre_data/train_test_Data/pre_by_mecab_test.pickled','wb'))

In [None]:
# pickle로 저장된 파일이 있을 때
train2 = pickle.load(open('./data/pre_data/train_test_Data/pre_by_mecab_train.pickled','rb'))
test2 = pickle.load(open('./data/pre_data/train_test_Data/pre_by_mecab_test.pickled','rb'))


### model 1

* size : Dimensionality of the feature vectors 
* dm : 1 - distibuted memory (PV-DM)  
* dm_concat : 1 - use concatenation of context vectors rather than sum/average  
* dm_mean : 0 - don't use the sum of the context word vectors  
> dm is used in non-concatenative mode.
* negative : 7 - neative specifies how many 'noise words' should be drawn.
* hs : 0 - hierarchical softmax 사용여부
* window : 5 - The maximum distance between the current and predicted word within a sentence.  
* alpha : the initial learning rate  
* min_alpha : learning rate will linearly drop to min_alpha as training progresses

In [None]:
from konlpy.utils import pprint

In [None]:
%%time
#PV-DM W/
d2v_model = Make_Doc2Vec_Model(data=train2, size = 1000, dm = 1, dm_concat = 1,
                   dm_mean = 0, negative = 7, hs = 0, epoch = 20, window = 5,
                   alpha = 0.025, min_alpha = 0.025, workers = cores, tagger = 'mecab')
pprint(d2v_model.most_similar('문재인/NNP'))
pprint(d2v_model.most_similar('노무현/NNP'))
pprint(d2v_model.most_similar('박근혜/NNP'))

In [None]:
del d2v_model

### model 2

* size : Dimensionality of the feature vectors 
* dm : 1 - distibuted memory (PV-DM)  
* dm_concat : 0 - don't use concatenation of context vectors rather than sum/average  
* dm_mean : 1 - don't use the sum of the context word vectors  
> dm is used in non-concatenative mode.
* negative : 7 - neative specifies how many 'noise words' should be drawn.
* hs : 0 - hierarchical softmax 사용여부
* window : 10 - The maximum distance between the current and predicted word within a sentence.  
* alpha : the initial learning rate  
* min_alpha : learning rate will linearly drop to min_alpha as training progresses

In [None]:
%%time
#PV-DBOW
d2v_model = Make_Doc2Vec_Model(data=train2, size = 1000, dm = 1, dm_concat = 0,
                   dm_mean = 1, negative = 7, hs = 0, epoch = 20, window = 10,
                   alpha = 0.025, min_alpha = 0.025, workers = cores, tagger = 'mecab')
pprint(d2v_model.most_similar('문재인/NNP'))
pprint(d2v_model.most_similar('노무현/NNP'))
pprint(d2v_model.most_similar('박근혜/NNP'))

In [None]:
del d2v_model

### model 3

* size : Dimensionality of the feature vectors 
* dm : 0 - distributed bag of words (PV-DBOW)
* dm_concat : 0 - don't use concatenation of context vectors rather than sum/average  
* dm_mean : 0 - don't use the sum of the context word vectors  
> dm is used in non-concatenative mode.
* negative : 7 - neative specifies how many 'noise words' should be drawn.
* hs : 0 - hierarchical softmax 사용여부
* window : 5 - The maximum distance between the current and predicted word within a sentence.  
* alpha : the initial learning rate  
* min_alpha : learning rate will linearly drop to min_alpha as training progresses

In [None]:
%%time
#PV-DM w/
d2v_model = Make_Doc2Vec_Model(data=train2, size = 1000, dm = 0, dm_concat = 0,
                   dm_mean = 0, negative = 7, hs = 0, epoch = 20, window = None,
                   alpha = 0.025, min_alpha = 0.025, workers = cores, tagger = 'mecab')
pprint(d2v_model.most_similar('문재인/NNP'))
pprint(d2v_model.most_similar('노무현/NNP'))
pprint(d2v_model.most_similar('박근혜/NNP'))

In [None]:
del train2
del test2
del d2v_model