# Description
> ## * Objective 
> * 직접 수집한 뉴스와 댓글의 긍부정 분석을 위하여 Doc2Vec, Word2Vec, FastText를 이용하여 만든 총 18개의 모델 중에서 어떠한 모델을 사용하는 것이 좀 더 정확할 것인지를 알아보기 위함.

> ## * Tagger
> 임의로 사용자사전을 추가할 수 있는 것을 택함
>> * [customized-Konlpy](https://github.com/lovit/customized_konlpy)의 Twitter
>> * [konlpy](http://konlpy.org/en/v0.4.4/)의 Mecab
> * 2개의 Tagger로 Tagging하여 분석을 각각 진행

> ## * Classification
> 생성된 모델에 대하여 5개의 알고리즘 총 6개(Neural Network는 2개)의 분류모델을 사용함.
>> Logistic Regression, Random Forest, Kernel SVM, XGBoost, Neural Network

> 동일한 train dataset과 test dataset로 classification 모델을 만듦
> #### * [Logistic Regression](https://datascienceschool.net/view-notebook/d0df94cf8dd74e8daec7983531f68dfc/)
> #### * [Random Forest](https://datascienceschool.net/view-notebook/766fe73c5c46424ca65329a9557d0918/)
> #### * [Kernel SVM](https://datascienceschool.net/view-notebook/69278a5de79449019ad1f51a614ef87c/)
> #### * [XGBoost](https://xgboost.readthedocs.io/en/latest/)
> #### * [Neural Network](https://datascienceschool.net/view-notebook/0178802a219c4e6bb9b820b49bf57f91/)

In [2]:
import os
from glob import glob
import sys
import pandas as pd
import numpy as np

## Data Path

In [3]:
if sys.platform =='darwin':
    loadModelPath = '/Volumes/disk1/model/'
elif sys.platform =='win32':
    loadModelPath = 'd:/model/'
saveTrainPath = './data/pre_data/train_test_Data2/'
saveClassifierPath = './data/pre_data/classifier/'

## RawData

In [4]:
rawdata = pd.read_csv('./data/sentiment_data/raw_data_for_sentiment.txt',header=None,encoding='utf-8')
print (rawdata.shape)

(491510, 2)


* 총 491510개의 데이터로 구성됨
> * [NAVER sentiment movie corpus](https://github.com/e9t/nsmc)의 20만건과 뉴스데이터베이스 'Kinds' 기반 분석 자료인 [공공데이터포털-뉴스빅데이터 분석 정보](https://www.data.go.kr/dataset/15012945/fileData.do)의 일부 자료에서 제공하는 긍정문장과 부정문장을 사용함
> * 추가 사전작업을 통해 총 491,510건의 감정분석을 위한 사전 데이터를 수집
> * train dataset과 test dataset은 9:1의 비율로 만듦

# Result

> ## * Model
> Twitter와 Mecab으로 각각 tagging된 데이터에 대하여 모델을 생성.  
> Continuous Bag of Words(CBOW), Skip-Gram 각각 이용
> * 단어를 vector로 바꿔주는 알고리즘 : Embedding

> ### * [Word2Vec](4_Make_Word2Vec_model_For_sentiment_analysis.ipynb)
> * Neural Network Language Model(NNLM)을 계승하면서도 학습 속도와 성능을 끌어올림
> * 단어를 vectorization할 때 단어의 문맥적 의미를 보존함.
> * In case you missed the buzz, word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow).
> * Using large amounts of unannotated plain text, word2vec learns relationships between words automatically.
> * The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.
> * Word2vec is very useful in automatic text tagging, recommender systems and machine translation.

> #### * Continuous Bag of Words (CBOW)  
> * 주변에 있는 단어들을 가지고 중심에 있는 단어를 맞추는 방식
> ex) 나는 _에 간다.

> ![cbow](./cbow.png)
> * 주어진 단어에 대해 앞 뒤로 c/2개 씩 총 C개의 단어를 Input으로 사용하여, 주어진 단어를 맞추기 위한 네트워크를 만든다. 
> * Input Layer, Projection Layer, Output Layer로 구성
> * Input Layer에서 중간 레이어로 가는 과정이 weight를 곱해주는 것이라기 보다는 단순히 Projection하는 과정에 가까움
> * Input Layer에서 Projection Layer로 갈 때는 모든 단어들이 공통적으로 사용하는 V*N 크기의 Projection Matrix W가 있다. (N은 Projection Layer의 길이 = 사용할 벡터의 길이)
> * Projection Layer에서 Output Layer로 갈 때는 N*V 크기의 Weight Matrix W'가 있다. 
> * Input layer에서는 NNLM모델과 똑같이 단어를 One-hot encoding으로 넣어주고, 여러 개의 단어를 각각 Projection시킨 후 그 벡터들의 평균을 구해서 Projection Layer에 보낸다. 
> * 그 뒤는 여기에 Weight Matrix를 곱해서 Output Layer로 보내고 softmax계산을 한후, 이 결과를 진짜 단어의 one-hot encoding과 비교하여 에러를 계산

> #### * Skip-Gram
> * 중심에 있는 단어로 주변 단어를 예측하는 방법
> ex) _ 외나무다리 __

> ![skip-gram](./skip-gram.png)
> * 예측하는 단어들의 경우 현재 단어 주위에서 샘플링하는데, '가까이 위치해있는 단어일수록 현재 단어와 관련이 더 많은 단어일것이다'라는 생각을 적용하기 위해 멀리 떨어져 있는 단어수록 낮은 확률로 택하는 방법을 사용한다. 
> * 나머지 구조는 CBOW와 방향만 반대일 뿐 굉장히 유사

> ### * 각 모델의 Classifier 생성 결과
> #### [Word2Vec](5_Train_Classifier_Using_Word2Vec_For_Sentiment_analysis.ipynb)

### Report

* Word2Vec을 만드는데 사용한 Parameters
> * sg : the training algorithm 
>> 1 : Continuous Bag of Words(CBOW)  
>> 0 : Skip-gram
> * size : Dimensionality of the feature vectors  
> * window : The maximum distance between the current and predicted word within a sentence  
> * min_count : lgnores alll words with total frequency lower than this  
> * cbow_mean : the context word vectors
>> 0 : sum of the context word vectors  
>> 1 : the mean, only applies when cbow is used  
> * negative : negative sampling
>> \>0 : the int for negative specifies how many 'noise words' should be drawn (usually between 5-20)  
>> 0 : no negative sampling
> * hs : Hierarchical softmax
>> 1 : hierarchical softmax will be used for model training  
>> 0 : & negative is non-zero, negative sampling will be used  
> * workers : Use these many worker threads to train the model ( = faster training with multicore machines)  


* model을 총 3개 만듦
> 1. sg옵션 - Skip-gram, cbow_mean 0
> 2. sg옵션 - Skip-gram, cbow_mean 1 
>> CBOW로 만들었어야 했는데 잘못 만든 것으로 판단
> 3. sg옵션 - CBOW, cbow_mean 0

> 나머지 옵션은 동일함
>> size : 1000  
>> hs : 0  
>> min_count : 2  
>> epoch : 20  
>> window : 10  
>> negative : 7

### * tagger : Twitter
* number of words in model's vocabulary
> 162640

#### model1의 결과
> Skip-gram, cbow_mean - sum of the context word vectors
* Logistic Regression : 0.780
* Random Forest : 0.739
* C-Support Vector : 0.489
* XGBoost : 0.772
* Neural Network1 : 0.788
* Neural Network2 : 0.785

#### model2의 결과
> Skip-gram, cbow_mean - the mean of the context word vectors
* Logistic Regression : 0.816
* Random Forest : 0.804
* C-Support Vector : 0.497
* XGBoost : 0.825
* Neural Network1 : 0.8318
* Neural Network2 : 0.8336

#### model3의 결과
> Continuous Bag of Words (CBOW), cbow_mean - sum of the context word vectors
* Logistic Regression : 0.828
* Random Forest : 0.803
* C-Support Vector : 0.491
* XGBoost : 0.830
* Neural Network1 : 0.8427
* Neural Network2 : 0.8487

### *  tagger : Mecab
* number of words in model's vocabulary
> 165361

#### model1의 결과
> Skip-gram, cbow_mean - sum of the context word vectors
* Logistic Regression : 0.790
* Random Forest : 0.745
* C-Support Vector : 0.488
* XGBoost : 0.774
* Neural Network1 : 0.788
* Neural Network2 : 0.788

#### model2의 결과
> Skip-gram, cbow_mean - the mean of the context word vectors
* Logistic Regression : 0.819
* Random Forest : 0.803
* C-Support Vector : 0.492
* XGBoost : 0.825
* Neural Network1 : 0.8294
* Neural Network2 : 0.8340

#### model3의 결과
> Continuous Bag of Words (CBOW), cbow_mean - sum of the context word vectors
* Logistic Regression : 0.829
* Random Forest : 0.802
* C-Support Vector : 0.489
* XGBoost : 0.826
* Neural Network1 : 0.8421
* Neural Network2 : 0.8465

* C-Support Vector의 Classification accuracy가 다른 classifier보다 떨어지는 것을 확인할 수 있다. 
> 시간이 너무 오래 걸려서 iteration의 최대치를 1500으로 제한을 둚.   
> scale도 하는등의 방법을 통해 시간을 단축시켜 보려함. 

> ### * [FastText](./4_Make_FastText_model_For_sentiment_analysis.ipynb)
> * 페이스북에서 개발한 단어 임베딩 기술  
> * 구글에서 개발한 fastText을 기본으로 하되 부분단어들을 Embedding하는 기법.  
> * 단어가 가지는 형태 정보를 학습할 수 있어, 다양한 접사가 존재하는 한국어같은 언어에 대해서 잘 동작  
> * The main principle behind fastText is that **<U>the morphological structure of a word carries important information about the meaning of the word, which is not taken into account by traditional word embeddings, which train a unique word embedding for every individual word**</U>. This is especially significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.
> * fastText attempts to solve this by treating each word as the aggregation of its subwords. For the sake of simplicity and language-independence, subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.
> * According to a detailed comparison of Word2Vec and FastText in this notebook, fastText does significantly better on syntactic tasks as compared to the original Word2Vec, especially when the size of the training corpus is small. Word2Vec slightly outperforms FastText on semantic tasks though. The differences grow smaller as the size of training corpus increases. Training time for fastText is significantly higher than the Gensim version of Word2Vec (15min 42s vs 6min 42s on text8, 17 mil tokens, 5 epochs, and a vector size of 100).
. * fastText can be used to obtain vectors for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.

> ### * 각 모델의 Classifier 생성 결과
> #### [FastText](./5_Train_Classifier_Using_FastText_For_Sentiment_analysis.ipynb)

### Report

* FastText을 만드는데 사용한 Parameters
> * sg : the training algorithm 
>> 1 : Skip-gram  
>> 0 : Continuous Bag of Words(CBOW)  
> * size : Dimensionality of the feature vectors  
> * window : The maximum distance between the current and predicted word within a sentence  
> * min_count : lgnores alll words with total frequency lower than this  
> * cbow_mean : the context word vectors
>> 0 : sum of the context word vectors  
>> 1 : the mean, only applies when cbow is used  
> * negative : negative sampling
>> \>0 : the int for negative specifies how many 'noise words' should be drawn (usually between 5-20)  
>> 0 : no negative sampling
> * hs : Hierarchical softmax
>> 1 : hierarchical softmax will be used for model training  
>> 0 : & negative is non-zero, negative sampling will be used  
> * workers : Use these many worker threads to train the model ( = faster training with multicore machines)    
> * word_ngrams : subword(ngrams) information
>> 1 : uses enriches word vectors with subword(ngrams) information  
>> 0 : equivalent to word2vec

* model을 총 3개 만듦
> 1. sg옵션 - Continuous Bag of Words (CBOW), cbow_mean 0
> 2. sg옵션 - Continuous Bag of Words (CBOW), cbow_mean 1 
> 3. sg옵션 - Skip-gram, cbow_mean 0

> 나머지 옵션은 동일함
>> size : 1000  
>> hs : 0  
>> min_count : 2  
>> epoch : 20  
>> window : 10  
>> negative : 7  
>> word_ngrams : 1

### * tagger : Twitter
* number of words in model's vocabulary
> 162564

#### model1의 결과
> Continuous Bag of Words (CBOW), cbow_mean - sum of the context word vectors
* Logistic Regression : 0.794
* Random Forest : 0.750
* C-Support Vector : 0.490
* XGBoost : 0.781
* Neural Network1 : 0.8036
* Neural Network2 : 0.8053

#### model2의 결과
> Continuous Bag of Words (CBOW), cbow_mean - the mean of the context word vectors
* Logistic Regression : 0.818
* Random Forest : 0.804
* C-Support Vector : 0.493
* XGBoost : 0.828
* Neural Network1 : 0.8321
* Neural Network2 : 0.8396

#### model3의 결과
> Skip-Gram, cbow_mean - sum of the context word vectors
* Logistic Regression : 0.830
* Random Forest : 0.806
* C-Support Vector : 0.492
* XGBoost : 0.832
* Neural Network1 : 0.8511
* Neural Network2 : 0.8499

### *  tagger : Mecab
* number of words in model's vocabulary
> 165823

#### model1의 결과
> Continuous Bag of Words (CBOW), cbow_mean - sum of the context word vectors
* Logistic Regression : 0.799
* Random Forest : 0.758
* C-Support Vector : 0.488
* XGBoost : 0.787
* Neural Network1 : 0.7980
* Neural Network2 : 0.8070

#### model2의 결과
> Continuous Bag of Words (CBOW), cbow_mean - the mean of the context word vectors
* Logistic Regression : 0.818
* Random Forest : 0.804
* C-Support Vector : 0.495
* XGBoost : 0.824
* Neural Network1 : 0.8350
* Neural Network2 : 0.8294

#### model3의 결과
> Skip-Gram, cbow_mean - sum of the context word vectors
* Logistic Regression : 0.828
* Random Forest : 0.801
* C-Support Vector : 0.490
* XGBoost : 0.828
* Neural Network1 : 0.8463
* Neural Network2 : 0.8477

* C-Support Vector의 Classification accuracy가 다른 classifier보다 떨어지는 것을 확인할 수 있다. 
> 시간이 너무 오래 걸려서 iteration의 최대치를 1500으로 제한을 둚.   
> scale도 하는등의 방법을 통해 시간을 단축시켜 보려함. 

> ### * [Doc2Vec](./4_Make_Doc2Vec_model_for_Sentiment_analysis.ipynb)
> #### * Paragraph Vector
> * Le and Mikolov 2014 introduces the Paragraph Vector, which outperforms more naïve representations of documents such as averaging the Word2vec word vectors of a document. The idea is straightforward: we act as if a paragraph (or document) is just another vector like a word vector, but we will call it a paragraph vector. We determine the embedding of the paragraph in vector space in the same way as words. Our paragraph vector model considers local word order like bag of n-grams, but gives us a denser representation in vector space compared to a sparse, high-dimensional representation.
> * Paragraph Vector - Distributed Memory (PV-DM)
> * This is the Paragraph Vector model **<U>analogous to Continuous-bag-of-words Word2vec</U>**. The paragraph vectors are obtained by training a neural network on the fake task of **<U>inferring a center word based on context words and a context paragraph.</U>** A paragraph is a context for all words in the paragraph, and a word in a paragraph can have that paragraph as a context.

> * Paragraph Vector - Distributed Bag of Words (PV-DBOW)
> * This is the Paragraph Vector model **<U>analogous to Skip-gram Word2vec</U>**. The paragraph vectors are obtained by training a neural network on the fake task of **<U>predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph</U>**.

> ### * 각 모델의 Classifier 생성 결과
> #### [Doc2vec](./5_Train_Classifier_Using_Doc2Vec_For_Sentiment_analysis.ipynb)

### Report

* Doc2Vec을 만드는데 사용한 Parameters
> * dm : the training algorithm 
>> 1 : distributed memory (PV-DM)  
>> 0 : distributed bag of words (PV-DBOW)
> * size : Dimensionality of the feature vectors  
> * window : The maximum distance between the current and predicted word within a sentence  
> * negative : negative sampling
>> \>0 : the int for negative specifies how many 'noise words' should be drawn (usually between 5-20)  
>> 0 : no negative sampling
> * hs : Hierarchical softmax
>> 1 : hierarchical softmax will be used for model training  
>> 0 : & negative is non-zero, negative sampling will be used  
> * workers : Use these many worker threads to train the model ( = faster training with multicore machines)  
> * alpha : the initial learning rate  
> * min_alpha : learning rate will linearly drop to min_alpha as training progresses  
> * dm_concat : concatenation of context vectors rather than sum/average  
>> 1 : use  
>> 0 : not use
> * dm_mean : the mean of the context word vectors
>> 1 : use the mean. only applies when dm is used in non-concatenative mode  
>> 0 : use the sum of the context word vectors  


* model을 총 3개 만듦
> 1. dm옵션 - distributed memory (PV-DM), dm_mean 1, window 10
> 2. dm옵션 - distributed memory (PV-DM), dm_concat 1, window 5
> 3. dm옵션 - distributed bag of words (PV-DBOW), cbow_mean 0, dm_concat 0, window default

> 나머지 옵션은 동일함
>> size : 1000  
>> hs : 0   
>> epoch : 20  
>> negative : 7

### * tagger : Twitter

#### model1의 결과
> dm옵션 - distributed memory (PV-DM), dm_mean 1, window 10
* Logistic Regression : 0.692
* Random Forest : 0.622
* C-Support Vector : 0.570
* XGBoost : 0.688
* Neural Network1 : 0.7104
* Neural Network2 : 0.7059

#### model2의 결과
> dm옵션 - distributed memory (PV-DM), dm_concat 1, window 5
* Logistic Regression : 0.507
* Random Forest : 0.535
* C-Support Vector : 0.489
* XGBoost : 0.542
* Neural Network1 : 0.5405
* Neural Network2 : 0.5421

#### model3의 결과
> dm옵션 - distributed bag of words (PV-DBOW), cbow_mean 0, dm_concat 0, window default
* Logistic Regression : 0.827
* Random Forest : 0.741
* C-Support Vector : 0.609
* XGBoost : 0.809
* Neural Network1 : 0.8474
* Neural Network2 : 0.8481

### *  tagger : Mecab

#### model1의 결과
> dm옵션 - distributed memory (PV-DM), dm_mean 1, window 10
* Logistic Regression : 0.704
* Random Forest : 0.636
* C-Support Vector : 0.573
* XGBoost : 0.698
* Neural Network1 : 0.7290
* Neural Network2 : 0.7280

#### model2의 결과
> dm옵션 - distributed memory (PV-DM), dm_concat 1, window 5
* Logistic Regression : 0.508
* Random Forest : 0.535
* C-Support Vector : 0.489
* XGBoost : 0.542
* Neural Network1 : 0.5405
* Neural Network2 : 0.5421

#### model3의 결과
> dm옵션 - distributed bag of words (PV-DBOW), cbow_mean 0, dm_concat 0, window default
* Logistic Regression : 0.830
* Random Forest : 0.743
* C-Support Vector : 0.610
* XGBoost : 0.816
* Neural Network1 : 0.8537
* Neural Network2 : 0.8522

* C-Support Vector의 Classification accuracy가 다른 classifier보다 떨어지는 것을 확인할 수 있다. 
> 시간이 너무 오래 걸려서 iteration의 최대치를 1500으로 제한을 둚.   
> scale도 하는등의 방법을 통해 시간을 단축시켜 보려함. 

## 기타 모델에 대한 설명

### Neural Network Language Model (NNLM)
> * Feed-Forward Neural Network Language Model
> * 단어를 vector로 바꾸는 neural network 기반 방법론
> * 컴퓨터는 단어를 숫자로 바꿔서 입력해야 컴퓨터는 연산을 수행
> * Input Layer, Projection Layer, Hidden Layer, Output Layer로 이우러진 Neural Network
> ![image](./nnlm.png)

>> 1. 현재 보고 있는 단어 이전의 단어들 N개를 one-hot encoding으로 벡터화
>> 2. 사전의 크기를 V라고 하고 Projection Layer의 크기를 P라고 했을 때, 각각의 벡터들은 V*P 크기의 Projection Matrix에 의해 다음 layer로 넘어가게 된다.
>> 3. Projection Layer의 값을 input이라고 생각하고, 크기 H인 hidden layer를 거쳐서 output layer에서 **각 단어들이 나올 확률**을 계산
>> 4. 실제 단어의 one-hot encoding 벡터와 비교하여 에러를 계산하고, 이를 back-propagation해서 네트워크의 weight들을 최적화해나가는 것

* 단점
> * 몇 개의 단어를 볼 건지에 대한 parameter N이 고정되어 있고, 정해주어야 한다. 
> * 이전의 단어들에 대해서만 신경쓸 수 있고, 현재 보고 있는 단어 앞에 있는 단어들을 고려하지 못한다. 
> * **느리다.**

### Recurrent Neural Network Language Model ( RNNLM)
> * NNLM을 Recurrent Neural Network의 형태로 변형한 것
> * Projection Layer없이 input, hidden, output layer로만 구성
> * **hidden layer에 recurrent한 연결이 있어 이전 시간의 Hidden Layer의 입력이 다시 입력되는 형식으로 구성**
> ![image](./rnnlm.png)
> * 그림에서 U라고 나타나 있는 부분이 Word Embedding으로 사용되며, **Hidden layer의 크기를 H라고 할 때 각 단어는 길이 H의 벡터로 표현**
> * NNLM과 달리 몇 개의 단어인지에 대해 정해줄 필요가 없이, 학습시켜줄 단어를 순차적으로 입력해주는 방식으로 학습
> * NNLM보다 연산량이 적다

* 이론 참고 

> [ratsgo's blog](https://ratsgo.github.io/)  
>> 1. [Word2Vec으로 문장 분류하기](https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/)  
>> 2. [빈도수 세기의 놀라운 마법 Word2Vec, Glove, Fasttext](https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/11/embedding/)  
>> 3. [Neural Network Language Model](https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/29/NNLM/)  
>> 4. [Word2Vec의 학습 방식](https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/)  

> [data scienceschool](https://datascienceschool.net)  
>> 1. [단어 임베딩과 word2vec](https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/)  
>> 2. [Scikit-Learn의 문서 전처리 기능](https://datascienceschool.net/view-notebook/3e7aadbf88ed4f0d87a76f9ddc925d69/)  
>> 3. [확률론적 언어 모형](https://datascienceschool.net/view-notebook/a0c848e1e2d343d685e6077c35c4203b/)  

> [BEOMSU KIM](https://shuuki4.wordpress.com/category/deep-learning/)
>>  1. [word2vec 관련 이론 정리](https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/)  