# Motive:

- word embedding 모델 중 문서의 id 를 추가적으로 input 으로 받아 문서 하나를 embedding 하는 doc2vec 모델을 이용했습니다.

- 대학원의 각 연구실, 각 연구실에서 투고한 논문을 개별 문서로 인식했습니다.

- 각 대학원 연구실 문서 (corpus) 는 투고 논문 제목, 논문은 abstract 의 일부분을 corpus 에 포함시켰습니다.

- 이번에는 기술적/시간 문제로 google scholar 상에 노출된 abstract 의 일부분만 크롤링할 수 있었습니다.

- 향후에는 더욱 많은 데이터로 embedding map 만들고, 개별 연구실 벡터를 clustering 혹은 classification 문제에 이용해보고 싶습니다.

# 크롤링:

- 각 연구실의 논문 중 2018년 이전에 투고된 논문만 수동으로 저장한뒤, selenium 을 통해 자동적으로 google scholar 에서 abstract 를 크롤링했습니다.

- 이 과정에 속도향상을 위해 multiprocessing 을 이용했습니다.

- 여러 오류가 발생했는데 예외처리를 제대로 하지 못해 데이터가 제대로 수집되지 않은 논문이 다수 있었습니다. 

In [1]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [177]:
import nltk
import pickle

### Dictionary for lab.

In [183]:
label =['dm','scm','polytope','imlab']
dct = {}
for i in range(len(label)):
    dct[i] = label[i] 

In [191]:
with open('corpus.pkl','rb') as file:
    corpus = pickle.load(file)

In [25]:
corpus = ['.'.join(i) for i in corpus]

In [45]:
with open('dm_abstract.pkl','rb') as file:
    dm_abstract = pickle.load(file)

In [186]:
with open('abstract_temp.pkl','rb') as file:
    temp = pickle.load(file)

In [51]:
abstract = dm_abstract + temp 

In [52]:
abstract

[['… effects (such as news about the macroeconomy or expected asset returns), employer‐specific\ntime fixed effects, investor fixed effects, investor‐level income effects, and time … We conclude in\nSection V by reconciling our results with the disposition effect and discussing …',
  '… Accelerated Degradation Testing (ADT) combines these two approaches by testing products\nin harsh … failures, may be combined with degradation observations to make inference on product\nlifetimes, as in … The point estimates and quantiles for the failure-time dis- tribution at use …',
  'Many researches have exploited textual data, such as news, online blogs, and financial\nreports, in order to predict stock price movements effectively. Previous studies formed the\ntask as a classification problem predicting upward or downward movement of stock prices\nfrom text documents. Such an approach, however, may be deemed inappropriate when\ncombined with sentiment analysis. In financial documents, same words ma

In [53]:
len(abstract) # abstracts for 4 labs 

4

In [38]:
#No abstract was crawled for lab. 2 due to error
abstract.pop(2)

[]

In [54]:
#flatten list
abstract = [item for sublist in abstract for item in sublist]

In [55]:
len(abstract)

202

# Preprocessing and POS-tagging

In [19]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from tqdm import tqdm
import nltk

In [85]:
tokens = []

for content in tqdm(corpus):
    t = gensim.utils.simple_preprocess(content)
    tokens.append(t)

100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 334.26it/s]


In [86]:
pos_tokens = []
pos_list = ['NN','NNS','NNP','NNPS']
for sublist in tqdm(tokens):
    temp = nltk.pos_tag(sublist)
    temp = [i for i in temp if i[1] in pos_list]
    pos_tokens.append(temp)

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 21.11it/s]


In [87]:
pos_tokens

[[('effect', 'NN'),
  ('reinforcement', 'NN'),
  ('learning', 'NN'),
  ('stock', 'NN'),
  ('market', 'NN'),
  ('product', 'NN'),
  ('failure', 'NN'),
  ('prediction', 'NN'),
  ('data', 'NNS'),
  ('stock', 'NN'),
  ('price', 'NN'),
  ('prediction', 'NN'),
  ('sentiment', 'NN'),
  ('analysis', 'NN'),
  ('disclosures', 'NNS'),
  ('representation', 'NN'),
  ('convolution', 'NN'),
  ('filter', 'NN'),
  ('word', 'NN'),
  ('clustering', 'NN'),
  ('representation', 'NN'),
  ('knowledge', 'NN'),
  ('extraction', 'NN'),
  ('visualization', 'NN'),
  ('design', 'NN'),
  ('process', 'NN'),
  ('documents', 'NNS'),
  ('detection', 'NN'),
  ('method', 'NN'),
  ('class', 'NN'),
  ('vectors', 'NNS'),
  ('customer', 'NN'),
  ('voice', 'NN'),
  ('classification', 'NN'),
  ('smartphone', 'NN'),
  ('user', 'NN'),
  ('segmentation', 'NN'),
  ('sequence', 'NN'),
  ('networks', 'NNS'),
  ('industry', 'NN'),
  ('network', 'NN'),
  ('business', 'NN'),
  ('text', 'NN'),
  ('disclosures', 'NNS'),
  ('news', 'NN'),

In [88]:
final_tokens = []
lem = WordNetLemmatizer()
for sublist in pos_tokens:
    temp=[]
    for item in sublist:
        t = lem.lemmatize(item[0], pos='n')
        temp.append(t)
    final_tokens.append(temp)

In [89]:
final_tokens

[['effect',
  'reinforcement',
  'learning',
  'stock',
  'market',
  'product',
  'failure',
  'prediction',
  'data',
  'stock',
  'price',
  'prediction',
  'sentiment',
  'analysis',
  'disclosure',
  'representation',
  'convolution',
  'filter',
  'word',
  'clustering',
  'representation',
  'knowledge',
  'extraction',
  'visualization',
  'design',
  'process',
  'document',
  'detection',
  'method',
  'class',
  'vector',
  'customer',
  'voice',
  'classification',
  'smartphone',
  'user',
  'segmentation',
  'sequence',
  'network',
  'industry',
  'network',
  'business',
  'text',
  'disclosure',
  'news',
  'machine',
  'detection',
  'integration',
  'inspection',
  'sale',
  'service',
  'data',
  'bag',
  'concept',
  'document',
  'representation',
  'word',
  'representation',
  'prediction',
  'drug',
  'failure',
  'option',
  'relationship',
  'production',
  'customer',
  'service',
  'data',
  'failure',
  'analysis',
  'product',
  'metrology',
  'copper',
 

## Tokenize abstract corpus

In [74]:
abs_tokens = []

for content in tqdm(abstract):
    t = gensim.utils.simple_preprocess(content)
    abs_tokens.append(t)
abs_pos_tokens = []
pos_list = ['NN','NNS','NNP','NNPS']
for sublist in tqdm(abs_tokens):
    temp = nltk.pos_tag(sublist)
    temp = [i for i in temp if i[1] in pos_list]
    abs_pos_tokens.append(temp)

abs_final_tokens = []

for sublist in abs_pos_tokens:
    temp=[]
    for item in sublist:
        t = lem.lemmatize(item[0], pos='n')
        temp.append(t)
    abs_final_tokens.append(temp)

100%|██████████████████████████████████████████████████████████████████████████████| 202/202 [00:00<00:00, 8439.58it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 202/202 [00:00<00:00, 292.27it/s]


# Build tagged corpus

In [90]:
for doc in abs_final_tokens:
    final_tokens.append(doc)

In [127]:
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(final_tokens[:4])]

In [128]:
abs_tagger = [TaggedDocument(words=_d, tags=['abs'+str(i)]) for i, _d in enumerate(final_tokens[4:])]
for doc in abs_tagger:
    tagged_data.append(doc)

In [131]:
len(tagged_data) # 4 lab + 202 abstracts

206

In [130]:
max_epochs = 100
vec_size = 100
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    if epoch % 10 ==9:
        print('iteration {0}'.format(epoch+1))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha



iteration 10
iteration 20
iteration 30
iteration 40
iteration 50
iteration 60
iteration 70
iteration 80
iteration 90
iteration 100


### 특정 lab. 과 가장 유사한 단어 벡터

In [133]:
doc1 = model.docvecs['1']
model.similar_by_vector(doc1)

[('optimization', 0.4897495210170746),
 ('population', 0.48179152607917786),
 ('payment', 0.45475950837135315),
 ('elssp', 0.4447600841522217),
 ('lnai', 0.4429042935371399),
 ('logistics', 0.4420936107635498),
 ('programming', 0.43970799446105957),
 ('fcd', 0.43673911690711975),
 ('setup', 0.42975914478302),
 ('supplier', 0.4212070107460022)]

In [134]:
doc2 = model.docvecs['2']
model.similar_by_vector(doc2)

[('rank', 0.7109458446502686),
 ('max', 0.6564980149269104),
 ('train', 0.6179179549217224),
 ('atm', 0.6132557392120361),
 ('handelman', 0.6125264763832092),
 ('configuration', 0.6064706444740295),
 ('cut', 0.5906915068626404),
 ('approximation', 0.5770852565765381),
 ('axis', 0.5629941821098328),
 ('multicast', 0.560961902141571)]

In [135]:
doc3 = model.docvecs['3']
model.similar_by_vector(doc3)

[('heterogenity', 0.5326517224311829),
 ('precision', 0.5017560720443726),
 ('leone', 0.4205802083015442),
 ('grapheme', 0.41912612318992615),
 ('bb', 0.41466397047042847),
 ('execute', 0.412802129983902),
 ('directive', 0.41278401017189026),
 ('factory', 0.412083238363266),
 ('momentum', 0.4105476140975952),
 ('future', 0.4076741337776184)]

In [136]:
doc0 = model.docvecs['0']
model.similar_by_vector(doc0)

[('misstatement', 0.4721144735813141),
 ('verification', 0.4721129834651947),
 ('detection', 0.42952078580856323),
 ('clad', 0.4235228896141052),
 ('self', 0.41090112924575806),
 ('cortex', 0.40609434247016907),
 ('secure', 0.40481600165367126),
 ('learning', 0.40409040451049805),
 ('laminate', 0.3999432623386383),
 ('observation', 0.3956422209739685)]

In [155]:
word_vec =model['support']
model.docvecs.most_similar([word_vec])

[('abs111', 0.2729227542877197),
 ('abs16', 0.27022090554237366),
 ('abs106', 0.2666151225566864),
 ('abs7', 0.2567080855369568),
 ('abs23', 0.2405010312795639),
 ('abs173', 0.23998074233531952),
 ('abs112', 0.23302768170833588),
 ('abs123', 0.230605348944664),
 ('abs91', 0.225291907787323),
 ('abs68', 0.22120696306228638)]

### Dict for abstracts

In [143]:
import pickle
with open('corpus.pkl','rb') as file:
    corpus = pickle.load(file)

In [144]:
corpus = [item for sublist in corpus for item in sublist]

In [145]:
corpus

['The Effect of Naive Reinforcement Learning in the Stock Market\n',
 'Product failure prediction with missing data\n',
 'Stock Price Prediction Through Sentiment Analysis of Corporate Disclosures Using Distributed Representation\n',
 'Applying convolution filter to matrix of word-clustering based document representation\n',
 'Knowledge Extraction and Visualization of Digital Design Process\n',
 'De-noising documents with a novelty detection method utilizing class vectors in customer-voice classification\n',
 'Smartphone user segmentation based on app usage sequence with deep neural networks\n',
 'Building Industry Network Based on Business Text: Corporate Disclosures and News\n',
 'Machine learning-based anomaly detection via integration of manufacturing, inspection and after-sales service data\n',
 'Bag-of-Concepts: Comprehending Document Representation through Clustering Words in Distributed Representation\n',
 'Reliable prediction of anti-diabetic drug failure with a reject option\

In [151]:
abs_dct = {}
for i in range(len(corpus)):
    abs_dct[i] = corpus[i]

In [163]:
abs_dct.get(35)

'Probabilistic local reconstruction for k-NN regression and its application to virtual metrology in semiconductor manufacturing\n'

### Lab. 0 ('DM') 와 가장 유사한 논문 

In [190]:
dct

{0: 'dm', 1: 'scm', 2: 'polytope', 3: 'imlab'}

In [159]:
similar_doc = model.docvecs.most_similar('0')
print(similar_doc)

[('abs44', 0.3526928424835205), ('abs47', 0.3400379717350006), ('abs46', 0.3288889527320862), ('abs35', 0.3200221061706543), ('abs87', 0.315833181142807), ('abs16', 0.31565016508102417), ('abs104', 0.3156258463859558), ('abs8', 0.30993974208831787), ('abs81', 0.30904173851013184), ('abs86', 0.30123087763786316)]


In [166]:
# 이들의 제목은?
for i in similar_doc:
    key_ = i[0][3:]
    print(key_)
    print(abs_dct.get(int(key_)))

44
Support Vector Class Description (SVCD): Classification in Kernel Space

47
Virtual metrology for run-to-run control in semiconductor manufacturing

46
Machine learning-based novelty detection for faulty wafer detection in semiconductor manufacturing

35
Probabilistic local reconstruction for k-NN regression and its application to virtual metrology in semiconductor manufacturing

87
An Up-trend Detection using an Auto-Associative Neural Network: KOSPI200 Futures

16
Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning

104
Effects of varying parameters on properties of self-organizing maps

8
Machine learning-based anomaly detection via integration of manufacturing, inspection and after-sales service data

81
Combining Gaussian Mixture Models

86
A Study on Rainfall-Runoff Models for Improving Ensemble Streamflow Prediction: 1. Rainfall-runoff Models Using Artificial Neural Networks



### Lab. 3 ('Imlab') 와 가장 유사한 논문 

In [189]:
# 이들의 제목은?
similar_doc = model.docvecs.most_similar('3')
for i in similar_doc:
    key_ = i[0][3:]
    print(key_)
    print(abs_dct.get(int(key_)))

201
Rationing Policies for Some Inventory Systems

24
A novel multi-class classification algorithm based on one-class support vector machine

196
The Effects of Inflation and Time-Value of Money on an Economic Order Quantity Model with a Random Product Life Cycle

32
KR-WordRank: A Korean word extraction method based on WordRank and unsupervised learning

189
Economic Lot Scheduling Problem with Imperfect Production Processes and Setup Times

92
Left-shoulder Detection in Korea Composite Stock Price Index using an Auto-Associative Neural Network

179
Hybrid Genetic Algorithm for Group Technology Economic Lot Scheduling Problem

44
Support Vector Class Description (SVCD): Classification in Kernel Space

56
Bootstrap based pattern selection for support vector regression

159
Vehicle Routing Problem with Time Windows considering Overtime and Outsourcing Vehicles

