## Doc2Vec

- Word2Vec을 변형하여 문서의 임베딩을 얻을 수 있도록 한 알고리즘
- 문서에 대해 직접 임베딩 수행
- 문맥을 고려한 임베딩
- (딥러닝 활용 자연어 처리 방법에서는 대부분 단어에 대해 임베딩)
- BoW 방식의 카운트 백터나 TF-IDF 벡터 : 문서 임베딩
  - 단어의 순서를 고려한 문맥 정보 무시 

#### Doc2Vec 원리
- Doc2Vec에서는 문서의 ID를 단어와 동일하게 취급해서 학습 과정에 포함
- 다른 단어들이 문맥정보가 반영되어 학습되는 동안 문서 ID도 그 문서에서 나온 단어들의 문백정보를 같이 학습
- 말뭉치에 있는 문서들에 대해 학습이 완료되면 문서들 간의 직접적인 비교 가능
- Doc2Vec과 같이 주어진 문서에 대해 유사한 문서를 찾거나 다양한 연산들 가능
- 또한 임베딩 벡터를 이용해 문서 분류와 같은 목적으로 활용 가능 
- 문서의 임베딩 벡터를 분류기의 입력으로 사용하면 감성 분석 등 다양한 작업 가능 

In [2]:
# 문서의 ID가 전체 단어의 일반적인 특성을 학습하게 됨

### Doc2Vec 예제 : 추천 시스템

- 도서 제목을 입력하면 줄거리의 유사도가 높은 순으로 도서 추천 

In [1]:
# 한 셀에서 여려 결과 출력
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all"

In [87]:
import pandas as pd
df = pd.read_csv("./data/data.csv")
df.head()

Unnamed: 0.2,Unnamed: 0.1,Desc,Unnamed: 0,author,genre,image_link,rating,title
0,0,We know that power is shifting: From West to E...,0.0,Moisés Naím,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.63,The End of Power: From Boardrooms to Battlefie...
1,1,Following the success of The Accidental Billio...,1.0,Blake J. Harris,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.94,"Console Wars: Sega, Nintendo, and the Battle t..."
2,2,How to tap the power of social software and ne...,2.0,Chris Brogan,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.78,Trust Agents: Using the Web to Build Influence...
3,3,William J. Bernstein is an American financial ...,3.0,William J. Bernstein,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.2,The Four Pillars of Investing
4,4,Amazing book. And I joined Steve Jobs and many...,4.0,Akio Morita,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.05,Made in Japan: Akio Morita and Sony


In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2382 entries, 0 to 2381
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0.1  2382 non-null   int64  
 1   Desc          2382 non-null   object 
 2   Unnamed: 0    1185 non-null   float64
 3   author        2382 non-null   object 
 4   genre         2382 non-null   object 
 5   image_link    2382 non-null   object 
 6   rating        2382 non-null   float64
 7   title         2382 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 149.0+ KB


In [89]:
df.title[0]

"The End of Power: From Boardrooms to Battlefields and Churches to States, Why Being In Charge Isn't What It Used to Be"

In [90]:
df.Desc[0]

"We know that power is shifting: From West to East and North to South, from presidential palaces to public squares, from once formidable corporate behemoths to nimble startups and, slowly but surely, from men to women. But power is not merely shifting and dispersing. It is also decaying. Those in power today are more constrained in what they can do with it and more at risk of losing it than ever before. In The End of Power, award-winning columnist and former Foreign Policy editor Moisés Naím illuminates the struggle between once-dominant megaplayers and the new micropowers challenging them in every field of human endeavor. Drawing on provocative, original research, Naím shows how the antiestablishment drive of micropowers can topple tyrants, dislodge monopolies, and open remarkable new opportunities, but it can also lead to chaos and paralysis. Naím deftly covers the seismic changes underway in business, religion, education, within families, and in all matters of war and peace. Example

In [None]:
# Desc 열에 대해 전처리 
# 아스키 문자 아니면 제거
# 소문자로 변환
# stopwords 제거
# HTML 태그 제거
# 구두점 제거 
# 영문자만 남기기

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re

# 아스키 문자 아니면 제거  
def _removeNonAscii(s):
    return 

def make_lower_case(text):
    return 

def remove_stop_words(text):

    return text

def remove_html(text):
    
    return html_pattern.sub(r'', text)

def remove_punctuation(text):

    text = " ".join(text)
    return text

df['cleaned'] = df['Desc'].apply(_removeNonAscii)
df['cleaned'] = df.cleaned.apply(make_lower_case)
df['cleaned'] = df.cleaned.apply(remove_stop_words)
df['cleaned'] = df.cleaned.apply(remove_punctuation)
df['cleaned'] = df.cleaned.apply(remove_html)

In [93]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Desc,Unnamed: 0,author,genre,image_link,rating,title,cleaned
0,0,We know that power is shifting: From West to E...,0.0,Moisés Naím,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.63,The End of Power: From Boardrooms to Battlefie...,know power shifting west east north south pres...
1,1,Following the success of The Accidental Billio...,1.0,Blake J. Harris,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.94,"Console Wars: Sega, Nintendo, and the Battle t...",following success accidental billionaires mone...
2,2,How to tap the power of social software and ne...,2.0,Chris Brogan,Business,https://i.gr-assets.com/images/S/compressed.ph...,3.78,Trust Agents: Using the Web to Build Influence...,tap power social software networks build busin...
3,3,William J. Bernstein is an American financial ...,3.0,William J. Bernstein,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.2,The Four Pillars of Investing,william j bernstein american financial theoris...
4,4,Amazing book. And I joined Steve Jobs and many...,4.0,Akio Morita,Business,https://i.gr-assets.com/images/S/compressed.ph...,4.05,Made in Japan: Akio Morita and Sony,amazing book joined steve jobs many akio morit...


In [None]:
df['Desc'][2][:50]

In [None]:
df['cleaned'][2][:50]

'tap power social software networks build business trust agents two social media veterans show tap power social networks build brand s influence reputation and course profits today s online influencers web natives trade trust reputation relationships using social media accrue influence builds brings businesses online the book shows people use online social tools build networks influence use networks positively impact business trust key building online reputations traffic trust agents key people business needs side delivers actionable steps case studies show social media positively impact business written authors ten years online media experience shows build wield influence online benefit brand combines high level theory practical step by step guidance want business succeed sit sidelines instead use web build trust consumers using trust agents'

In [95]:
from gensim.models.doc2vec import TaggedDocument
from konlpy.tag import Okt
okt = Okt()

- genism의 TaggedDocument
    - 문서에 그 문서의 분류명이나 키워드 같은 메타 정보를 담은 문자열 또는 정수값을 부여하는 기능

In [None]:
%%time
from tqdm import tqdm

tagged_corpus_list = []

for index, row in tqdm(df.iterrows(), total=len(df)):
  text = row['cleaned']
  tag = row['title']


print('문서의 수 :', len(tagged_corpus_list))

100%|█████████████████████████████████████████████████████████████████████████████| 2382/2382 [00:07<00:00, 323.71it/s]

문서의 수 : 2382
CPU times: total: 3.45 s
Wall time: 7.38 s





In [97]:
tagged_corpus_list[0]

TaggedDocument(words=['know', 'power', 'shifting', 'west', 'east', 'north', 'south', 'presidential', 'palaces', 'public', 'squares', 'formidable', 'corporate', 'behemoths', 'nimble', 'startups', 'and', 'slowly', 'surely', 'men', 'women', 'power', 'merely', 'shifting', 'dispersing', 'also', 'decaying', 'power', 'today', 'constrained', 'risk', 'losing', 'ever', 'before', 'end', 'power', 'award', 'winning', 'columnist', 'former', 'foreign', 'policy', 'editor', 'moiss', 'nam', 'illuminates', 'struggle', 'once', 'dominant', 'megaplayers', 'new', 'micropowers', 'challenging', 'every', 'field', 'human', 'endeavor', 'drawing', 'provocative', 'original', 'research', 'nam', 'shows', 'antiestablishment', 'drive', 'micropowers', 'topple', 'tyrants', 'dislodge', 'monopolies', 'open', 'remarkable', 'new', 'opportunities', 'also', 'lead', 'chaos', 'paralysis', 'nam', 'deftly', 'covers', 'seismic', 'changes', 'underway', 'business', 'religion', 'education', 'within', 'families', 'matters', 'war', 'pea

- window ( int , 선택 사항 ) – 문장 내에서 현재 단어와 예측 단어 사이의 최대 거리
- workers ( int , 선택 사항 ) – 이 많은 작업자 스레드를 사용하여 모델을 학습(=멀티코어 머신으로 더 빠른 학습)
- total_examples ( int , 선택 사항 ) – 문서 수
- corpus_count : input 데이터에서 문장이 몇개 들어가 있는지 확인 가능

In [98]:
from gensim.models import Doc2Vec

In [None]:
# 추천 시스템
# 도서명(title) 입력하면 도서의 줄거리와 유사한 도서 추천 

In [None]:
# 함수로 작성
def book_recommend(title):


In [None]:
############################################################################################################

In [None]:
# 추천 시스템에서 도서명 입력 시 대소문자 상관없이 검색해서 
# 대문자로 된 원래 도서명으로 출력  - 딥러닝 34