<a href="https://colab.research.google.com/github/chu-ise/378A-2022/blob/main/notebooks/05/02_soynlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Soynlp](https://github.com/lovit/soynlp)

## Preparing the environment

In [1]:
%%capture
%pip install soynlp

In [2]:
import os

WORKSPACE_DIR = "../../workspace" 
print(f'WORKSPACE_DIR = {WORKSPACE_DIR}')
data_dir = os.path.join(WORKSPACE_DIR, "data")
os.makedirs(data_dir, exist_ok=True)

WORKSPACE_DIR = ../../workspace


In [6]:
import gdown
import os
id = "1uDxxBNZ-qZscPjdEam0QEjTdKcFUBMuy"

data_file = "bok_minutes.csv"
gdown.download(id=id, output=data_file, quiet=False, fuzzy=True)

Downloading...
From: https://drive.google.com/uc?id=1uDxxBNZ-qZscPjdEam0QEjTdKcFUBMuy
To: /content/bok_minutes.csv
100%|██████████| 10.1M/10.1M [00:00<00:00, 103MB/s]


'bok_minutes.csv'

In [7]:
import pandas as pd
df = pd.read_csv(data_file)
df.head()

Unnamed: 0,id,filename,mdate,rdate,section,text
0,BOK_20181130_20181218_S1,BOK_20181130_20181218,2018-11-30 10:00:00,2018-12-18 16:00:00,Economic Situation,일부 위원은 관련부서에서 지난 3/4분기 중 유로지역 경제성장 부진을 자동차 관련 ...
1,BOK_20181130_20181218_S2,BOK_20181130_20181218,2018-11-30 10:00:00,2018-12-18 16:00:00,Foreign Currency,일부 위원은 그동안 글로벌펀드와 패시브펀드의 규모가 크게 확대되어 우리나라 자본유출...
2,BOK_20181130_20181218_S3,BOK_20181130_20181218,2018-11-30 10:00:00,2018-12-18 16:00:00,Financial Markets,"일부 위원은 현재 대기업들이 전반적으로는 문제가 없지만, 건설 조선업 등에 속하는 ..."
3,BOK_20181130_20181218_S4,BOK_20181130_20181218,2018-11-30 10:00:00,2018-12-18 16:00:00,Monetary Policy,일부 위원은 최근 경기상황과 금융불균형 등을 고려할 때 확장적 재정정책의 필요성에는...
4,BOK_20181130_20181218_S5,BOK_20181130_20181218,2018-11-30 10:00:00,2018-12-18 16:00:00,Participants’ Views,일부 위원은 최근 실물경제 성장경로의 하방위험이 다소 커진 것으로 보이고 물가도 상...


In [9]:
docs = df['text']

sentences = []
for doc in docs:
  sentences += str(doc).split('\n')
print(len(sentences))

32179


## 단어 추출

In [10]:
%%time
from soynlp.word import WordExtractor

word_extractor = WordExtractor()
word_extractor.train(sentences)

training was done. used memory 0.420 Gb
CPU times: user 11.7 s, sys: 352 ms, total: 12.1 s
Wall time: 12.2 s


`extract()` 메서드로 각 cohesion, branching entropy, accessor variety 등의 통계 수치를 계산할 수 있다.

In [11]:
word_score = word_extractor.extract()

all cohesion probabilities was computed. # words = 46651
all branching entropies was computed # words = 72584
all accessor variety was computed # words = 72584


## Cohesion

In [14]:
word_score["금융"].cohesion_forward

0.5499382581535556

In [16]:
word_score["금융통화"].cohesion_forward

0.12035310794024298

In [17]:
word_score["금융통화위"].cohesion_forward

0.2021725230109059

In [18]:
word_score["금융통화위원"].cohesion_forward

0.27834133424371005

In [19]:
word_score["금융통화위원회"].cohesion_forward

0.3444673684406629

In [24]:
word_score["기준"].cohesion_forward

0.21523503077783995

In [25]:
word_score["기준금"].cohesion_forward

0.4311915272638221

In [26]:
word_score["기준금리"].cohesion_forward

0.5706795870422376

In [28]:
word_score["기준금리는"].cohesion_forward

0.37575743286182

## Branching Entropy

In [29]:
word_score["기준"].right_branching_entropy

0.7689286841948785

In [31]:
# '기준금' 다음에는 항상 '리'만 나온다.
word_score["기준금"].right_branching_entropy

-0.0

In [32]:
word_score["기준금리"].right_branching_entropy

1.8391611000017114

In [33]:
word_score["기준금리는"].right_branching_entropy

1.1524847314161935

## Accessor Variety

In [34]:
word_score["기준"].right_accessor_variety

39

In [35]:
# '기준금' 다음에는 항상 '리'만 나온다.
word_score["기준금"].right_accessor_variety

1

In [36]:
word_score["기준금리"].right_accessor_variety

30

In [37]:
word_score["기준금리는"].right_accessor_variety

12

## L-토큰화

In [39]:
text = '한편 IMF가 추정한 우리나라의 GDP갭률은 금년에도 소폭의 마이너스(-)를 지속하고 있는데, 잠재성장률 추정의 불확실성을 감안하더라도 최근의 고용상황, 제조업가동률, 물가상승률 등에 비추어 볼 때 동 추정치가 어느 정도 타당성이 있어 보인다고 언급하면서 관련부서의 견해를 물었음.'

In [41]:
from soynlp.tokenizer import LTokenizer

scores = {word:score.cohesion_forward for word, score in word_score.items()}
l_tokenizer = LTokenizer(scores=scores)

l_tokenizer.tokenize(text, flatten=False)

[('한편', ''),
 ('IMF', '가'),
 ('추정', '한'),
 ('우리나라', '의'),
 ('GDP', '갭률은'),
 ('금년', '에도'),
 ('소폭', '의'),
 ('마이너스', '(-)를'),
 ('지속', '하고'),
 ('있는', '데,'),
 ('잠재', '성장률'),
 ('추정', '의'),
 ('불확실성', '을'),
 ('감안', '하더라도'),
 ('최근', '의'),
 ('고용', '상황,'),
 ('제조업', '가동률,'),
 ('물가', '상승률'),
 ('등에', ''),
 ('비추어', ''),
 ('볼', ''),
 ('때', ''),
 ('동', ''),
 ('추정', '치가'),
 ('어느', ''),
 ('정도', ''),
 ('타당', '성이'),
 ('있어', ''),
 ('보인다고', ''),
 ('언급', '하면서'),
 ('관련부서', '의'),
 ('견해를', ''),
 ('물었음.', '')]

## 최대 점수 토큰화

In [42]:
from soynlp.tokenizer import MaxScoreTokenizer

maxscore_tokenizer = MaxScoreTokenizer(scores=scores)
maxscore_tokenizer.tokenize(text)

['한편',
 'IMF',
 '가',
 '추정',
 '한',
 '우리나라',
 '의',
 'GDP',
 '갭률은',
 '금년',
 '에도',
 '소',
 '폭의',
 '마이너스',
 '(-)를',
 '지속',
 '하고',
 '있는',
 '데,',
 '잠재',
 '성장',
 '률',
 '추정',
 '의',
 '불확실성',
 '을',
 '감안',
 '하더라도',
 '최근',
 '의',
 '고용',
 '상황',
 ',',
 '제조업',
 '가동률,',
 '물가',
 '상승',
 '률',
 '등에',
 '비추어',
 '볼',
 '때',
 '동',
 '추정',
 '치가',
 '어느',
 '정도',
 '타당',
 '성이',
 '있어',
 '보인다고',
 '언급',
 '하면서',
 '관련부서',
 '의',
 '견해를',
 '물었음.']