# "LDA로 한국전력 적자 기사 분석"
> "토픽 모델링"

- toc:true- branch: master
- badges: true
- comments: true
- author: Ryu Han
- categories: [climate change, kepco]


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

In [None]:
title = '/content/kepco_deficit_6months.csv'

In [None]:
df = pd.read_csv(title)

In [None]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,agency,naver_url,title,content
0,0,주간조선언론사 선정,https://n.news.naver.com/mnews/article/053/000...,한전 적자에 휘청…정치가 밀어붙인 한전공대의 운명,[\n\n\n\n\n지난 3월 2일 전남 나주시 빛가람동에서 열린 한국에너지공과대학...
1,1,매경이코노미언론사 선정,https://n.news.naver.com/mnews/article/024/000...,김종갑 前 한전 사장이 밝히는 ‘한전 적자’ 해법,[\n전기요금 원가 연계 없으면 ‘정상화’ 불가능김종갑 한양대 특훈교수는 2018년...
2,2,조선비즈언론사 선정,https://n.news.naver.com/mnews/article/366/000...,"전력 도매가격, 상한제 도입 앞두고 급등 조짐",[\n민간 발전사는 상한제 도입 반대 시위 한국전력(015760)이 발전사들에게 전...


## 1. 데이터 전처리

In [None]:
# konlpy 라이브러리 설치
!apt-get update
!apt-get install g++ openjdk-8-jdk 
!pip install konlpy JPype1-py3
!bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
#출처: https://biology-statistics-programming.tistory.com/32 [히비스서커스의 블로그:티스토리]

In [None]:
from konlpy.tag import Mecab
tokenizer = Mecab()

In [None]:
from konlpy.tag import Mecab
from tqdm import tqdm
import re
import pickle
import csv
import pandas as pd
from pandas import DataFrame 
import numpy as np

In [None]:
def clean_text(content):
    content = content.replace(".", "").strip()
    content = content.replace("·", " ").strip()
    content = content.replace("\n", "").strip()
    pattern = '[^ ㄱ-ㅣ가-힣|0-9]+'
    content = re.sub(pattern=pattern, repl='', string=content)
    return content

In [None]:
df['content'] = df.content.apply(clean_text)

In [None]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,agency,naver_url,title,content
0,0,주간조선언론사 선정,https://n.news.naver.com/mnews/article/053/000...,한전 적자에 휘청…정치가 밀어붙인 한전공대의 운명,지난 3월 2일 전남 나주시 빛가람동에서 열린 한국에너지공과대학 입학식 및 비전선포...
1,1,매경이코노미언론사 선정,https://n.news.naver.com/mnews/article/024/000...,김종갑 前 한전 사장이 밝히는 ‘한전 적자’ 해법,전기요금 원가 연계 없으면 정상화 불가능김종갑 한양대 특훈교수는 2018년부터 20...
2,2,조선비즈언론사 선정,https://n.news.naver.com/mnews/article/366/000...,"전력 도매가격, 상한제 도입 앞두고 급등 조짐",민간 발전사는 상한제 도입 반대 시위 한국전력015760이 발전사들에게 전기를 사올...


In [None]:
def get_nouns(tokenizer, sentence):
    tagged = tokenizer.pos(sentence)
    nouns = [s for s, t in tagged if t in ['NNG', 'NNP', 'VA', 'XR'] and len(s) >1]
    return nouns

def tokenize(df):
    tokenizer = Mecab(dicpath='/usr/local/lib/mecab/dic/mecab-ko-dic')
    processed_data = []
    for sent in tqdm(df['content']):
        sentence = clean_text(str(sent).replace("\n", "").strip())
        processed_data.append(get_nouns(tokenizer, sentence))
    return processed_data

In [None]:
len(df)

2836

In [None]:
df.dropna(how='any')
len(df)

2836

In [None]:
df['processed_content'] = processed_data

In [None]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,agency,naver_url,title,content,processed_content
0,0,주간조선언론사 선정,https://n.news.naver.com/mnews/article/053/000...,한전 적자에 휘청…정치가 밀어붙인 한전공대의 운명,지난 3월 2일 전남 나주시 빛가람동에서 열린 한국에너지공과대학 입학식 및 비전선포...,"[전남, 나주시, 가람동, 한국, 에너지, 공과, 대학, 입학식, 전선, 포식, 뉴..."
1,1,매경이코노미언론사 선정,https://n.news.naver.com/mnews/article/024/000...,김종갑 前 한전 사장이 밝히는 ‘한전 적자’ 해법,전기요금 원가 연계 없으면 정상화 불가능김종갑 한양대 특훈교수는 2018년부터 20...,"[전기, 요금, 원가, 연계, 정상, 가능, 김종갑, 한양대, 특훈, 교수, 한국전..."
2,2,조선비즈언론사 선정,https://n.news.naver.com/mnews/article/366/000...,"전력 도매가격, 상한제 도입 앞두고 급등 조짐",민간 발전사는 상한제 도입 반대 시위 한국전력015760이 발전사들에게 전기를 사올...,"[민간, 발전사, 상한제, 도입, 반대, 시위, 한국전력, 발전사, 전기, 기준, ..."


## LDA 토픽모델링

In [None]:
from gensim.models.ldamodel import LdaModel
from gensim.models.callbacks import CoherenceMetric
from gensim import corpora
from gensim.models.callbacks import PerplexityMetric

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
dictionary = corpora.Dictionary(processed_data)

2022-07-09 21:25:52,135 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2022-07-09 21:25:52,838 : INFO : built Dictionary(13125 unique tokens: ['가결', '가능', '가람동', '가림막', '가운데']...) from 2836 documents (total 785221 corpus positions)


In [None]:
dictionary.filter_extremes(no_below=2, no_above=0.5)

2022-07-09 21:25:53,367 : INFO : discarding 4472 tokens: [('가림막', 1), ('개교일', 1), ('교과', 1), ('교지', 1), ('규모', 1444), ('넓이', 1), ('뙤약볕', 1), ('레볼루션', 1), ('배차', 1), ('부담', 1487)]...
2022-07-09 21:25:53,370 : INFO : keeping 8653 tokens which were in no less than 2 and no more than 1418 (=50.0%) documents
2022-07-09 21:25:53,389 : INFO : resulting dictionary: Dictionary(8653 unique tokens: ['가결', '가능', '가람동', '가운데', '간격']...)


In [None]:
corpus = [dictionary.doc2bow(text) for text in processed_data]

### 모델링

In [None]:
num_topics = 5
chunksize = 2000
passes = 20
iterations = 400
eval_every = None

temp = dictionary[0]
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)


2022-07-09 21:25:58,513 : INFO : using autotuned alpha, starting with [0.2, 0.2, 0.2, 0.2, 0.2]
2022-07-09 21:25:58,520 : INFO : using serial LDA version on this node
2022-07-09 21:25:58,534 : INFO : running online (multi-pass) LDA training, 5 topics, 20 passes over the supplied corpus of 2836 documents, updating model once every 2000 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2022-07-09 21:25:58,538 : INFO : PROGRESS: pass 0, at document #2000/2836
2022-07-09 21:26:07,586 : INFO : optimized alpha [0.12613794, 0.023167312, 0.10367903, 0.09287643, 0.038539857]
2022-07-09 21:26:07,588 : INFO : merging changes from 2000 documents into a model of 2836 documents
2022-07-09 21:26:07,603 : INFO : topic #0 (0.126): 0.011*"단가" + 0.010*"물가" + 0.010*"원전" + 0.008*"가스" + 0.007*"정책" + 0.007*"영업" + 0.007*"발전" + 0.007*"시장" + 0.006*"산업" + 0.006*"결정"
2022-07-09 21:26:07,608 : INFO : topic #1 (0.023): 0.015*"기관" + 0.014*"공공" + 0.009*"경영" + 

In [None]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2022-07-09 21:27:07,340 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2022-07-09 21:27:07,382 : INFO : CorpusAccumulator accumulated stats from 2000 documents


Average topic coherence: -1.3052.
[([(0.01524945, '후보'),
   (0.014800063, '단가'),
   (0.01322821, '물가'),
   (0.01281813, '국민'),
   (0.011858739, '가스'),
   (0.010960233, '산업'),
   (0.010288112, '결정'),
   (0.009155416, '정책'),
   (0.00831113, '발표'),
   (0.008002558, '반영'),
   (0.007134421, '원전'),
   (0.0070603937, '공약'),
   (0.007021735, '동결'),
   (0.006810347, '국제'),
   (0.0066747516, '대선'),
   (0.006522878, '기자'),
   (0.0064972537, '서울'),
   (0.0063531045, '연동'),
   (0.006103425, '예정'),
   (0.00599609, '최대')],
  -0.9115895166694079),
 ([(0.013056193, '물가'),
   (0.012969216, '경제'),
   (0.008364702, '기업'),
   (0.007984413, '정책'),
   (0.0060336622, '대통령'),
   (0.005244408, '시장'),
   (0.00455205, '장관'),
   (0.0045388252, '국민'),
   (0.0045238542, '안정'),
   (0.004387625, '세계'),
   (0.004323399, '공급'),
   (0.0039611566, '부총리'),
   (0.0038171748, '기자'),
   (0.003753166, '미국'),
   (0.0036840646, '문제'),
   (0.0036569836, '관련'),
   (0.0036342787, '국제'),
   (0.0036246954, '한국'),
   (0.0034647188, '재

## 시각화

In [None]:
!pip install pyLDAvis

import pickle
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
lda_visualization = gensimvis.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(lda_visualization, 'file_name2.html')

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
pyLDAvis.display(lda_visualization)