<a href="https://colab.research.google.com/github/analyst-rhie/data-collection-information/blob/main/%5B6%EC%9E%A5%5D_NLP_%EC%9E%90%EC%97%B0%EC%96%B4%EC%B2%98%EB%A6%AC_%ED%86%A0%ED%94%BD%EB%AA%A8%EB%8D%B8%EB%A7%81_%EC%9D%B4%EC%88%98%EB%A7%8C%EC%BB%B4%ED%93%A8%ED%84%B0%EC%97%B0%EA%B5%AC%EC%86%8C_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 토픽 모델링(Topic Modeling)
* 토픽 모델링은 문서 집합에서 주제를 찾아내기 위한 기술
* 토픽 모델링은 '특정 주제에 관한 문서에서는 특정 단어가 자주 등장할 것이다'라는 직관을 기반
* 예를 들어, 주제가 '개'인 문서에서는 개의 품종, 개의 특성을 나타내는 단어가 다른 문서에 비해 많이 등장
* 주로 사용되는 토픽 모델링 방법은 잠재 의미 분석과 잠재 디리클레 할당 기법이 있음.
  * 잠재 디리클레는 잠재 의미 분석의 발전 기법이다.

### 잠재 의미 분석(Latent Semantic Analysis)
* 잠재 의미 분석(LSA)은 주로 문서 색인의 의미 검석에 사용
* 잠재 의미 인덱싱(Latent Semantic indexing, LSI)로도 알려져 있음.
* LSA의 목표는 문서와 단어의 기반이 되는 잠재적인 토픽을 발견하는 것
* 잠재적인 토픽은 문서에 있는 단어들의 분포를 주도한다고 가정
* LSA 방법
  * 문서 모음에서 생성한 문서-단어 행렬(Document Term Matrix)에서 단어-토픽 행렬(Term-Topic Matrix)과 토픽-중요도 행렬(Topic-Importance Matrix), 그리고 토픽-문서 행렬(Topic-Document Matrix)로 분해


### 잠재 디리클레 할당(Latent Dirichlet ALlocation)
* 잠재 디리클레 할당은 대표적인 토픽 모델링 알고리즘 중 하나
* 잠재 디리클레 할당 방법
  1. 사용자가 토픽이 개수를 지정해 알고리즘에 전달
  2. 모든 단어들을 토픽 중 하나에 할당
  3. 모든 문서의 모든 단어에 단어 w가 가정에 의거 $p(t|d)$, $p(w|t)$에 따라 토픽을 재할당, 이를 반복, 이 때 가정은 자신만이 잘못된 토픽에 할당되어 있고 다른 모든 단어는 올바른 토픽에 할당된다는 것을 의미
* $p(t|d)$ - 문서 d의 단어들 중 토픽 t에 해당하는 비율
* 해당 문서의 자주 등장하는 다른 단어의 토픽이 해당 단어의 토픽이 될 가능성이 높음을 의미
* $p(w|t)$ - 단어 w를 가지고 있는 모든 문서들 중 토픽 t가 할당된 비율
* 다른 문서에서 단어 w에 많이 할당된 토픽이 해당 단어의 토픽이 될 가능성이 높음을 의미

### 데이터 준비

In [19]:
from sklearn.datasets import fetch_20newsgroups

# 토픽 모델링이 목적이므로 headers, footnotes, 인용문등은 삭제해준다.
dataset = fetch_20newsgroups(shuffle=True, random_state = 1,
                             remove = ('headers', 'footers', 'quotes')) 
documents = dataset.data

print(len(documents))
documents[0]

11314


"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

In [5]:
documents[3]

'Notwithstanding all the legitimate fuss about this proposal, how much\nof a change is it?  ATT\'s last product in this area (a) was priced over\n$1000, as I suspect \'clipper\' phones will be; (b) came to the customer \nwith the key automatically preregistered with government authorities. Thus,\naside from attempting to further legitimize and solidify the fed\'s posture,\nClipper seems to be "more of the same", rather than a new direction.\n   Yes, technology will eventually drive the cost down and thereby promote\nmore widespread use- but at present, the man on the street is not going\nto purchase a $1000 crypto telephone, especially when the guy on the other\nend probably doesn\'t have one anyway.  Am I missing something?\n   The real question is what the gov will do in a year or two when air-\ntight voice privacy on a phone line is as close as your nearest pc.  That\nhas got to a problematic scenario for them, even if the extent of usage\nnever surpasses the \'underground\' stature

* 텍스트 전처리가 필요함.

In [41]:
import re
import nltk
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import preprocess_string

nltk.download('stopwords')

def clean_text(d):
  pattern = r'[^a-zA-Z\s]'
  text = re.sub(pattern, '',d)
  return text

def clean_stopword(d):
  stop_words = stopwords.words('english')
  return ' '.join([w.lower() for w in d.split() if w.lower() not in stop_words and len(w) > 3])
# 불용어 지우고 단어 3개 이하인거 제외하고 소문자로 바꾸고

def preprocessing(d):
  return preprocess_string(d)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
import pandas as pd
# 전처리 전 전체 갯수 확인
new_df = pd.DataFrame({'article':documents})
len(new_df)


11314

In [43]:
new_df.replace("",float("NaN"),inplace=True)
new_df.isnull().values.any() #null이 존재

True

In [44]:
new_df.dropna(inplace= True) #null 삭제
print(len(new_df))

11096


In [45]:
new_df['article'] = new_df['article'].apply(clean_text)
new_df

Unnamed: 0,article
0,Well im not sure about the story nad it did se...
1,\n\n\n\n\n\n\nYeah do you expect people to rea...
2,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...
4,Well I will have to change the scoring on my p...
...,...
11309,Danny Rubenstein an Israeli journalist will be...
11310,\n
11311,\nI agree Home runs off Clemens are always me...
11312,I used HP DeskJet with Orange Micros Grappler ...


In [46]:
new_df['article'] = new_df['article'].apply(clean_stopword)
new_df['article']

0        well sure story seem biased disagree statement...
1        yeah expect people read actually accept hard a...
2        although realize principle strongest points wo...
3        notwithstanding legitimate fuss proposal much ...
4        well change scoring playoff pool unfortunately...
                               ...                        
11309    danny rubenstein israeli journalist speaking t...
11310                                                     
11311    agree home runs clemens always memorable kinda...
11312    used deskjet orange micros grappler system upd...
11313    argument murphy scared hell came last year han...
Name: article, Length: 11096, dtype: object

In [47]:
tokenized_news = new_df['article'].apply(preprocessing)
tokenized_news = tokenized_news.to_list()

[['yeah',
  'expect',
  'peopl',
  'read',
  'actual',
  'accept',
  'hard',
  'atheism',
  'need',
  'littl',
  'leap',
  'faith',
  'jimmi',
  'logic',
  'run',
  'steam',
  'sorri',
  'piti',
  'sorri',
  'feel',
  'denial',
  'faith',
  'need',
  'pretend',
  'happili',
  'mayb',
  'start',
  'newsgroup',
  'altatheisthard',
  'wont',
  'bummin',
  'byeby',
  'dont',
  'forget',
  'flintston',
  'chewabl',
  'bake',
  'timmon'],
 ['realiz',
  'principl',
  'strongest',
  'point',
  'like',
  'know',
  'question',
  'sort',
  'arab',
  'countri',
  'want',
  'continu',
  'think',
  'tank',
  'charad',
  'fixat',
  'israel',
  'stop',
  'start',
  'ask',
  'sort',
  'question',
  'arab',
  'countri',
  'realiz',
  'work',
  'arab',
  'countri',
  'treatment',
  'jew',
  'decad',
  'fixat',
  'israel',
  'begin',
  'look',
  'like',
  'bias',
  'attack',
  'group',
  'recogn',
  'stupid',
  'center',
  'polici',
  'research',
  'fanci',
  'bigot',
  'hate',
  'israel'],
 ['notwithstan

In [50]:
import numpy as np

drop_news = [index for index, sentence in enumerate(tokenized_news) if len(sentence) <=1]
# sentence가 1개 이하인 것들 제거

news_texts = np.delete(tokenized_news, drop_news, axis= 0)
print(len(news_texts))

10926


  return array(a, dtype, copy=False, order=order)


### Genism을 이용한 토픽 모델링

In [51]:
from gensim import corpora

dictionary = corpora.Dictionary(news_texts)
corpus = [dictionary.doc2bow(text) for text in news_texts] #문서를 bow로 바꿈
print(corpus[1])

[(50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1)]


* 토픽모델링 할때 딕셔너리랑 코퍼스 값들이 필요하다. 그래서 준비를 해둔 것

### 잠재 의미 분석을 위한 LsiModel


In [53]:
from gensim.models import LsiModel
#lsi model은 LDA 보다 빠르다는 장점이 있다.
lsi_model = LsiModel(corpus, num_topics = 20, id2word = dictionary) # 뉴스 그룹이 예시가 20개 였다.
topics = lsi_model.print_topics()
topics

[(0,
  '1.000*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.008*"mgvgvgvgvgvgvgvgvgvgvgvgvgvgvgv" + 0.005*"maxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.003*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxaxq" + 0.002*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxf" + 0.002*"mqaxaxaxaxaxaxaxaxaxaxaxaxaxax" + 0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxasqq" + 0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxqq" + 0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxasq" + 0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxqqf"'),
 (1,
  '0.393*"file" + 0.191*"program" + 0.158*"imag" + 0.126*"peopl" + 0.125*"avail" + 0.119*"inform" + 0.116*"includ" + 0.116*"entri" + 0.114*"work" + 0.112*"dont"'),
 (2,
  '0.456*"file" + -0.215*"peopl" + -0.210*"know" + -0.192*"said" + -0.176*"dont" + 0.158*"entri" + -0.158*"think" + -0.153*"stephanopoulo" + 0.139*"imag" + -0.129*"go"'),
 (3,
  '0.409*"file" + 0.286*"entri" + -0.241*"imag" + -0.168*"avail" + -0.141*"wire" + -0.136*"data" + -0.122*"version" + 0.116*"onam" + -0.109*"window" + 0.104*"said"'),
 (4,
  '-0.618*"wire" + -0.250*"ground" + -0.188*"circu

* 사실 토픽개수가 몇개인지 판단하기 힘듦.

21.10.5 22분까지 완