# Scikit-learn을 이용한 LDA 실습 코드
* 출처: [딥 러닝을 이용한 자연어 처리 입문 / 19-03 사이킷런의 잠재 디리클레 할당(LDA) 실습](https://wikidocs.net/40710)
* 본 코드는 아래와 같은 순서로 구성되어 있습니다.
    1. 데이터 불러오기
    2. 데이터 전처리
    3. 역벡터화 후 TF-IDF로 다시 벡터화
    4. LDA 토픽 모델링 (scikit-learn)

## 01. Data Load

In [1]:
import pandas as pd
import urllib.request
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

data = pd.read_csv('./Dataset/abcnews-date-text.csv')
print('뉴스 제목 개수 :',len(data))

뉴스 제목 개수 : 1244184


In [2]:
# 상위 5개 샘플 출력
print(data.head(5))

   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


In [3]:
# 필요한 열(headline_text, 뉴스 기사 제목)만 추출
text = data[['headline_text']]
text.head(5)

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


## 02. Text preprocessing

In [4]:
# 토큰화 수행 및 결과 확인
text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)
print(text.head(5))

                                       headline_text
0  [aba, decides, against, community, broadcastin...
1  [act, fire, witnesses, must, be, aware, of, de...
2  [a, g, calls, for, infrastructure, protection,...
3  [air, nz, staff, in, aust, strike, for, pay, r...
4  [air, nz, strike, to, affect, australian, trav...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1)


In [5]:
# 불용어(against, be, of, a, ...) 제거
stop_words = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop_words)])

print(text.head(5))

                                       headline_text
0   [aba, decides, community, broadcasting, licence]
1    [act, fire, witnesses, must, aware, defamation]
2     [g, calls, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop_words)])


In [6]:
# 표제어 추출(Lemmatization) 수행하여 기본형으로 변경 (동사의 3인칭 단수 표현도 1인칭으로 변경됨)
text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])
print(text.head(5))

                                       headline_text
0       [aba, decide, community, broadcast, licence]
1      [act, fire, witness, must, aware, defamation]
2      [g, call, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])


In [7]:
# 길이가 3 이하인 단어 제거
tokenized_doc = text['headline_text'].apply(lambda x: [word for word in x if len(word) > 3])
print(tokenized_doc[:5])

0       [decide, community, broadcast, licence]
1      [fire, witness, must, aware, defamation]
2    [call, infrastructure, protection, summit]
3                   [staff, aust, strike, rise]
4      [strike, affect, australian, travellers]
Name: headline_text, dtype: object


## 03. Vectorization (TF-IDF)

In [8]:
# 역토큰화 (토큰화 작업을 되돌림, 즉 전처리를 통해 분리되어 리스트 형태로 저장된 값들을 문자열로 합침)
detokenized_doc = []
for i in range(len(text)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

# 다시 text['headline_text']에 재저장
text['headline_text'] = detokenized_doc

text['headline_text'][:5]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = detokenized_doc


0       decide community broadcast licence
1       fire witness must aware defamation
2    call infrastructure protection summit
3                   staff aust strike rise
4      strike affect australian travellers
Name: headline_text, dtype: object

In [9]:
# 상위 1,000개의 단어를 보존 (필수는 아님)
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000)
X = vectorizer.fit_transform(text['headline_text'])

# TF-IDF 행렬의 크기 확인
print('TF-IDF 행렬의 크기 :',X.shape)

TF-IDF 행렬의 크기 : (1244184, 1000)


## 04. Topic Modeling

In [10]:
lda_model = LatentDirichletAllocation(n_components=10,learning_method='online',random_state=777,max_iter=1)
lda_top = lda_model.fit_transform(X)

print(lda_model.components_)
print(lda_model.components_.shape) 

[[1.00005784e-01 1.00000401e-01 1.00000837e-01 ... 1.00008966e-01
  1.00003001e-01 1.00003989e-01]
 [1.00001271e-01 1.00000170e-01 1.00000702e-01 ... 1.00009027e-01
  1.00004822e-01 6.11677308e+02]
 [1.00001893e-01 1.00000747e-01 5.29631843e+02 ... 1.00004665e-01
  1.00003828e-01 1.00003730e-01]
 ...
 [1.00006258e-01 1.00000570e-01 1.00000388e-01 ... 1.00006276e-01
  1.00001666e-01 1.00008566e-01]
 [1.00000280e-01 1.00000135e-01 1.00001609e-01 ... 1.00003391e-01
  1.00001003e-01 1.00006488e-01]
 [1.03272231e+02 1.00000349e-01 1.00001881e-01 ... 1.00005116e-01
  1.00004106e-01 1.00006425e-01]]
(10, 1000)


In [12]:
# 단어 집합. 1,000개의 단어가 저장됨.
terms = vectorizer.get_feature_names_out()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(lda_model.components_,terms)

Topic 1: [('australia', 20556.0), ('sydney', 11219.29), ('melbourne', 8765.73), ('kill', 6646.06), ('court', 6004.12)]
Topic 2: [('coronavirus', 41719.62), ('covid', 28960.68), ('government', 9793.89), ('change', 7576.98), ('home', 7457.74)]
Topic 3: [('south', 7102.98), ('death', 6825.39), ('speak', 5402.31), ('care', 4521.48), ('interview', 4058.71)]
Topic 4: [('donald', 8536.49), ('restrictions', 6456.4), ('world', 6320.61), ('state', 6087.5), ('water', 4219.67)]
Topic 5: [('vaccine', 8040.87), ('open', 6915.17), ('coast', 5990.55), ('warn', 5472.84), ('morrison', 5247.93)]
Topic 6: [('trump', 14878.13), ('charge', 7717.96), ('health', 6836.95), ('murder', 6663.45), ('house', 6624.13)]
Topic 7: [('australian', 13885.88), ('queensland', 13373.53), ('record', 9037.88), ('test', 7713.9), ('help', 5922.05)]
Topic 8: [('case', 13146.83), ('police', 11143.1), ('live', 7528.03), ('border', 6855.54), ('tasmania', 5664.64)]
Topic 9: [('victoria', 11777.5), ('school', 6009.78), ('attack', 550