# BERTopic을 이용한 Topic Modeling
* 출처: [딥 러닝을 이용한 자연어 처리 입문 / 19-08 버토픽(BERTopic)](https://wikidocs.net/161310)
* TF-IDF와 BERT 임베딩(SBERT)을 결합하여 키워드를 도출하는 형태로, 동적 모델링이 적용되어 시간적 정보와 관계없이 키워드 도출이 가능한 방법
* 본 코드는 아래와 같은 순서로 구성되어 있습니다.
    1. 

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [2]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
docs[:5]

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

In [3]:
print('총 문서의 수 :', len(docs))

총 문서의 수 : 18846


In [4]:
# 모델 객체 생성
model = BERTopic()
# 토픽 모델링 수행
topics, probabilities = model.fit_transform(docs)

print('각 문서의 토픽 번호 리스트 :',len(topics))
print('첫번째 문서의 토픽 번호 :', topics[0])

각 문서의 토픽 번호 리스트 : 18846
첫번째 문서의 토픽 번호 : 0


In [5]:
# get_topic_info(): 토픽의 개수, 크기, 각 토픽에 할당된 단어들 살펴보기
# Topic의 -1은 토픽이 할당되지 않은 이상치 문서
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6832,-1_to_the_of_and,"[to, the, of, and, is, in, for, you, it, that]",[\nThat was my point. If I play poker with Mon...
1,0,1832,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[\nWales Conference, Adams Division, Semifinal..."
2,1,576,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,449,2_ites_15_each_,"[ites, 15, each, , , , , , , ]","[each, \n ..."
4,3,354,3_israel_israeli_arab_arabs,"[israel, israeli, arab, arabs, jews, palestini...","[\nThis a ""tried and true"" method utilized by ..."
...,...,...,...,...,...
216,215,10,215_plutonium_nuclear_clancy_reactors,"[plutonium, nuclear, clancy, reactors, bomb, r...",[ Doug Holland claims Tom Clancy has provide...
217,216,10,216_hotel_voucher_package_airfare,"[hotel, voucher, package, airfare, vacation, m...",[Items for sale.....\n\nThis package was bough...
218,217,10,217_clinton_bush_administration_government,"[clinton, bush, administration, government, ch...",[\nEven if what Brad says turns out to be accu...
219,218,10,218_icon_icons_click_box,"[icon, icons, click, box, change, manager, bro...",[\nDo you mean the icons _of_ the program grou...


In [6]:
# Count 열의 값을 모두 합하면 총 문서의 수
model.get_topic_info()['Count'].sum()

18846

In [7]:
# 0번 토픽부터 219번 토픽까지 존재하며, 임의로 5번 토픽에 대해 단어 출력
model.get_topic(5)

[('car', 0.020532907492082055),
 ('ford', 0.014310311194111455),
 ('cars', 0.014239067893161298),
 ('mustang', 0.013813683439244866),
 ('engine', 0.012155170240176559),
 ('v8', 0.010309828244736507),
 ('convertible', 0.008175574481483087),
 ('toyota', 0.007724302165736773),
 ('v6', 0.00766773719564878),
 ('wagon', 0.006115950910789453)]

In [8]:
# 토픽 시각화
model.visualize_topics()

In [12]:
# 토픽 시각화2
model.visualize_barchart()

In [13]:
# 토픽 유사도 시각화
model.visualize_heatmap()


## 토픽 수 줄이기

In [15]:
# 임의의 숫자(20개) 지정하여 축소
model = BERTopic(nr_topics=20)
topics, probabilities = model.fit_transform(docs)

model.visualize_topics()

In [16]:
# 모델이 자동으로 줄이도록 축소
model = BERTopic(nr_topics="auto")
topics, probabilities = model.fit_transform(docs)

model.visualize_topics()

In [17]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6526,-1_the_to_of_and,"[the, to, of, and, is, in, for, that, it, you]","[\nIn the abstract, what you're saying is true..."
1,0,3112,0_for_and_with_the,"[for, and, with, the, is, to, it, on, image, or]",[FOR SALE:\n\nAT&T Dataport Internal 14.4K Fax...
2,1,1830,1_game_team_he_games,"[game, team, he, games, the, players, season, ...",[I thought I'd post my predicted standings sin...
3,2,635,2_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, governm...","[[An article from comp.org.eff.news, EFFector ..."
4,3,624,3_were_armenian_of_the,"[were, armenian, of, the, they, armenians, and...",[Accounts of Anti-Armenian Human Right Violati...
...,...,...,...,...,...
110,109,11,109_alarm_sensor_alarms_car,"[alarm, sensor, alarms, car, viper, shock, alp...","[\n\nAlthough, others have in the past and wil..."
111,110,11,110_media_publications_spiking_digging,"[media, publications, spiking, digging, person...",[\n\n\nThe answer to your second question lies...
112,111,10,111_hernia_hernias_ring_procedure,"[hernia, hernias, ring, procedure, pvcs, repai...",[\n\n\nI suspect you mean laparoscopic instead...
113,112,10,112_koresh_he_david_his,"[koresh, he, david, his, cult, compound, sermo...",[ ...


## 임의의 문서에 대한 예측

In [18]:
new_doc = docs[0]
print(new_doc)



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [19]:
topics, probs = model.transform([new_doc])
print('예측한 토픽 번호 :', topics)


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



예측한 토픽 번호 : [1]
