# BERTopic

* [BERTopicの公式ドキュメント](https://maartengr.github.io/BERTopic/index.html)
* [BERTopicの公式ノートブック](https://colab.research.google.com/drive/18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing#scrollTo=W2AaTdNhCkGO)
* Transformers(デフォルトだと'all-MiniLM-L6-v2')とc-TF-IDFを使って，トピックの記述に重要な単語を残したまま解釈しやすいクラスタを作成するトピックモデル
* https://qiita.com/takky_0330/items/9cf8d642be3b216dd70d


<a href="https://colab.research.google.com/github/fuyu-quant/Data_Science/blob/main/Natural_Language_processing/Topic_Model(english)/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
%%capture
!pip install bertopic

In [4]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

## データの用意

In [6]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

## BERTopic

In [17]:
#embedding_model="all-MiniLM-L6-v2"
embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens"

# EmbeddingにSentenceTransformerを使う
#from sentence_transformers import SentenceTransformer
#embedding_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")

# 公開されているさまざまなモデルでEmbeddingをする場合
#from flair.embeddings import TransformerDocumentEmbeddings
#embedding_model=TransformerDocumentEmbeddings('roberta-base')

# Universal Sentence Encoderを使う
#import tensorflow_hub
#embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
#topic_model = BERTopic(verbose=True, embedding_model=embedding_model)

# 文章のEmbeddingと単語のEmbeddingで違うものを使いたい時
#from bertopic.backend import WordDocEmbedder
#import gensim.downloader as api
#from sentence_transformers import SentenceTransformer

# 単語のEmbedding
#ft = api.load('fasttext-wiki-news-subwords-300')

# 文章のEmbedding
#distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")
#word_doc_embedder = WordDocEmbedder(embedding_model=distilbert, word_embedding_model=ft)
#topic_model = BERTopic(verbose=True, embedding_model=word_doc_embedder)

In [18]:
topic_model = BERTopic(#language="japanese",   # 日本語で使いたい時
                       embedding_model=embedding_model,
                       calculate_probabilities=True,  # 外れ値のクラスになるべく割り当てられないように設定
                       verbose=True,
                       nr_topics="20"  # トピック数の指定(20個になるわけではない)
                       )

## BERTopicの学習

In [19]:
topic_model.fit(docs)

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-12-28 14:23:51,057 - BERTopic - Transformed documents to Embeddings
2022-12-28 14:24:13,908 - BERTopic - Reduced dimensionality
2022-12-28 14:24:25,149 - BERTopic - Clustered reduced embeddings
2022-12-28 14:24:35,278 - BERTopic - Reduced number of topics from 95 to 21


<bertopic._bertopic.BERTopic at 0x7f6b1f0eb9d0>

In [20]:
topics, probs = topic_model.fit_transform(docs)

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-12-28 14:26:46,046 - BERTopic - Transformed documents to Embeddings
2022-12-28 14:27:09,213 - BERTopic - Reduced dimensionality
2022-12-28 14:27:16,969 - BERTopic - Clustered reduced embeddings
2022-12-28 14:27:27,264 - BERTopic - Reduced number of topics from 85 to 73


In [21]:
probs

array([[0.09790828, 0.24484116, 0.01032329, ..., 0.0058334 , 0.00838453,
        0.01118882],
       [0.04578481, 0.00286609, 0.00740172, ..., 0.01259724, 0.00792609,
        0.00571211],
       [0.08635302, 0.16948329, 0.00901567, ..., 0.004999  , 0.00730724,
        0.009767  ],
       ...,
       [0.06447445, 0.00370416, 0.01707378, ..., 0.0062696 , 0.0603413 ,
        0.00794144],
       [0.02263598, 0.00160543, 0.00217715, ..., 0.02038177, 0.00268623,
        0.00264215],
       [0.05080583, 0.00487645, 0.47876717, ..., 0.00475717, 0.0132666 ,
        0.00647266]])

## それぞれのトピックの表示

In [22]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,7890,-1_the_to_and_of
1,0,2791,0_of_the_that_and
2,1,1544,1_game_he_team_the
3,2,1203,2_car_the_bike_it
4,3,764,3_space_the_of_and
...,...,...,...
68,67,10,67_sweden_usa_april_switzerland
69,68,10,68_subscribe_please_subscrive_me
70,69,10,69_3ds_file_texture_prj
71,70,10,70_silver_solder_crystal_cpu


## Topic modelの可視化

In [23]:
topic_model.visualize_topics()

## Topic modelのクラスタリング

In [24]:
topic_model.visualize_hierarchy()