<a href="https://colab.research.google.com/github/fuyu-quant/Data_Science/blob/main/Natural_Language_processing/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic
* [BERTopicの公式ドキュメント](https://maartengr.github.io/BERTopic/index.html)
* [BERTopicの公式ノートブック](https://colab.research.google.com/drive/18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing#scrollTo=W2AaTdNhCkGO)
* Transformers(デフォルトだと'all-MiniLM-L6-v2')とc-TF-IDFを使って，トピックの記述に重要な単語を残したまま解釈しやすいクラスタを作成するトピックモデル
* https://qiita.com/takky_0330/items/9cf8d642be3b216dd70d

## Install

In [None]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.10.0-py2.py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 1.6 MB/s 
[?25hCollecting hdbscan>=0.8.28
  Downloading hdbscan-0.8.28.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 32.6 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.3 MB/s 
[?25hCollecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 7.8 MB/s 
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 73.1 MB/s 
C

## Model

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']


# さまざまなTransformerモデルを選択できる
#topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2").fit(docs)
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens").fit(docs)


# EmbeddingにSentenceTransformerを使う場合
'''
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("xlm-r-bert-base-nli-stsb-mean-tokens", device="cuda")
topic_model = BERTopic(embedding_model=sentence_model).fit(docs)
'''

# 公開されているさまざまなモデルでEmbeddingをする場合
'''
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta).fit(docs)
'''

# Universal Sentence Encoderを使う
'''
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
topic_model = BERTopic(verbose=True, embedding_model=embedding_model).fit(docs)
'''


# 文章のEmbeddingと単語のEmbeddingで違うものを使いたい時
'''
from bertopic.backend import WordDocEmbedder
import gensim.downloader as api
from sentence_transformers import SentenceTransformer

# 単語のEmbedding
ft = api.load('fasttext-wiki-news-subwords-300')

# 文章のEmbedding
distilbert = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens")

# モデルの作成
word_doc_embedder = WordDocEmbedder(embedding_model=distilbert, word_embedding_model=ft)
topic_model = BERTopic(verbose=True, embedding_model=word_doc_embedder).fit(docs)
'''

Downloading:   0%|          | 0.00/795 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/722 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/518 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

"\nfrom flair.embeddings import TransformerDocumentEmbeddings\n\nroberta = TransformerDocumentEmbeddings('roberta-base')\ntopic_model = BERTopic(embedding_model=roberta).fit(docs)\n"

In [None]:
topics, probs = topic_model.fit_transform(docs)

In [None]:
probs

array([1.        , 0.        , 0.32331681, ..., 0.84741065, 1.        ,
       1.        ])

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,8808,-1_the_to_and_for
1,0,1601,0_game_he_team_the
2,1,1258,1_god_that_of_is
3,2,748,2_space_nasa_the_of
4,3,706,3_of_patients_is_and
...,...,...,...
101,100,11,100_power_plug_thewho_phone
102,101,11,101_uv_flashlight_bulb_light
103,102,10,102_sweden_usa_switzerland_april
104,103,10,103_output_oname_printf_file
