<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/probabilistic_topic_models/semi_BERTOPIC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook based on code retrieved from [MaartenGr/BERTopic](https://github.com/MaartenGr/BERTopic)

Remember to enable GPU by `Runtime>Change runtime type>Hardware accelerator (GPU)`

In [None]:
!pip install bertopic

We use the popular [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset which contains roughly 18000 newsgroups posts that each is assigned to one of 20 categories. Using this dataset we can try to extract its corresponding topic model whilst taking its underlying categories into account.

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data["data"]
categories = data["target"]
category_names = data["target_names"]
classes = [data["target_names"][i] for i in data["target"]]
for idx, val in enumerate(category_names):
    print(idx, val)

Each document is put into one of the previous categories:

In [None]:
print("Document:",docs[0])
print("Category:",categories[0] )
print("Label:",category_names[categories[0]])



For this example, imagine we only use the labels of categories that are related to computers and we want to create a topic model using semi-supervised modeling:

In [None]:
labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',
              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
              'comp.windows.x',]
indices = [category_names.index(label) for label in labels_to_add]
new_categories = [label if label in indices else -1 for label in categories]
print(new_categories[:10],"..")

`new_categories` contains many -1 values since we do not know all the categories.

In [None]:
print("Document:",docs[0])
print("Category:",new_categories[0] )

Next, we use those newly constructed labels to create a  semi-supervised topic model:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english", 
                                   min_df=10)

topic_model = BERTopic(calculate_probabilities=True, vectorizer_model=vectorizer_model, low_memory=True, verbose=True)
topics, _ = topic_model.fit_transform(docs, y=new_categories)

## Extracting Topics

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [None]:
topic_model.get_topic_info().head(10)

-1 refers to all outliers and should typically be ignored. 

The topics that were created mostly make sense. There are some clearly defined topics but also some topics that seem mostly derived from other topics. We can visualize this by extracting the topic representations per class and see if our unsupervised model closely resembles this.

NOTE: You can hover over the bars to see the representation per class!!

In [None]:
topics_per_class = topic_model.topics_per_class(docs, topics, classes=classes)
fig_unsupervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)
fig_unsupervised

NOTE: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.



## Visualization

### Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

In [None]:
topic_model.visualize_topics()

### Visualize Topic Probabilities

The variable probabilities that is returned from transform() or fit_transform() can be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

### Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

### Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)