# Analyzing topics with BERTopic

[*NB: This notebook is adapted directly from the documentation for the package ```BERTopic``` which we're working with today. You can find more at the website [here](https://maartengr.github.io/BERTopic/index.html) and at the [Github repo](https://github.com/MaartenGr/BERTopic)*]

In this tutorial we will be exploring how to use BERTopic to create topics from text data scraped from the AU domain and from Twitter. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for. 


## BERTopic
BERTopic is a topic modelling technique that creates dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

We'll start by installing all of the packages we need, and then loading some individual dataset that we're interested in.

This next step might take a little bit of time when you first run it.

In [None]:
!bash setup.sh

In [None]:
import os
import pandas as pd 
# select a different file here
filename = "/work/795173/deduped/au.dk_deduped.ndjson"
# using only the titles
docs = pd.read_json(filename, lines="true")

In [None]:
docs.head()

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

We start by instantiating BERTopic. We set language to `multilingual` since our documents are primarily Danish. If you would like to use an English language model, please use `language="english"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). So you might want to turn this off if you want to speed up the model. 


In [None]:
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

# get rid of most frequent words
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# initialize the model
topic_model = BERTopic(language="multilingual", 
                       calculate_probabilities=True, 
                       verbose=True,
                       ctfidf_model=ctfidf_model)

# train the topic model on our docs
topics, probs = topic_model.fit_transform(docs["text"])

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [None]:
freq = topic_model.get_topic_info()
freq.head(5)

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

In [None]:
# document index to inspect
doc_id = 49
print(docs.iloc[doc_id])
print(f"Primary topic id: {topic_model.topics_[doc_id]}")

And we can get the "keywords" for a specific topic like this:

In [None]:
topic_model.get_topic(0)

To visualize the distribution of all topics for this specfic document, we simply call:

In [None]:
topic_model.visualize_distribution(probs[doc_id], 
                                   min_probability=0.015)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy()

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=8)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(width=1000, 
                              height=1000)

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar  to an input search_term. Here, we are going to be searching for topics that closely relate the  search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = topic_model.find_topics("forskning", top_n=5)
print(similar_topics)