# BERTopic

BERTopic is a topic modeling technique and create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

![model-layers](./model-layers.png)

here : sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence.

In [1]:
# ref : https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-terms

import os
import sys
import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sentence_transformers import SentenceTransformer, models
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from umap import UMAP

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [2]:
sys.path.append(os.path.join("..", "util"))
from NewsProcessor import NewsProcessor

In [3]:
BASELINE_MODEL_NAME = 'sentence-transformers/multi-qa-mpnet-base-dot-v1' # name of the Sentence-Transformer model for baseline

In [4]:
PATH_NEWS_CODE = os.path.join('..', 'data', 'news_code.csv') # path to the news code file
PATH_NEWS_TEST = os.path.join('..', 'data', 'news_test') # path to the news file
PATH_MODEL = os.path.join('..', 'model') # path to the local model
N_SUBJECTS = -1 # use -1 to get all subjects
DATA_SIZE = 3000 # size of the dataset to use

In [5]:
SEED = 1234

torch.manual_seed(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True
torch.cuda.manual_seed_all(SEED)

# Load data

In [6]:
# initialize the NewsProcessor
processor = NewsProcessor()

# process news data
processor.process(PATH_NEWS_CODE, PATH_NEWS_TEST) 

# get only the top subjects per news (if N_SUBJECTS != -1)
if N_SUBJECTS != -1:
    processor.select_top_subjects_per_news(n=N_SUBJECTS)

# get news data
news = processor.get_news()

# add encoded subjects to news
news['subjects_encoded'] = processor.encode_subjects().tolist()

# get random sample of news for evaluation
news = news.sample(n=DATA_SIZE, random_state=SEED)

Preprocessing news data...
Size before filtering: (359406, 18).
Size after filtering: (184284, 5).
Preprocessing complete.


In [7]:
# get only headline and rename to data
docs_news = news[['headline']].rename(columns={'headline': 'data'})
# convert to list
docs_news = docs_news['data'].tolist()

print(f"Number of news: {len(docs_news)}")

Number of news: 3000


# BERTopic

## Topic visualization

Visualize the topics and get insight into their relationships.

In [8]:
# # Prepare embeddings
sentence_model = SentenceTransformer(PATH_MODEL) # use fine-tuned model
embeddings = sentence_model.encode(docs_news, show_progress_bar=False)

# Train BERTopic
topic_model = BERTopic().fit(docs_news, embeddings)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.3, metric='cosine', random_state=42).fit_transform(embeddings)

# Visualize topics
topic_model.visualize_documents(docs_news, reduced_embeddings=reduced_embeddings)

## Hierarchical topic reduction

Extend the previous method by calculating the topic representation at different levels of the hierarchy (by merging similar topics).

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(docs_news)

# Visualize hierarchical topics
topic_model.visualize_hierarchical_documents(docs_news, hierarchical_topics, reduced_embeddings=reduced_embeddings)

In [10]:

# topic_model.visualize_hierarchical_documents(docs_news, hierarchical_topics, embeddings=embeddings)

# Baseline comparison

Compare the performance of the trained S-BERT model with the baseline model. 

The baseline model is the same pre-trained S-BERT model, but without fine-tuning on the news dataset.

In [11]:
# use baseline model for comparison purposes
# max_seq_length : max number of words in a sentence
word_embedding_model = models.Transformer(BASELINE_MODEL_NAME, max_seq_length=256) 
    # pooling model for sentence embedding
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(
    # dimension of the input vector
    in_features=pooling_model.get_sentence_embedding_dimension(), 
        # dimension of the output vector
    out_features=256,
    # activation function (tanh, relu, ...)
    activation_function=nn.Tanh()) 

sentence_model_baseline = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

In [None]:
# Prepare embeddings
embeddings_baseline = sentence_model_baseline.encode(docs_news, show_progress_bar=False)

# Train BERTopic
topic_model_baseline = BERTopic().fit(docs_news, embeddings_baseline)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings_baseline = UMAP(n_neighbors=10, n_components=2, min_dist=0.3, metric='cosine', random_state=42).fit_transform(embeddings_baseline)

# Visualize topics
topic_model_baseline.visualize_documents(docs_news, reduced_embeddings=reduced_embeddings_baseline)

In [None]:
hierarchical_topics_baseline = topic_model_baseline.hierarchical_topics(docs_news)

# Visualize hierarchical topics
topic_model_baseline.visualize_hierarchical_documents(docs_news, hierarchical_topics_baseline, reduced_embeddings=reduced_embeddings_baseline)

In [14]:
# topic_model_baseline.visualize_hierarchical_documents(docs_news, hierarchical_topics, embeddings=embeddings_baseline)

# Other visualizations

## Topic Hierarchy

Visualize the hierarchy of topics and how they have been reduced.

In [15]:
hierarchical_topics = topic_model.hierarchical_topics(docs_news)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

100%|██████████| 76/76 [00:00<00:00, 442.93it/s]


## Intertopic Distance 

Visualize the distance between topics and how they relate to each other.

In [16]:
topic_model.visualize_topics()

## Visualize Terms

Visualize the selected terms for a few topics (selected using C-TF-IDF).

In [17]:
topic_model.visualize_barchart(top_n_topics=10)

## Topic Similarity

Visualize a matrix indicating how similar certain topics are to each other (by simply applying cosine similarities through those topic embeddings).

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)