<a href="https://colab.research.google.com/github/drob-xx/Tune_BERTopic_HDBSCAN/blob/main/BERTune_UMAP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Introduction**

> What is the relationship between [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/index.html) settings and topic cluster creation in [BERTopic](https://maartengr.github.io/BERTopic/index.html)? This is the basic issue addressed below. In modeling a straight forward, homogeneous corpus of 30,000 English language news articles this notebook explains and shows dramatic improvements in the creation of a topic model. While the specific settings for this corpus will necessarily be different than for another corpus, the hope is that the discrete, reproduceable steps demonstrated here can be used to solve a wide-range of similar issues with many different corpi.

## **The Corpus**

> The [documents used here](https://www.kaggle.com/datasets/danrobinson707/newsdf) are from a [larger publicly available dataset](https://www.kaggle.com/datasets/harishcscode/all-news-articles-from-home-page-media-house) on Kaggle. The articles are a collection of news stories from a handful of major English language news publications. The predominant sources seem to be from the U.S., England and Australia. There is a mix of human interest, politics, science, medical, sports, entertainment and other typical, general audience subjects. The vast majority of articles are 500-1500 words in length. There is a long-tail of article sizes but only 0.84 percent (252) are more than 3000 words long. A small handful of the articles are in Welsh, not English.


## **This Notebook**

> This notebook is divided into four parts. 

* **Setup** installs, imports, defines some utility procedures, switches into the default dir, and reads two csv files into DataFrames. These are documents used throughout.

* **BERT_ALL** creates a single BERTopic model from all 30,000 articles.

* **BERT_1** and **BERT_2** are splits of the corpus **BERT_ALL** created because the results of experimenting with the parameters led the author to believe that a single parameterized HDBSCAN was insufficient for modeling this particular set of data. See below for details.

## **The Scatterplots**

> For each dataset there is a scatterplot that is a 2D TSNE reduction of the default BERTopic, 5D UMAP reduction of the underlying BERT embeddings. The scatterplots are a critical tool in evaluating the results of a particular configuration. Each datapoint represents one document and the hover text shows the first 400 characters of the modeled article. TSNE was used because it provides a somewhat more visually interpretable/coherent view of the data than UMAP provides in this case. The topic assignments are projected onto the scatterplot and the user can easily see a spatial relationship between  documents and categorizations. 

> These visualizations seem to invite close inspection as they provide a clearer and possibly unique view as to how the clustering effects document categorization. 

## **A Note On Randomness**

> Of course many of the underlying algorithms are stochastic in nature. In practice during the preparation of this notebook multiple itterations of all of the techniques - either amalgamated within BERTopic or the independent components: UMAP, HDBSCAN, PacMAP, TSNE - were run many times. While small differences were observed from run to run they were minor and were not deemed relevant for the discussion at hand. The reader is encouraged to do their own investigations if they are curious about these issues.


## Setup

### Download Data
This notebook requires two csv files which can be downloaded at:

[News0.csv](https://www.kaggle.com/datasets/danrobinson707/news0csv)

[News1.csv](https://www.kaggle.com/datasets/danrobinson707/news1csv)

In [None]:
!pip install bertopic

In [None]:
from bertopic import BERTopic
# BERTopic installs both HDBSCAN and UMAP
from hdbscan import HDBSCAN

from sklearn.feature_extraction import text 
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import pickle
import plotly.express as px

In [None]:
# Utilitys functions

def load(filepath):
      with open(filepath, 'rb') as fp:
          return pickle.load(fp)

def save(var, filepath):
      with open(filepath, 'wb') as fp:
          return pickle.dump(var, fp)

def PrepBERTopicTblForPlotly(targetTable, Text, BertModel, includeTopicText=True) :

  targetTable['topic'] = [str(top) for top in BertModel._map_predictions(BertModel.hdbscan_model.labels_)]
  
  # There is a bug in BERTopic when the -1 category is not the largest category
  sortedDF = BertModel.get_topic_info().sort_values(by=['Topic'])

  topicDict = {key : '{} ({})'.format(val, num) for  key, val, num in zip(sortedDF['Topic'].values, sortedDF['Name'].values, sortedDF['Count'].values)}  
  topicTexts = [topicDict[int(doctopic)] for doctopic in targetTable['topic'].values]
  targetTable['topic_text'] = topicTexts
  brtexts = []
  topicnums = targetTable['topic']
  for topicnum, texts in zip(topicnums, [txt[:400] for txt in Text]) :
    if includeTopicText :
      astr = '<br><br>' + topicDict[int(topicnum)] + '<br><br>'
    else :
      astr = '<br><br>'
    for idx in range(0, 400, 60) :
      astr += texts[idx:idx+60]
      astr += '<br>'
    brtexts.append(astr)         
  targetTable['text'] = brtexts

def getOrderedTopicTextFromBERT(BERTModel) :
  # Get around a BERTopic Bug
  sortedDF = BERTModel.get_topic_info().sort_values(by=['Topic'])

  return ['{} ({})'.format(txt, cnt) for txt, cnt in zip(sortedDF['Name'].values, sortedDF['Count'].values)]


In [None]:
# Change this to whereever you have downloaded the corpi 
#     and/or want to store your intermediate models

cd /content/drive/MyDrive/Projects/BERTune_UMAP

/content/drive/MyDrive/Projects/BERTune_UMAP


In [None]:
News1DF = pd.read_csv('./News0.csv')
News2DF = pd.read_csv('./News1.csv')

## BERT_ALL

In [None]:
## Create a single corpi from the two - will become clearer below.

ALLDF = pd.concat([News1DF, News2DF])
ALLDF.reset_index(drop=True, inplace=True)

In [None]:
# This cell runs in about 8 mins. on a Colab+ with GPU and addt'l memory. 
#   BERTopic is memory intensive, so watch out for crashes.


# Adding these stop-words is cosmetic.
stop_words = text.ENGLISH_STOP_WORDS.union(['said', 'say', 'says', 'year', 'years', 'new', 'mr'])
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stop_words)

# Chose a min_topic_size of 150 to force a reasonable clustering. The whole point
#   of this exercise is to show that randomly setting these values (or just relying
#   on defaults) will have a *huge* effect on the model.  In this case 150
#   will force to a reasonably 'natural' segmentation. What is meant here
#   by 'natural' and how this number was arrived at (through multiple iterations
#   of HDBSCAN params), is beyond the scope of this notebook.

BERT_ALL = BERTopic(
                  vectorizer_model=vectorizer_model,
                  calculate_probabilities=False,
                  verbose=True,
                  low_memory=True,
                  min_topic_size=150                  
                  )

# Set UMAPs random state so that UMAP output will be consistent across runs

BERT_ALL.umap_model.random_state=42

# Fit the model

BERT_ALL_Topics, _ = BERT_ALL.fit_transform(ALLDF['text'])

# Save if you wish - takes about 2 mins.
# BERT_ALL.save('./BERT_ALL')

In [None]:
# If you saved...

BERT_ALL = BERTopic.load('./BERT_ALL')

In [None]:
# We need to create a 2D representation for the scatterplot. Could have used
#    UMAP to do this - but partly because it would be a 2D reduction of a 
#    5D reduction of the original embeddings, it gets pretty visually sketchy.
#    The reader is encouraged to play with different algorithms to 
#    aprehend the differences. There is no 'right' way to do this. TSNE in this 
#    case was convenient and salutary for this purpose. Check out PacMAP as well.

BERT_ALL_TSNE = TSNE(n_components=2, learning_rate='auto',
                  init='random').fit_transform(BERT_ALL.umap_model.embedding_)

In [None]:
BERT_ALL_DF = pd.DataFrame()
BERT_ALL_DF['x'] = BERT_ALL_TSNE[:,0]
BERT_ALL_DF['y'] = BERT_ALL_TSNE[:,1]

PrepBERTopicTblForPlotly(BERT_ALL_DF, ALLDF['text'], BERT_ALL)
ordered_list = getOrderedTopicTextFromBERT(BERT_ALL)

## **Initial Results**

> In the scatter plot created below we see that the corpus has been broken up into 6 identified topics and one small set of outliers. Note that the relative positioning of the documents within these clusters is very cohesive. 

> The most important feature of this particular configuration shows that topics 1-5 are sports related. Within those, the largest set seems to be predominantly soccer with a fair number of rugby articles. The other groupings are somewhat mixed but overall are about American Football, Golf, Tennis, Car Racing and Boxing. While these sports groupings are by no means homogeneous, they are remarkably weighted towards a particular sport.

> The other, very large cluster can be thought of as "News" For the most part there are few sports stories in this grouping. Where they do intrude it seems to be because there is overlap with other kinds of news so you get Sports/Crime, Sports/Politics, Sports/Culture etc. etc. Far more interesting is that within this cluster it is easy to visually identify many many sub-clusters that presumably HDBSCAN, with the settings used, was unable to identify as unique. Yet by zooming around the cluster and examining specific large and small visually clustered groupings, it is apparent that there is internal structure - articles about earthquakes, crime, sexual-abuse, movies, internet companies, etc. etc. grouped around one another. While there *seem* to be some outliers, there are really very few. Furthermore, closer inspection of the entire article may reveal clues as to why these articles were positioned as they were.

> After concerted and failing attempts to find parameters that would result in a better sub-categorization of the "News" cluster, an guess about the limits of HDBSCAN was made. Namely, that the geometry of this particular dataset means that no single HDBSCAN set of parameters can both preserve the basic Sports/News dichotomy while at the same time allow for a "rationalized" categorization of the News articles that would correspond to their seemingly coherent spatial positioning. Interestingly, other visualizations, namely using UMAP 2D and PacMAP visualizations of the embeddings more clearly shows a noticable gap between the Sports and News categorizatons.

> Note that you can click and double click on the legend to hide / show individual or groups of categories.

In [None]:
fig = px.scatter(BERT_ALL_DF, x='x', y='y', 
                color='topic_text', 
                width=1200, 
                height=850,
                hover_data= {'x' : False,
                            'y' : False,
                            'topic' : False,
                            'text' : True },
                category_orders={'topic_text' : ordered_list})
fig.update_layout(
    hoverlabel=dict(
        bgcolor="white",
))

fig.update_traces(marker=dict(size=6),
                  selector=dict(mode='markers')
)

fig.show()

## BERT_1 - 'NEWS'

The corpus was divided into two parts. Using settings similar to the above, the set of 'News' topics were separated from the 'Sports'. The BERTopic model was created and then a series of HDBSCAN parameters were cycled through. Based on this output a min_cluster_size of 330 and min_samples of 165 seemed like very good candidates. The process for selecting the optimum values is relatively complicated and won't be explained here. The author encourages questsions regarding why these parameters were chosen. These parameters are not being presented as fully optimized - but do suffice for the overall demonstration.

In [None]:
stop_words = text.ENGLISH_STOP_WORDS.union(['said', 'say', 'says', 'year', 'years', 'new', 'mr'])
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stop_words)
hdbscan_model = HDBSCAN(min_samples=165,  
                        min_cluster_size=330)

BERT_1 = BERTopic(
                  vectorizer_model=vectorizer_model,
                  calculate_probabilities=False,
                  verbose=True,
                  low_memory=True,                  
                  hdbscan_model=hdbscan_model,
                  )
# Set UMAPs random state so that UMAP output will be consistent across runs

BERT_ALL.umap_model.random_state=42

# Fit the model

BERT_1_topics, _ = BERT_1.fit_transform(News1DF['text'])

In [None]:
BERT_1.save('./BERT_1')
save(BERT_1_topics, './BERT_1_topics')

In [None]:
BERT_1 = BERTopic.load('./BERT_1')

In [None]:
stop_words = text.ENGLISH_STOP_WORDS.union(['said', 'say', 'says', 'year', 'years', 'new', 'mr'])
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stop_words)
BERT_1.update_topics(News1DF['text'], BERT_1_topics, vectorizer_model=vectorizer_model)

In [None]:
BERT_1_TSNE = TSNE(n_components=2, learning_rate='auto',
                  init='random').fit_transform(BERT_1.umap_model.embedding_)

In [None]:
BERT_1_DF = pd.DataFrame()
BERT_1_DF['x'] = BERT_1_TSNE[:,0]
BERT_1_DF['y'] = BERT_1_TSNE[:,1]

PrepBERTopicTblForPlotly(BERT_1_DF, News1DF['text'], BERT_1)
ordered_list = getOrderedTopicTextFromBERT(BERT_1)

We now have a nice segmentation of the texts. Uncategorized have shot up, but they are more or less evenly distributed. There are some areas where it seems like it would be 'nice' to have a better identified clustering, but overall this is a much better representation than the above. It may very well be that further experimentation with the clustering would further fine-tune these results.

In [None]:
fig = px.scatter(BERT_1_DF, x='x', y='y', 
                color='topic_text', 
                width=850, 
                height=600,
                hover_data= {'x' : False,
                            'y' : False,
                            'topic' : False,
                            'text' : True },
                category_orders={'topic_text' : ordered_list})
fig.update_layout(
    hoverlabel=dict(
        bgcolor="white",
))

fig.update_traces(marker=dict(size=6),
                  selector=dict(mode='markers')
)

fig.show()

## BERT_2

Using a min_samples of 40 and min_cluster_size of 80 (arrived at through the same process as the BERT_1 parameters) 

In [None]:
stop_words = text.ENGLISH_STOP_WORDS.union(['said', 'say', 'says', 'year', 'years', 'new', 'mr'])
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stop_words)
hdbscan_model = HDBSCAN(min_samples=40,  
                        min_cluster_size=80)

BERT_2 = BERTopic(
                  vectorizer_model=vectorizer_model,
                  calculate_probabilities=False,
                  verbose=True,
                  low_memory=True,                  
                  hdbscan_model=hdbscan_model,
                  )

# Set UMAPs random state so that UMAP output will be consistent across runs

BERT_ALL.umap_model.random_state=42

# Fit the model

BERT_2_Topics, _ = BERT_2.fit_transform(News2DF['text'])

BERT_2.save('./BERT_2')
save(BERT_2, './BERT_2')


In [None]:
BERT_2 = BERTopic.load('./BERT_2')

In [None]:
BERT_2_TSNE = TSNE(n_components=2, learning_rate='auto',
                  init='random').fit_transform(BERT_2.umap_model.embedding_)

In [None]:
BERT_2_DF = pd.DataFrame()
BERT_2_DF['x'] = BERT_2_TSNE[:,0]
BERT_2_DF['y'] = BERT_2_TSNE[:,1]

PrepBERTopicTblForPlotly(BERT_2_DF, News2DF['text'], BERT_2)
ordered_list = getOrderedTopicTextFromBERT(BERT_2)

In [None]:
fig = px.scatter(BERT_2_DF, x='x', y='y', 
                color='topic_text', 
                width=850, 
                height=600,
                hover_data= {'x' : False,
                            'y' : False,
                            'topic' : False,
                            'text' : True },
                category_orders={'topic_text' : ordered_list})
fig.update_layout(
    hoverlabel=dict(
        bgcolor="white",
))

fig.update_traces(marker=dict(size=6),
                  selector=dict(mode='markers')
)

fig.show()