<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_TOPIC_MODEL_20241101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic

🔴 copied from the [Kubota Colab](https://colab.research.google.com/drive/1YsDp5_qGXGJKsEXsS8DO8CA_lqZc6EpA).  

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [None]:
#!pip install bertopic[visualization]

In [1]:
import pandas as pd
import numpy as np
import time
import math
import uuid
import re
import os
import json
import pickle
from datetime import date
from itertools import compress
from bertopic import BERTopic
from umap import UMAP
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## Connect your Google Drive

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

# 🔴 Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [3]:
# The bibliometrics folder
# Colab
ROOT_FOLDER_PATH = "drive/MyDrive/Bibliometrics_Drive"

# Mac
ROOT_FOLDER_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder = 'Q339_igem'

analysis_id = 'a01_tm__f01_e01__hdbs'

# Filtered label
settings_directive = "settings_analysis_directive_2025-08-04-16-40.json"

In [4]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder}/{analysis_id}/{settings_directive}', 'r') as file:
    settings = json.load(file)

In [5]:
# Input dataset
dataset_file_path = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['metadata']['filtered_folder']}/dataset_raw_cleaned.csv"

In [6]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [7]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [8]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [9]:
# Open the data file
df = pd.read_csv(f"{dataset_file_path}", encoding='latin-1')
print(df.shape)
df.head()

(4199, 14)


Unnamed: 0,X_N,uuid,UT,PY,AU,TI,AB,Countries,DI,ID,WC,Institutions,CR,C1
0,1,f399a9de-8ec5-4ace-b6f3-b9137f3107eb,173,2009,UNIPV-Pavia,Ethanol? Whey not!,Cheese whey is classified as a special waste f...,ITA,https://2009.igem.org/Team:UNIPV-Pavia,Food/Energy,,UniversitÃ degli Studi di Pavia\r\r\nDipartim...,,
1,2,9a1131bd-91da-4871-9bad-d809761bc77e,174,2009,Newcastle,Bac-man: sequestering cadmium into Bacillus sp...,Cadmium contamination can be a serious problem...,GBR,https://2009.igem.org/Team:Newcastle,Environment,,Newcastle University\r\r\nNewcastle upon Tyne\...,,
2,3,f866ee62-25b0-4cde-b895-45009fa2b6fe,175,2009,TUDelft,Bacterial Relay Race,"In our project, we aim at creating a cell-to-c...",NLD,https://2009.igem.org/Team:TUDelft,Information Processing,,"Delft University of Technology\r\r\nDelft, The...",,
3,4,32a634be-6e1d-4d11-bace-8a31f9a3a88a,176,2009,USTC,E. coli Automatic Directed Evolution Machine: ...,Evolution is powerful enough to create everyth...,CHN,https://2009.igem.org/Team:USTC,Foundational Advance,,University of Science and Technology of China\...,,
4,5,63d73d3a-b239-4624-a2bb-2fb6d068294e,177,2009,Warsaw,BacInVader Ð a new system for cancer genetic t...,The main aim of our project is to design a mod...,POL,https://2009.igem.org/Team:Warsaw,Health/Medicine,,University of Warsaw\r\r\nKrakowskie Przedmies...,,




---



## PART 2: Topic Model

In [10]:
# bibliometrics_folder
# project_folder
# project_name_suffix
# ROOT_FOLDER_PATH = f"drive/MyDrive/{bibliometrics_folder}"

#############################################################
# Embeddings folder
embeddings_folder_name = settings['tmo']['embeds_folder']

# Which column has the year of the documents?
my_year = settings['tmo']['year_column']

# Number of topics. Select the number of topics to extract.
# Choose 0, for automatic detection.
n_topics = 71#settings['tmo']['n_topics']

# Minimum number of documents per topic
min_topic_size = settings['tmo']['min_topic_size']

# Threshold for others
# This is the threshold for the topics that will be considered as "others"
# Topics with a cumulative percentage of documents below this threshold will be grouped into an "others" topic.
# For example, if set to 0.9, topics that together account for 90% of the documents will be kept, and the rest will be grouped into "others".
others_threshold = 0.99#settings['tmo']['others_threshold']

In [11]:
# Get the embeddings back.
embeddings = load_pickle(f"{ROOT_FOLDER_PATH}/{project_folder}/{settings['metadata']['filtered_folder']}/{embeddings_folder_name}/embeddings.pck")
corpus =     pd.read_csv(f"{ROOT_FOLDER_PATH}/{project_folder}/{settings['metadata']['filtered_folder']}/{embeddings_folder_name}/corpus.csv").reset_index(drop=True)

In [12]:
# Combine embeddings
documents = corpus.text.to_list()

In [62]:
# corpus['uuid'] = [uuid.uuid4() for _ in range(len(corpus.index))]
# corpus['X_N'] = [i for i in range(1, len(corpus.index)+1)]

In [13]:
len(documents)

4199

In [14]:
#len(embeddings) == len(documents)
len(embeddings['embeddings']) == len(documents)

True

In [15]:
len(embeddings['embeddings'][0])

384

In [16]:
from hdbscan.hdbscan_ import HDBSCAN
# Execute the topic model.
# I suggest changing the values marked with #<---
# The others are the default values and they'll work fine in most cases.
# This will take several minutes to finish.

# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=100)

if n_topics == 0:
  # Initiate topic model with HDBScan (Automatic topic selection)
  topic_model_params = HDBSCAN(min_cluster_size=min_topic_size,
                               metric='euclidean',
                               cluster_selection_method='eom',
                               prediction_data=True)
else:
  # Initiate topic model with K-means (Manual topic selection)
  topic_model_params = KMeans(n_clusters = n_topics)

# Initiate BERTopic
topic_model = BERTopic(umap_model = umap_model,
                       hdbscan_model = topic_model_params,
                       min_topic_size=min_topic_size,
                       #nr_topics=15,          #<--- Footnote 1
                       n_gram_range=(1,3),
                       language='english',
                       calculate_probabilities=True,
                       verbose=True)



# Footnote 1: This controls the number of topics we want AFTER clustering.
# Add a hashtag at the beggining to use the number of topics returned by the topic model.
# When using HDBScan nr_topics will be obtained after orphans removal, and there is no warranty that `nr_topics < HDBScan topics`.
# thus, with HDBScan `nr_topics` means N topics OR LESS.
# When using KMeans nr_topics can be used to further reduce the number of topics.
# We use the topics as returned by the topic model. So we do not need to activate it here.

In [None]:
# Compute topic model
#topics, probabilities = topic_model.fit_transform(documents, embeddings)
topics, probabilities = topic_model.fit_transform(documents, embeddings['embeddings'])

In [66]:
# Compute topic model
#topics, probabilities = topic_model.fit_transform(documents, embeddings)
topics, probabilities = topic_model.fit_transform(documents, embeddings['embeddings'])

2025-08-04 16:45:38,492 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-04 16:45:46,733 - BERTopic - Dimensionality - Completed ✓
2025-08-04 16:45:46,743 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-04 16:45:46,819 - BERTopic - Cluster - Completed ✓
2025-08-04 16:45:46,824 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-08-04 16:45:50,603 - BERTopic - Representation - Completed ✓


In [67]:
# Get the list of topics
# Topic = the topic number. From the largest topic.
#         "-1" is the generic topic. Genericr keywords are aggegrated here.
# Count = Documents assigned to this topic
# Name = Top 4 words of the topic based on probability
# Representation = The list of words representing this topic
# Representative_Docs = Documents assigned to this topic
tm_summary = topic_model.get_topic_info()
tm_summary

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,177,0_heavy_metal_heavy metal_the,"[heavy, metal, heavy metal, the, metals, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...
1,1,127,1_the_of_to_in,"[the, of, to, in, and, we, system, is, as, by]",[Conversensations Developing a TwoWay QuorumSe...
2,2,120,2_of_the_to_and,"[of, the, to, and, we, circuits, in, synthetic...",[Back to the Basics Synthetic biology has stri...
3,3,117,3_insulin_to_the_diabetes,"[insulin, to, the, diabetes, of, glucose, in, ...",[MINILOSS MIcrofluidic orgaN chIp for bLOod gl...
4,4,101,4_the_of_and_in,"[the, of, and, in, production, to, is, for, sy...",[Oppossum Plants and Pichia You Down With OPP ...
...,...,...,...,...,...
66,66,19,66_olfactory_vocs_volatile_volatile organic,"[olfactory, vocs, volatile, volatile organic, ...",[Vigilantly Optimizing Cancer Detection Using ...
67,67,18,67_histamine_allergy_allergic_allergies,"[histamine, allergy, allergic, allergies, is, ...",[Allergy test master the histamine receptor ba...
68,68,15,68_retardant_fire_flame_flame retardant,"[retardant, fire, flame, flame retardant, surf...",[Synbiofoam a synthetic alternative to fluoros...
69,69,13,69_pfas_pfoa_per and_of pfas,"[pfas, pfoa, per and, of pfas, substances, and...",[Detection and Degradation of Perfluoroalkyl S...


In [68]:
# Save the topic model assets
tm_folder_path = f'{ROOT_FOLDER_PATH}/{project_folder}/{settings["metadata"]["analysis_id"]}'

if not os.path.exists(tm_folder_path):
  !mkdir $tm_folder_path

tm_summary.to_csv(f'{tm_folder_path}/topic_model_info.csv', index=False)

In [69]:
# Number of topics found
found_topics = max(tm_summary.Topic) + 1
found_topics

71

In [70]:
# Confirm all documents are assigned
sum(tm_summary.Count) == len(corpus)

True

In [71]:
# Get top 10 terms for a topic
topic_model.get_topic(0)

[('heavy', 0.015450560745119704),
 ('metal', 0.015082348247861244),
 ('heavy metal', 0.0106227120823726),
 ('the', 0.010473297896099042),
 ('metals', 0.010056216944294582),
 ('and', 0.01001487564575892),
 ('of', 0.00907147239699955),
 ('in', 0.00887287961654058),
 ('to', 0.008634367233939611),
 ('ions', 0.008075810917577921)]

In [72]:
# Get the top 10 documents for a topic
topic_model.get_representative_docs(0)

['Nanocrystal Ecoli Flocculation Union Heavy metals especially zinc and cadmium inevitably have detrimental effects on the environment and our human race To be more specific water sources can be contaminated by heavy metals leaching from industrial waste acid rain can exacerbate this process by releasing heavy metals trapped in soils Considering the properties of heavy metals unable to decay and thus a different kind of challenge for medication we have constructed a heavy metal detection and recycle system Briefly our system is based on a special strain of Ecoli containing smtlocus CDS and an unfamiliar flocculation gene The system can detect the existence of metal zinc and cadmium through visualized pigment gene driven by the smtlocus recycle these heavy metal ions by forming nanocrystals resulted from CDS and flocculate the nanocrystals Eventually we can realize our doublewin goal safeguarding our environment by removing heavy metal ions and yielding a great amount of nanocrystals',


In [73]:
# Others

# # Get the number of documents per topic (same as in the table above)
# topic_model.get_topic_freq(0)

# # Get the main keywords per topic
# topic_model.get_topics()

In [74]:
# Print the parameters used. (For reporting)
topic_model.get_params()

{'calculate_probabilities': True,
 'ctfidf_model': ClassTfidfTransformer(),
 'embedding_model': None,
 'hdbscan_model': KMeans(n_clusters=71),
 'language': 'english',
 'low_memory': False,
 'min_topic_size': 5,
 'n_gram_range': (1, 3),
 'nr_topics': None,
 'representation_model': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.0, n_components=5, n_jobs=1, random_state=100, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 'vectorizer_model': CountVectorizer(ngram_range=(1, 3)),
 'verbose': True,
 'zeroshot_min_similarity': 0.7,
 'zeroshot_topic_list': None}

In [75]:
tm_params = dict(topic_model.get_params())
for key, value in tm_params.items():
    tm_params[key]=  str(value)
with open(f'{tm_folder_path}/topic_model_params.json', 'w') as f:
    json.dump(tm_params, f, ensure_ascii=False, indent=4)
    print('Done')

Done


In [94]:
# # Get the topic score for each paper and its assigned topic
# topic_distr, _ = topic_model.approximate_distribution(documents, batch_size=1000)
# distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]
# topic_distr[0]

In [98]:
# Get the topic score for each paper and its assigned topic
import numpy as np
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_distances

def calculate_centroid_scores(embeddings, topics, distance_metric='euclidean'):
    """
    Calculate scores for documents based on distance from topic centroids.
    
    Args:
        embeddings: List/array of document embeddings (shape: [n_docs, embedding_dim])
        topics: List of topic assignments for each document (length: n_docs)
        distance_metric: 'euclidean' or 'cosine'
    
    Returns:
        scores: List of scores (lower = closer to centroid)
    """
    embeddings = np.array(embeddings)
    topics = np.array(topics)
    
    # Group documents by topic
    topic_groups = defaultdict(list)
    for idx, topic in enumerate(topics):
        topic_groups[topic].append(idx)
    
    # Calculate centroids for each topic
    centroids = {}
    for topic, doc_indices in topic_groups.items():
        topic_embeddings = embeddings[doc_indices]
        centroids[topic] = np.mean(topic_embeddings, axis=0)
    
    # Calculate scores (distances from centroids)
    scores = np.zeros(len(embeddings))
    
    for topic, doc_indices in topic_groups.items():
        centroid = centroids[topic]
        topic_embeddings = embeddings[doc_indices]
        
        if distance_metric == 'euclidean':
            distances = np.linalg.norm(topic_embeddings - centroid, axis=1)
        elif distance_metric == 'cosine':
            distances = cosine_distances([centroid], topic_embeddings)[0]
        else:
            raise ValueError("distance_metric must be 'euclidean' or 'cosine'")
        
        scores[doc_indices] = distances
    
    return scores.tolist()


# Normalize scores to 0-1 range per topic
def normalize_scores_by_topic(scores, topics):
    """Normalize scores to 0-1 range within each topic."""
    scores = np.array(scores)
    topics = np.array(topics)
    normalized_scores = scores.copy()
    
    for topic in np.unique(topics):
        topic_mask = topics == topic
        topic_scores = scores[topic_mask]
        
        if len(topic_scores) > 1:  # Only normalize if more than 1 document
            min_score = topic_scores.min()
            max_score = topic_scores.max()
            if max_score > min_score:
                normalized_scores[topic_mask] = 1 - ((topic_scores - min_score) / (max_score - min_score))
            else:
                normalized_scores[topic_mask] = 0  # All documents identical
    
    return normalized_scores.tolist()



In [99]:
# Usage with normalization:
scores = calculate_centroid_scores(embeddings['embeddings'], topics)
normalized_scores = normalize_scores_by_topic(scores, topics)

In [100]:
# Document information. Including the topic assignation
dataset_clustering_results = topic_model.get_document_info(documents, df = corpus, metadata={"Score": normalized_scores})

# Check for orphans (Topic == -1), save and remove them
if -1 in dataset_clustering_results['Topic'].values:
    orphans = dataset_clustering_results[dataset_clustering_results['Topic'] == -1]
    orphans.to_csv(f'{tm_folder_path}/orphans.csv', index=False)
    dataset_clustering_results = dataset_clustering_results[dataset_clustering_results['Topic'] != -1]

In [101]:
# Standar format for report analysis
dataset_clustering_results = dataset_clustering_results.reset_index(drop=True)
dataset_clustering_results = dataset_clustering_results.drop(columns=['text'])
dataset_clustering_results['X_E'] = dataset_clustering_results['Score']
dataset_clustering_results['X_C'] = dataset_clustering_results['Topic'] + 1



# Assign 'level0' based on cluster coverage (90%)
cluster_counts = dataset_clustering_results['X_C'].value_counts().sort_values(ascending=False)
total_rows = len(dataset_clustering_results)
cumulative = cluster_counts.cumsum() / total_rows
main_clusters = cluster_counts.index[cumulative <= others_threshold].tolist()
dataset_clustering_results['level0'] = dataset_clustering_results['X_C'].apply(lambda x: x if x in main_clusters else 99999)
dataset_clustering_results.head()


Unnamed: 0,UT,uuid,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Score,X_E,X_C,level0
0,173,791aefbd-0fde-417e-8e1b-2f7210647032,Ethanol Whey not Cheese whey is classified as ...,22,22_vitamin_the_of_and,"[vitamin, the, of, and, to, in, production, la...",[Producing FL by Saccharomyces cerevisiae Fuc...,vitamin - the - of - and - to - in - productio...,False,0.825086,0.825086,23,23
1,174,84d56c08-f62f-4bfd-986c-71a44c1af2ed,Bacman sequestering cadmium into Bacillus spor...,0,0_heavy_metal_heavy metal_the,"[heavy, metal, heavy metal, the, metals, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...,heavy - metal - heavy metal - the - metals - a...,False,0.484021,0.484021,1,1
2,175,8a69e820-ef66-4438-8ed2-08943ab2c6b1,Bacterial Relay Race In our project we aim at ...,1,1_the_of_to_in,"[the, of, to, in, and, we, system, is, as, by]",[Conversensations Developing a TwoWay QuorumSe...,the - of - to - in - and - we - system - is - ...,True,0.961273,0.961273,2,2
3,176,572f8dd4-bb30-4d59-a319-2a5e650cce90,E coli Automatic Directed Evolution Machine a ...,9,9_dna_the_of_to,"[dna, the, of, to, and, in, evolution, for, we...",[SciPhi Enabling orthogonal replication and p...,dna - the - of - to - and - in - evolution - f...,False,0.680135,0.680135,10,10
4,177,be21c9a1-0aac-4e87-86f5-f6023d374461,BacInVader a new system for cancer genetic th...,12,12_cancer_tumor_cells_the,"[cancer, tumor, cells, the, to, and, of, thera...",[BLAST Bifidobacterium Longum induced Apoptos...,cancer - tumor - cells - the - to - and - of -...,False,0.727408,0.727408,13,13


In [106]:
# Standar format for report analysis
dataset_clustering_results['cl99'] = [True if x == 99999 else False for x in dataset_clustering_results['level0']]
dataset_clustering_results['cl-99'] = [True if x == 99999 else False for x in dataset_clustering_results['level0']]
dataset_clustering_results.head()

Unnamed: 0,UT,uuid,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Score,X_E,X_C,level0,cl99,cl-99
0,173,791aefbd-0fde-417e-8e1b-2f7210647032,Ethanol Whey not Cheese whey is classified as ...,22,22_vitamin_the_of_and,"[vitamin, the, of, and, to, in, production, la...",[Producing FL by Saccharomyces cerevisiae Fuc...,vitamin - the - of - and - to - in - productio...,False,0.825086,0.825086,23,23,False,False
1,174,84d56c08-f62f-4bfd-986c-71a44c1af2ed,Bacman sequestering cadmium into Bacillus spor...,0,0_heavy_metal_heavy metal_the,"[heavy, metal, heavy metal, the, metals, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...,heavy - metal - heavy metal - the - metals - a...,False,0.484021,0.484021,1,1,False,False
2,175,8a69e820-ef66-4438-8ed2-08943ab2c6b1,Bacterial Relay Race In our project we aim at ...,1,1_the_of_to_in,"[the, of, to, in, and, we, system, is, as, by]",[Conversensations Developing a TwoWay QuorumSe...,the - of - to - in - and - we - system - is - ...,True,0.961273,0.961273,2,2,False,False
3,176,572f8dd4-bb30-4d59-a319-2a5e650cce90,E coli Automatic Directed Evolution Machine a ...,9,9_dna_the_of_to,"[dna, the, of, to, and, in, evolution, for, we...",[SciPhi Enabling orthogonal replication and p...,dna - the - of - to - and - in - evolution - f...,False,0.680135,0.680135,10,10,False,False
4,177,be21c9a1-0aac-4e87-86f5-f6023d374461,BacInVader a new system for cancer genetic th...,12,12_cancer_tumor_cells_the,"[cancer, tumor, cells, the, to, and, of, thera...",[BLAST Bifidobacterium Longum induced Apoptos...,cancer - tumor - cells - the - to - and - of -...,False,0.727408,0.727408,13,13,False,False


In [103]:
dataset_clustering_results.level0.value_counts()

level0
1     177
2     127
3     120
4     117
5     101
     ... 
63     23
64     22
65     22
66     20
67     19
Name: count, Length: 68, dtype: int64

In [104]:
# Save the dataframe
dataset_clustering_results.to_csv(f'{tm_folder_path}/dataset_minimal.csv', index=False)

In [105]:
# Save the topic model
topic_model.save(f'{tm_folder_path}/topic_model_object.pck')





---

