<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_TOPIC_MODEL_20241101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic

üî¥ copied from the [Kubota Colab](https://colab.research.google.com/drive/1YsDp5_qGXGJKsEXsS8DO8CA_lqZc6EpA).  

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [1]:
import pandas as pd
import os
import json
import pickle
from datetime import date
from itertools import compress
from bertopic import BERTopic
from umap import UMAP
from sklearn.cluster import KMeans

  from .autonotebook import tqdm as notebook_tqdm


# üî¥ Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [2]:
# The bibliometrics folder
# Colab
ROOT_FOLDER_PATH = "drive/MyDrive/Bibliometrics_Drive"

# Mac
ROOT_FOLDER_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder = 'Q339_igem'

# Analysis ID
analysis_id = 'a01_tm__f01_e01__hdbs'

# Filtered label
settings_directive = "settings_analysis_directive_2025-08-04-16-40.json"

In [4]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder}/{analysis_id}/{settings_directive}', 'r') as file:
    settings = json.load(file)

In [5]:
# Input dataset
dataset_file_path = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['metadata']['filtered_folder']}/dataset_raw_cleaned.csv"

In [6]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [7]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [8]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [10]:
# Open the data file
df = pd.read_csv(f"{dataset_file_path}", encoding='latin-1', usecols=['X_N', 'uuid', 'UT', 'PY', 'TI', 'AB'])
print(df.shape)
df.head()

(4199, 6)


Unnamed: 0,X_N,uuid,UT,PY,TI,AB
0,1,f399a9de-8ec5-4ace-b6f3-b9137f3107eb,173,2009,Ethanol? Whey not!,Cheese whey is classified as a special waste f...
1,2,9a1131bd-91da-4871-9bad-d809761bc77e,174,2009,Bac-man: sequestering cadmium into Bacillus sp...,Cadmium contamination can be a serious problem...
2,3,f866ee62-25b0-4cde-b895-45009fa2b6fe,175,2009,Bacterial Relay Race,"In our project, we aim at creating a cell-to-c..."
3,4,32a634be-6e1d-4d11-bace-8a31f9a3a88a,176,2009,E. coli Automatic Directed Evolution Machine: ...,Evolution is powerful enough to create everyth...
4,5,63d73d3a-b239-4624-a2bb-2fb6d068294e,177,2009,BacInVader √ê a new system for cancer genetic t...,The main aim of our project is to design a mod...




---



## PART 2: Topic Model

In [11]:
# bibliometrics_folder
# project_folder
# project_name_suffix
# ROOT_FOLDER_PATH = f"drive/MyDrive/{bibliometrics_folder}"

#############################################################
# Embeddings folder
embeddings_folder_name = settings['tmo']['embeds_folder']

# Which column has the year of the documents?
my_year = settings['tmo']['year_column']

# Number of topics. Select the number of topics to extract.
# Choose 0, for automatic detection.
n_topics = 71#settings['tmo']['n_topics']

# Minimum number of documents per topic
min_topic_size = settings['tmo']['min_topic_size']

# Threshold for others
# This is the threshold for the topics that will be considered as "others"
# Topics with a cumulative percentage of documents below this threshold will be grouped into an "others" topic.
# For example, if set to 0.9, topics that together account for 90% of the documents will be kept, and the rest will be grouped into "others".
others_threshold = 0.99#settings['tmo']['others_threshold']

In [14]:
# Get the embeddings back.
embeddings = load_pickle(f"{ROOT_FOLDER_PATH}/{project_folder}/{settings['metadata']['filtered_folder']}/{embeddings_folder_name}/embeddings.pck")
corpus =     pd.read_csv(f"{ROOT_FOLDER_PATH}/{project_folder}/{settings['metadata']['filtered_folder']}/{embeddings_folder_name}/corpus.csv").reset_index(drop=True)
documents = corpus.text.to_list()
print(f"Embeddings shape: {embeddings['embeddings'].shape}")

Embeddings shape: (4199, 384)


In [28]:
# verify that the number of embeddings matches the number of documents
assert(len(embeddings['embeddings']) == len(documents))

# verify that embedding and documents are in the same order as the original dataframe
assert(corpus['uuid'].tolist() == embeddings['embeddings_ids'])

---

# Topic Modeling with BERTopic

In [29]:
from hdbscan.hdbscan_ import HDBSCAN
# Execute the topic model.
# I suggest changing the values marked with #<---
# The others are the default values and they'll work fine in most cases.
# This will take several minutes to finish.

# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=100)

if n_topics == 0:
  # Initiate topic model with HDBScan (Automatic topic selection)
  topic_model_params = HDBSCAN(min_cluster_size=min_topic_size,
                               metric='euclidean',
                               cluster_selection_method='eom',
                               prediction_data=True)
else:
  # Initiate topic model with K-means (Manual topic selection)
  topic_model_params = KMeans(n_clusters = n_topics)

# Initiate BERTopic
topic_model = BERTopic(umap_model = umap_model,
                       hdbscan_model = topic_model_params,
                       min_topic_size=min_topic_size,
                       #nr_topics=15,          #<--- Footnote 1
                       n_gram_range=(1,3),
                       language='english',
                       calculate_probabilities=True,
                       verbose=True)



# Footnote 1: This controls the number of topics we want AFTER clustering.
# Add a hashtag at the beggining to use the number of topics returned by the topic model.
# When using HDBScan nr_topics will be obtained after orphans removal, and there is no warranty that `nr_topics < HDBScan topics`.
# thus, with HDBScan `nr_topics` means N topics OR LESS.
# When using KMeans nr_topics can be used to further reduce the number of topics.
# We use the topics as returned by the topic model. So we do not need to activate it here.

In [30]:
# Compute topic model
topics, probabilities = topic_model.fit_transform(documents, embeddings['embeddings'])

2026-01-19 15:16:47,276 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-19 15:16:59,503 - BERTopic - Dimensionality - Completed ‚úì
2026-01-19 15:16:59,504 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-19 15:16:59,607 - BERTopic - Cluster - Completed ‚úì
2026-01-19 15:16:59,612 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-19 15:17:01,646 - BERTopic - Representation - Completed ‚úì


In [31]:
# Get the list of topics
# Topic = the topic number. From the largest topic.
#         "-1" is the generic topic. Genericr keywords are aggegrated here.
# Count = Documents assigned to this topic
# Name = Top 4 words of the topic based on probability
# Representation = The list of words representing this topic
# Representative_Docs = Documents assigned to this topic
tm_summary = topic_model.get_topic_info()
tm_summary

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,150,0_the_of_and_production,"[the, of, and, production, to, in, yeast, for,...",[Fruit wine brewing Plus The high content of h...
1,1,122,1_cancer_cells_the_to,"[cancer, cells, the, to, of, and, in, tumor, t...",[Project CARKinos a better way to treat Lung ...
2,2,121,2_the_to_nitrogen_and,"[the, to, nitrogen, and, of, water, in, is, al...",[Smart Algicidal Bacteria to Control Water Blo...
3,3,120,3_light_the_of_and,"[light, the, of, and, to, in, we, cell, is, co...",[Lights Camera Flip Engineering a LightActivat...
4,4,115,4_metal_heavy_heavy metal_metals,"[metal, heavy, heavy metal, metals, the, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...
...,...,...,...,...,...
66,66,17,66_clock_melatonin_circadian_the,"[clock, melatonin, circadian, the, insomnia, o...",[Transplanting KaiABC Reduce your jet lag with...
67,67,14,67_crispr_to_gene_drive,"[crispr, to, gene, drive, in, system, we, of, ...",[Bioengineering a mechanism to override plasmi...
68,68,13,68_pfas_pfoa_per and_of pfas,"[pfas, pfoa, per and, of pfas, substances, and...",[Detection and Degradation of Perfluoroalkyl S...
69,69,11,69_chromium_hexavalent chromium_hexavalent_oxy...,"[chromium, hexavalent chromium, hexavalent, ox...",[OxygenMAX Previous work have shown that the e...


In [32]:
# Save the topic model assets
tm_folder_path = f'{ROOT_FOLDER_PATH}/{project_folder}/{settings["metadata"]["analysis_id"]}'

if not os.path.exists(tm_folder_path):
  !mkdir $tm_folder_path

tm_summary.to_csv(f'{tm_folder_path}/topic_model_info.csv', index=False)

In [33]:
# Number of topics found
found_topics = max(tm_summary.Topic) + 1
found_topics

71

In [34]:
# Confirm all documents are assigned
sum(tm_summary.Count) == len(corpus)

True

In [36]:
# Get the top 10 documents for a topic
topic_model.get_representative_docs(0)

['Fruit wine brewing Plus The high content of higher alcohol in beverages can easily lead to symptoms such as headache which is the main reason for the slow drunkenness This project finely regulates the metabolic pathways of higher alcohols in Saccharomyces cerevisiae through genetic engineering to reduce the production of higher alcohols First the intracellular homologous recombination technology was used to knock out the branchedchain amino acid aminotransferase encoded by the BAT gene and at the same time the alcohol acetyltransferase ATF gene was used to replace the position of the BAT gene Secondly the ATF gene is placed on the expression vector of Saccharomyces cerevisiae and then transferred to Saccharomyces cerevisiae to achieve the purpose of reducing the level of higher alcohol in Saccharomyces cerevisiae This project makes the brewed beverage have less high alcohol content so that the drinker is not uncomfortable in the head',
 'Construction of a Saccharomyces cerevisiae cel

In [37]:
# Print the parameters used. (For reporting)
topic_model.get_params()

{'calculate_probabilities': True,
 'ctfidf_model': ClassTfidfTransformer(),
 'embedding_model': None,
 'hdbscan_model': KMeans(n_clusters=71),
 'language': 'english',
 'low_memory': False,
 'min_topic_size': 5,
 'n_gram_range': (1, 3),
 'nr_topics': None,
 'representation_model': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.0, n_components=5, n_jobs=1, random_state=100, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 'vectorizer_model': CountVectorizer(ngram_range=(1, 3)),
 'verbose': True,
 'zeroshot_min_similarity': 0.7,
 'zeroshot_topic_list': None}

In [38]:
tm_params = dict(topic_model.get_params())
for key, value in tm_params.items():
    tm_params[key]=  str(value)
with open(f'{tm_folder_path}/topic_model_params.json', 'w') as f:
    json.dump(tm_params, f, ensure_ascii=False, indent=4)
    print('Done')

Done


---
# Centroids

In [43]:
import numpy as np
from sklearn.metrics.pairwise import cosine_distances

def calculate_centroid_scores(embeddings, topics, distance_metric='euclidean'):
    """
    Vectorized calculation of distance from topic centroids.
    Returns normalized scores (0-1) where 0 is the centroid and 1 is far.
    """
    embeddings = np.array(embeddings)
    topics = np.array(topics)
    unique_topics = np.unique(topics)
    
    scores = np.zeros(len(embeddings))
    
    # Calculate centroids and distances
    for topic in unique_topics:
        if topic == -1: continue # Skip outliers for centroid calculation
        
        mask = topics == topic
        topic_embeds = embeddings[mask]
        centroid = np.mean(topic_embeds, axis=0)
        
        if distance_metric == 'euclidean':
            dists = np.linalg.norm(topic_embeds - centroid, axis=1)
        else:
            dists = cosine_distances([centroid], topic_embeds)[0]
            
        # Normalize 0-1 within topic
        if len(dists) > 1 and dists.max() > dists.min():
            scores[mask] = (dists - dists.min()) / (dists.max() - dists.min())
        else:
            scores[mask] = 0 # Single document or identical distances
            
    return scores

In [42]:
# Usage with normalization:
scores = calculate_centroid_scores(embeddings['embeddings'], topics)

---
# Coordinates

In [45]:
dataset_clustering_results = topic_model.get_document_info(documents, df = corpus)
dataset_clustering_results.head()

Unnamed: 0,text,UT,uuid,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document
0,Ethanol Whey not Cheese whey is classified as ...,173,791aefbd-0fde-417e-8e1b-2f7210647032,Ethanol Whey not Cheese whey is classified as ...,27,27_of_the_and_to,"[of, the, and, to, production, in, lactose, mi...",[The Reform of Two Strains in Yogurt Since lot...,of - the - and - to - production - in - lactos...,False
1,Bacman sequestering cadmium into Bacillus spor...,174,84d56c08-f62f-4bfd-986c-71a44c1af2ed,Bacman sequestering cadmium into Bacillus spor...,4,4_metal_heavy_heavy metal_metals,"[metal, heavy, heavy metal, metals, the, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...,metal - heavy - heavy metal - metals - the - a...,False
2,Bacterial Relay Race In our project we aim at ...,175,8a69e820-ef66-4438-8ed2-08943ab2c6b1,Bacterial Relay Race In our project we aim at ...,25,25_the_of_to_in,"[the, of, to, in, and, quorum, system, we, quo...",[A multilayer signalprocessing system based on...,the - of - to - in - and - quorum - system - w...,False
3,E coli Automatic Directed Evolution Machine a ...,176,572f8dd4-bb30-4d59-a319-2a5e650cce90,E coli Automatic Directed Evolution Machine a ...,6,6_of_the_and_to,"[of, the, and, to, for, in, evolution, synthet...",[rEvolver Vivo La Evolution Naturallyoccurrin...,of - the - and - to - for - in - evolution - s...,False
4,BacInVader a new system for cancer genetic th...,177,be21c9a1-0aac-4e87-86f5-f6023d374461,BacInVader a new system for cancer genetic th...,10,10_cancer_tumor_cells_the,"[cancer, tumor, cells, the, to, and, of, thera...",[BLAST Bifidobacterium Longum induced Apoptos...,cancer - tumor - cells - the - to - and - of -...,False


In [47]:
from sklearn.manifold import TSNE

# 6. VISUALIZATION (Switching to t-SNE for the "Map" look)
print("Generating 2D coordinates for visualization...")

# Filter out outliers (-1) to keep the visualization clean
valid_indices = dataset_clustering_results['Topic'] != -1
valid_embeddings = np.array(embeddings['embeddings'])[valid_indices]
df_clean = dataset_clustering_results[valid_indices].reset_index(drop=True)

# --- KEY CHANGE: Use t-SNE instead of UMAP for the layout ---
# metric='cosine' is better for embeddings
# init='pca' is CRITICAL: it gives you the "ball" shape instead of random clusters
# perplexity=30-50: balances local vs global structure
tsne_model = TSNE(
    n_components=2, 
    perplexity=50,        # Higher values = more global structure
    learning_rate='auto',
    init='pca',           # This creates the global "continent" shape
    #n_iter=1000, 
    metric='cosine', 
    random_state=42, 
    n_jobs=-1             # Use all CPU cores
)

coords_2d = tsne_model.fit_transform(valid_embeddings)

# Scale coordinates to fill the frame nicely (optional but helps visualization)
# This creates that "filled frame" look
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-100, 100))
coords_2d = scaler.fit_transform(coords_2d)


Generating 2D coordinates for visualization...


In [49]:
# Add coordinates as columns to df_clean
df_clean['x_coords_tsne'] = coords_2d[:, 0]
df_clean['y_coords_tsne'] = coords_2d[:, 1]

In [48]:
coords_2d

array([[-38.240646 , -27.92819  ],
       [-54.59228  ,  76.619385 ],
       [ 15.818377 , -33.132805 ],
       ...,
       [ 33.482437 ,  24.06134  ],
       [ 62.346188 ,   2.6459503],
       [ 63.095627 , -54.317375 ]], shape=(4199, 2), dtype=float32)

---
# Save

In [58]:
# Save coords. Coords cover only the non-orphans documents.
df_clean[['uuid', 'UT', 'x_coords_tsne', 'y_coords_tsne']].to_csv(f'{tm_folder_path}/document_coords_tsne.csv', index=False)

In [52]:
# Document information. Including the topic assignation
dataset_clustering_results = topic_model.get_document_info(documents, df = corpus, metadata={"Score": scores})

# Check for orphans (Topic == -1), save and remove them
if -1 in dataset_clustering_results['Topic'].values:
    orphans = dataset_clustering_results[dataset_clustering_results['Topic'] == -1]
    orphans.to_csv(f'{tm_folder_path}/orphans.csv', index=False)
    dataset_clustering_results = dataset_clustering_results[dataset_clustering_results['Topic'] != -1]

In [53]:
# Standar format for report analysis
dataset_clustering_results = dataset_clustering_results.reset_index(drop=True)
dataset_clustering_results = dataset_clustering_results.drop(columns=['text'])
dataset_clustering_results['X_E'] = dataset_clustering_results['Score']
dataset_clustering_results['X_C'] = dataset_clustering_results['Topic'] + 1

# Assign 'level0' based on cluster coverage (90%)
cluster_counts = dataset_clustering_results['X_C'].value_counts().sort_values(ascending=False)
total_rows = len(dataset_clustering_results)
cumulative = cluster_counts.cumsum() / total_rows
main_clusters = cluster_counts.index[cumulative <= others_threshold].tolist()
dataset_clustering_results['level0'] = dataset_clustering_results['X_C'].apply(lambda x: x if x in main_clusters else 99999)
dataset_clustering_results.head()


Unnamed: 0,UT,uuid,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Score,X_E,X_C,level0
0,173,791aefbd-0fde-417e-8e1b-2f7210647032,Ethanol Whey not Cheese whey is classified as ...,27,27_of_the_and_to,"[of, the, and, to, production, in, lactose, mi...",[The Reform of Two Strains in Yogurt Since lot...,of - the - and - to - production - in - lactos...,False,0.203861,0.203861,28,28
1,174,84d56c08-f62f-4bfd-986c-71a44c1af2ed,Bacman sequestering cadmium into Bacillus spor...,4,4_metal_heavy_heavy metal_metals,"[metal, heavy, heavy metal, metals, the, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...,metal - heavy - heavy metal - metals - the - a...,False,0.4909,0.4909,5,5
2,175,8a69e820-ef66-4438-8ed2-08943ab2c6b1,Bacterial Relay Race In our project we aim at ...,25,25_the_of_to_in,"[the, of, to, in, and, quorum, system, we, quo...",[A multilayer signalprocessing system based on...,the - of - to - in - and - quorum - system - w...,False,0.079969,0.079969,26,26
3,176,572f8dd4-bb30-4d59-a319-2a5e650cce90,E coli Automatic Directed Evolution Machine a ...,6,6_of_the_and_to,"[of, the, and, to, for, in, evolution, synthet...",[rEvolver Vivo La Evolution Naturallyoccurrin...,of - the - and - to - for - in - evolution - s...,False,0.313584,0.313584,7,7
4,177,be21c9a1-0aac-4e87-86f5-f6023d374461,BacInVader a new system for cancer genetic th...,10,10_cancer_tumor_cells_the,"[cancer, tumor, cells, the, to, and, of, thera...",[BLAST Bifidobacterium Longum induced Apoptos...,cancer - tumor - cells - the - to - and - of -...,False,0.292083,0.292083,11,11


In [54]:
# Standar format for report analysis
dataset_clustering_results['cl99'] = [True if x == 99999 else False for x in dataset_clustering_results['level0']]
dataset_clustering_results['cl-99'] = [True if x == 99999 else False for x in dataset_clustering_results['level0']]
dataset_clustering_results.head()

Unnamed: 0,UT,uuid,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Score,X_E,X_C,level0,cl99,cl-99
0,173,791aefbd-0fde-417e-8e1b-2f7210647032,Ethanol Whey not Cheese whey is classified as ...,27,27_of_the_and_to,"[of, the, and, to, production, in, lactose, mi...",[The Reform of Two Strains in Yogurt Since lot...,of - the - and - to - production - in - lactos...,False,0.203861,0.203861,28,28,False,False
1,174,84d56c08-f62f-4bfd-986c-71a44c1af2ed,Bacman sequestering cadmium into Bacillus spor...,4,4_metal_heavy_heavy metal_metals,"[metal, heavy, heavy metal, metals, the, and, ...",[Nanocrystal Ecoli Flocculation Union Heavy me...,metal - heavy - heavy metal - metals - the - a...,False,0.4909,0.4909,5,5,False,False
2,175,8a69e820-ef66-4438-8ed2-08943ab2c6b1,Bacterial Relay Race In our project we aim at ...,25,25_the_of_to_in,"[the, of, to, in, and, quorum, system, we, quo...",[A multilayer signalprocessing system based on...,the - of - to - in - and - quorum - system - w...,False,0.079969,0.079969,26,26,False,False
3,176,572f8dd4-bb30-4d59-a319-2a5e650cce90,E coli Automatic Directed Evolution Machine a ...,6,6_of_the_and_to,"[of, the, and, to, for, in, evolution, synthet...",[rEvolver Vivo La Evolution Naturallyoccurrin...,of - the - and - to - for - in - evolution - s...,False,0.313584,0.313584,7,7,False,False
4,177,be21c9a1-0aac-4e87-86f5-f6023d374461,BacInVader a new system for cancer genetic th...,10,10_cancer_tumor_cells_the,"[cancer, tumor, cells, the, to, and, of, thera...",[BLAST Bifidobacterium Longum induced Apoptos...,cancer - tumor - cells - the - to - and - of -...,False,0.292083,0.292083,11,11,False,False


In [55]:
dataset_clustering_results.level0.value_counts()

level0
1     150
2     122
3     121
4     120
5     115
     ... 
63     20
64     19
65     18
66     18
67     17
Name: count, Length: 68, dtype: int64

In [56]:
# Save the dataframe
dataset_clustering_results.to_csv(f'{tm_folder_path}/dataset_minimal.csv', index=False)

In [57]:
# Save the topic model
topic_model.save(f'{tm_folder_path}/topic_model_object.pck')



In [59]:
df_clean['uuid'].isin(dataset_clustering_results['uuid'])

0       True
1       True
2       True
3       True
4       True
        ... 
4194    True
4195    True
4196    True
4197    True
4198    True
Name: uuid, Length: 4199, dtype: bool