# **Dataset unlabeled**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
path_to_module = '/Users/cesaraugustoseminarioyrigoyen/Documents/CORSI/DATA_SCIENCE_POLI/3_year/Applied Data science/P8-project/utils.py' 
sys.path.append(path_to_module)
import utils

#P8 polito github - project import
#!git clone https://github.com/adsp-polito/2024-P8-PPS.git
unlabeled_dataset=pd.read_excel('interventions_not_labeled.xlsx').drop(columns=['Unnamed: 0'])

In [2]:
print(f"size of dataset: {unlabeled_dataset.shape}")
unlabeled_dataset.head(3)

size of dataset: (1006, 8)


Unnamed: 0,Title,Abstract,telemedicine,imaging,surgery,drug,screening,device
0,design feature participant characteristic infl...,research participate tended limited single ind...,0,0,0,0,0,0
1,value meaning rural primary practice implicati...,understand unique perspective value motivate c...,0,0,0,0,0,0
2,cultural influence asian american metasynthesis,summarize asian american negotiate involvement...,0,0,0,0,0,0


In [3]:
stats=utils.show_stats_of_titles_abstracts(unlabeled_dataset)

Average Title Length: 6.86 words
Average Abstract Length: 115.66 words
Number of rows with Anomalies in the Title: 14
Number of rows with Anomalies in the Abstract: 48


# **Topic modeling BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
from sentence_transformers import SentenceTransformer
from transformers import pipeline
s_pipe=pipeline("sentiment-analysis",model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

In [4]:
# %%capture
# !pip install bertopic
# !pip install datasets
# !pip install openai
import tqdm as notebook_tqdm
from tqdm.autonotebook import tqdm, trange
import pandas as pd
import numpy as np
import time as Time

# **Importing Data**

*   Data is imported from the github repository [P8 polito repository](https://github.com/adsp-polito/2024-P8-PPS).

*   Embeddings are already calculated using [Pubmedbert embedding model](https://huggingface.co/NeuML/pubmedbert-base-embeddings).




In [15]:
papers=(unlabeled_dataset['Title'] + " " + unlabeled_dataset['Abstract']).to_list()
print(f"{len(papers)} papers loaded")

1006 papers loaded


#***BERT Pipeline***

BERTopic can be viewed as a sequence of steps to create its topic representations. There are five steps to this process:

![https://maartengr.github.io/BERTopic/algorithm/default.svg](https://maartengr.github.io/BERTopic/algorithm/default.svg)

The pipeline above implies significant modularity of BERTopic:

 ![https://maartengr.github.io/BERTopic/algorithm/modularity.svg](https://maartengr.github.io/BERTopic/algorithm/modularity.svg)

## **Embeddings**
Embeddings calculated in the binary classificator step. We load them in the ***embeddings*** variable

In [5]:
embedding_model = 'neuml/pubmedbert-base-embeddings'
papers,embeddings=utils.embedd(unlabeled_dataset,embedding_model)

Count NaN values:  62
Count empty values (NaN o empty strings): 62


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

## **Dimensionality reduction**
*   Reduce the size of the embeddings to a certain degree. [Curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

*   default [UMAP](https://github.com/lmcinnes/umap) with random_state=42 for repeatability



In [6]:
from umap import UMAP

umap_model = UMAP(n_neighbors=30, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

## **Clustering**
- `nr_topics` is a parameter which controls directly the number of topics,  **after** they have been created

- `min_cluster_size` indirectly controls the number of topics that will be created (advised)


In [7]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=20, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

## **Tokenizer**
*   Default representation of topics is calculated through [c-TF-IDF](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#5-topic-representation).
*   c-TF-IDF is powered by the [CountVectorizer](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) which converts text into tokens. Using the CountVectorizer, we can remove stopwords, ignore infrequent words --> improve default representation



In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

## **Representation tuning**
- [other topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) : [KeyBERTInspired](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#keybertinspired) and [PartOfSpeech](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#partofspeech), or [OpenAI's ChatGPT](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#chatgpt) and [open-source](https://maartengr.github.io/BERTopic/getting_started/representation/llm.html#langchain) alternatives.

- In BERTopic, you can model many different topic representations simultanously. This is called [multi-aspect](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling.


### LLM representations

#### *Zephyr*

In [None]:
#!pip install ctransformers[cuda]
#!pip install --upgrade git+https://github.com/huggingface/transformers
#load a quantized model which is a compressed version of the original model
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

In [None]:
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-GGUF",
    model_file="zephyr-7b-alpha.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)


In [None]:
prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

In [None]:
from bertopic.representation import TextGeneration

# Text generation with Zephyr
zephyr_model = TextGeneration(generator, prompt=prompt)


#### *Llama*

In [10]:
#need to login with huggingface token
from huggingface_hub import notebook_login
notebook_login()
from torch import cuda

model_id = 'meta-llama/Llama-2-7b-chat-hf'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

print(device)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

cpu


In [None]:
#!pip install accelerate bitsandbytes xformers adjustText
#!apt install -U bitsandbytes
from torch import bfloat16
import transformers

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)
model.eval()

# Our text generator
generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=500,
    repetition_penalty=1.1
)


In [None]:
# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics.
<</SYS>>
"""

# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.
[/INST]
"""

prompt = system_prompt + example_prompt + main_prompt


In [None]:
from bertopic.representation import TextGeneration
#from bertopic import BERTopic

# Text generation with Llama 2
llama2_model = TextGeneration(generator, prompt=prompt)


In [None]:
#llama_model = LlamaCPP("/content/zephyr-7b-alpha.Q4_K_M.gguf")


####OpenAI

In [None]:
from bertopic.representation import OpenAI

In [None]:
# GPT-3.5
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
client = openai.OpenAI(api_key="sk-...")
openai_model = OpenAI(client, model="gpt-3.5-turbo",exponential_backoff=True, chat=True, prompt=prompt)

### other representations

In [9]:
#!pip install typing-extensions --upgrade

import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)



# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    #"OpenAI": openai_model,  # Uncomment if you will use OpenAI
    "MMR": mmr_model,
    "POS": pos_model,
    #"LLama2": llama2_model,
    #"Zephyr":zephyr_model
}

## **Training**
If you want to iterate over the topic model it is advised to use the pre-calculated embeddings as that significantly speeds up training.

In [10]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

topics, probs = topic_model.fit_transform(papers, embeddings)

2025-01-06 16:25:02,181 - BERTopic - Reduced dimensionality
2025-01-06 16:25:02,194 - BERTopic - Clustered reduced embeddings


In [11]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,290,-1_service_medical_used_data,"[service, medical, used, data, time, research,...","[respondent, utility, measure, attribute, cost...","[service, medical, used, data, time, research,...","[service, medical, data, time, research, attri...",[latent class model heterogeneity latent class...
1,0,148,0_family_endoflife_home_caregiver,"[family, endoflife, home, caregiver, advance, ...","[nursing home, end life, advance planning, eld...","[family, endoflife, home, caregiver, advance, ...","[family, endoflife, home, caregiver, advance, ...",[unpacking impact adult home death family care...
2,1,123,1_cancer_breast_information_breast cancer,"[cancer, breast, information, breast cancer, o...","[breast cancer, cancer survivor, lung cancer, ...","[cancer, breast, information, breast cancer, o...","[cancer, breast, information, oncology, role, ...",[understanding value regarding early stage lun...
3,2,85,2_attribute_method_dce_healthcare,"[attribute, method, dce, healthcare, data, exp...","[experiment dces, dces, technology assessment,...","[attribute, method, dce, healthcare, data, exp...","[attribute, method, healthcare, data, experime...",[novel design process selection attribute incl...
4,3,63,3_colleague_lesson_routine practice_say,"[colleague, lesson, routine practice, say, nh,...","[, , , , , , , , , ]","[colleague, lesson, routine practice, say, nh,...","[lesson, routine practice, right, shift, progr...","[ , , implementing nh lesson magic programme ..."
5,4,51,4_sdm_clinician_practice_physician,"[sdm, clinician, practice, physician, option, ...","[sdm, practice, clinician, physician, provider...","[sdm, clinician, practice, physician, option, ...","[sdm, clinician, practice, physician, option, ...",[assessing option gridxae practicability feasi...
6,5,47,5_woman_pregnancy_contraceptive_attribute,"[woman, pregnancy, contraceptive, attribute, m...","[pregnancy, woman, mother, fertility, pregnant...","[woman, pregnancy, contraceptive, attribute, m...","[woman, pregnancy, contraceptive, attribute, m...",[woman attribute firsttrimester miscarriage ma...
7,6,45,6_tto_state_utility_value,"[tto, state, utility, value, time, tradeoff, v...","[state valuation, utility value, valuation, st...","[tto, state, utility, value, time, tradeoff, v...","[tto, state, utility, value, time, tradeoff, v...",[correcting value influence importance correct...
8,7,43,7_mental_depression_service_sdm,"[mental, depression, service, sdm, consumer, s...","[mental, schizophrenia, depression, sdm, psych...","[mental, depression, service, sdm, consumer, s...","[mental, depression, service, sdm, consumer, u...",[family involvement consumer serious mental il...
9,8,34,8_prostate_prostate cancer_cancer_men,"[prostate, prostate cancer, cancer, men, decis...","[prostate cancer, prostate, localized prostate...","[prostate, prostate cancer, cancer, men, decis...","[prostate, cancer, men, decisional, da, decisi...",[voice methodology novel mixedmethods approach...


In [12]:
#Get all representations for a single topic

topic_model.get_topic(1, full=True)

{'Main': [('cancer', 0.07616497273631591),
  ('breast', 0.022842168597903233),
  ('information', 0.021877162910885994),
  ('breast cancer', 0.020457208742065674),
  ('oncology', 0.019214844439836235),
  ('role', 0.017607622295247),
  ('sdm', 0.01755119865155897),
  ('lung', 0.017230621023265575),
  ('need', 0.01709343257243929),
  ('lung cancer', 0.016797490220192786)],
 'KeyBERT': [('breast cancer', 0.5204103),
  ('cancer survivor', 0.51313436),
  ('lung cancer', 0.47941023),
  ('cancer', 0.46110094),
  ('oncology', 0.45552957),
  ('breast', 0.45364767),
  ('oncologist', 0.4379478),
  ('palliative', 0.3608046),
  ('sdm', 0.36003447),
  ('survivor', 0.334724)],
 'MMR': [('cancer', 0.07616497273631591),
  ('breast', 0.022842168597903233),
  ('information', 0.021877162910885994),
  ('breast cancer', 0.020457208742065674),
  ('oncology', 0.019214844439836235),
  ('role', 0.017607622295247),
  ('sdm', 0.01755119865155897),
  ('lung', 0.017230621023265575),
  ('need', 0.01709343257243929),


**NOTE**: The labels generated by OpenAI's **ChatGPT** are especially interesting to use throughout your model. Below, we will go into more detail how to set that as a custom label.

**🔥 Tip - Parameters 🔥**
***
If you would like to return the topic-document probability matrix, then it is advised to use `calculate_probabilities=True`. Do note that this can significantly slow down training. To speed it up, use [cuML's HDBSCAN](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html#cuml-hdbscan) instead. You could also approximate the topic-document probability matrix with `.approximate_distribution` which will be discussed later.
***

## **Topic-Document Distribution**
If using `calculate_probabilities=True` is not possible, than you can [approximate the topic-document distributions](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html) using `.approximate_distribution`. It is a fast and flexible method for creating different topic-document distributions.

In [13]:
# `topic_distr` contains the distribution of topics in each document
topic_distr, _ = topic_model.approximate_distribution(papers, window=8, stride=4)

100%|██████████| 2/2 [00:00<00:00, 10.27it/s]


Next, lets take a look at a specific abstract and see how the topic distribution was extracted:

In [14]:
abstract_id = 10
print(papers[abstract_id])

consent ethic blood management goal blood management pbm optimize outcome individual managing blood precious unique resource safeguarded managed judiciously corollary successful pbm minimization avoidance blood transfusion stewardship donated blood first achieved multidisciplinary approach personalized management plan decided substitute follows physicianpatient relationship integral component medical practice fundamental link doctor based trust honest communication central pbm accurate timely diagnosis based sound physiology pathophysiology bedrock scientifically based medicine founded pbm context start question status blood specific abnormality blood managed allogeneic blood transfusion considered reasonable alternative compelling scientific reason implement nontransfusion default position uncertainty questionable evidence efficacy allogeneic blood transfusion due known potential hazard must informed diagnosis nature severity prognosis option along risk benefit involved regarding mana

In [None]:
# Visualize the topic-document distribution for a single document
topic_model.visualize_distribution(topic_distr[abstract_id])

It seems to have extracted a number of topics that are relevant and shows the distributions of these topics across the abstract. We can go one step further and visualize them on a token-level:

In [16]:
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(papers[abstract_id], calculate_tokens=True)

# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(papers[abstract_id], topic_token_distr[0])
df

100%|██████████| 1/1 [00:00<00:00, 271.30it/s]


Unnamed: 0,consent,ethic,blood,management,goal,blood.1,management.1,pbm,optimize,outcome,individual,managing,blood.2,precious,unique,resource,safeguarded,managed,judiciously,corollary,successful,pbm.1,minimization,avoidance,blood.3,transfusion,stewardship,donated,blood.4,first,achieved,multidisciplinary,approach,personalized,management.2,plan,decided,substitute,follows,physicianpatient,relationship,integral,component,medical,practice,fundamental,link,doctor,based,trust,honest,communication,central,pbm.2,accurate,timely,diagnosis,based.1,sound,physiology,pathophysiology,bedrock,scientifically,based.2,medicine,founded,pbm.3,context,start,question,status,blood.5,specific,abnormality,blood.6,managed.1,allogeneic,blood.7,transfusion.1,considered,reasonable,alternative,compelling,scientific,reason,implement,nontransfusion,default,position,uncertainty,questionable,evidence,efficacy,allogeneic.1,blood.8,transfusion.2,due,known,potential,hazard,must,informed,diagnosis.1,nature,severity,prognosis,option,along,risk,benefit,involved,regarding,management.3,however,part,process,multifaceted,medical.1,legal,ethical,economic,issue,encompassing,informed.1,consent.1,furthermore,variability,circumstance,complexity,medical.2,science,working,system,consent.2,take,place,bewildering,also,clinician,obtaining,consent.3,adding,concept,blood.9,management.4,differentiates,donor,blood.10,management.5,avoid,confusion,perception,pbm.4,specific.1,medical.3,intervention,personalized.1,pbm.5,tailoring,pbm.6,specific.2,characteristic,approach.1,difficulty,addressing,informed.2,consent.4,ethical.1,aspect,pbm.7,usually,reassured,nothing,order,blood.11,case,focus,pbm.8,keep,way,circumstance.1,hematologist,involved.1,blood.12,advocate,abnormality.1,require,expert,involvement,primary,managed.2
0_family_endoflife_home_caregiver,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.107,0.107,0.107,0.107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4_sdm_clinician_practice_physician,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.122,0.122,0.122,0.122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5_woman_pregnancy_contraceptive_attribute,0.0,0.0,0.0,0.103,0.103,0.103,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.132,0.132,0.132,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**🔥 Tip - `use_embedding_model` 🔥**
***
As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected embedding_model instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:

```python
topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)
```
***




## **Outlier Reduction**
By default, HDBSCAN generates outliers which is a helpful mechanic in creating accurate topic representations. However, you might want to assign every single document to a topic. We can use `.reduce_outliers` to map some or all outliers to a topic:

In [None]:
# Reduce outliers
new_topics = topic_model.reduce_outliers(papers, topics)

# Reduce outliers with pre-calculate embeddings instead
new_topics = topic_model.reduce_outliers(papers, topics, strategy="embeddings", embeddings=embeddings)

100%|██████████| 1/1 [00:03<00:00,  3.74s/it]


**💡  NOTE - Update Topics with Outlier Reduction 💡**
***
After having generated updated topic assignments, we can pass them to BERTopic in order to update the topic representations:

```python
topic_model.update_topics(docs, topics=new_topics)
```

It is important to realize that updating the topics this way may lead to errors if topic reduction or topic merging techniques are used afterwards. The reason for this is that when you assign a -1 document to topic 1 and another -1 document to topic 2, it is unclear how you map the -1 documents. Is it matched to topic 1 or 2.
***

## **Visualize Topics**

With visualizations, we are closing into the realm of subjective "best practices". These are things that I generally do because I like the representations but your experience might differ.

Having said that, there are two visualizations that are my go-to when visualizing the topics themselves:

* `topic_model.visualize_topics()`
* `topic_model.visualize_hierarchy()`

In [17]:
topic_model.visualize_topics(custom_labels=True)

In [18]:
topic_model.visualize_hierarchy(custom_labels=True)

## **Visualize Documents**

When visualizing documents, it helps to have embedded the documents beforehand to speed up computation. Fortunately, we have already done that as a "best practice".

Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.

In [19]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

The following plot is **interactive** which means that you can zoom in, double click on a label to only see that one and generally interact with the plot:

In [20]:
# Visualize the documents in 2-dimensional space and show the titles on hover instead of the abstracts
# NOTE: You can hide the hover with `hide_document_hover=True` which is especially helpful if you have a large dataset
topic_model.visualize_documents(papers, reduced_embeddings=reduced_embeddings, custom_labels=True)

In [21]:
# We can also hide the annotation to have a more clear overview of the topics
topic_model.visualize_documents(papers, reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True)

**💡  NOTE - 2-dimensional space 💡**
***
Although visualizing the documents in 2-dimensional gives an idea of their underlying structure, there is a risk involved.

Visualizing the documents in 2-dimensional space means that we have lost significant information since the original embeddings were more than 384 dimensions. Condensing all that information in 2 dimensions is simply not possible. In other words, it is merely an **approximation**, albeit quite an accurate one.
***

## **Serialization**

When saving a BERTopic model, there are several ways in doing so. You can either save the entire model with `pickle`, `pytorch`, or `safetensors`.

Personally, I would advise going with `safetensors` whenever possible. The reason for this is that the format allows for a very small topic model to be saved and shared.

When saving a model with `safetensors`, it skips over saving the dimensionality reduction and clustering models. The `.transform` function will still work without these models but instead assign topics based on the similarity between document embeddings and the topic embeddings.

As a result, the `.transform` step might give different results but it is generally worth it considering the smaller and significantly faster model.

In [None]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
topic_model.save("my_model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

**💡  NOTE - Embedding Model 💡**
***
Using `safetensors`, we are not saving the underlying embedding model but merely a pointer to the model. For example, in the above example we are saving the string `"sentence-transformers/all-MiniLM-L6-v2"` so that we can load in the embedding model alongside the topic model.

This currently only works if you are using a sentence transformer model. If you are using a different model, you can load it in when loading the topic model like this:

```python
from sentence_transformers import SentenceTransformer

# Define embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Load model and add embedding model
loaded_model = BERTopic.load("path/to/my/model_dir", embedding_model=embedding_model)
```
***

As mentioned above, loading can be done as follows:

In [None]:
from sentence_transformers import SentenceTransformer

# Define embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Load model and add embedding model
loaded_model = BERTopic.load("my_model_dir", embedding_model=embedding_model)

## **Inference**

To speed up the inference, we can leverage a "best practice" that we used before, namely serialization. When you save a model as `safetensors` and then load it in, we are removing the dimensionality reduction and clustering steps from the pipeline.

Instead, the assignment of topics is done through cosine similarity of document embeddings and topic embeddings. This speeds up inferences significantly.

To show its effect, let's start by disabling the logger:

In [None]:
from bertopic._utils import MyLogger
logger = MyLogger("ERROR")
loaded_model.verbose = False
topic_model.verbose = False

Then, we run inference on both the loaded model and the non-loaded model:

In [None]:
%timeit loaded_model.transform(abstracts[:100])

In [None]:
%timeit topic_model.transform(abstracts[:100])

**1000 documents**

In [None]:
%timeit loaded_model.transform(abstracts[:1000])

In [None]:
%timeit topic_model.transform(abstracts[:1000])

**10_000 documents**

In [None]:
%timeit loaded_model.transform(abstracts[:10000])

In [None]:
%timeit topic_model.transform(abstracts[:10000])

Based on the above, the `loaded_model` seems to be quite a bit faster for inference than the original `topic_model`.