<h1>Topic Modeling Pipeline</h1>

![com_pipeline.png](attachment:441a225c-756c-43a3-bd90-d55e882f0b7c.png)

<h1>Clustering</h1>

![bert_topic_pipeline.png](attachment:a6994c43-039a-4fac-b1a4-31696915c164.png)

<h1> Embedding Documents</h1>

![embed_doc.png](attachment:01deafae-7a91-4887-9849-c78a600cf9ab.png)

<ul>
    <li> Convert textual data to embeddings</li>
    <li> Embeddings are numerical representation of text (in order to capture meaning)</li>
</ul>

In [None]:
from sentence_transformers import SentenceTransformer

# Create an embedding for each abstract
embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

<h1>Reducing Dimensionality of Embeddings</h1>

![dim_reduction.png](attachment:5f71ad5d-6d2e-40cf-b25f-61d0adb3bde8.png)

![compressed_dim.png](attachment:5d5582d1-7f03-4a48-8cdb-a476bae62b65.png)

In [None]:
from umap import UMAP

# We reduce the input embeddings from 384 dimensions to 5 dimensions
umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)

<h1>Cluster the Reduced Embeddings</h1>

![cluster_reduced_dim.png](attachment:a27bb113-5a9e-4386-bcf2-d4d7914cc5ed.png)

In [None]:
from hdbscan import HDBSCAN

# We fit the model and extract the clusters
hdbscan_model = HDBSCAN(
    min_cluster_size=50, metric="euclidean", cluster_selection_method="eom"
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_

<h1>Topic Representation</h1>

![topic_repr.png](attachment:0295e5a5-9432-4332-8068-0efb848e613f.png)

<h2>c-TF-IDF</h2>

![bog_count_vector.png](attachment:065c2893-ba2b-46da-bf09-3c4f37f337f9.png)

![c_tf.png](attachment:be597550-b8f0-4531-9d48-fdca4f8d9ea5.png)

![idf_form.png](attachment:dbcdc547-4c54-449f-82e0-0bf911e95c50.png)