# LLM enhanced topic modelling

This notebook demonstrates the impact of LLM-augmented topic modeling compared to a baseline (non-augmented) approach. The topic modeling pipeline follows a standard workflow: embedding text followed by clustering.

In the LLM-augmented pipeline, we first summarize each message using a large language model (LLM). The generated summaries—intended to retain only the core meaning—are then passed through the embedding and clustering steps.

In the baseline (non-augmented) pipeline, the original message text is passed directly into the topic modeling pipeline without any summarization.

The key idea is that LLM-based summarization helps distill each message to its essential content, which in turn can produce more human-interpretable and coherent topic clusters.

This notebook demonstrates the pipeline using a small sample dataset.

Note: The summarization step is implemented in the notebook ``1.llm_workflow.ipynb". Please refer to that file for details on how the summaries are generated.

In [1]:
from sentence_transformers import SentenceTransformer
import umap
from sklearn.cluster import HDBSCAN
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go

  from .autonotebook import tqdm as notebook_tqdm


## Load text and llm-generated summaries

This section loads the raw text data and their corresponding LLM-generated summaries from text files. The raw texts represent the original input, while the summaries are distilled versions created using a large language model. These datasets will be used for embedding, clustering, and visualization in subsequent steps.

In [2]:
with open('sample.txt', 'r', encoding='utf-8') as file:
    texts = [line.strip() for line in file]

with open('summaries.txt', 'r', encoding='utf-8') as file:
    summaries = [line.strip() for line in file]

In [11]:
texts

['My heart burns with an eternal flame for you!',
 'You are the sun, the moon, and every star that ever shone!',
 'I am hopelessly, wildly, gloriously in love with you!',
 'Your name is etched upon the walls of my soul!',
 'You have bewitched me, body and soul!',
 'I would cross oceans of fire just to see you smile!',
 'Every breath I take is a love letter to you!',
 'You are my destiny, my downfall, my everything!',
 'Without you, even the stars seem dim and lifeless!',
 'My love for you defies time, space, and all reason!']

In [10]:
summaries

['Love/Obsession',
 'Glory/Adoration',
 'Love',
 'Poetry/Romance',
 'Love',
 'Love',
 'Love',
 'Destiny',
 'Adoration or appreciation for a person or relationship',
 'Love']

## Topic Modelling Worflow

This section initializes the key components of the topic modeling workflow, including the embedding model and the clustering algorith 
These components will be used to process both the raw texts and LLM-generated summaries in subsequent steps.

### initalise objects

In [3]:
model = SentenceTransformer('all-MiniLM-L6-v2')  # embed text
clusterer = HDBSCAN(min_cluster_size=2, metric='euclidean') # cluster embeddings

### Implement on raw text

In [4]:
# Encode the raw texts into embeddings using the pre-trained SentenceTransformer model
embeddings_text = model.encode(texts, show_progress_bar=True)

# Perform clustering on the embeddings using HDBSCAN
labels_text = clusterer.fit_predict(embeddings_text)


Batches: 100%|██████████| 1/1 [00:01<00:00,  1.27s/it]


### Implement on LLM-generated summaries

In [5]:
embeddings_summaries = model.encode(summaries, show_progress_bar=True)
labels_summaries = clusterer.fit_predict(embeddings_summaries)

Batches: 100%|██████████| 1/1 [00:00<00:00,  2.84it/s]


## Visualise Results
This section visualizes the clusters formed by the topic modeling pipeline. Two scatter plots are displayed side by side: one for the raw text and another for the LLM-generated summaries. Each point represents a text or summary, and its color indicates the cluster it belongs to.

In [6]:
reducer = umap.UMAP(n_neighbors=2, n_components=2, metric='cosine', random_state=42)  # project embeddings into 2d space for visualisation

### create visualistion for topic modelling with raw text

In [None]:
# Reduce the dimensionality of the text embeddings to 2D for visualization
embedding_text_2d = reducer.fit_transform(embeddings_text)

# Create a DataFrame to store the 2D embeddings, original text, and cluster labels
df_text = pd.DataFrame({
    'x': embedding_text_2d[:, 0],  # x-coordinate of the 2D embedding
    'y': embedding_text_2d[:, 1],  # y-coordinate of the 2D embedding
    'text': texts,                 # Original text
    'cluster': labels_text         # Cluster labels assigned by HDBSCAN
})

# Create a scatter plot for the text embeddings
scatter_text = go.Scatter(
        x=df_text['x'],            # x-coordinates for the scatter plot
        y=df_text['y'],            # y-coordinates for the scatter plot
        mode='markers',            # Marker mode for the scatter plot
        marker=dict(
            size=6,                # Marker size
            color=df_text['cluster'],  # Color based on cluster labels
            colorscale='YlGnBu',       # Color scale for the clusters
        ),
        text=df_text['text'],      # Text to display on hover
        hoverinfo='text'           # Display text on hover
    )


  warn(


### create visualistion for topic modelling with summaries

In [8]:
embedding_summaries_2d = reducer.fit_transform(embeddings_summaries)
df_sum = pd.DataFrame({
    'x': embedding_summaries_2d[:, 0],
    'y': embedding_summaries_2d[:, 1],
    'text': texts,
    'cluster': labels_summaries
})
scatter_sum = go.Scatter(
    x=df_sum['x'],
    y=df_sum['y'],
    mode='markers',
    marker=dict(
        size=6,
        color=df_sum['cluster'],  
        colorscale='YlGnBu',       
    ),
    text=df_sum['text'],             
    hoverinfo='text'                  
    )


In [12]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=1, cols=2,subplot_titles=('raw text','summaries'))

fig.add_trace(
    scatter_text,
    row=1, col=1
)

fig.add_trace(
    scatter_sum,
    row=1, col=2
)