# LLM enhanced topic modelling

This notebook demonstrates the impact of LLM-augmented topic modeling compared to a baseline (non-augmented) approach. The topic modeling pipeline follows a standard workflow: embedding text followed by clustering.

In the LLM-augmented pipeline, we first summarize each message using a large language model (LLM). The generated summaries—intended to retain only the core meaning—are then passed through the embedding and clustering steps.

In the baseline (non-augmented) pipeline, the original message text is passed directly into the topic modeling pipeline without any summarization.

The key idea is that LLM-based summarization helps distill each message to its essential content, which in turn can produce more human-interpretable and coherent topic clusters.

This notebook demonstrates the pipeline using a small sample dataset.

Note: The summarization step is implemented in the notebook ``1.llm_workflow.ipynb". Please refer to that file for details on how the summaries are generated.

In [2]:
from sentence_transformers import SentenceTransformer
import umap
from sklearn.cluster import HDBSCAN
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go

  from .autonotebook import tqdm as notebook_tqdm


## Load text and llm-generated summaries

This section loads the raw text data and their corresponding LLM-generated summaries from text files. The raw texts represent the original input, while the summaries are distilled versions created using a large language model. These datasets will be used for embedding, clustering, and visualization in subsequent steps.

In [3]:
with open('sample.txt', 'r', encoding='utf-8') as file:
    texts = [line.strip() for line in file]

with open('summaries.txt', 'r', encoding='utf-8') as file:
    summaries = [line.strip() for line in file]

## Topic Modelling Worflow

This section initializes the key components of the topic modeling workflow, including the embedding model and the clustering algorith 
These components will be used to process both the raw texts and LLM-generated summaries in subsequent steps.

### initalise objects

In [4]:
model = SentenceTransformer('all-MiniLM-L6-v2')  # embed text
clusterer = HDBSCAN(min_cluster_size=2, metric='euclidean') # cluster embeddings

### Implement on raw text

In [5]:
# Encode the raw texts into embeddings using the pre-trained SentenceTransformer model
embeddings_text = model.encode(texts, show_progress_bar=True)

# Perform clustering on the embeddings using HDBSCAN
labels_text = clusterer.fit_predict(embeddings_text)


Batches: 100%|██████████| 1/1 [00:06<00:00,  6.45s/it]


### Implement on LLM-generated summaries

In [6]:
embeddings_summaries = model.encode(summaries, show_progress_bar=True)
labels_summaries = clusterer.fit_predict(embeddings_summaries)

Batches: 100%|██████████| 1/1 [00:02<00:00,  2.37s/it]


## Visualise Results
This section visualizes the clusters formed by the topic modeling pipeline. Two scatter plots are displayed side by side: one for the raw text and another for the LLM-generated summaries. Each point represents a text or summary, and its color indicates the cluster it belongs to.

In [8]:
reducer = umap.UMAP(n_neighbors=2, n_components=2, metric='cosine', random_state=42)  # project embeddings into 2d space for visualisation

### create visualistion for topic modelling with raw text

In [10]:
embedding_text_2d = reducer.fit_transform(embeddings_text)

df_text = pd.DataFrame({
    'x': embedding_text_2d[:, 0],  
    'y': embedding_text_2d[:, 1],  
    'text': texts,            
    'cluster': labels_text 
})

scatter_text = go.Scatter(
        x=df_text['x'],      
        y=df_text['y'],        
        mode='markers',        
        marker=dict(
            size=6,            
            color=df_text['cluster'],  
            colorscale='YlGnBu',       
        ),
        text=df_text['text'],  
        hoverinfo='text'       
    )

scatter_text.update_layout(
    title='Raw text'  
)


AttributeError: 'Scatter' object has no attribute 'update_layout'

### create visualistion for topic modelling with summaries

In [11]:
embedding_summaries_2d = reducer.fit_transform(embeddings_summaries)
df_sum = pd.DataFrame({
    'x': embedding_summaries_2d[:, 0],
    'y': embedding_summaries_2d[:, 1],
    'text': texts,
    'cluster': labels_summaries
})
scatter_sum = go.Scatter(
    x=df_sum['x'],
    y=df_sum['y'],
    mode='markers',
    marker=dict(
        size=6,
        color=df_sum['cluster'],  
        colorscale='YlGnBu',       
    ),
    text=df_sum['text'],             
    hoverinfo='text'                  
    )

scatter_sum.update_layout(
    title='LLM-Generated Summaries'
)

AttributeError: 'Scatter' object has no attribute 'update_layout'

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=1, cols=2,subplot_titles=('raw text','summaries'))

fig.add_trace(
    scatter_text,
    row=1, col=1
)

fig.add_trace(
    scatter_sum,
    row=1, col=2
)

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed