# An LLM-Approach to Semantic Clustering and Topic Modeling of Academic Literature

[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) stands as a fundamental task in unsupervised learning, where the goal is to group unlabeled data into related categories; whereas [Topic Modeling](https://en.wikipedia.org/wiki/Topic_model) focuses on identifying thematic structures within a collection of documents. These techniques find applications across various domains, enabling tasks such as information retrieval, anomaly detection, trend analysis, and biomedical research.

This notebook provides an end-to-end guide to building an LLM-based pipeline for automatic categorization of research articles into latent topics using open source. Our playground is a  [dataset of 25,000 research arXiv publications](https://huggingface.co/datasets/dcarpintero/arxiv.cs.CL.embedv3.clustering.medium) from Computational Linguistics (Natural Language Processing) published before May 2024.

At its core, the clustering problem relies on finding similar examples. This is a natural task for embeddings, as they capture the semantic relationships in a corpus, and can be provided as input features to a clustering algorithm to establish similarity links among the examples. We begin by transforming the `title:abstract` pairs of our dataset into an embeddings representation using  [Jina-Embeddings-v2](https://arxiv.org/abs/2310.19923), a BERT-ALiBi based attention model supporting 8192 sequence length, and subsequently applying HDBSCAN [2] in a reduced dimensional space. Topic modeling is then performed at cluster level using a random subset of `titles` within each cluster. This latter process combines [LangChain](https://www.langchain.com/) and [Pydantic](https://docs.pydantic.dev/) with [Mistral-7B-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) to define a topic pipeline that generates structured `JSON` output.

To measure the clustering and topic modeling effectiveness, we visualize the outcomes after further applying [UMAP](https://en.wikipedia.org/wiki/Uniform_Manifold_Approximation_and_Projection) [1] dimensionality reduction.

<figure>
  <img style="margin: 0 auto; display: block;" src="https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/iE3e4VJSY84JyyTR9krmf.png">
  <figcaption style="text-align: center;">LLM-based Pipeline for Semantic Clustering and Topic Modeling of Academic Literature </figcaption>
</figure>

In [2]:
%pip install --upgrade altair datasets hdbscan scikit-learn umap-learn --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m857.8/857.8 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

## 1. Embeddings Transformation

[intro]

In [146]:
from datasets import load_dataset
import tqdm as notebook_tqdm

ds = load_dataset("dcarpintero/arxiv.cs.CL.10k.embeddings.mpnet", split="train")

Downloading data:   0%|          | 0.00/178M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [147]:
ds

Dataset({
    features: ['id', 'doc_url', 'title', 'publication_date', 'update_date', 'authors', 'category_primary', 'category_all', 'abstract', 'embeddings'],
    num_rows: 10000
})

## 2. Projecting Embeddings for Dimensionality Reduction

We then project our (`title:abstract`) embeddings pairs from a high-dimensional space (768) to a lower-dimensional one (5) using
[dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction). This process will reduce the computational complexity and memory usage during clustering.

To implement this step, we use [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection) [1], a popular technique known for its effectiveness in preserving both the local and global data structures. In practice, this makes it a preferred choice for handling complex datasets with high-dimensional embeddings.

In [148]:
import umap

umap_reducer = umap.UMAP(n_neighbors=100,
                         n_components=5,
                         min_dist=0.1,
                         metric='cosine')
umap_embedding = umap_reducer.fit_transform(ds['embeddings'])

In our implementation, we configure UMAP with:
- `n_neighbors=100` to consider 100 nearest neighbors for each point (arXiv publication);
- `n_components=5` to reduce the embeddings from 768 to 5 dimensions;
- `min_dist=0.1` to maintain a balance between the local and global structure; and,
- `metric='cosine'` to measure the distance between points using the cosine similarity metric.

Note that when we apply HDBSCAN clustering in the next step, the clusters found will be influenced by how UMAP preserved the local structures. A smaller `n_neighbors` value means UMAP will focus more on local structures, whereas a larger value allows to capture more global representations, which might be beneficial for understanding overall patterns in the data.

## 3. Semantic Clustering

This section shows how to use the reduced (`title:abstract`) embeddings pairs as input features of a clustering algorithm. This allows for the identification of related categories based on the distance between the provided embeddings.

We have opted for [HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) (Hierarchical Density-Based Spatial Clustering of Applications with Noise) [2], an advanced clustering algorithm that extends DBSCAN by adapting to varying density clusters. Unlike K-Means which requires pre-specifying the number of clusters, HDBSCAN has only one important hyperparameter, `n`, which establishes the minimum number of examples to include in a cluster. As a density-based method, it can also detect outliers in the data.

HDBSCAN works by first transforming the data space according to the density of the data points, making denser regions (areas where data points are close together in high numbers) more attractive for cluster formation. The algorithm then builds a hierarchy of clusters based on the minimum cluster size established by the hyperparameter `n`. This allows it to distinguish between noise (sparse areas) and dense regions (potential clusters). Finally, HDBSCAN condenses this hierarchy to derive the most persistent clusters, efficiently identifying clusters of different densities and shapes.

Note that while we define a minimum cluster size similar to the number of neighbors in UMAP, in practice they do not need to be equal.

In [149]:
import hdbscan

hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=100,
                                metric='euclidean',
                                cluster_selection_method='eom')
clusters = hdbscan_model.fit_predict(umap_embedding)

We prepare the dataset for visualization by further reducing the number of dimensions, in this case from '5' to '2'.

In [150]:
import pandas as pd

reduced_embeddings = umap.UMAP(n_neighbors=100, n_components=2, min_dist=0.1, metric='cosine').fit_transform(ds['embeddings'])
df = pd.DataFrame(reduced_embeddings, columns=['x', 'y'])
df['cluster'] = clusters
df['title'] = ds['title']

df = df[df['cluster'] != -1] # remove outliers

In [134]:
df.head(10)

Unnamed: 0,x,y,cluster,title
0,6.39021,1.447678,8,Planetarium: A Rigorous Benchmark for Translat...
1,8.26941,7.373854,4,InternLM-XComposer-2.5: A Versatile Large Visi...
2,8.124364,7.802285,4,BACON: Supercharge Your VLM with Bag-of-Concep...
3,6.167572,1.872279,8,A Review of the Applications of Deep Learning-...
5,9.568141,4.940374,16,Evaluating Automatic Metrics with Incremental ...
6,3.335279,3.650203,11,How Similar Are Elected Politicians and Their ...
9,3.523068,0.792918,1,Self-Evaluation as a Defense Against Adversari...
10,3.603736,0.815879,1,Single Character Perturbations Break LLM Align...
12,6.899806,2.962453,15,How Does Quantization Affect Multilingual LLMs?
13,5.016772,5.89614,3,CiteAssist: A System for Automated Preprint Ci...


In [135]:
df['cluster'].unique()


array([ 8,  4, 16, 11,  1, 15,  3, 14,  5,  9,  2,  7, 10, 17,  6, 13, 12,
        0])

## 4. Topic Modeling with LLMs

Having performed the clustering step, we now illustrate how to identify the topic of each cluster by combining an LLM such as [Mistral-7B-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) with [Pydantic](https://docs.pydantic.dev/) and [LangChain](https://www.langchain.com/) to create a topic modeling pipeline.

In [136]:
%pip install huggingface_hub langchain langchain_huggingface --upgrade --quiet

### 4.1 Pydantic Model

[Pydantic Models](https://docs.pydantic.dev/latest/concepts/models/) are classes that derive from `pydantic.BaseModel`, defining fields as type-annotated attributes. They bear a strong resemblance to `Python` dataclasses. However, they have been designed with subtle but significant differences that optimize various operations such as validation, serialization, and `JSON` schema generation. Our `Topic` class defines a field named `category`. This will generate output in a structured format, rather than a free-form text block, facilitating easier processing and analysis of the topic modeling results.

In [137]:
from pydantic import BaseModel, Field

class Topic(BaseModel):
    """
    Pydantic Model to generate an structured Topic Model
    """
    label: str = Field(..., description="Identified topic")

### 4.2 LangChain Prompt Template

[LangChain Prompt Templates](https://python.langchain.com/v0.2/docs/concepts/#prompt-templates) are pre-defined recipes for generating prompts for language models.

In [138]:
from langchain_core.prompts import PromptTemplate

topic_prompt = """
    You are a helpful Research Engineer. Your task is to analyze a set of research paper titles related to Natural Language Processing and
    determine the overarching topic of the cluster. Based on the titles provided, you should identify and label the most relevant topic.
    The response should be concise, clearly stating the single  identified topic. Format your response in JSON as indicated in the 'EXPECTED OUTPUT' section below.
    No additional information or follow-up questions are needed.

    EXPECTED OUTPUT:
    {{"label": "Topic Name"}}

    TITLES:
    {titles}
    """

### 4.3 Inference of Topic Identification

This section illustrates how to compose a topic pipeline using the [LangChain Expression Language (LCEL)](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language-lcel).

In [139]:
import os
from google.colab import userdata
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('HUGGINGFACEHUB_API_TOKEN')

In [140]:
import os

from langchain.chains import LLMChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.output_parsers import PydanticOutputParser

from typing import List

def TopicModeling(titles: List[str]) -> str:
    """
    Infer the common topic of the given titles w/ LangChain, Pydantic, OpenAI
    """
    repo_id = "mistralai/Mistral-7B-Instruct-v0.3"
    llm = HuggingFaceEndpoint(
        repo_id=repo_id,
        temperature=0.2,
        huggingfacehub_api_token=os.environ["HUGGINGFACEHUB_API_TOKEN"]
    )
    prompt = PromptTemplate.from_template(topic_prompt)
    parser = PydanticOutputParser(pydantic_object=Topic)

    topic_chain = prompt | llm | parser
    return topic_chain.invoke({"titles": titles})

To enable the model to infer the topic of each cluster, we provide a random subset of 25 paper titles from each cluster as input.

In [156]:
%%capture
topics = []
for i, cluster in df.groupby('cluster'):
    titles = cluster['title'].sample(25).tolist()
    topic = TopicModeling(titles)
    topics.append(topic.label)

Lets assign each arXiv publication to each cluster, and see what are the top 15 topics.

In [157]:
n_clusters = len(df['cluster'].unique())

topic_map = dict(zip(range(n_clusters), topics))
df['topic'] = df['cluster'].map(topic_map)

In [143]:
df['topic'].value_counts().head(15)

topic
Large Language Models Efficiency and Compression       691
Multimodal Language Models                             600
Bias in Language Models                                591
Jailbreak Attacks and Defense                          566
Natural Language Processing in Healthcare              552
Multilingual Machine Translation                       370
Speech Recognition and Translation                     354
Chain-of-Thought Reasoning in Large Language Models    339
Retrieval-Augmented Generation                         314
Autonomous Agents and Planning with Language Models    238
Named Entity Recognition                               217
Large Language Models in Education                     203
Large Language Model Alignment                         197
Text Summarization                                     194
Natural Language Processing for Dialogue Systems       166
Name: count, dtype: int64

## 5. Visualization

In [144]:
%pip install vegafusion[embed]>=1.5.0 --quiet

import altair as alt
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [145]:
chart = alt.Chart(df).mark_circle(size=5).encode(
    x='x',
    y='y',
    color='topic:N',
    tooltip=['title', 'topic']
).interactive().properties(
    title='10K arXiv Abstracts in NLP | Embeddings | UMAP | HDBSCAN | Mistral-7B',
    width=600,
    height=400,
)
chart.display()

----