## The Impact of Intermediate Dimensionality on the Clustering Coherence of News Embeddings using HDBSCAN
---
This notebook evaluates clustering pipelines for news article embeddings by systematically varying intermediate dimensionality reduction and measuring cluster coherence, coverage, and stability. We show that clustering quality is highly sensitive to the reduced dimension and that stable, interpretable trends emerge only within a narrow range of pipeline configurations.

In [1]:
import pandas as pd

### Introduction
---
Recent progress in transformer-based language models has made it straightforward to represent large collections of text as high-dimensional semantic embeddings. Because these embeddings are often massive (over 3,000 dimensions), we rarely cluster them directly. Instead, we use "intermediate" steps to shrink the data down before grouping it.

Why do we do this? Direct clustering in high dimension is hindered by _distance concentration_. Geometrically, as dimensionality increases, the contrast between the distance to the nearest and farthest neighbors diminishes. Density-based algorithms like [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) identifies clusters by evaluating the distance and connectivity of a Minimum Spanning Tree. This 'flatness' in distance prevents the algorithm from distinguishing between dense topical cores and the sparse background noise. Intermediate dimensionality reduction serves to __restore the local density contrast__ necessary for HDBSCAN’s reachability metrics to function.

In most data pipelines, developers use Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP) to handle this shrinkage. While PCA is a classic, linear approach that focuses on broad patterns, UMAP is a modern, non-linear method designed to keep related items close together. However, after searching for a 'best' intermediate dimension, most sources pointed to heuristics.

This choice is especially important for Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN), a clustering algorithm that groups data based on how "dense" it is. Because the very concept of "density" changes depending on how many dimensions you have, the intermediate step isn't just a technicality—it might completely change which news stories get grouped together and which are thrown out as "noise."

### The Goal
---
I'm particularly intrested in using HDBSCAN for automating trend detection in news for a digestive reading app called SightRead. Through this project I hope to learn more about the tradeoffs in choosing a clustering pipeline, and simply document findings. 

In this work, I explore how changing the number of intermediate dimensions for both PCA and UMAP affects the final quality of news clusters. By testing a range of dimensions, I aim to measure:

__Cluster Coherence__: How semantically similar are the articles within a single group?

__Noise Levels__: How much of our data is discarded as "un-clusterable" in different dimensions?

__Method Comparison__: Does UMAP’s non-linear approach actually produce better clusters for news than the simpler PCA method?

By analyzing embeddings from RSS news feeds, I hope to provide a practical guide for which dimensionality reduction settings actually work best for organizing the daily news.

### An Overview of Data Collection
---
I'll give a quick overview of where and how the data is collected, more information on the dataset can be found [here](https://github.com/evansun06/trend-analysis-sr/blob/main/wrangling.ipynb). Articles were polled for a 1-week window from January 5th to January 12th from a select subset of RSS feeds from _The Guardian_, _The Wall Street Journal_, _United Nations_, _Fox News_, and _Daily Mail_.

<p align="center">
	<img src="assets/rss_aggregation.png" alt="Data aggregation diagram" width="720" />
</p>

_figure: SightRead RSS aggregation workflow_



### Methodology
---

__Data and Embeddings__

Let $N$ be the number of news articles contained in the dataset. Each article $i$ is represented by an embedding vector

$$
x_i \in \mathbb{R}^D
$$

where D is the embedding dimension (3072 in our case).

We stack embeddings into a matrix $X \in \mathbb{R}^{D\times N}$ where

$$

X = \begin{bmatrix} x_1^{\top} \\ x_2^{\top} \\ \vdots \\ x_N^{\top}\end{bmatrix} \in \mathbb{R}^{N \times D}


$$

For each intermediate dimension $d \in \{2, 5, 10, 15, 25, 50, 75, 100, 200, 500\}$, we run the following pipelines.

__A. Principle Component Analysis__

1. L2 Normalization (scikit-learn)
$$
\forall x_i \in X \rightarrow \hat{x}_i = \frac{x_i}{||x_i||_2}
$$
2. Reduce to intermediate dimension via PCA
$$
    {PCA}_{(d)}: \mathbb{R}^D \rightarrow \mathbb{R}^d
$$

3. Cluster using HDBSCAN with fixed hyperparameters (control)
    - `min_cluster_size` = 15
    - `min_samples` = 10 (semi-conservative)
    - Outputs: $l \in \{-1, 0, 1, ...\}$ where -1 is noise

__B. Unifold Manifold Approximation and Projection__
- Repeat with UMAP
$$
{UMAP}_{(d)}: \mathbb{R}^D \rightarrow \mathbb{R}^d
$$

### Dataset overview

__Fields__ (post-wrangling):
1. article_id: UUID for a unique article.
2. ingestion_id: UUID for the ingested article (articles can be ingested in different ways, in this dataset all used the `default` version).
3. normalized_text: Headline concatenated with description, normalized.
4. polled_at: timestamp when the article was polled at (each RSS feed was polled by the hour)
5. published_at: the publication date of the article for possible time series analysis
6. embedding: the embedding vector (OpenAI text-embedding-3-large)

*_see `wrangling.ipynb` for wrangling procedures_

In [2]:
df = pd.read_csv("data/wrangled_articles.csv")
df.head()

Unnamed: 0,article_id,ingestion_id,normalized_text,polled_at,published_at,embedding
0,d9509dc9-a405-4866-9201-19e3d063d137,000d8dea-8f60-470e-b539-eb51780b55db,Outrage as 'miserable' shovel-wielding neighbo...,2026-01-07 13:33:24.714064+00,2026-01-07 13:29:24+00,[-0.01143522 -0.03746951 -0.00753336 ... -0.00...
1,9a156ce2-11d6-4eff-b243-e2d2bb208faa,0072ab60-b1ed-42a9-9624-88c18292e588,"Morality, military might and a sense of mischi...",2026-01-09 05:17:56.535446+00,2026-01-09 03:15:08+00,[ 0.00615122 -0.03799464 -0.01626345 ... 0.00...
2,df3b2fc9-4d0f-4199-bca1-ea6d954e0c31,007f72ee-5039-47bb-8c61-c981f2810d3a,Republican senator vows to block all Fed nomin...,2026-01-12 15:17:02.967673+00,2026-01-12 14:55:53+00,[-0.00727808 -0.02314259 -0.02138059 ... 0.01...
3,d0f6077e-15be-497e-9a2b-1c023486ce9a,00879af4-8278-40a8-8338-dba8a0b45810,Starmer prepares to rip up Brexit: PM ready to...,2026-01-05 04:42:39.19293+00,2026-01-04 15:49:23+00,[ 0.01808163 0.02099675 -0.01054433 ... 0.00...
4,847a6cef-0c95-4a0f-9388-7345ba01bf26,00937788-6ba8-4db8-8107-645754dd7362,ANOTHER poll shows Labour in third behind Refo...,2026-01-07 11:12:56.5161+00,2026-01-07 10:57:35+00,[ 0.03266984 -0.00281101 -0.01202355 ... 0.01...


### References