Skip to content

developerzohaib786/json-data-clustering

Repository files navigation

Joplin AI Note Clustering Benchmark Plugin

This repository contains a Joplin plugin prototype that benchmarks local text embeddings and performs unsupervised note clustering inside the Joplin plugin sandbox.

Goal

The goal of this plugin is to validate that a fully local AI pipeline can run inside Joplin Desktop and provide useful automatic grouping of notes.

Specifically, it demonstrates:

  • Embedding generation for note text with Transformers.js (running in a worker)
  • Optional dimensionality reduction with UMAP
  • Automatic K selection using silhouette score
  • Final K-Means clustering and sidebar visualization
  • End-to-end timing metrics for performance discussion

Tech Stack and Packages

Core Runtime

  • @huggingface/transformers for embedding inference
  • onnxruntime-web backend assets copied to plugin dist (dist/onnx-dist) for WASM runtime
  • Web Worker for non-blocking inference

Clustering and Math

  • @saehrimnir/druidjs for UMAP dimensionality reduction
  • In-project K-Means implementation
  • In-project cosine-distance silhouette scoring

Build Tooling

  • TypeScript
  • Webpack
  • copy-webpack-plugin to package static assets (data.json and ONNX runtime files)

Note: The build no longer depends on a tools/ directory. Static assets are copied by webpack during npm run dist.

Model Configuration Used

Current defaults:

  • Embedding model: Xenova/bge-small-en-v1.5
  • Display name: BGE-small-en-v1.5
  • DType: q8
  • Pooling: mean
  • Normalization: enabled

These settings are defined in src/modelConfig.ts.

Data Format

Input file: src/data.json

Important: this is line-delimited JSON (JSONL-style), even though the file extension is .json.

Each line is one note-like record, for example:

{"title":"Linear Algebra","body":"Vectors, matrices, eigenvalues..."}
{"text":"Raw text-only note format is also supported"}

Accepted fields:

  • text
  • or title + body (combined during parsing)
  • optional label

For fast local testing, the plugin currently limits processing to the first 100 records.

Pipeline Working (Step by Step)

  1. Plugin starts and opens a sidebar panel.
  2. data.json is loaded from plugin installation directory.
  3. Data lines are parsed into note text payloads.
  4. Worker loads the embedding model and performs warmup inference.
  5. Main thread sends notes to worker one-by-one for embedding.
  6. Worker returns embedding vectors and per-note inference time.
  7. Plugin optionally runs UMAP to reduce vector dimensions.
  8. Plugin tries multiple K values (K=2 to an adaptive max).
  9. For each K, K-Means is executed and silhouette score is computed.
  10. Best K is selected by highest silhouette score.
  11. Final K-Means is run with best K.
  12. Cluster groups + benchmark metrics are rendered in the sidebar.

Clustering Method Used

This plugin uses a classic unsupervised clustering stack:

  • Feature space: transformer embeddings (semantic vectors)
  • Distance basis: cosine similarity / cosine distance
  • Optional projection: UMAP (for better separability and lower compute)
  • Clustering algorithm: K-Means
  • Model selection metric: silhouette score

Why this combination:

  • Embeddings capture semantic meaning of notes.
  • K-Means is simple, explainable, and fast for PoC.
  • Silhouette score gives an objective way to pick K.
  • UMAP can improve cluster geometry and speed for larger sets.

Performance Reporting

The sidebar reports:

  • Model load time
  • Warmup time
  • Per-note embedding latency
  • Average latency (excluding warmup)
  • Total embedding time
  • Silhouette score for each tested K
  • Selected best K and final cluster sizes

Observed behavior in this repository

  • Pipeline successfully scales from small samples to larger corpora in prior runs (see screenshot section below).
  • Worker-based inference keeps UI responsive during embedding.
  • Main bottleneck is embedding inference time, which scales roughly with note count.

Build and Run

npm install
npm run dist

This creates a plugin archive in publish/.

Install in Joplin:

  1. Open Tools -> Options -> Plugins
  2. Choose Install from file
  3. Select the generated .jpl
  4. Restart Joplin

Demo of Pipeline Working

https://drive.google.com/file/d/1VPv44PIQ71v0Q-gJQ-1Qtr8ZV9MiWVbK/view?usp=sharing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors