This repository contains a Joplin plugin prototype that benchmarks local text embeddings and performs unsupervised note clustering inside the Joplin plugin sandbox.
The goal of this plugin is to validate that a fully local AI pipeline can run inside Joplin Desktop and provide useful automatic grouping of notes.
Specifically, it demonstrates:
- Embedding generation for note text with Transformers.js (running in a worker)
- Optional dimensionality reduction with UMAP
- Automatic K selection using silhouette score
- Final K-Means clustering and sidebar visualization
- End-to-end timing metrics for performance discussion
@huggingface/transformersfor embedding inferenceonnxruntime-webbackend assets copied to plugin dist (dist/onnx-dist) for WASM runtime- Web Worker for non-blocking inference
@saehrimnir/druidjsfor UMAP dimensionality reduction- In-project K-Means implementation
- In-project cosine-distance silhouette scoring
- TypeScript
- Webpack
copy-webpack-pluginto package static assets (data.jsonand ONNX runtime files)
Note: The build no longer depends on a tools/ directory. Static assets are copied by webpack during npm run dist.
Current defaults:
- Embedding model:
Xenova/bge-small-en-v1.5 - Display name:
BGE-small-en-v1.5 - DType:
q8 - Pooling:
mean - Normalization: enabled
These settings are defined in src/modelConfig.ts.
Input file: src/data.json
Important: this is line-delimited JSON (JSONL-style), even though the file extension is .json.
Each line is one note-like record, for example:
{"title":"Linear Algebra","body":"Vectors, matrices, eigenvalues..."}
{"text":"Raw text-only note format is also supported"}Accepted fields:
text- or
title+body(combined during parsing) - optional
label
For fast local testing, the plugin currently limits processing to the first 100 records.
- Plugin starts and opens a sidebar panel.
data.jsonis loaded from plugin installation directory.- Data lines are parsed into note text payloads.
- Worker loads the embedding model and performs warmup inference.
- Main thread sends notes to worker one-by-one for embedding.
- Worker returns embedding vectors and per-note inference time.
- Plugin optionally runs UMAP to reduce vector dimensions.
- Plugin tries multiple K values (
K=2to an adaptive max). - For each K, K-Means is executed and silhouette score is computed.
- Best K is selected by highest silhouette score.
- Final K-Means is run with best K.
- Cluster groups + benchmark metrics are rendered in the sidebar.
This plugin uses a classic unsupervised clustering stack:
- Feature space: transformer embeddings (semantic vectors)
- Distance basis: cosine similarity / cosine distance
- Optional projection: UMAP (for better separability and lower compute)
- Clustering algorithm: K-Means
- Model selection metric: silhouette score
Why this combination:
- Embeddings capture semantic meaning of notes.
- K-Means is simple, explainable, and fast for PoC.
- Silhouette score gives an objective way to pick K.
- UMAP can improve cluster geometry and speed for larger sets.
The sidebar reports:
- Model load time
- Warmup time
- Per-note embedding latency
- Average latency (excluding warmup)
- Total embedding time
- Silhouette score for each tested K
- Selected best K and final cluster sizes
- Pipeline successfully scales from small samples to larger corpora in prior runs (see screenshot section below).
- Worker-based inference keeps UI responsive during embedding.
- Main bottleneck is embedding inference time, which scales roughly with note count.
npm install
npm run distThis creates a plugin archive in publish/.
Install in Joplin:
- Open
Tools -> Options -> Plugins - Choose
Install from file - Select the generated
.jpl - Restart Joplin
https://drive.google.com/file/d/1VPv44PIQ71v0Q-gJQ-1Qtr8ZV9MiWVbK/view?usp=sharing