Fast Clustering #481

KennethEnevoldsen · 2024-04-22T09:39:53Z

This creates an alternative fast clustering for MTEB. It is notably faster. It gains this performance by downsampling and only performing the embed step once. To then regain the robustness of creating clusters multiple times it samples from the embeddings to create clusters.

mteb/abstasks/AbsTaskClusteringFast.py

mteb/tasks/Clustering/eng/ArxivClusteringP2PFast.py

mteb/tasks/Clustering/swe/swedn_clustering.py

mteb/tasks/Clustering/eng/ArxivClusteringP2PFast.py

tmp/results/ArxivClusteringP2PFast1.json

isaac-chung

Noticed that this PR includes some efforts for a consolidated multilingual MTEBScores class. Could that be a separate PR?

mteb/abstasks/AbsTaskClusteringFast.py

isaac-chung · 2024-04-22T17:19:46Z

tmp/results/SwednClustering.json

@@ -0,0 +1,17 @@
+{
+  "all": {
+    "evaluation_time": 9.65,


Looking at the file names, does this mean that normal was 4x faster than the "fast" class? 🤔

Yep in the case of Swedn it is about the same (it is the same amount of documents being embedded), the extra cost comes from the bootstrapping and fitting clusters.

mteb/abstasks/AbsTaskClusteringFast.py

KennethEnevoldsen · 2024-04-23T08:23:03Z

Noticed that this PR includes some efforts for a consolidated multilingual MTEBScores class. Could that be a separate PR?

Yeah, just had to consider it when designing the task, but I think you are right that it is indeed a separate PR.

KennethEnevoldsen · 2024-04-23T09:36:28Z

Thanks for the review @isaac-chung and @Muennighoff.

I have now:

removed the MTEBScore (for another PR)
Added the Fast clustering task, but keeping the old implementation
Added two new datasets using the old implementation, adding a "superseeded_by" attribute to the old datasets, which will raise a warning.
I have split swednClustering into P2P and S2S

I believe this is close to ready to merge, let me know what you think

edit: @imenelydiaker adding you here so that you see the "superseeded_by" implementation as well

isaac-chung

Most of my comments were on the MTEBScore class. I did not review the cluster datasets. The rest looks good to me 💪

…mark/mteb into clustering_fast

KennethEnevoldsen · 2024-04-23T12:54:20Z

I will enable auto-merge on this PR

Muennighoff · 2024-05-06T18:16:28Z

Amazing work! Just noting that new models are now using this, e.g. in https://huggingface.co/spaces/mteb/leaderboard?task=clustering&language=french the voyage-law-2 model was run with fast clustering while other models were not. Is this okay?

KennethEnevoldsen added 2 commits April 21, 2024 14:26

added outline for fast clustering

1210c6f

added examples of clustering transformations

ce6b71d

KennethEnevoldsen added the WIP Work In Progress label Apr 22, 2024

KennethEnevoldsen requested a review from MartinBernstorff April 22, 2024 09:39

KennethEnevoldsen commented Apr 22, 2024

View reviewed changes

KennethEnevoldsen requested a review from Muennighoff April 22, 2024 09:51

KennethEnevoldsen mentioned this pull request Apr 22, 2024

duplicate examples in TwentyNewsgroupsClustering #407

Open

KennethEnevoldsen requested review from isaac-chung and removed request for MartinBernstorff April 22, 2024 12:02

isaac-chung reviewed Apr 22, 2024

View reviewed changes

KennethEnevoldsen added 4 commits April 23, 2024 11:03

docs: added points

61cea5c

added superseeded by logic

f567818

format

1272487

Added fast clustering task

91b2ae6

KennethEnevoldsen mentioned this pull request Apr 23, 2024

Transitioning to a faster MTEB #482

Closed

Added results for new datasets

d2d2489

KennethEnevoldsen removed the WIP Work In Progress label Apr 23, 2024

Typo

8fe2a53

isaac-chung approved these changes Apr 23, 2024

View reviewed changes

KennethEnevoldsen added 4 commits April 23, 2024 12:39

Merge branch 'main' into clustering_fast

4c79e9d

added results

cf89b73

Merge branch 'clustering_fast' of https://github.com/embeddings-bench…

d9145be

…mark/mteb into clustering_fast

Merge branch 'main' into clustering_fast

f8388a4

KennethEnevoldsen enabled auto-merge (squash) April 23, 2024 12:53

KennethEnevoldsen merged commit 8d454bd into main Apr 23, 2024
7 checks passed

This was referenced Apr 23, 2024

Add TedTalks dataset #386

Closed

WIP: Multilabel classification #440

Merged

KennethEnevoldsen mentioned this pull request Apr 24, 2024

Faster Clustering: Doe Downsampling work? #424

Closed

KennethEnevoldsen mentioned this pull request May 1, 2024

Add new Polish clustering tasks (PL-MTEB) #607

Merged

10 tasks

This was referenced May 8, 2024

fix: Convert Multilingual/Crosslingual to fast-loading format #635

Merged

Sampling MTEB #647

Closed

mmteb | Arabic | Retrieval Task #669

Closed

Muennighoff mentioned this pull request May 14, 2024

Added ArXiv Hierarchical clustering (S2S and P2P) #699

Merged

10 tasks

KennethEnevoldsen mentioned this pull request Jun 11, 2024

Thoughts on accelerating the evaluation speed: supporting DDP/FSDP model inference + multi-task evaluation in parallel #883

Closed

KennethEnevoldsen deleted the clustering_fast branch July 19, 2024 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Clustering #481

Fast Clustering #481

KennethEnevoldsen commented Apr 22, 2024

isaac-chung left a comment

isaac-chung Apr 22, 2024

KennethEnevoldsen Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024 •

edited

Loading

isaac-chung left a comment •

edited

Loading

KennethEnevoldsen commented Apr 23, 2024

Muennighoff commented May 6, 2024 •

edited

Loading

Fast Clustering #481

Fast Clustering #481

Conversation

KennethEnevoldsen commented Apr 22, 2024

isaac-chung left a comment

Choose a reason for hiding this comment

isaac-chung Apr 22, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 23, 2024

Choose a reason for hiding this comment

KennethEnevoldsen commented Apr 23, 2024

KennethEnevoldsen commented Apr 23, 2024 • edited Loading

isaac-chung left a comment • edited Loading

Choose a reason for hiding this comment

KennethEnevoldsen commented Apr 23, 2024

Muennighoff commented May 6, 2024 • edited Loading

KennethEnevoldsen commented Apr 23, 2024 •

edited

Loading

isaac-chung left a comment •

edited

Loading

Muennighoff commented May 6, 2024 •

edited

Loading