Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Clustering #481

Merged
merged 12 commits into from
Apr 23, 2024
Merged

Fast Clustering #481

merged 12 commits into from
Apr 23, 2024

Conversation

KennethEnevoldsen
Copy link
Contributor

This creates an alternative fast clustering for MTEB. It is notably faster. It gains this performance by downsampling and only performing the embed step once. To then regain the robustness of creating clusters multiple times it samples from the embeddings to create clusters.

@KennethEnevoldsen KennethEnevoldsen added the WIP Work In Progress label Apr 22, 2024
mteb/abstasks/AbsTaskClusteringFast.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskClusteringFast.py Show resolved Hide resolved
mteb/tasks/Clustering/eng/ArxivClusteringP2PFast.py Outdated Show resolved Hide resolved
mteb/tasks/Clustering/swe/swedn_clustering.py Outdated Show resolved Hide resolved
mteb/tasks/Clustering/eng/ArxivClusteringP2PFast.py Outdated Show resolved Hide resolved
tmp/results/ArxivClusteringP2PFast1.json Outdated Show resolved Hide resolved
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that this PR includes some efforts for a consolidated multilingual MTEBScores class. Could that be a separate PR?

mteb/abstasks/AbsTaskClusteringFast.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskClusteringFast.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskClusteringFast.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskClusteringFast.py Show resolved Hide resolved
@@ -0,0 +1,17 @@
{
"all": {
"evaluation_time": 9.65,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the file names, does this mean that normal was 4x faster than the "fast" class? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep in the case of Swedn it is about the same (it is the same amount of documents being embedded), the extra cost comes from the bootstrapping and fitting clusters.

mteb/abstasks/AbsTaskClusteringFast.py Outdated Show resolved Hide resolved
mteb/abstasks/AbsTaskClusteringFast.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor Author

Noticed that this PR includes some efforts for a consolidated multilingual MTEBScores class. Could that be a separate PR?

Yeah, just had to consider it when designing the task, but I think you are right that it is indeed a separate PR.

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Apr 23, 2024

Thanks for the review @isaac-chung and @Muennighoff.

I have now:

  • removed the MTEBScore (for another PR)
  • Added the Fast clustering task, but keeping the old implementation
  • Added two new datasets using the old implementation, adding a "superseeded_by" attribute to the old datasets, which will raise a warning.
  • I have split swednClustering into P2P and S2S

I believe this is close to ready to merge, let me know what you think

edit: @imenelydiaker adding you here so that you see the "superseeded_by" implementation as well

@KennethEnevoldsen KennethEnevoldsen removed the WIP Work In Progress label Apr 23, 2024
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments were on the MTEBScore class. I did not review the cluster datasets. The rest looks good to me 💪

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) April 23, 2024 12:53
@KennethEnevoldsen
Copy link
Contributor Author

I will enable auto-merge on this PR

@KennethEnevoldsen KennethEnevoldsen merged commit 8d454bd into main Apr 23, 2024
7 checks passed
This was referenced Apr 23, 2024
@Muennighoff
Copy link
Contributor

Muennighoff commented May 6, 2024

Amazing work! Just noting that new models are now using this, e.g. in https://huggingface.co/spaces/mteb/leaderboard?task=clustering&language=french the voyage-law-2 model was run with fast clustering while other models were not. Is this okay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants