-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Clustering #481
Fast Clustering #481
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that this PR includes some efforts for a consolidated multilingual MTEBScores class. Could that be a separate PR?
tmp/results/SwednClustering.json
Outdated
@@ -0,0 +1,17 @@ | |||
{ | |||
"all": { | |||
"evaluation_time": 9.65, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the file names, does this mean that normal was 4x faster than the "fast" class? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep in the case of Swedn it is about the same (it is the same amount of documents being embedded), the extra cost comes from the bootstrapping and fitting clusters.
Yeah, just had to consider it when designing the task, but I think you are right that it is indeed a separate PR. |
Thanks for the review @isaac-chung and @Muennighoff. I have now:
I believe this is close to ready to merge, let me know what you think edit: @imenelydiaker adding you here so that you see the "superseeded_by" implementation as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments were on the MTEBScore class. I did not review the cluster datasets. The rest looks good to me 💪
I will enable auto-merge on this PR |
Amazing work! Just noting that new models are now using this, e.g. in https://huggingface.co/spaces/mteb/leaderboard?task=clustering&language=french the voyage-law-2 model was run with fast clustering while other models were not. Is this okay? |
This creates an alternative fast clustering for MTEB. It is notably faster. It gains this performance by downsampling and only performing the embed step once. To then regain the robustness of creating clusters multiple times it samples from the embeddings to create clusters.