-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper segment: Comparing scores of Clustering to ClusteringFast #835
Comments
I'm happy to collaborate and help out here. I'd imagine the evaluation criteria are quite similar to those of the retrieval speedups? |
Yep, quite similar feel free to update the first comment. I believe a good case for this is the MTEB(eng) clustering tasks e.g. arxiv clustering (though it might be too large). Not sure exactly what the most convincing argument is here, but something like:
might be simple, but hard to scale to many tasks (ideally we want all ranks to stay the same). Significant rank is probably a better metric. |
@KennethEnevoldsen I picked e5-small, e5-base, and e5-large. Any preferences on the "close-relative" model? |
Looks like we mixed up how much to embed and how much to sample for bootstrapping. Added more analyses in the PR above. Now should be seeing even faster gains and better spearman score. |
A few good pairs for: would be: If we can meaningfully differentiate these models I would be quite happy. |
Adding a comment here for visibility of the progress made in #892:
|
With the PR merged and the bulk of the content moved to the paper draft, I'm closing the issue. Can be reopened if needed. |
These segments compare the new bootstrapping method (+ multilevel) for clustering methods to the older one. The goal is to establish that it produces similar results (model rank) at a lower cost.
Details
Multi-level will be covered in #849
The text was updated successfully, but these errors were encountered: