Paper segment: Comparing scores of Clustering to ClusteringFast #835

KennethEnevoldsen · 2024-05-28T10:19:45Z

These segments compare the new bootstrapping method (+ multilevel) for clustering methods to the older one. The goal is to establish that it produces similar results (model rank) at a lower cost.

Details

Using e.g. the English benchmark's clustering tasks, obtain v_measures of 3 models.
On the same task's v2 (fast version) and on the same models, obtain v_measures
These two methods should then obtain a spearman ranks correlation of ~1. If we just rank them I believe it would be equivalent to just using Spearman ranks correlation on the v_measure
Additionally, run a check for duplicate sentences like in Speed up Reranking tasks #793 (e.g. on duplicate examples in TwentyNewsgroupsClustering #407 )

Multi-level will be covered in #849

isaac-chung · 2024-05-28T11:10:17Z

I'm happy to collaborate and help out here. I'd imagine the evaluation criteria are quite similar to those of the retrieval speedups?

KennethEnevoldsen · 2024-05-28T11:17:15Z

Yep, quite similar feel free to update the first comment. I believe a good case for this is the MTEB(eng) clustering tasks e.g. arxiv clustering (though it might be too large).

Not sure exactly what the most convincing argument is here, but something like:

model	old rank	new score
m1	1	1 (-)
m2	2	3 (downarrow)
m3	3	2 (uparrow)

might be simple, but hard to scale to many tasks (ideally we want all ranks to stay the same). Significant rank is probably a better metric.

isaac-chung · 2024-06-07T10:54:30Z

I would probably go with a small, medium, large, and then "close-relative" for the large one (we want to differentiate if we do a spearman's correlation matrix we should have perfect for small, medium, large and close to perfect for large and relatives)

@KennethEnevoldsen I picked e5-small, e5-base, and e5-large. Any preferences on the "close-relative" model?

isaac-chung · 2024-06-08T10:51:53Z

Looks like we mixed up how much to embed and how much to sample for bootstrapping. Added more analyses in the PR above. Now should be seeing even faster gains and better spearman score.

KennethEnevoldsen · 2024-06-08T18:44:06Z

A few good pairs for:

multilingual-e5-large

would be:

If we can meaningfully differentiate these models I would be quite happy.

isaac-chung · 2024-06-14T11:28:43Z

Adding a comment here for visibility of the progress made in #892:

So far we've found a way forward to compare ranks and significant ranks of models in Clustering and ClusteringFast tasks in the English benchmark. On average we observe a 15x speedup.
Instead of downsampling all datasets with a single value, we downsample datasets to 4% of its original size. The only exceptions are RedditClustering and StackExchangeClustering (they use 32768 samples) due to their high category count and short documents. Under this method when comparing the classic and fast versions, all tasks exhibit moderate to perfect agreement. (see table in PR description)
A few more models are to be run. Probably on the higher end of the English benchmark to get a wider spread.
a. e5-large-v2 for large
b. paraphrase-multilingual-mpnet-base-v2 for medium
The table in the PR description can be added to the paper, specifically in the section where the speed up is described. Detailed plots to be added to the Appendix B under the ClusteringFast dataset construction section.

isaac-chung · 2024-06-18T10:16:59Z

With the PR merged and the bulk of the content moved to the paper draft, I'm closing the issue. Can be reopened if needed.

This was referenced May 28, 2024

Paper segment: Speeding up retrieval #836

Open

Paper segment: Speeding up MTEB (English) + co2 impact #838

Open

KennethEnevoldsen assigned isaac-chung May 28, 2024

KennethEnevoldsen mentioned this issue May 31, 2024

Finalizing MMTEB #784

Open

4 tasks

isaac-chung mentioned this issue Jun 7, 2024

fix: Compare Cluster and ClusterFast scores and speedup #892

Merged

2 tasks

This was referenced Jun 10, 2024

Paper Writing: An overview issue #896

Open

Thoughts on accelerating the evaluation speed: supporting DDP/FSDP model inference + multi-task evaluation in parallel #883

Closed

isaac-chung closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper segment: Comparing scores of Clustering to ClusteringFast #835

Paper segment: Comparing scores of Clustering to ClusteringFast #835

KennethEnevoldsen commented May 28, 2024 •

edited by isaac-chung

Loading

isaac-chung commented May 28, 2024

KennethEnevoldsen commented May 28, 2024

isaac-chung commented Jun 7, 2024

isaac-chung commented Jun 8, 2024 •

edited

Loading

KennethEnevoldsen commented Jun 8, 2024

isaac-chung commented Jun 14, 2024 •

edited

Loading

isaac-chung commented Jun 18, 2024

Paper segment: Comparing scores of Clustering to ClusteringFast #835

Paper segment: Comparing scores of Clustering to ClusteringFast #835

Comments

KennethEnevoldsen commented May 28, 2024 • edited by isaac-chung Loading

isaac-chung commented May 28, 2024

KennethEnevoldsen commented May 28, 2024

isaac-chung commented Jun 7, 2024

isaac-chung commented Jun 8, 2024 • edited Loading

KennethEnevoldsen commented Jun 8, 2024

isaac-chung commented Jun 14, 2024 • edited Loading

isaac-chung commented Jun 18, 2024

KennethEnevoldsen commented May 28, 2024 •

edited by isaac-chung

Loading

isaac-chung commented Jun 8, 2024 •

edited

Loading

isaac-chung commented Jun 14, 2024 •

edited

Loading