Faster Clustering: Doe Downsampling work? #424

KennethEnevoldsen · 2024-04-18T13:13:33Z

So I did a few experiments with downsampling on some of the larger datasets. We seem to get quite consistent performance, but seems like it is consistently slightly higher scores for smaller samples, though the std seems to remain similar.

This seems to indicate that we can't just keep the existing scores, but estimate them using lower sample size.

…nto faster_clustering

KennethEnevoldsen · 2024-04-18T13:19:36Z

mteb/tasks/BitextMining/multilingual/BUCCBitextMining.py

feel free to ignore this

KennethEnevoldsen · 2024-04-18T13:19:51Z

mteb/abstasks/AbsTaskClustering.py

@@ -66,4 +66,4 @@ def _evaluate_monolingual(self, model, dataset, split="test", **kwargs):

        v_mean = np.mean(v_measures)
        v_std = np.std(v_measures)
-        return {"v_measure": v_mean, "v_measure_std": v_std}
+        return {"v_measure": v_mean, "v_measure_std": v_std, "v_measures": v_measures}


to allow user to calculate ci etc.

KennethEnevoldsen · 2024-04-18T13:20:09Z

tmp/slowest_segment.py

feel free to ignore

imenelydiaker · 2024-04-18T13:33:30Z

I think it's okay having better results on a sub-sampled test set the moment all models are evaluated on this same set, evaluations become comparable.

Something about the subsampling, I noticed that for classification tasks the code used by contributors shuffles the dataset and selects N_SAMPLES (all random fixed by a seed). We can do better 🤔
I think the sub-sampling should be stratified by labels (classes). To allow keeping the same propotions of classes in the sub-sampled test set as it was in the original one. If labels are present in a dataset, we can have the same behavior for clustering tasks.

MartinBernstorff · 2024-04-19T07:09:00Z

Agreed on all counts; we've talked a bit about this @KennethEnevoldsen. As long as we don't update the benchmark before all datasets are run on the new subsampling mechanism, this looks good to me 👍

Muennighoff · 2024-04-19T21:26:46Z

Agree with what has been said - we should make sure we're not losing precision in the task & update all scores if they change👍

KennethEnevoldsen · 2024-04-24T07:46:34Z

superseded by #481

KennethEnevoldsen added 5 commits April 18, 2024 10:05

wip

fcfd177

wip

01b41f1

Merge branch 'main' of https://github.com/embeddings-benchmark/mteb i…

b5bbf8c

…nto faster_clustering

docs: Added dataset documentation

977522e

Added experiments on downsampling

1dc3cca

KennethEnevoldsen added the WIP Work In Progress label Apr 18, 2024

KennethEnevoldsen commented Apr 18, 2024

View reviewed changes

KennethEnevoldsen requested review from MartinBernstorff, Muennighoff and imenelydiaker April 18, 2024 13:20

KennethEnevoldsen self-assigned this Apr 19, 2024

MartinBernstorff removed their request for review April 20, 2024 07:56

KennethEnevoldsen mentioned this pull request Apr 22, 2024

duplicate examples in TwentyNewsgroupsClustering #407

Open

isaac-chung mentioned this pull request Apr 23, 2024

Add helper function to unify how we subsample data #508

Closed

KennethEnevoldsen closed this Apr 24, 2024

KennethEnevoldsen deleted the faster_clustering branch July 19, 2024 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Clustering: Doe Downsampling work? #424

Faster Clustering: Doe Downsampling work? #424

KennethEnevoldsen commented Apr 18, 2024 •

edited

Loading

KennethEnevoldsen Apr 18, 2024

KennethEnevoldsen Apr 18, 2024

KennethEnevoldsen Apr 18, 2024

imenelydiaker commented Apr 18, 2024 •

edited

Loading

MartinBernstorff commented Apr 19, 2024

Muennighoff commented Apr 19, 2024

KennethEnevoldsen commented Apr 24, 2024

Faster Clustering: Doe Downsampling work? #424

Faster Clustering: Doe Downsampling work? #424

Conversation

KennethEnevoldsen commented Apr 18, 2024 • edited Loading

KennethEnevoldsen Apr 18, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 18, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Apr 18, 2024

Choose a reason for hiding this comment

imenelydiaker commented Apr 18, 2024 • edited Loading

MartinBernstorff commented Apr 19, 2024

Muennighoff commented Apr 19, 2024

KennethEnevoldsen commented Apr 24, 2024

KennethEnevoldsen commented Apr 18, 2024 •

edited

Loading

imenelydiaker commented Apr 18, 2024 •

edited

Loading