Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Clustering: Doe Downsampling work? #424

Closed
wants to merge 5 commits into from

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Apr 18, 2024

So I did a few experiments with downsampling on some of the larger datasets. We seem to get quite consistent performance, but seems like it is consistently slightly higher scores for smaller samples, though the std seems to remain similar.

This seems to indicate that we can't just keep the existing scores, but estimate them using lower sample size.

@KennethEnevoldsen KennethEnevoldsen added the WIP Work In Progress label Apr 18, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to ignore this

@@ -66,4 +66,4 @@ def _evaluate_monolingual(self, model, dataset, split="test", **kwargs):

v_mean = np.mean(v_measures)
v_std = np.std(v_measures)
return {"v_measure": v_mean, "v_measure_std": v_std}
return {"v_measure": v_mean, "v_measure_std": v_std, "v_measures": v_measures}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to allow user to calculate ci etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to ignore

@imenelydiaker
Copy link
Contributor

imenelydiaker commented Apr 18, 2024

I think it's okay having better results on a sub-sampled test set the moment all models are evaluated on this same set, evaluations become comparable.

Something about the subsampling, I noticed that for classification tasks the code used by contributors shuffles the dataset and selects N_SAMPLES (all random fixed by a seed). We can do better 🤔
I think the sub-sampling should be stratified by labels (classes). To allow keeping the same propotions of classes in the sub-sampled test set as it was in the original one. If labels are present in a dataset, we can have the same behavior for clustering tasks.

@MartinBernstorff
Copy link
Contributor

Agreed on all counts; we've talked a bit about this @KennethEnevoldsen. As long as we don't update the benchmark before all datasets are run on the new subsampling mechanism, this looks good to me 👍

@KennethEnevoldsen KennethEnevoldsen self-assigned this Apr 19, 2024
@Muennighoff
Copy link
Contributor

Agree with what has been said - we should make sure we're not losing precision in the task & update all scores if they change👍

@KennethEnevoldsen
Copy link
Contributor Author

superseded by #481

@KennethEnevoldsen KennethEnevoldsen deleted the faster_clustering branch July 19, 2024 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work In Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants