-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Clustering: Doe Downsampling work? #424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to ignore this
@@ -66,4 +66,4 @@ def _evaluate_monolingual(self, model, dataset, split="test", **kwargs): | |||
|
|||
v_mean = np.mean(v_measures) | |||
v_std = np.std(v_measures) | |||
return {"v_measure": v_mean, "v_measure_std": v_std} | |||
return {"v_measure": v_mean, "v_measure_std": v_std, "v_measures": v_measures} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to allow user to calculate ci etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to ignore
I think it's okay having better results on a sub-sampled test set the moment all models are evaluated on this same set, evaluations become comparable. Something about the subsampling, I noticed that for classification tasks the code used by contributors shuffles the dataset and selects N_SAMPLES (all random fixed by a seed). We can do better 🤔 |
Agreed on all counts; we've talked a bit about this @KennethEnevoldsen. As long as we don't update the benchmark before all datasets are run on the new subsampling mechanism, this looks good to me 👍 |
Agree with what has been said - we should make sure we're not losing precision in the task & update all scores if they change👍 |
superseded by #481 |
So I did a few experiments with downsampling on some of the larger datasets. We seem to get quite consistent performance, but seems like it is consistently slightly higher scores for smaller samples, though the std seems to remain similar.
This seems to indicate that we can't just keep the existing scores, but estimate them using lower sample size.