-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converted VG to hierarchical #694
Conversation
I can't run it for E5 as Ucloud is down today and it would take hours on my computer :') |
Also added stratified subsampling code to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let us get the results in from the other models as well - otherwise good
@x-tabdeveloping let us get this one merged in as well |
:') |
Okay I got a machine halleluyah |
Since the stratified subsampling doesn't exactly work as expected with multilabel data, I will just go with a random sample of 2048 entries I think. |
@KennethEnevoldsen green light? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
green light 🟢
Tests are failing because of missing datasets unrelated to my PR, what to do? |
pull from main should fix it |
* Added hierarchical VG clustering tasks * Added startified subsampling for multilabel tasks to AbsTask * Added stratified subsampling to VG clustering * Fixed stratified subsampling for multilabel tasks * fix: Converted VG to AbsTaskClusteringFast * Added results for paraphrase model * Removed debugging print statements * Added 'not specified' license to VGHierarchical * Added proper license from Norsk Aviskorpus * Ran linting * Replaced stratification with just regular subsampling * fix: fixed subsampling * Added results for VG * Added points * fix: Fixed JSON in 694.jsonl --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Checklist for adding MMTEB dataset
Reason for dataset addition:
VG Clustering was not hierarchical nor ClusteringFast before #656
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).