Added multilabel stratification to AbsTaskMultilabelClassification #760

dokato · 2024-05-17T18:17:50Z

This is continuation of the discussion from #698.

x-tabdeveloping

See my comments. Let's continue discussion on this, I think we are going in the right direction but this still needs some thought and work.

mteb/abstasks/AbsTaskMultilabelClassification.py

pyproject.toml

KennethEnevoldsen · 2024-05-23T09:33:54Z

hi @dokato and @x-tabdeveloping: I would probably just do the split when creating the dataset on HF. This avoids the extra dependency.

x-tabdeveloping · 2024-05-23T10:01:44Z

Yeah but then we need to reupload a lot of datasets, and I'm wondering if it's too much of a hustle. I am really in favour of not introducing new dependencies if possible.
I'm wondering if it would make sense to just copy the file over from sklmultilearn, simplify it a bit, and call it a day.
It seems pretty self-contained to me at least. What do you think @dokato @KennethEnevoldsen ?

KennethEnevoldsen · 2024-05-23T10:40:41Z

@x-tabdeveloping I believe that is the best solution. We just need to add a reference

dokato · 2024-05-24T16:29:13Z

Thanks for advise @x-tabdeveloping @KennethEnevoldsen. That's what I did, I moved their function (with acknowledgments) and slightly modified so it returns indices _iterative_train_test_split.

x-tabdeveloping · 2024-05-30T08:30:37Z

Sorry for the delay @dokato I'm just mid exam-season, I'll have a look at it right now :D

x-tabdeveloping

Beautiful! Looks absolutely good to me. Can I ask you to implement it for one or more tasks and see if it works? Thanks for your awesome work and patience

dokato · 2024-05-31T10:35:00Z

dokato · 2024-05-31T10:38:22Z

I've changed "Stratified subsampling of test set to 2000 examples." bit to more commonly used in classification 2048 AbsTaskMultilabelClassification.py. But now, given that we have proper stratificaiton I wonder if we need that bit at all, or should we just assume that the sampling is done in the dataset_transform method? Similarly as is in other tasks. Maybe just a warning if the split if bigger than 2048 would suffice?

dokato · 2024-06-05T07:58:02Z

@x-tabdeveloping struggling to understand what might be going on here. I tried to rebase but it had too many merge conflicts, so I forced push it. But honestly, have no idea why those tests are failing, esp. that they related to some other datasets...

x-tabdeveloping · 2024-06-05T12:46:56Z

Well, since this PR is making changes to quite fundamental things in the library, I would prefer if everything was passing before we went on, and all conflicts were resolved. I think if nothing fixes it probably the best thing to do is just reapply your changes to the current main in a new PR or a new branch, it seems like the most painless option to me. I don't have the slightest cue either what might have gone wrong here unfortunately.

KennethEnevoldsen · 2024-06-05T18:48:18Z

@dokato pulling from main should resolve a lot of the dataset issues.

dokato · 2024-06-05T21:07:01Z

Good shoutout guys, I rebased again. I guess it's ready?

x-tabdeveloping · 2024-06-06T10:56:47Z

I think it looks alright! Thanks for the work and patience @dokato :D

dokato · 2024-06-06T12:09:18Z

No worries, pleasure! hope you're exams went fine ;)

dokato added the WIP Work In Progress label May 17, 2024

dokato requested a review from x-tabdeveloping May 17, 2024 18:17

dokato mentioned this pull request May 17, 2024

MalteseNewsClassification added #546

Merged

10 tasks

x-tabdeveloping reviewed May 19, 2024

View reviewed changes

mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved

mteb/abstasks/AbsTaskMultilabelClassification.py Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

dokato mentioned this pull request May 20, 2024

Multilabel Brazilian Toxic Tweets Classification #773

Merged

10 tasks

imenelydiaker assigned x-tabdeveloping May 24, 2024

dokato force-pushed the mulistart branch 3 times, most recently from a8104b8 to e89f14b Compare May 24, 2024 16:27

dokato changed the title ~~[WIP] Added multilabel stratification to AbsTaskMultilabelClassification~~ Added multilabel stratification to AbsTaskMultilabelClassification May 28, 2024

x-tabdeveloping self-requested a review May 30, 2024 08:30

x-tabdeveloping reviewed May 30, 2024

View reviewed changes

x-tabdeveloping mentioned this pull request May 30, 2024

Use stratification in multilabel tasks. #850

Closed

9 tasks

dokato force-pushed the mulistart branch 2 times, most recently from a8104b8 to 053cbaa Compare May 31, 2024 10:31

dokato removed the WIP Work In Progress label May 31, 2024

dokato force-pushed the mulistart branch 3 times, most recently from b55919f to 6988b89 Compare June 4, 2024 12:24

merge conflicts fixed for stratification

d9c3710

dokato force-pushed the mulistart branch from 6988b89 to d9c3710 Compare June 5, 2024 20:59

x-tabdeveloping merged commit d7dc9a8 into embeddings-benchmark:main Jun 6, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multilabel stratification to AbsTaskMultilabelClassification #760

Added multilabel stratification to AbsTaskMultilabelClassification #760

dokato commented May 17, 2024

x-tabdeveloping left a comment

KennethEnevoldsen commented May 23, 2024

x-tabdeveloping commented May 23, 2024

KennethEnevoldsen commented May 23, 2024

dokato commented May 24, 2024

x-tabdeveloping commented May 30, 2024

x-tabdeveloping left a comment

dokato commented May 31, 2024

dokato commented May 31, 2024

dokato commented Jun 5, 2024

x-tabdeveloping commented Jun 5, 2024

KennethEnevoldsen commented Jun 5, 2024

dokato commented Jun 5, 2024

x-tabdeveloping commented Jun 6, 2024

dokato commented Jun 6, 2024

Added multilabel stratification to AbsTaskMultilabelClassification #760

Added multilabel stratification to AbsTaskMultilabelClassification #760

Conversation

dokato commented May 17, 2024

x-tabdeveloping left a comment

Choose a reason for hiding this comment

KennethEnevoldsen commented May 23, 2024

x-tabdeveloping commented May 23, 2024

KennethEnevoldsen commented May 23, 2024

dokato commented May 24, 2024

x-tabdeveloping commented May 30, 2024

x-tabdeveloping left a comment

Choose a reason for hiding this comment

dokato commented May 31, 2024

dokato commented May 31, 2024

dokato commented Jun 5, 2024

x-tabdeveloping commented Jun 5, 2024

KennethEnevoldsen commented Jun 5, 2024

dokato commented Jun 5, 2024

x-tabdeveloping commented Jun 6, 2024

dokato commented Jun 6, 2024