Multilabel Brazilian Toxic Tweets Classification #773

dokato · 2024-05-20T20:17:48Z

Checklist for adding MMTEB dataset

Reason for dataset addition: there's shortage of datasets with Brazilian Portugise, and currently we don't have big enough variety of multilabel datasets. This one is well structured and described: https://paperswithcode.com/dataset/told-br

Furthermore, I corrected some minor issues with Maltese News Classification.

dokato · 2024-05-20T20:25:31Z

@x-tabdeveloping sorry to bother again, but I need advise how to handle this set. On HF it consists of 1 train split with 21k examples.
Here again I can't use built-in stratified train_test_split from datasets as it complains about column type, so I just use random samples. But as a consequence, we can't guarantee that the labels from training will match ones in test. WDYT? Should we sort #760 or #694 first?

Ruqyai

LGTM

x-tabdeveloping · 2024-05-21T09:27:46Z

As far as I know we stopped dataset submissions on the 15th.
I think we should focus our efforts on speeding up the benchmark and finalising everything before running the models.

x-tabdeveloping · 2024-05-21T09:31:30Z

I see you have some changes in Maltese news. Can you elaborate on what you did, and why? If it is relevant to the benchmark we should consider putting it in another PR.

x-tabdeveloping · 2024-05-21T09:34:21Z

@Ruqyai The checklist is incomplete and the name of the PR is set to "work in progress". I believe it would be a tad irresponsible to merge this, no? Seems a bit too early and undercooked to just LGTM it to me.

dokato · 2024-05-21T12:59:53Z

Your PR for Multilabel Classification was merged only last week, which left just a couple of days to submit before 15th. While I appreciate that we focus on model submissions, and I'm eager and hands on with that, I don't think we should resign from an interesting dataset because of an arbitrary deadline, especially given that Brazilian dialect of Portuguese is underrepresented. While working on it I spotted some minor mistakes in Maltese News Classification: a) wrong type of task b) lack of import.

x-tabdeveloping · 2024-05-21T13:14:57Z

I see your point! Can I ask you to move the changes related to the Maltese News to another PR so we can discuss them separately? (seems quite reasonable otherwise)
@KennethEnevoldsen what is your take on this? Should we consider adding this still or stick to the deadline we set to ourselves?

KennethEnevoldsen

@KennethEnevoldsen what is your take on this? Should we consider adding this still or stick to the deadline we set to ourselves?

This looks fine to merge in for me - @dokato will you fill out the checklist

mteb/tasks/MultiLabelClassification/por/BrazilianToxicTweetsClassification.py

…assification.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

dokato added 4 commits May 20, 2024 19:38

BrazilianToxic Tweets multilabel classification

13f4ba4

minor maltese news clf fixes

effbc37

BrazilianToxicTweetsClassification improvements

5bbbf66

BrazilianToxicTweetsClassification cleanup

d617dc8

dokato added the WIP Work In Progress label May 20, 2024

dokato requested a review from x-tabdeveloping May 20, 2024 20:17

Merge branch 'main' into multi-pr

13d5a14

Ruqyai approved these changes May 20, 2024

View reviewed changes

isaac-chung assigned x-tabdeveloping May 22, 2024

dokato mentioned this pull request May 23, 2024

Finalizing MMTEB #784

Open

4 tasks

KennethEnevoldsen approved these changes May 23, 2024

View reviewed changes

mteb/tasks/MultiLabelClassification/por/BrazilianToxicTweetsClassification.py Outdated Show resolved Hide resolved

dokato and others added 2 commits May 24, 2024 14:35

Update mteb/tasks/MultiLabelClassification/por/BrazilianToxicTweetsCl…

a56e2c3

…assification.py Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

points added

0b058aa

dokato changed the title ~~Multilabel Brazilian Toxic Tweets Classification [WIP]~~ Multilabel Brazilian Toxic Tweets Classification May 24, 2024

dokato removed the WIP Work In Progress label May 24, 2024

dokato merged commit 5f0cd32 into embeddings-benchmark:main May 24, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilabel Brazilian Toxic Tweets Classification #773

Multilabel Brazilian Toxic Tweets Classification #773

dokato commented May 20, 2024 •

edited

Loading

dokato commented May 20, 2024

Ruqyai left a comment

x-tabdeveloping commented May 21, 2024

x-tabdeveloping commented May 21, 2024 •

edited

Loading

x-tabdeveloping commented May 21, 2024

dokato commented May 21, 2024

x-tabdeveloping commented May 21, 2024

KennethEnevoldsen left a comment •

edited

Loading

Multilabel Brazilian Toxic Tweets Classification #773

Multilabel Brazilian Toxic Tweets Classification #773

Conversation

dokato commented May 20, 2024 • edited Loading

Checklist for adding MMTEB dataset

dokato commented May 20, 2024

Ruqyai left a comment

Choose a reason for hiding this comment

x-tabdeveloping commented May 21, 2024

x-tabdeveloping commented May 21, 2024 • edited Loading

x-tabdeveloping commented May 21, 2024

dokato commented May 21, 2024

x-tabdeveloping commented May 21, 2024

KennethEnevoldsen left a comment • edited Loading

Choose a reason for hiding this comment

dokato commented May 20, 2024 •

edited

Loading

x-tabdeveloping commented May 21, 2024 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading