Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add Multilingual Hate Speech detection task #439

Merged
merged 11 commits into from
Apr 24, 2024

Conversation

rbroc
Copy link
Contributor

@rbroc rbroc commented Apr 19, 2024

Checklist for adding MMTEB dataset

closes #395

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the POINTS.md file.

@rbroc
Copy link
Contributor Author

rbroc commented Apr 19, 2024

First stab at this, but before I go ahead would love to have input on the following:

  • The dataset includes 10 languages. First release as part of a 2021 ACL paper, was a monolingual English dataset. Then the remaining 9 languages were released, expanding the original dataset. There is a separate 2022 workshop paper for these additional datasets. I am using the latter for reference and bibtex_citation, but any input on this? Hope we are good license-wise if we do that?
  • All datasets are released separately on HF. I am passing a dictionary with language-specific revision tags for each dataset to revision. Feels a bit weird though, any better suggestion?
  • The name of the dataset passed to metadata is technically the name of one of the datasets, is that an issue?
  • n_samples and avg_character_length are computed across all languages. n_samples is 18250 in total. This seems high, but we are benchmarking separately on each monolingual dataset -- so size should be okay? Or do I need to downsample?

@KennethEnevoldsen
Copy link
Contributor

The dataset includes 10 languages. First release as part of a 2021 ACL paper, was a monolingual English dataset. Then the remaining 9 languages were released, expanding the original dataset. There is a separate 2022 workshop paper for these additional datasets. I am using the latter for reference and bibtex_citation, but any input on this?

Let use use both.

All datasets are released separately on HF. I am passing a dictionary with language-specific revision tags for each dataset to revision. Feels a bit weird though, any better suggestion?
The name of the dataset passed to metadata is technically the name of one of the datasets, is that an issue?

Given the license I would rehost it. Feel free to host it on the mteb, hf group (you can request to join and I will add you)

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, this looks really promising, @rbroc.I think with rehost and downsampling, it will be great.

Also note that the point system changed (to avoid merge conflicts) - see #438

@rbroc
Copy link
Contributor Author

rbroc commented Apr 19, 2024

thanks @KennethEnevoldsen!
double-checking real quick:

  • making this a single dataset on HF, hosted on mteb -- not multiple dataset, correct?
  • when you say crediting both papers, how would i do that? reference and bibtex_citation metadata fields can only take strings -- i could of course reference those in the README of the new dataset, but we're still left with what these two metadata fields should look like in the new Task.

@KennethEnevoldsen
Copy link
Contributor

making this a single dataset on HF, hosted on mteb -- not multiple dataset, correct?

Yep!

when you say crediting both papers, how would i do that? reference and bibtex_citation metadata fields can only take strings -- i could of course reference those in the README of the new dataset, but we're still left with what these two metadata fields should look like in the new Task.

Let reference be the newest, but the BibTeX citation can be just two BibTeX citations.

@rbroc
Copy link
Contributor Author

rbroc commented Apr 19, 2024

awesome thanks! i'll most probably have to focus on other stuff rest of today, but hoping to wrap this up latest early next week.

@KennethEnevoldsen KennethEnevoldsen self-assigned this Apr 19, 2024
@rbroc
Copy link
Contributor Author

rbroc commented Apr 19, 2024

@KennethEnevoldsen I have implemented requested changes, new dataset hosted here: https://huggingface.co/datasets/mteb/multi-hatecheck

I have kept all data from the original dataset in the HF dataset, as well as additional interesting columns one could focus on in the future (type of hate speech). I do subsampling and splitting here, easier to deal with if we want to later increase the number of samples.

Not sure how what the point system for this should be, is this 1 or 9 datasets? Also feels like repeated review should grant more than 2x points for you?

@rbroc rbroc mentioned this pull request Apr 22, 2024
10 tasks
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbroc you might be interested in #440 for the multilabel case. Otherwise, everything looks good!

Will you add in the points

docs/mmteb/points/439.jsonl Outdated Show resolved Hide resolved
@rbroc rbroc changed the title Multilingual Hate Speech detection task fix: add Multilingual Hate Speech detection task Apr 23, 2024
@rbroc
Copy link
Contributor Author

rbroc commented Apr 23, 2024

@KennethEnevoldsen should be ready to merge

@KennethEnevoldsen
Copy link
Contributor

Ahh @rbroc sorry missed the question related to points:

You should get 2 points for the dataset and 4 bonus points pr. language that does not have a classification task.

@rbroc
Copy link
Contributor Author

rbroc commented Apr 24, 2024

no worries at all! it seems like all languages are covered in multilingual datasets, so the 2 I added is correct. :)

@KennethEnevoldsen KennethEnevoldsen merged commit eee7175 into embeddings-benchmark:main Apr 24, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scale Italian HateSpeech to multilingual
2 participants