Add NollySenti Bitext Mining (MINERS) #915

gentaiscool · 2024-06-13T20:29:03Z

NollySenti is Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

gentaiscool · 2024-06-13T20:29:47Z

@KennethEnevoldsen I created a new PR for NollySenti.

gentaiscool · 2024-06-14T02:34:07Z

I have completed the checklist @KennethEnevoldsen.

KennethEnevoldsen

Will you add points and push models results as well?

KennethEnevoldsen · 2024-06-15T12:14:15Z

mteb/tasks/BitextMining/multilingual/NollySentiBitextMining.py

+        }
+        """,
+        n_samples={"train": 1640},
+        avg_character_length={"train": 4.46},


a bit short?

I updated the avg character length per sample (previously I put the avg char per word)

…bitext_nollysenti

gentaiscool · 2024-06-15T20:08:02Z

Thank you for the review. I addressed them. I updated the avg number of character per sample, added the results, and points. The new languages are hau-Latn, ibo-Latn, pcm-Latn, yor-Latn, so the points are 2 + 4 x 4 = 18. Plus, I added the point for the reviewer. @KennethEnevoldsen

gentaiscool added 2 commits June 13, 2024 16:25

add NollySentiBitextMining

664dc98

update reference

75f0130

gentaiscool mentioned this pull request Jun 13, 2024

Integrate MINERS #907

Closed

Merge branch 'embeddings-benchmark:main' into bitext_nollysenti

4db80f0

KennethEnevoldsen reviewed Jun 15, 2024

View reviewed changes

gentaiscool added 5 commits June 15, 2024 13:50

Merge branch 'embeddings-benchmark:main' into bitext_nollysenti

168d40a

update avg char

b846ff5

Merge branch 'bitext_nollysenti' of github.com:gentaiscool/mteb into …

e7e56d8

…bitext_nollysenti

add points

5ac09ce

add results

bbe9fb0

gentaiscool requested a review from KennethEnevoldsen June 15, 2024 20:08

KennethEnevoldsen merged commit df68f8c into embeddings-benchmark:main Jun 15, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NollySenti Bitext Mining (MINERS) #915

Add NollySenti Bitext Mining (MINERS) #915

gentaiscool commented Jun 13, 2024 •

edited

Loading

gentaiscool commented Jun 13, 2024

gentaiscool commented Jun 14, 2024

KennethEnevoldsen left a comment

KennethEnevoldsen Jun 15, 2024

gentaiscool Jun 15, 2024

gentaiscool commented Jun 15, 2024

Add NollySenti Bitext Mining (MINERS) #915

Add NollySenti Bitext Mining (MINERS) #915

Conversation

gentaiscool commented Jun 13, 2024 • edited Loading

Checklist

Adding datasets checklist

gentaiscool commented Jun 13, 2024

gentaiscool commented Jun 14, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Jun 15, 2024

Choose a reason for hiding this comment

gentaiscool Jun 15, 2024

Choose a reason for hiding this comment

gentaiscool commented Jun 15, 2024

gentaiscool commented Jun 13, 2024 •

edited

Loading