Add Vaani Multilingual langauge detection and gender classification task by anime-sh · Pull Request #2367 · embeddings-benchmark/mteb

anime-sh · 2025-03-14T23:44:23Z

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: Add multilingual speech task

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
- "facebook/wav2vec2-xls-r-300m"
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

{
  "dataset_revision": "b133cc2e158905798a723f29685e483517e61275",
  "task_name": "IISc-Vaani-Language-Detection",
  "mteb_version": "1.36.20",
  "scores": {
    "train": [
      {
        "accuracy": 0.88094,
        "f1": 0.11324,
        "lrap": 0.973541,
        "scores_per_experiment": [
          {
            "accuracy": 0.88094,
            "f1": 0.11324,
            "lrap": 0.973541
          },
          {
            "accuracy": 0.88094,
            "f1": 0.11324,
            "lrap": 0.973541
          }
        ],
        "main_score": 0.88094,
        "hf_subset": "WestBengal_Purulia",
        "languages": [
          "ben-Beng",
          "hin-Deva",
          "rjs-Beng",
          "sat-Olck"
        ]
      }
    ]
  },
  "evaluation_time": 285.6733286380768,
  "kg_co2_emissions": null
}

{
  "dataset_revision": "b133cc2e158905798a723f29685e483517e61275",
  "task_name": "IISc-Vaani-Gender-Classification",
  "mteb_version": "1.36.20",
  "scores": {
    "train": [
      {
        "accuracy": 0.620996,
        "f1": 0.618563,
        "f1_weighted": 0.618696,
        "ap": 0.572232,
        "ap_weighted": 0.572232,
        "scores_per_experiment": [
          {
            "accuracy": 0.639648,
            "f1": 0.638888,
            "f1_weighted": 0.63905,
            "ap": 0.586072,
            "ap_weighted": 0.586072
          },
          {
            "accuracy": 0.627441,
            "f1": 0.627416,
            "f1_weighted": 0.627446,
            "ap": 0.575327,
            "ap_weighted": 0.575327
          },
          {
            "accuracy": 0.602051,
            "f1": 0.601364,
            "f1_weighted": 0.601203,
            "ap": 0.556079,
            "ap_weighted": 0.556079
          },
          {
            "accuracy": 0.631836,
            "f1": 0.621147,
            "f1_weighted": 0.621769,
            "ap": 0.585662,
            "ap_weighted": 0.585662
          },
          {
            "accuracy": 0.604004,
            "f1": 0.603999,
            "f1_weighted": 0.604012,
            "ap": 0.558021,
            "ap_weighted": 0.558021
          }
        ],
        "main_score": 0.620996,
        "hf_subset": "WestBengal_Purulia",
        "languages": [
          "ben-Beng",
          "hin-Deva",
          "rjs-Beng",
          "sat-Olck"
        ]
      }
    ]
  },
  "evaluation_time": 824.7135016918182,
  "kg_co2_emissions": null
}

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

Samoed · 2025-03-15T14:24:01Z

mteb/tasks/Audio/AudioMultilabelClassification/eng/VaaniLanguageDetection.py

+            )
+            self.dataset[subset] = self.dataset[subset].map(
+                lambda x: {
+                    self.label_column_name: literal_eval(x[self.label_column_name])


Suggested change

self.label_column_name: literal_eval(x[self.label_column_name])

self.label_column_name: json.loads(x[self.label_column_name])

Samoed · 2025-03-15T14:28:18Z

mteb/tasks/Audio/AudioMultilabelClassification/eng/VaaniLanguageDetection.py

+        },
+        type="AudioMultilabelClassification",
+        category="a2t",
+        eval_splits=["train"],


Since this dataset have only train I think this should implement cross_validation for Multilabel datasets similarly as for classification

the each subset has a train set which has 100k samples, crossfold with normal Ks (3,5,7) will be a quite expensive
One idea i had was to generate a test split from the train split in dataset transform and then proceed as a normal task

KennethEnevoldsen · 2025-06-15T18:59:32Z

@anime-sh will close this PR as it seems to have gotten stale

anime-sh added 3 commits March 14, 2025 16:40

Add Vaani Multilingual langauge detection task

4a1e389

fix revision vaani

934a233

make lint

0ee9dc6

anime-sh added the audio Audio extension label Mar 14, 2025

RahulSChand assigned anime-sh Mar 15, 2025

anime-sh and others added 7 commits March 14, 2025 17:23

fix subset handling in dataset transformer

638dabf

fix stratified subsampling args

90b28a3

lint

75295a5

aaaaaaaaaaa

5a1ae83

Handle handle multi label column properly

da584e1

make lint

f5453b8

set reasonable samples per label

5d53283

anime-sh requested review from RahulSChand, Samoed and isaac-chung March 15, 2025 04:24

anime-sh added 2 commits March 14, 2025 22:11

subsample Vaani properly

89dfa7f

add gender classification task

2801d0c

anime-sh changed the title ~~Add Vaani Multilingual langauge detection task~~ Add Vaani Multilingual langauge detection and gender classification task Mar 15, 2025

anime-sh added 4 commits March 14, 2025 23:33

make lint

328871a

add all subsets gender class

15a3677

fix wav2vec2

555a9b0

add vaani gender import

6b16850

Samoed reviewed Mar 15, 2025

View reviewed changes

silky1708 mentioned this pull request May 7, 2025

MAEB Overview Issue #2072

Closed

84 tasks

silky1708 linked an issue May 7, 2025 that may be closed by this pull request

Add Vaani dataset #2669

Open

silky1708 mentioned this pull request May 8, 2025

Add Vaani dataset #2669

Open

KennethEnevoldsen closed this Jun 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Vaani Multilingual langauge detection and gender classification task#2367

Add Vaani Multilingual langauge detection and gender classification task#2367
anime-sh wants to merge 16 commits intoembeddings-benchmark:maebfrom
anime-sh:IISCVaani-maeb

anime-sh commented Mar 14, 2025 •

edited

Loading

Uh oh!

Samoed Mar 15, 2025

Uh oh!

Samoed Mar 15, 2025

Uh oh!

anime-sh Mar 15, 2025

Uh oh!

KennethEnevoldsen commented Jun 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	self.label_column_name: literal_eval(x[self.label_column_name])
	self.label_column_name: json.loads(x[self.label_column_name])

Conversation

anime-sh commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Quality

Documentation

Testing

Adding datasets checklist

Adding a model checklist

Uh oh!

Samoed Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

anime-sh Mar 15, 2025

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen commented Jun 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anime-sh commented Mar 14, 2025 •

edited

Loading