Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmteb | Arabic | Retrieval Task #669

Conversation

bakrianoo
Copy link
Contributor

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • This is a dataset for mmteb initiative.

  • The Dataset is for Arabic Retrieval tasks

  • The Dataset is for Keyword-Based searching tasks (The retrieval part in the RAG pipeline)

  • Although the promising capabilities of using embeddings for semantic search of queries, we still notice some challenges when the query becomes too short and in keywords style.

  • I have tested that the dataset runs with the mteb package.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.

    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()

  • I have filled out the metadata object in the dataset file (find documentation on it here).

  • Run tests locally to make sure nothing is broken using make test.

  • Run the formatter to format the code using make lint.

  • [] I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any a few minor comments. Especially the size concerns me a bit.

Feel free to add points as well.

mteb/tasks/Retrieval/ara/SadeemKeywordRetrieval.py Outdated Show resolved Hide resolved
mteb/tasks/Retrieval/ara/SadeemKeywordRetrieval.py Outdated Show resolved Hide resolved
@KennethEnevoldsen
Copy link
Contributor

@bakrianoo looks like the tests fail - will you have a look at this

@bakrianoo
Copy link
Contributor Author

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

@Ruqyai
Copy link
Contributor

Ruqyai commented May 17, 2024

@KennethEnevoldsen

I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?

_________________________ test_all_metadata_is_filled __________________________
[gw0] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

    def test_all_metadata_is_filled():
        all_tasks = get_tasks()
    
        unfilled_metadata = []
        for task in all_tasks:
            if task.metadata.name not in _HISTORIC_DATASETS:
                if not task.metadata.is_filled():
                    unfilled_metadata.append(task.metadata.name)
        if unfilled_metadata:
>           raise ValueError(
                f"The metadata of the following datasets is not filled: {unfilled_metadata}"
            )
E           ValueError: The metadata of the following datasets is not filled: ['SadeemKeywordRetrieval']

tests/test_TaskMetadata.py:436: ValueError
=============================== warnings summary ===============================
tests/test_mteb.py::test_mteb_task[average_word_embeddings_levy_dependency-task0]
  /home/runner/.cache/huggingface/modules/datasets_modules/datasets/strombergnlp--bornholmsk_parallel/a93ddacca6042553271bf4c1c0e0[35](https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669#step:5:36)df3fccf848c6820417766752d676567815/bornholmsk_parallel.py:27: DeprecationWarning: invalid escape sequence \o
    _CITATION = """\

tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_mteb_rerank
tests/test_mteb_rerank.py::test_reranker_same_ndcg1
  /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669

Hi @bakrianoo
I faced a similar error. These steps that I did to fix it:

  • go to the file tests/test_TaskMetadata.py

  • add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.

  • save the file and run make test

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 17, 2024

Hi @bakrianoo
I faced a similar error. These steps that I did to fix it:

go to the file tests/test_TaskMetadata.py

add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.

save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

Comment on lines +27 to +38
date=None,
form=["written"],
domains=["Blog"],
task_subtypes=None,
license=None,
socioeconomic_status=None,
annotations_creators=None,
dialect=None,
text_creation=None,
bibtex_citation=None,
n_samples={_EVAL_SPLIT: 7179},
avg_character_length={_EVAL_SPLIT: 500.0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the test fails is because the metadata is not filled which it should be.

date is the time that the text were written (e.g. scraped from twitter from 2001-2020)
task_subtype I would put Keyword Retrieval and add it to the list of allowed subtypes
license is required
socioeconomic status is the social status of the text writers (e.g. high for lawyers).
dialect should be an empty list if there are no dialects

You can read more about these on the TaskMetadata object

@Ruqyai
Copy link
Contributor

Ruqyai commented May 18, 2024

Hi @bakrianoo
I faced a similar error. These steps that I did to fix it:
go to the file tests/test_TaskMetadata.py
add 'SadeemKeywordRetrieval', to the list of _HISTORIC_DATASETS manully.
save the file and run make test

Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix.

Thanks @KennethEnevoldsen .. I am doing here PR #763
Please check if you could merge my PR without needs to comment the test_all_metadata_is_filled function.

@KennethEnevoldsen
Copy link
Contributor

@bakrianoo would love to have this PR merged in. I will close it for now, but if you have the time please do re-open it and adress the metadata issues. I will make sure it gets a quick review and that we finish up the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants