-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmteb | Arabic | Retrieval Task #669
mmteb | Arabic | Retrieval Task #669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any a few minor comments. Especially the size concerns me a bit.
Feel free to add points as well.
@bakrianoo looks like the tests fail - will you have a look at this |
I am tried to update the meta values of the dataset many times, but can not explore which meta is not accepted by the testing process. Can you help ?
https://github.com/embeddings-benchmark/mteb/actions/runs/9128236014/job/25100164821?pr=669 |
Hi @bakrianoo
|
results/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/SadeemKeywordRetrieval.json
Show resolved
Hide resolved
Please do not do this. We specifically have exceptions for _HISTORIC_DATASETS but the test is intended to fail for new dataset @Ruqyai if you have done this for a previous dataset please make a PR with the fix. |
date=None, | ||
form=["written"], | ||
domains=["Blog"], | ||
task_subtypes=None, | ||
license=None, | ||
socioeconomic_status=None, | ||
annotations_creators=None, | ||
dialect=None, | ||
text_creation=None, | ||
bibtex_citation=None, | ||
n_samples={_EVAL_SPLIT: 7179}, | ||
avg_character_length={_EVAL_SPLIT: 500.0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the test fails is because the metadata is not filled which it should be.
date is the time that the text were written (e.g. scraped from twitter from 2001-2020)
task_subtype I would put Keyword Retrieval and add it to the list of allowed subtypes
license is required
socioeconomic status is the social status of the text writers (e.g. high for lawyers).
dialect should be an empty list if there are no dialects
You can read more about these on the TaskMetadata object
Thanks @KennethEnevoldsen .. I am doing here PR #763 |
@bakrianoo would love to have this PR merged in. I will close it for now, but if you have the time please do re-open it and adress the metadata issues. I will make sure it gets a quick review and that we finish up the metadata. |
Checklist for adding MMTEB dataset
Reason for dataset addition:
This is a dataset for
mmteb
initiative.The Dataset is for Arabic Retrieval tasks
The Dataset is for Keyword-Based searching tasks (The retrieval part in the RAG pipeline)
Although the promising capabilities of using embeddings for semantic search of queries, we still notice some challenges when the query becomes too short and in
keywords
style.I have tested that the dataset runs with the
mteb
package.I have run the following models on the task (adding the results to the pr). These can be run using the
mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using
self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using
make test
.Run the formatter to format the code using
make lint
.[] I have added points for my submission to the points folder using the PR number as the filename (e.g.
438.jsonl
).