Feat: allow decreasing size of datasets loaded from BEIR#3392
Feat: allow decreasing size of datasets loaded from BEIR#3392julian-risch merged 4 commits intodeepset-ai:mainfrom
Conversation
|
Hi @ugm2 thanks for creating this PR and for tagging me. It's on my list. I'll get back to you with a review later today or tomorrow. 🙂 |
julian-risch
left a comment
There was a problem hiding this comment.
Hi @ugm2 the code changes look very good to me already! I left a few small comments to further improve the code. Just let me know if you have any questions. Happy to discuss the requested changes too if you disagree with any of them. Looking forward to the revised PR. 🙂
haystack/pipelines/base.py
Outdated
| qrels_new = {} | ||
| for query_id, document_rel_dict in qrels.items(): | ||
| document_rel_ids_intersection = list(corpus_ids & set(list(document_rel_dict.keys()))) | ||
| # If there are no remaining documents related to the query, delete de query |
haystack/pipelines/base.py
Outdated
| :param query_params: The params to use during querying (see pipeline.run's params). | ||
| :param dataset: The BEIR dataset to use. | ||
| :param dataset_dir: The directory to store the dataset to. | ||
| :param dataset_size: Maximum number of documents to load from given dataset. |
There was a problem hiding this comment.
Let's add the following: If set to None (default) or to a value larger than the number of documents in the dataset, the full dataset is loaded.
haystack/pipelines/base.py
Outdated
| query_params: dict = {}, | ||
| dataset: str = "scifact", | ||
| dataset_dir: Path = Path("."), | ||
| dataset_size: Optional[int] = None, |
There was a problem hiding this comment.
I wonder whether we would like to call this num_documents instead of dataset_size. It gives a bit more information to the user.
There was a problem hiding this comment.
I'd say let's rename it to num_documents please.
There was a problem hiding this comment.
Yeah, makes sense. Done!
haystack/pipelines/base.py
Outdated
| corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev" | ||
|
|
||
| # crop dataset if `dataset_size` is provided | ||
| if dataset_size is not None: |
There was a problem hiding this comment.
We could add some error handling for dataset_size < 0 and dataset_size > len(corpus).
There was a problem hiding this comment.
However, the code won't fail if dataset_size > len(corpus) and it will load the full dataset in that case, which is good.
There was a problem hiding this comment.
Let's skip the newly added code block if dataset_size > len(corpus) or dataset_size < 1 and show log a message that the dataset size remains unchanged.
There was a problem hiding this comment.
Okay, see if what I've done suits you!
|
@julian-risch linting seems to be failing for some reason, I don't know why |
|
@ugm2 The error message is the following: That's this line.
|
|
@julian-risch yeah, makes sense. Now it's working |
julian-risch
left a comment
There was a problem hiding this comment.
These changes look good to me! 👍 @ugm2 Thank you Unai for integrating all the feedback/requested changes and thank you for contributing to Haystack in general! Looking forward to reviewing another PR of yours some time. 😉
Related Issues
eval_beir()#3387Proposed Changes:
When evaluating pipelines the user normally would use the in-built eval_beir() method. This method doesn't contain a way of cropping the dataset of choice and for quick tests or reduced benchmarks when the dataset is huge that is a big plus.
This PR introduces a cropping feature (a new parameter) to allow this.
How did you test it?
Manually since there are no tests for
eval_beir().It can be manually tested with the following script (having enabled
Elasticsearch):You will see that the logging is saying that is indexing 300 documents instead of all of them (more than 3000)
Checklist