Skip to content

Feat: allow decreasing size of datasets loaded from BEIR#3392

Merged
julian-risch merged 4 commits intodeepset-ai:mainfrom
ugm2:feature/benchmark_crop_dataset
Oct 21, 2022
Merged

Feat: allow decreasing size of datasets loaded from BEIR#3392
julian-risch merged 4 commits intodeepset-ai:mainfrom
ugm2:feature/benchmark_crop_dataset

Conversation

@ugm2
Copy link
Contributor

@ugm2 ugm2 commented Oct 14, 2022

Related Issues

Proposed Changes:

When evaluating pipelines the user normally would use the in-built eval_beir() method. This method doesn't contain a way of cropping the dataset of choice and for quick tests or reduced benchmarks when the dataset is huge that is a big plus.

This PR introduces a cropping feature (a new parameter) to allow this.

How did you test it?

Manually since there are no tests for eval_beir().

It can be manually tested with the following script (having enabled Elasticsearch):

from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack import Pipeline
from haystack.nodes.retriever import BM25Retriever
from haystack.nodes.file_converter import TextConverter

es_endpoint = "127.0.0.1"
es_port = 9200
index = "documents_test"
dataset = "nfcorpus"
dataset_size = 300

document_store = ElasticsearchDocumentStore(host=es_endpoint, port=es_port, similarity="dot_product", index=index)
es_retriever = BM25Retriever(document_store=(document_store))
text_converter = TextConverter()

# INDEXING PIPELINE
index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

# SEARCH PIPELINE
search_pipeline = Pipeline()
search_pipeline.add_node(es_retriever, name="ESRetriever", inputs=["Query"])

Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=search_pipeline, dataset=dataset, dataset_size=dataset_size
)

You will see that the logging is saying that is indexing 300 documents instead of all of them (more than 3000)

Checklist

@ugm2 ugm2 marked this pull request as ready for review October 18, 2022 09:27
@ugm2 ugm2 requested a review from a team as a code owner October 18, 2022 09:27
@ugm2 ugm2 requested review from mayankjobanputra and removed request for a team October 18, 2022 09:27
@julian-risch
Copy link
Member

Hi @ugm2 thanks for creating this PR and for tagging me. It's on my list. I'll get back to you with a review later today or tomorrow. 🙂

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ugm2 the code changes look very good to me already! I left a few small comments to further improve the code. Just let me know if you have any questions. Happy to discuss the requested changes too if you disagree with any of them. Looking forward to the revised PR. 🙂

qrels_new = {}
for query_id, document_rel_dict in qrels.items():
document_rel_ids_intersection = list(corpus_ids & set(list(document_rel_dict.keys())))
# If there are no remaining documents related to the query, delete de query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo de -> the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

:param query_params: The params to use during querying (see pipeline.run's params).
:param dataset: The BEIR dataset to use.
:param dataset_dir: The directory to store the dataset to.
:param dataset_size: Maximum number of documents to load from given dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the following: If set to None (default) or to a value larger than the number of documents in the dataset, the full dataset is loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

query_params: dict = {},
dataset: str = "scifact",
dataset_dir: Path = Path("."),
dataset_size: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we would like to call this num_documents instead of dataset_size. It gives a bit more information to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say let's rename it to num_documents please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. Done!

corpus, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

# crop dataset if `dataset_size` is provided
if dataset_size is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add some error handling for dataset_size < 0 and dataset_size > len(corpus).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the code won't fail if dataset_size > len(corpus) and it will load the full dataset in that case, which is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's skip the newly added code block if dataset_size > len(corpus) or dataset_size < 1 and show log a message that the dataset size remains unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, see if what I've done suits you!

@julian-risch julian-risch changed the title Feat: Feat: allow decreasing size of datasets loaded from BEIR Oct 18, 2022
@ugm2 ugm2 requested review from julian-risch and removed request for mayankjobanputra October 19, 2022 16:19
@ugm2
Copy link
Contributor Author

ugm2 commented Oct 20, 2022

@julian-risch linting seems to be failing for some reason, I don't know why

@julian-risch
Copy link
Member

julian-risch commented Oct 20, 2022

@ugm2 The error message is the following:

Run pylint -ry -j 0 haystack/ rest_api/rest_api ui/
************* Module haystack.pipelines.base
haystack/pipelines/base.py:773:11: R1716: Simplify chained comparison between the operands (chained-comparison)

That's this line.

if num_documents is not None and 0 < num_documents < len(corpus) would be a first improvement.

@ugm2
Copy link
Contributor Author

ugm2 commented Oct 20, 2022

@julian-risch yeah, makes sense. Now it's working

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me! 👍 @ugm2 Thank you Unai for integrating all the feedback/requested changes and thank you for contributing to Haystack in general! Looking forward to reviewing another PR of yours some time. 😉

@julian-risch julian-risch merged commit e41cb24 into deepset-ai:main Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: allow decreasing dataset size in eval_beir()

2 participants