Feat: allow decreasing size of datasets loaded from BEIR by ugm2 · Pull Request #3392 · deepset-ai/haystack

ugm2 · 2022-10-14T14:21:31Z

Related Issues

fixes Feature: allow decreasing dataset size in eval_beir() #3387

Proposed Changes:

When evaluating pipelines the user normally would use the in-built eval_beir() method. This method doesn't contain a way of cropping the dataset of choice and for quick tests or reduced benchmarks when the dataset is huge that is a big plus.

This PR introduces a cropping feature (a new parameter) to allow this.

How did you test it?

Manually since there are no tests for eval_beir().

It can be manually tested with the following script (having enabled Elasticsearch):

from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack import Pipeline
from haystack.nodes.retriever import BM25Retriever
from haystack.nodes.file_converter import TextConverter

es_endpoint = "127.0.0.1"
es_port = 9200
index = "documents_test"
dataset = "nfcorpus"
dataset_size = 300

document_store = ElasticsearchDocumentStore(host=es_endpoint, port=es_port, similarity="dot_product", index=index)
es_retriever = BM25Retriever(document_store=(document_store))
text_converter = TextConverter()

# INDEXING PIPELINE
index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

# SEARCH PIPELINE
search_pipeline = Pipeline()
search_pipeline.add_node(es_retriever, name="ESRetriever", inputs=["Query"])

Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=search_pipeline, dataset=dataset, dataset_size=dataset_size
)

You will see that the logging is saying that is indexing 300 documents instead of all of them (more than 3000)

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

julian-risch · 2022-10-18T09:32:29Z

Hi @ugm2 thanks for creating this PR and for tagging me. It's on my list. I'll get back to you with a review later today or tomorrow. 🙂

julian-risch

Hi @ugm2 the code changes look very good to me already! I left a few small comments to further improve the code. Just let me know if you have any questions. Happy to discuss the requested changes too if you disagree with any of them. Looking forward to the revised PR. 🙂

julian-risch · 2022-10-18T10:48:43Z

haystack/pipelines/base.py

+            qrels_new = {}
+            for query_id, document_rel_dict in qrels.items():
+                document_rel_ids_intersection = list(corpus_ids & set(list(document_rel_dict.keys())))
+                # If there are no remaining documents related to the query, delete de query


small typo de -> the

julian-risch · 2022-10-18T10:50:01Z

haystack/pipelines/base.py

        :param query_params: The params to use during querying (see pipeline.run's params).
        :param dataset: The BEIR dataset to use.
        :param dataset_dir: The directory to store the dataset to.
+        :param dataset_size: Maximum number of documents to load from given dataset.


Let's add the following: If set to None (default) or to a value larger than the number of documents in the dataset, the full dataset is loaded.

julian-risch · 2022-10-18T10:51:13Z

haystack/pipelines/base.py

        query_params: dict = {},
        dataset: str = "scifact",
        dataset_dir: Path = Path("."),
+        dataset_size: Optional[int] = None,


I wonder whether we would like to call this num_documents instead of dataset_size. It gives a bit more information to the user.

I'd say let's rename it to num_documents please.

Yeah, makes sense. Done!

julian-risch · 2022-10-18T10:55:58Z

haystack/pipelines/base.py

        corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")  # or split = "train" or "dev"

+        # crop dataset if `dataset_size` is provided
+        if dataset_size is not None:


We could add some error handling for dataset_size < 0 and dataset_size > len(corpus).

However, the code won't fail if dataset_size > len(corpus) and it will load the full dataset in that case, which is good.

Let's skip the newly added code block if dataset_size > len(corpus) or dataset_size < 1 and show log a message that the dataset size remains unchanged.

Okay, see if what I've done suits you!

ugm2 · 2022-10-20T07:43:26Z

@julian-risch linting seems to be failing for some reason, I don't know why

julian-risch · 2022-10-20T07:49:42Z

@ugm2 The error message is the following:

Run pylint -ry -j 0 haystack/ rest_api/rest_api ui/
************* Module haystack.pipelines.base
haystack/pipelines/base.py:773:11: R1716: Simplify chained comparison between the operands (chained-comparison)

That's this line.

if num_documents is not None and 0 < num_documents < len(corpus) would be a first improvement.

ugm2 · 2022-10-20T09:50:25Z

@julian-risch yeah, makes sense. Now it's working

julian-risch

These changes look good to me! 👍 @ugm2 Thank you Unai for integrating all the feedback/requested changes and thank you for contributing to Haystack in general! Looking forward to reviewing another PR of yours some time. 😉

Adds cropping of dataset in eval beir

482d349

julian-risch self-requested a review October 14, 2022 14:29

julian-risch added the topic:eval label Oct 14, 2022

Adapts queries to remaining cropped documents

ddd4ccd

ugm2 marked this pull request as ready for review October 18, 2022 09:27

ugm2 requested a review from a team as a code owner October 18, 2022 09:27

ugm2 requested review from mayankjobanputra and removed request for a team October 18, 2022 09:27

julian-risch requested changes Oct 18, 2022

View reviewed changes

julian-risch changed the title ~~Feat:~~ Feat: allow decreasing size of datasets loaded from BEIR Oct 18, 2022

Adds logging warning if num_documents has an invalid value

9158121

ugm2 requested review from julian-risch and removed request for mayankjobanputra October 19, 2022 16:19

Adapts to linting suggestions

fe6ce11

julian-risch approved these changes Oct 21, 2022

View reviewed changes

julian-risch merged commit e41cb24 into deepset-ai:main Oct 21, 2022

Conversation

ugm2 commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Checklist

Uh oh!

julian-risch commented Oct 18, 2022

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ugm2 commented Oct 20, 2022

Uh oh!

julian-risch commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ugm2 commented Oct 20, 2022

Uh oh!

julian-risch left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ugm2 commented Oct 14, 2022 •

edited

Loading

julian-risch commented Oct 20, 2022 •

edited

Loading