Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make batchwise adding of evaluation data possible #717

Merged
merged 8 commits into from Jan 12, 2021

Conversation

bogdankostic
Copy link
Contributor

When trying to add evaluation data containing millions of documents, my computer ran out of memory.

To solve this problem, this PR adds a method to add eval docs batch-wise so that not all eval docs have to be loaded to memory at once. Furthermore, this PR introduces the method squad_json_to_jsonl as the required format for add_eval_data_batchwise is jsonl.

@bogdankostic
Copy link
Contributor Author

All tests pass on my local machine. Not sure why test_faiss_index_save_and_load fails in the CI. The error message is sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) disk I/O error, so I suspect that the problem might be in the CI.

Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Would propose a different design though.

haystack/document_store/elasticsearch.py Outdated Show resolved Hide resolved
haystack/document_store/elasticsearch.py Outdated Show resolved Hide resolved
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bogdankostic bogdankostic merged commit 7709b6c into master Jan 12, 2021
@bogdankostic bogdankostic deleted the batchwise_add_eval_data branch January 12, 2021 16:54
@Rob192
Copy link
Contributor

Rob192 commented Jan 15, 2021

Hello,
This PR is not compatible with Windows and throws the following AttributeError: 'WindowsPath' object has no attribute 'endswith'
Would it be possible to e.g. make use of the pathlib Library to prevent this ?
Thanks !

@Timoeller
Copy link
Contributor

Hey @Rob192 sure, we already use Pathlib in some parts of haystack.

Would you like to create a PR with the proposed changes? I guess the change will be minor. Otherwise we can also take over.

@Rob192
Copy link
Contributor

Rob192 commented Jan 15, 2021

OK @Timoeller ! will do !

@lalitpagaria
Copy link
Contributor

lalitpagaria commented Jan 15, 2021

@Timoeller Raised ticket #714 to catch these platform dependent issue early. Currently Windows users are 15%-20% of overall haystack downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants