New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved delete_by_query to not run the wait_task if not appropriate #209
Conversation
👍 |
1 similar comment
Hey, in my case I was deleting a large amount of objects (my records are paragraphs from large PDF files, and I don't really have an ID per paragraph stored anywhere) so I easily get more than 1000 objects. The delete by query literally took 15 mins+ Is there any other way we can speed it up for that scenario? |
@redox Do you have any ideas or suggestions on how to handle a scenario like that? |
lib/algolia/index.rb
Outdated
res = delete_objects(res['hits'].map { |h| h['objectID'] }) | ||
ids = res['hits'].map { |h| h['objectID'] } | ||
res = delete_objects(ids) | ||
break if ids.size != 1000 # no need to wait_task if we had less than 1000 objects matching; we won't get more after anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this condition be continue if ids.size = 1000
?
If I get 1000 results it most probably means that I'll get another result with another search and there is no need for waiting for the delete task. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the end the more results I get the longer I wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem here is that if you don't wait, the next search()
calls might return the same object ids, thus flooding the API with duplicates delete calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to mitigate that would be to first scan the entire result set, and then calls to delete all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, didn't think of that.
Changing implementation to you browse
would break compatibility as the API key now needs only search
and delete records
ACLs. browse
implementation would need browse
ACL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raphi By scanning the entire result set do you mean treating it as a paginated search and acquiring all the IDs of the objects to delete and then doing asynchronous deletes? I was about to try something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@obahareth yes exactly. @redox what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested by @raphi in algolia#209, add a new version of delete_by_query that scans the entire index for all IDs to delete without awaiting delete results like in the old `delete_by_query` (which is now `delete_by_query!`) and then delete all matching objects asynchronously in one call to delete_objects. Comment: algolia#209 (comment)
As suggested by @raphi in #209, add a new version of delete_by_query that scans the entire index for all IDs to delete without awaiting delete results like in the old `delete_by_query` (which is now `delete_by_query!`) and then delete all matching objects asynchronously in one call to delete_objects. Comment: #209 (comment)
As suggested by @raphi in #209, add a new version of delete_by_query that scans the entire index for all IDs to delete without awaiting delete results like in the old `delete_by_query` (which is now `delete_by_query!`) and then delete all matching objects asynchronously in one call to delete_objects. Comment: #209 (comment)
3 similar comments
3 similar comments
1 similar comment
🙌 |
@obahareth I merge your PR here in order to trigger Travis's build |
Thanks @raphi, I'm sorry there were still issues here and there. |
@raphi I didn't get counted as a contributor 😟 |
Fix #204