-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand TieredMergePolicy deletePctAllowed limits #11761
Comments
Historically this was not configurable and Lucene would allow up to 50% deleted documents. When we introduced an option, we made sure to introduce a lower bound on the value because a value of zero would essentially require Lucene to rewrite every segment that has a deletion after every update operation, which is certainly undesirable. Allowing users to go from 50% to 20% felt like a significant improvement already. We could discuss lowering the limit if we feel like this could lead to merging patterns that are still acceptable. E.g. I used |
Hi, thanks for the response! Your explanation of 0% not being allowed makes complete sense. For some context though, using our own forked version of Maybe if we want to maintain those limits, we could create a |
I got some numbers for write amplification for the case tested in
Assuming these numbers are representative, maybe we could allow users to configure 5% as the allowed percentage of deletes that their indexes may have, which translates to ~2x more write amplification compared to the default of 33% according to the above numbers. For reference, the algorithm that |
Thanks for taking the time to look into this! I think 5% would be a good start, it would be near the threshold we want to test (we were thinking 2% but looking at your initial write amplification numbers, this may not be a great idea). I'm also planning on making an issue to create a |
Here is the issue I created with a PR attached in case you were interested: #11795 |
I have also ran into this on our patent search system. In our index the problem is exagerrated by the larger documents tending to be more frequently reindexed so the 20% deleted documents can translate to 40% of the overall index size! I ccan definitely imagine that for a system where indexing is light and infrequent 2% may make sense to ensure optimal perfomance/disk usage, without requiring the explicit use expungeDeletes. Having said that 5% is definitely low enough for my use case. |
I was also doing consulting for an huge Elasticsearch user and they also had this problem of keeping deletes as low as possible and the 20% limit was way too high. 20% looks like an arbitrary limitation. |
If someone opens a PR to decrease the limit from 20% to 5% I'll happily approve the change given the results I shared above. |
Here is a PR: #11831 |
@jpountz based on these numbers wouldn't it also make sense to consider changing the default from |
I can include that in the above PR as well if you all agree that it's a good idea. |
I think it's probably a good idea. Faster search performance (due to the indexes having fewer deleted documents) and a reduced risk of underestimating space requirements and running out of disk space seem a reasonable tradeoff for a small increase in indexing time. |
+1 to update the default from 33 to 20. |
1 similar comment
+1 to update the default from 33 to 20. |
I updated the PR to change the default as well. |
Closing: #11831 has been merged. |
Description
I'm an engineer at Amazon Search and we have been experimenting with more aggressively getting rid of deleted documents. We use TieredMergePolicy and we would like to set
TieredMergePolicy#deletesPctAllowed
to be lower than the current limit of 20%.I was wondering why this limit was set in place. I'm sure I could be missing some context here. Maybe we could keep the limits in place but allow users to explicitly remove the checks? Any information would be much appreciated, thanks!
The text was updated successfully, but these errors were encountered: