Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand TieredMergePolicy deletePctAllowed limits #11761

Closed
mdmarshmallow opened this issue Sep 9, 2022 · 16 comments
Closed

Expand TieredMergePolicy deletePctAllowed limits #11761

mdmarshmallow opened this issue Sep 9, 2022 · 16 comments

Comments

@mdmarshmallow
Copy link
Contributor

Description

I'm an engineer at Amazon Search and we have been experimenting with more aggressively getting rid of deleted documents. We use TieredMergePolicy and we would like to set TieredMergePolicy#deletesPctAllowed to be lower than the current limit of 20%.

I was wondering why this limit was set in place. I'm sure I could be missing some context here. Maybe we could keep the limits in place but allow users to explicitly remove the checks? Any information would be much appreciated, thanks!

@jpountz
Copy link
Contributor

jpountz commented Sep 15, 2022

Historically this was not configurable and Lucene would allow up to 50% deleted documents. When we introduced an option, we made sure to introduce a lower bound on the value because a value of zero would essentially require Lucene to rewrite every segment that has a deletion after every update operation, which is certainly undesirable. Allowing users to go from 50% to 20% felt like a significant improvement already.

We could discuss lowering the limit if we feel like this could lead to merging patterns that are still acceptable. E.g. I used BaseMergePolicyTestCase#doTestSimulateUpdates in the past to get a sense of how this option would influence write amplification.

@mdmarshmallow
Copy link
Contributor Author

Hi, thanks for the response! Your explanation of 0% not being allowed makes complete sense. For some context though, using our own forked version of TieredMergePolicy, we have tested with down to 2% allowable deletion and still see that behavior is desirable for us (more specifically much lower index sizes than at 20% deletes).

Maybe if we want to maintain those limits, we could create a setDeletesPctAllowedUnsafe() or something like that which has no limits on it (or at least drops the lower bound to being > 0 instead of > 20)?

@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2022

I got some numbers for write amplification for the case tested in TestTieredMergePolicy#testSimulateUpdates:

Allowed percentage of deletes Write amplification
50 (max) 4.34
33 (default) 4.34
20 (min) 4.68
10 6.13
5 8.76
4 10.31
3 12.97
2 18.76
1 37.89
0 10779.78

Assuming these numbers are representative, maybe we could allow users to configure 5% as the allowed percentage of deletes that their indexes may have, which translates to ~2x more write amplification compared to the default of 33% according to the above numbers.

For reference, the algorithm that TieredMergePolicy uses to keep the number of deletes under the threshold consists of running the most balanced merge (with a small bias towards merges that reclaim more deletes) until the number of deletes of the index is under the threshold.

@mdmarshmallow
Copy link
Contributor Author

Thanks for taking the time to look into this! I think 5% would be a good start, it would be near the threshold we want to test (we were thinking 2% but looking at your initial write amplification numbers, this may not be a great idea). I'm also planning on making an issue to create a FilterDirectory that tracks the write amplification factor. This should let us (Amazon Search) get some write amplification numbers that I can hopefully share here as well.

@mdmarshmallow
Copy link
Contributor Author

Here is the issue I created with a PR attached in case you were interested: #11795

@dan2097
Copy link

dan2097 commented Sep 22, 2022

I have also ran into this on our patent search system. In our index the problem is exagerrated by the larger documents tending to be more frequently reindexed so the 20% deleted documents can translate to 40% of the overall index size!
For my use case 5% would be a massive improvement.

I ccan definitely imagine that for a system where indexing is light and infrequent 2% may make sense to ensure optimal perfomance/disk usage, without requiring the explicit use expungeDeletes. Having said that 5% is definitely low enough for my use case.

@uschindler
Copy link
Contributor

I was also doing consulting for an huge Elasticsearch user and they also had this problem of keeping deletes as low as possible and the 20% limit was way too high. 20% looks like an arbitrary limitation.

@jpountz
Copy link
Contributor

jpountz commented Sep 29, 2022

If someone opens a PR to decrease the limit from 20% to 5% I'll happily approve the change given the results I shared above.

@mdmarshmallow
Copy link
Contributor Author

Here is a PR: #11831

@rmuir
Copy link
Member

rmuir commented Sep 29, 2022

I got some numbers for write amplification for the case tested in TestTieredMergePolicy#testSimulateUpdates:

@jpountz based on these numbers wouldn't it also make sense to consider changing the default from 33% to 20% as well? Just an idea.

@mdmarshmallow
Copy link
Contributor Author

I can include that in the above PR as well if you all agree that it's a good idea.

@dan2097
Copy link

dan2097 commented Oct 4, 2022

I can include that in the above PR as well if you all agree that it's a good idea.

I think it's probably a good idea. Faster search performance (due to the indexes having fewer deleted documents) and a reduced risk of underestimating space requirements and running out of disk space seem a reasonable tradeoff for a small increase in indexing time.

@jpountz
Copy link
Contributor

jpountz commented Oct 4, 2022

+1 to update the default from 33 to 20.

1 similar comment
@uschindler
Copy link
Contributor

+1 to update the default from 33 to 20.

@mdmarshmallow
Copy link
Contributor Author

I updated the PR to change the default as well.

@jpountz
Copy link
Contributor

jpountz commented Oct 13, 2022

Closing: #11831 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants