-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Bloom Filter as non-experimental/core postings format #12986
Comments
supporting back compat is a one-way door and a big deal. Back compat has a heavy price and is responsible for lots of bugs (e.g. Lucene 9.9.1 release). It can't be done on a whim and IMO we already support way "too much" here by far. Too much commercial influence on the project... I also don't think this bloom format is ready. You argue it should have lucene's backwards compatibility support at the same time as pointing to thousands of lines unmerged PR with recommended changes to it! That's not stability. |
@rmuir : We understand supporting backward compatibility is one way door, but bloom filter has been in experimental mode for years now. So, we would like to understand from Lucene community - has it been abandoned?
The intent here is to help make bloom ready by contributing to lucene if the community is aligned and aim to release it. It can be controlled under feature flag/ experiment mode until it is production ready. @mikemccand would like to hear your thoughts as well. |
I agree with @rmuir -- promising backwards compatibility (API or index format) is a huge burden on Lucene developers, and it's hard enough with the default Codec today. Given that the bloom postings format is still very much in flux, let's wait on removing the experimental tag. E.g. we are also pursuing another experimental codec (inspired by Tantivy) that also seems to speed up the primary-key lookup use case. Note that OpenSearch devs could also choose to offer this backwards compatibility to its users. The promise need not be implemented only in Lucene. Thank you for sharing those benchmark results. That is indeed quite a sizable impact on indexing throughput / long-pole latencies, especially as you greatly increase the bloom filter size to lower the false-positive rate. It looks like that test was 100% +1 to pursue the linked improvements (off-heap option) -- the linked PR looks interesting -- maybe open a PR here for that? Or is that change somehow OpenSearch specific? |
Thanks @mikemccand for the feedback. We can pursue the route to offer the backward compatibility in OpenSearch directly if there are no other takers among Lucene users. 💯 for contributing the off heap implementation to Lucene to prevent memory overhead due to large bloom filter size.
It would require some changes in the current PR to be contributed to Lucene as it depends on some of the OpenSearch wrapper data structures like BigArrays. |
Thanks @mikemccand @rmuir for your feedback.
That is correct. We used Lucene updateDocument calls for the entier test, with 25% of the updates being updates not appends.
Thanks, I can work on a PR to contribute the changes to enable the support with Lucene constructs. I think we should have a way to achieve this in Lucene using a wrapper on existing bit-set implementation and/or the underlying array within the bitset. |
I'd like to stick to the current policy that only the default codec has backward compatibility support. One idea would be to fold bloom filters into the default postings format like we did for the That said, +1 to improve the bloom filter postings format to load the bloom filter off-heap at read time, this would be an interesting improvement. |
Description
Today, BloomFilteringPostingsFormat in Lucene is marked experimental.
Based on our analysis of the the data structure in OpenSearch for the Primary Key field using the
nyc_taxis
workload [See Issue], we have found that it proves very useful for indexing performance, and also certain term queries/Get Document calls on the PK.We want to introduce this as an opt-in feature in OpenSearch for customers so that they can take advantage of the performance improvements, and wanted inputs from the community on the following:
We've done few changes in OpenSearch to support an off-heap implementation and introduce certain knobs which may prove useful for Lucene customers and would like to see if they can be contributed to Lucene as well [opensearch-project/OpenSearch/pull/11027].
The text was updated successfully, but these errors were encountered: