Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Bloom Filter as non-experimental/core postings format #12986

Open
mgodwan opened this issue Dec 29, 2023 · 6 comments
Open

Introduce Bloom Filter as non-experimental/core postings format #12986

mgodwan opened this issue Dec 29, 2023 · 6 comments

Comments

@mgodwan
Copy link

mgodwan commented Dec 29, 2023

Description

Today, BloomFilteringPostingsFormat in Lucene is marked experimental.

Based on our analysis of the the data structure in OpenSearch for the Primary Key field using the nyc_taxis workload [See Issue], we have found that it proves very useful for indexing performance, and also certain term queries/Get Document calls on the PK.

We want to introduce this as an opt-in feature in OpenSearch for customers so that they can take advantage of the performance improvements, and wanted inputs from the community on the following:

  1. Why is the BloomFilteringPostingsFormat in experimental status?
  2. Is it possible to contribute and mark it as a core feature with support for backward compatibility in Lucene?

We've done few changes in OpenSearch to support an off-heap implementation and introduce certain knobs which may prove useful for Lucene customers and would like to see if they can be contributed to Lucene as well [opensearch-project/OpenSearch/pull/11027].

@rmuir
Copy link
Member

rmuir commented Dec 29, 2023

supporting back compat is a one-way door and a big deal. Back compat has a heavy price and is responsible for lots of bugs (e.g. Lucene 9.9.1 release). It can't be done on a whim and IMO we already support way "too much" here by far. Too much commercial influence on the project...

I also don't think this bloom format is ready. You argue it should have lucene's backwards compatibility support at the same time as pointing to thousands of lines unmerged PR with recommended changes to it! That's not stability.

@shwetathareja
Copy link

shwetathareja commented Jan 2, 2024

@rmuir : We understand supporting backward compatibility is one way door, but bloom filter has been in experimental mode for years now. So, we would like to understand from Lucene community - has it been abandoned?

I also don't think this bloom format is ready.

The intent here is to help make bloom ready by contributing to lucene if the community is aligned and aim to release it. It can be controlled under feature flag/ experiment mode until it is production ready.

@mikemccand would like to hear your thoughts as well.

@mikemccand
Copy link
Member

I agree with @rmuir -- promising backwards compatibility (API or index format) is a huge burden on Lucene developers, and it's hard enough with the default Codec today.

Given that the bloom postings format is still very much in flux, let's wait on removing the experimental tag. E.g. we are also pursuing another experimental codec (inspired by Tantivy) that also seems to speed up the primary-key lookup use case.

Note that OpenSearch devs could also choose to offer this backwards compatibility to its users. The promise need not be implemented only in Lucene.

Thank you for sharing those benchmark results. That is indeed quite a sizable impact on indexing throughput / long-pole latencies, especially as you greatly increase the bloom filter size to lower the false-positive rate. It looks like that test was 100% updateDocument calls with 25% of the updates being updates not appends?

+1 to pursue the linked improvements (off-heap option) -- the linked PR looks interesting -- maybe open a PR here for that? Or is that change somehow OpenSearch specific?

@shwetathareja
Copy link

shwetathareja commented Jan 2, 2024

Thanks @mikemccand for the feedback. We can pursue the route to offer the backward compatibility in OpenSearch directly if there are no other takers among Lucene users.

💯 for contributing the off heap implementation to Lucene to prevent memory overhead due to large bloom filter size.

Or is that change somehow OpenSearch specific?

It would require some changes in the current PR to be contributed to Lucene as it depends on some of the OpenSearch wrapper data structures like BigArrays.

@mgodwan
Copy link
Author

mgodwan commented Jan 3, 2024

Thanks @mikemccand @rmuir for your feedback.

It looks like that test was 100% updateDocument calls with 25% of the updates being updates not appends?

That is correct. We used Lucene updateDocument calls for the entier test, with 25% of the updates being updates not appends.

+1 to pursue the linked improvements (off-heap option) -- the opensearch-project/OpenSearch#11027 looks interesting -- maybe open a PR here for that? Or is that change somehow OpenSearch specific?

Thanks, I can work on a PR to contribute the changes to enable the support with Lucene constructs. I think we should have a way to achieve this in Lucene using a wrapper on existing bit-set implementation and/or the underlying array within the bitset.

@jpountz
Copy link
Contributor

jpountz commented Jan 8, 2024

I'd like to stick to the current policy that only the default codec has backward compatibility support. One idea would be to fold bloom filters into the default postings format like we did for the Pulsing codec (inlining postings in the terms dictionary when docFreq=1). One reason why I'm a bit on the fence about it is that this format still needs to materialize the bloom filter in heap at write time, while I'd like to keep heap requirements of the default codec as low as possible.

That said, +1 to improve the bloom filter postings format to load the bloom filter off-heap at read time, this would be an interesting improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants