Introduce Bloom Filter as non-experimental/core postings format #12986

mgodwan · 2023-12-29T09:02:34Z

Description

Today, BloomFilteringPostingsFormat in Lucene is marked experimental.

Based on our analysis of the the data structure in OpenSearch for the Primary Key field using the nyc_taxis workload [See Issue], we have found that it proves very useful for indexing performance, and also certain term queries/Get Document calls on the PK.

We want to introduce this as an opt-in feature in OpenSearch for customers so that they can take advantage of the performance improvements, and wanted inputs from the community on the following:

Why is the BloomFilteringPostingsFormat in experimental status?
Is it possible to contribute and mark it as a core feature with support for backward compatibility in Lucene?

We've done few changes in OpenSearch to support an off-heap implementation and introduce certain knobs which may prove useful for Lucene customers and would like to see if they can be contributed to Lucene as well [opensearch-project/OpenSearch/pull/11027].

The text was updated successfully, but these errors were encountered:

rmuir · 2023-12-29T12:34:25Z

supporting back compat is a one-way door and a big deal. Back compat has a heavy price and is responsible for lots of bugs (e.g. Lucene 9.9.1 release). It can't be done on a whim and IMO we already support way "too much" here by far. Too much commercial influence on the project...

I also don't think this bloom format is ready. You argue it should have lucene's backwards compatibility support at the same time as pointing to thousands of lines unmerged PR with recommended changes to it! That's not stability.

shwetathareja · 2024-01-02T05:02:12Z

@rmuir : We understand supporting backward compatibility is one way door, but bloom filter has been in experimental mode for years now. So, we would like to understand from Lucene community - has it been abandoned?

I also don't think this bloom format is ready.

The intent here is to help make bloom ready by contributing to lucene if the community is aligned and aim to release it. It can be controlled under feature flag/ experiment mode until it is production ready.

@mikemccand would like to hear your thoughts as well.

mikemccand · 2024-01-02T11:52:52Z

I agree with @rmuir -- promising backwards compatibility (API or index format) is a huge burden on Lucene developers, and it's hard enough with the default Codec today.

Given that the bloom postings format is still very much in flux, let's wait on removing the experimental tag. E.g. we are also pursuing another experimental codec (inspired by Tantivy) that also seems to speed up the primary-key lookup use case.

Note that OpenSearch devs could also choose to offer this backwards compatibility to its users. The promise need not be implemented only in Lucene.

Thank you for sharing those benchmark results. That is indeed quite a sizable impact on indexing throughput / long-pole latencies, especially as you greatly increase the bloom filter size to lower the false-positive rate. It looks like that test was 100% updateDocument calls with 25% of the updates being updates not appends?

+1 to pursue the linked improvements (off-heap option) -- the linked PR looks interesting -- maybe open a PR here for that? Or is that change somehow OpenSearch specific?

shwetathareja · 2024-01-02T12:50:39Z

Thanks @mikemccand for the feedback. We can pursue the route to offer the backward compatibility in OpenSearch directly if there are no other takers among Lucene users.

💯 for contributing the off heap implementation to Lucene to prevent memory overhead due to large bloom filter size.

Or is that change somehow OpenSearch specific?

It would require some changes in the current PR to be contributed to Lucene as it depends on some of the OpenSearch wrapper data structures like BigArrays.

mgodwan · 2024-01-03T08:23:20Z

Thanks @mikemccand @rmuir for your feedback.

It looks like that test was 100% updateDocument calls with 25% of the updates being updates not appends?

That is correct. We used Lucene updateDocument calls for the entier test, with 25% of the updates being updates not appends.

+1 to pursue the linked improvements (off-heap option) -- the opensearch-project/OpenSearch#11027 looks interesting -- maybe open a PR here for that? Or is that change somehow OpenSearch specific?

Thanks, I can work on a PR to contribute the changes to enable the support with Lucene constructs. I think we should have a way to achieve this in Lucene using a wrapper on existing bit-set implementation and/or the underlying array within the bitset.

jpountz · 2024-01-08T10:38:03Z

I'd like to stick to the current policy that only the default codec has backward compatibility support. One idea would be to fold bloom filters into the default postings format like we did for the Pulsing codec (inlining postings in the terms dictionary when docFreq=1). One reason why I'm a bit on the fence about it is that this format still needs to materialize the bloom filter in heap at write time, while I'd like to keep heap requirements of the default codec as low as possible.

That said, +1 to improve the bloom filter postings format to load the bloom filter off-heap at read time, this would be an interesting improvement.

mgodwan added the type:enhancement label Dec 29, 2023

mgodwan mentioned this issue Jan 11, 2024

Enable Fuzzy codec for doc id fields using a bloom filter opensearch-project/OpenSearch#11027

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Bloom Filter as non-experimental/core postings format #12986

Introduce Bloom Filter as non-experimental/core postings format #12986

mgodwan commented Dec 29, 2023

rmuir commented Dec 29, 2023

shwetathareja commented Jan 2, 2024 •

edited

mikemccand commented Jan 2, 2024

shwetathareja commented Jan 2, 2024 •

edited

mgodwan commented Jan 3, 2024 •

edited

jpountz commented Jan 8, 2024

Introduce Bloom Filter as non-experimental/core postings format #12986

Introduce Bloom Filter as non-experimental/core postings format #12986

Comments

mgodwan commented Dec 29, 2023

Description

rmuir commented Dec 29, 2023

shwetathareja commented Jan 2, 2024 • edited

mikemccand commented Jan 2, 2024

shwetathareja commented Jan 2, 2024 • edited

mgodwan commented Jan 3, 2024 • edited

jpountz commented Jan 8, 2024

shwetathareja commented Jan 2, 2024 •

edited

shwetathareja commented Jan 2, 2024 •

edited

mgodwan commented Jan 3, 2024 •

edited