Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy of bloom2 impacted by compilers #240

Open
matthewvon opened this issue Feb 22, 2022 · 1 comment
Open

Accuracy of bloom2 impacted by compilers #240

matthewvon opened this issue Feb 22, 2022 · 1 comment

Comments

@matthewvon
Copy link
Contributor

I have ported util/bloom2.cc to a different environment. The bloom filter, with 16 bits per key, is supposed to have 0.0004 false positive rate. The filter was showing 0.0117 and 0.0519 within two different compiler environments (Debian 10 and Debian 11 respectively using default gcc packages). The bloom filter is important due to Riak's read before write characteristics.

The problem appears related to these two lines:

https://github.com/basho/leveldb/blob/develop/util/bloom2.cc#L70

https://github.com/basho/leveldb/blob/develop/util/bloom2.cc#L97

The existing code uses a mix of variable declarations: size_t, unsigned, and uint32_t. This appears to cause portions of the code to be 64 bit and portions 32 bit. Forcing all variables around these two lines to consistent uint32_t declaration was the cure.

However, just patching the current code does not address how to "fix" bad bloom filters in existing data sets. The corrected bloom filter code has the potential to make data written with the old code to seemingly disappear. As is, there might be a scenario where data also disappears due to distribution of new Riak releases that use a different compiler than before. I have not proven this, just guessing.

The solution might be to create a corrected bloom3.cc. Then the existing leveldb code can automatically pick the bloom filter that matches existing data files and use the corrected bloom3.cc bloom filters for new data and data newly passing through a normal compaction.

@martinsumner
Copy link
Contributor

martinsumner commented Dec 7, 2022

Sorry @matthewvon , I didn't see this issue when you raised it. Thank you for highlighting the issue here.

In terms of contributors who are left with rights to basho/leveldb and basho/eleveldb, we have none with the c++ skills necessary to maintain this, or the test setup to prove non-functional changes. So we're making only the minimal changes we have to - and most of the main customers are migrating to leveled or bitcask, to protect themselves from future issues.

There are other (i.e. non-Riak) users of leveledb/eleveldb out there though, and it seems a shame that this should wither away due to lack of skills in our community. Some of the other generally useful tools in the basho box have been freed from the confines of the basho repo (e.g. lager, webmachine). Perhaps if a volunteer was to come forward to maintain we should do the same with basho leveldb/eleveldb.

Sorry for the late and unsatisfactory answer. I will leave the issue open, as it is clearly an important problem, even if we have no plans to resolve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants