-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify bloom filters #258
Simplify bloom filters #258
Conversation
Moved to dependency on BitMapProducer, IndexProducer and BitCountProducer to retrieve internal representations of the data.
@aherbert , @dota17 , @garydgregory , @kinow You all reviewed an earlier Bloom filter submission. I think this change solves many of the issues noted in the earlier release wrt the complexity and scale of the submission. If you have time please take a look at what is here as a replacement for what is currently in the 4.5 snapshot |
Just some general comments: For now only make public or protected what must be. We can open up the API in the future. When we release an API, we lock in binary support. Add Javadoc since tags to new types. Close HTML tags in Javadoc. For example, match <p> with </p> Hopefully someone can get into the nitty gritty. |
…ons-collections.git into simplify_bloom_filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Claudenw couldn't find time for a proper review, using IDE, debugging, reading links/papers/refs about filters. So instead did just a simple review with GitHub UI. Added a few minor comments, but hopefully that will help a little.
I did not check backward compatibility (not sure if the code changed has been released, sorry), nor if the coverage is good enough, as it is a WIP 👍
Great work on simplifying, and also on the comments, and updating the tests. Thanks!!! (it also needs a JIRA ticket, almost forgot)
src/main/java/org/apache/commons/collections4/bloomfilter/SparseBloomFilter.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/ShapeFactoryTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/ShapeFactoryTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/ShapeTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/BitMapTest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/BloomFilter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/SimpleBloomFilter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/CountingBloomFilter.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/collections4/bloomfilter/Shape.java
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #258 +/- ##
============================================
+ Coverage 86.06% 86.07% +0.01%
+ Complexity 4672 4666 -6
============================================
Files 292 286 -6
Lines 13467 13508 +41
Branches 1954 1984 +30
============================================
+ Hits 11590 11627 +37
- Misses 1320 1321 +1
- Partials 557 560 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Claudenw
There are a few things to fix here:
- The merge operations check that bad bits have not been set. If they have then an exception is raised but the bad bits are still present and so the filter is now corrupt. The filters should clear the bad bits and thus act as if it is impossible to put bad bits into the filter.
- Add tests using
or
,and
andxor
with different length filters. The fix is present but there are no tests to show it works. - Update the SimpleHasher to use integer math only by subtracting the increment instead of adding.
- Move some of the public utility classes out of the interfaces. They are public by default but not all methods are public so cannot be fully used outside the package, and lack documented behaviour on these methods. The IndexFilter implementations are public but are implementation details likely to change with further optimisation.
- Remove some of the public API. The methods/constructors appear only to be used in unit tests. There are a lot of public constructors for BloomFilters to fill them with bits but none for the ArrayCountingBloomFilter (inconsistent).
Possibles:
- Drop mergeInPlace. It is confusing with merge being present as well. A user can always copy a filter and do a merge to a new copy. Currently the merge to a copy has a return statement of the mergeInPlace that is being ignored. So you may return a copy that has not been correctly merged. This is not possible with the current implementations except the ArrayCountingBloomFilter which can return false if the state is invalid after merge.
- Provide support for non-fast-fail forEach. This to be used for fast merge and bit operations on filters (e.g. all set operations). The fail-fast is only really needed for the
contains
methods when you can stop checking as soon as you know it does not contain the argument. Many operations never require the result of the passed in predicate in the forEach loop. This could wait for some performance tests to see if a forEach without the checking is much faster. - Support a
merge
without checking and then provide a methodadd
with checking for a change in the filter. This will duplicate a lot of (simple) code. But it provides a nice method to check if a filter contains an object and do something if it did not, rather than calling contains before then doing a merge, or doing a merge and checking cardinality change afterwards.
For a bitmap bloom filter this can be done branchless by accumulating all the zero bits that are set to 1 using:
long[] flag = {0};
int[] idx = new int[1];
bitMapProducer.forEachBitMap(value -> {
int i = idx[0]++;
flag[0] |= (~bitMap[i] & value);
bitMap[i] |= value;
return true;
});
return flag[0] != 0;
For a filter with an explicit cardinality you just check the cardinality before and after the merge.
src/main/java/org/apache/commons/collections4/bloomfilter/ArrayCountingBloomFilter.java
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/ShapeTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/apache/commons/collections4/bloomfilter/SetOperationsTest.java
Outdated
Show resolved
Hide resolved
import org.apache.commons.collections4.bloomfilter.hasher.Hasher; | ||
import org.apache.commons.collections4.bloomfilter.hasher.Shape; | ||
import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher; | ||
|
||
import org.junit.jupiter.api.Test; | ||
|
||
/** | ||
* Test {@link SetOperations}. | ||
*/ | ||
public class SetOperationsTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added the fix to compute And/Or/Xor correctly with different length filters. This class has no tests comparing filters with a different shape (length). Thus you cannot avoid regressions if future implementations change the method back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter1 = new SparseBloomFilter(shape, IndexProducer.fromIndexArray(new int[] { 5, 63 })); | ||
filter2 = new SparseBloomFilter(shape, IndexProducer.fromIndexArray(new int[] { 5, 64, 69 })); | ||
assertEquals(4, SetOperations.orCardinality(filter1, filter2)); | ||
assertEquals(4, SetOperations.orCardinality(filter2, filter1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the previous test using different length filters:
Shape shape2 = Shape.fromKM(3, 192);
filter1 = new SparseBloomFilter(shape2, IndexProducer.fromIntArray(new int[] { 1, 63, 185}));
filter2 = new SparseBloomFilter(shape, IndexProducer.fromIntArray(new int[] { 5, 64, 69 }));
assertEquals(6, SetOperations.orCardinality(filter1, filter2));
assertEquals(6, SetOperations.orCardinality(filter2, filter1));
Others should be added for and and xor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created https://issues.apache.org/jira/browse/COLLECTIONS-827 to track.
*/ | ||
long[] getBits(); | ||
boolean isSparse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with IndexProducer this could be replaced with a characteristics flag:
int characteristics();
Currently the only characteristic is sparse. Are there any other characteristics to report that may be of use? This would allow them to be added without adding more methods to the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.apache.org/jira/browse/COLLECTIONS-818 to cover this
*/ | ||
int andCardinality(BloomFilter other); | ||
default int estimateN() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a double?
Since this is a pass through method that is converting a precise value from the shape into an imprecise one I would recommend removing it. A paragraph can be added to the class javadoc on how to estimate N and also the size of the union and intersection with another filter. Those methods also suffer from the same inaccuracy. If this functionality was moved to class javadoc then it is clear to the user how to compute it and what the computation requires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened https://issues.apache.org/jira/browse/COLLECTIONS-817 to cover.
*/ | ||
int orCardinality(BloomFilter other); | ||
default int estimateUnion(BloomFilter other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting a double result to int.
Another issue is that the argument filter's shape is not checked. The merge should join the smaller filter into the bigger one and then the estimateN called on the larger filter. Otherwise this method is not symmetric. One way would be fine but the other would result in a merge error.
If this is only intended for filters with the same number of bits then the expected result of a mismatch is not documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*/ | ||
int xorCardinality(BloomFilter other); | ||
default int estimateIntersection(BloomFilter other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting a double result to int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// idx[0] will be limit+1 so decrement it | ||
idx[0]--; | ||
int idxLimit = BitMap.getLongIndex(shape.getNumberOfBits()); | ||
if (idxLimit < idx[0]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no coverage of this branch. It is not possible. If this were true the bitMap[idx[0]] in the main forEach loop would have thrown an array index out of bound exception. I think this can be removed. The final check to determine if the final upper bitmap does not set incorrect bits is valid.
Note: Even though the check is made that the upper bits have been set correctly, if they have not the exception is raised and the filter now has incorrect bits for the rest of its lifetime, i.e. no recovery is made and no flag is present in the filter to indicate the bits are bad. So you could then merge this into another filter and get the same exception, or record the bits to file and have bad bits in the recording, etc.
The same 'bug' is present in the SparseBloomFilter. It will raise an exception after merge if the TreeSet first or last position are outside the shape but not correct it.
I think these filters should attempt to clear any invalid bits before throwing exceptions. This makes them act as if they are a plain int[]
of size nbits
containing 0 or 1s. It should not be possible to put an index outside the range of the shape in the filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened https://issues.apache.org/jira/browse/COLLECTIONS-816 to address this
Hi All. Ping. Where are we on this PR? |
I was planning on getting back to this this week, however I have taken ill. I think that once the merge conflicts are resolved I think that it should be merged as is. it is not perfect and there are several proposed changes, however, it is MUCH better than what is in 4.5-SNAPSHOT now. Let's merge and open the proposed changes as tickets. |
A lot of progress was made in cleaning up the package. The work stalled but did not reach a consensus. There are several feedback comments I left that are as yet unresolved. I think these should be addressed before a merge. The final extent of the public API can be addressed after merge. Currently there are some helper classes that could be reduced to a package-level and optionally exposed via factory helper methods and interfaces to allow backing implementations to be changed. |
While I agree that we did not reach consensus, I think we can all agree that the current pull request is much better than the current package in 4.5-SNAPSHOT. The size of this change is very large, I would like to see the change sets reduced as we resolve the remaining issues. |
Note: The build is failing. |
Yes I would like to merge this request and pick up the any issues as new issues to work through as smaller changes. |
That sounds good to me. |
The drawback of merging now is losing all the comments that are as yet unresolved on this PR. |
Hi @aherbert |
Some of the issues are simple to fix here. But if all comments have been captured then a merge will at least bring in the large improvements in this update and finer details can be worked on after. Also note that when reviewing I stopped commenting on trivial javadoc issues to reduce noise so there will be some work to clean up this aspect too. |
I think the open tickets capture the issues/discussions remaining and that the branch can now be merged. |
Merged! TY @Claudenw |
I think I need to open a few more but that could be done after the merge
…On Mon, Jun 13, 2022, 19:49 Gary Gregory ***@***.***> wrote:
The drawback of merging now is losing all the comments that are as yet
unresolved on this PR.
Hi @aherbert <https://github.com/aherbert>
I see that JIRA issues are now open to track tasks:
https://issues.apache.org/jira/browse/COLLECTIONS-816 through
https://issues.apache.org/jira/browse/COLLECTIONS-819
Is that good enough that we can merge?
—
Reply to this email directly, view it on GitHub
<#258 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASTVHRD4VC3D6DAYASVHJLVO57DHANCNFSM5FUBJMFA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
* Fixed some unit tests * First set with complete test cases. * Cleaned up hasher collecton processing * cleaned up code * added license headers * Refactored and cleaned up Moved to dependency on BitMapProducer, IndexProducer and BitCountProducer to retrieve internal representations of the data. * Added license header. * Updated documentation * Fixed bug and added tests * Added "@SInCE 4.5" where necessary * Added BitMapProducer constructor to SimpleBloomFilter * added BitMapProducer.fromLongArray() and Hasher.isEmpty() * Changes to speed up Simple filter processing * Null hasher used when a hasher is required but no values are available. * Added Hasher.Filter and Hasher.FilteredIntConsumer * Updated documentation + formatted. * Added license * fixed checkstyle issues * fixed javadoc issues * fixed test issue * fixed javadoc issues * Reduced the acceptable delta for p tests * Updated docs and test cases * Updated docs and test cases * fixed issue with Shape javadoc * Added more test coverage. * fixed formatting issues * Updated tests to use assertThrows. * fixed indents * Added constructor with IndexProducer * Fixed issue with compare and different length bitMap arrays * fixed formatting issues * Efficiency changes cleaned up asIndexArray BitMapProducer to IndexProducer conversion * changed XProviers to use XPredicates * Removed NoMatchException * Removed unneeded BitMap funcs Moves isSparse() to Shape. * fixed checkstyle issues * Fixed javadoc errors * simplified parameter in BitMapProducer.fromIndexProducer * fixed tests * added BitMapping verification * Added more tests * Added more tests * Fixed typos * Changes requested by aherbert * fixed "bit map" in documentation * Renamed tests * Removed blank lines * changed new X<foo> to new X<> * updated documentation * Added BloomFilter.copy() * changed ArrayCountingBloomFilter to use copy() method * cleaned up numberOfBitsMaps() * added asBitMapArray() and makePredicate() to BitMapProducer * Moved asIndexArray() to IndexProducer * harmonized Simple and Sparse Bloom filter constructors * Implemented AbstractCountingBloomFilter.asindexArray() * updated documentation * fixed up NullHasher * implemented hasher filter * Fixed style issues * added default SimpleHasher increment. * Added modulus calculation to SimpleHasher * fixed Hashing issues * moved hasher/filter/* to /hasher * moved bloomfilter/hasher to bloomfilter * fixed up checkstyle issues * Made Filter -> IndexFilter -w- factory * moved IndexFilter into Hasher * updated hashing tests & fixed checksyle * removed SingleItemhasherCollection as associated methods * Fixed some unit tests * First set with complete test cases. * Cleaned up hasher collecton processing * cleaned up code * added license headers * Refactored and cleaned up Moved to dependency on BitMapProducer, IndexProducer and BitCountProducer to retrieve internal representations of the data. * Added license header. * Updated documentation * Fixed bug and added tests * Added "@SInCE 4.5" where necessary * Added BitMapProducer constructor to SimpleBloomFilter * added BitMapProducer.fromLongArray() and Hasher.isEmpty() * Changes to speed up Simple filter processing * Null hasher used when a hasher is required but no values are available. * Added Hasher.Filter and Hasher.FilteredIntConsumer * Updated documentation + formatted. * Added license * fixed checkstyle issues * fixed javadoc issues * fixed test issue * fixed javadoc issues * Reduced the acceptable delta for p tests * Updated docs and test cases * Updated docs and test cases * fixed issue with Shape javadoc * Added more test coverage. * fixed formatting issues * Updated tests to use assertThrows. * fixed indents * Added constructor with IndexProducer * Fixed issue with compare and different length bitMap arrays * fixed formatting issues * Efficiency changes cleaned up asIndexArray BitMapProducer to IndexProducer conversion * changed XProviers to use XPredicates * Removed NoMatchException * Removed unneeded BitMap funcs Moves isSparse() to Shape. * fixed checkstyle issues * Fixed javadoc errors * simplified parameter in BitMapProducer.fromIndexProducer * fixed tests * added BitMapping verification * Added more tests * Added more tests * Fixed typos * Changes requested by aherbert * fixed "bit map" in documentation * Renamed tests * Removed blank lines * changed new X<foo> to new X<> * updated documentation * Added BloomFilter.copy() * changed ArrayCountingBloomFilter to use copy() method * cleaned up numberOfBitsMaps() * added asBitMapArray() and makePredicate() to BitMapProducer * Moved asIndexArray() to IndexProducer * harmonized Simple and Sparse Bloom filter constructors * Implemented AbstractCountingBloomFilter.asindexArray() * updated documentation * fixed up NullHasher * implemented hasher filter * Fixed style issues * added default SimpleHasher increment. * Added modulus calculation to SimpleHasher * fixed Hashing issues * moved hasher/filter/* to /hasher * moved bloomfilter/hasher to bloomfilter * fixed up checkstyle issues * Made Filter -> IndexFilter -w- factory * moved IndexFilter into Hasher * updated hashing tests & fixed checksyle * removed SingleItemhasherCollection as associated methods * fixed javadoc issues * fixed javadoc issues * added checks for BitMapProducer limits and index limits * updated tests * added tests * fixed checkstyle issues * fixed formatting and test coverage * fixed javadoc issue * put back checkstyle.xml * switched to forEachBitMapPair * updated BitMap and Index array production * fixed merge with BitMapProducer * Cleaned up formatting * fixed checkstyle issues * fixed coding issues * updated documentation * simplified test * removed unwanted merge files * removed duplicate entry * put back test that incorrectly removed * fixed asIndexArray error * fixed checkstyle errors * Changes for last review
Addresses COLLECTIONS-801
While this looks like a massive change, most of the change is removing the hashing management complexity.
The stripped down version no longer tracks the hashing protocol and assumes that this is the responsibility of the projects that use this library.
Hashing classes have been reduced to 2 implementations and one interface. The hasher implementation expects a hash to be generated and passed to it to use. Other implementations that use hash functions directly are possible but not implemented here.
The complexity of the BloomFilter has been reduced. The internal representation of the filter is no longer exposed directly. Instead, BitMapProducer and IndexProducer interfaces are implemented to allow IntConsumers and LongConsumers to receive representations of the internal structures.
The CountingBloomFilter uses BitCountProducer for to retrieve the index and count values.
Bloom filters merge() now produces a new Bloom filter while mergeInPlace() modifies the current Bloom filter.
Shape has been simplified to only track the numberOfBits and numberOfHashFunction. A Shape.Factory has been added to create shapes from standard requests (e.g. numberOfItems and probabilityOfCollision)