Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify bloom filters #258

Merged
merged 171 commits into from
Jun 15, 2022
Merged

Conversation

Claudenw
Copy link
Contributor

@Claudenw Claudenw commented Oct 8, 2021

Addresses COLLECTIONS-801

While this looks like a massive change, most of the change is removing the hashing management complexity.

The stripped down version no longer tracks the hashing protocol and assumes that this is the responsibility of the projects that use this library.

Hashing classes have been reduced to 2 implementations and one interface. The hasher implementation expects a hash to be generated and passed to it to use. Other implementations that use hash functions directly are possible but not implemented here.

The complexity of the BloomFilter has been reduced. The internal representation of the filter is no longer exposed directly. Instead, BitMapProducer and IndexProducer interfaces are implemented to allow IntConsumers and LongConsumers to receive representations of the internal structures.

The CountingBloomFilter uses BitCountProducer for to retrieve the index and count values.

Bloom filters merge() now produces a new Bloom filter while mergeInPlace() modifies the current Bloom filter.

Shape has been simplified to only track the numberOfBits and numberOfHashFunction. A Shape.Factory has been added to create shapes from standard requests (e.g. numberOfItems and probabilityOfCollision)

@Claudenw
Copy link
Contributor Author

Claudenw commented Oct 9, 2021

@aherbert , @dota17 , @garydgregory , @kinow

You all reviewed an earlier Bloom filter submission. I think this change solves many of the issues noted in the earlier release wrt the complexity and scale of the submission. If you have time please take a look at what is here as a replacement for what is currently in the 4.5 snapshot

@garydgregory
Copy link
Member

garydgregory commented Oct 9, 2021

Just some general comments:

For now only make public or protected what must be. We can open up the API in the future. When we release an API, we lock in binary support.

Add Javadoc since tags to new types.

Close HTML tags in Javadoc. For example, match <p> with </p>

Hopefully someone can get into the nitty gritty.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Claudenw couldn't find time for a proper review, using IDE, debugging, reading links/papers/refs about filters. So instead did just a simple review with GitHub UI. Added a few minor comments, but hopefully that will help a little.

I did not check backward compatibility (not sure if the code changed has been released, sorry), nor if the coverage is good enough, as it is a WIP 👍

Great work on simplifying, and also on the comments, and updating the tests. Thanks!!! (it also needs a JIRA ticket, almost forgot)

@coveralls
Copy link

coveralls commented Nov 23, 2021

Coverage Status

Coverage decreased (-0.1%) to 90.003% when pulling f2bc698 on Claudenw:simplify_bloom_filters into 23f2d36 on apache:master.

@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2022

Codecov Report

Merging #258 (ef2b7d5) into master (06f7b6f) will increase coverage by 0.01%.
The diff coverage is 98.09%.

@@             Coverage Diff              @@
##             master     #258      +/-   ##
============================================
+ Coverage     86.06%   86.07%   +0.01%     
+ Complexity     4672     4666       -6     
============================================
  Files           292      286       -6     
  Lines         13467    13508      +41     
  Branches       1954     1984      +30     
============================================
+ Hits          11590    11627      +37     
- Misses         1320     1321       +1     
- Partials        557      560       +3     
Impacted Files Coverage Δ
...mmons/collections4/bloomfilter/BitMapProducer.java 91.66% <91.66%> (ø)
...ns/collections4/bloomfilter/SimpleBloomFilter.java 94.25% <94.25%> (ø)
...ons/collections4/bloomfilter/HasherCollection.java 96.07% <96.07%> (ø)
...ections4/bloomfilter/ArrayCountingBloomFilter.java 100.00% <100.00%> (+2.35%) ⬆️
...ons/collections4/bloomfilter/BitCountProducer.java 100.00% <100.00%> (ø)
...pache/commons/collections4/bloomfilter/BitMap.java 100.00% <100.00%> (ø)
.../commons/collections4/bloomfilter/BloomFilter.java 100.00% <100.00%> (ø)
...pache/commons/collections4/bloomfilter/Hasher.java 100.00% <100.00%> (ø)
...ommons/collections4/bloomfilter/IndexProducer.java 100.00% <100.00%> (ø)
...ommons/collections4/bloomfilter/SetOperations.java 100.00% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06f7b6f...ef2b7d5. Read the comment docs.

Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Claudenw

There are a few things to fix here:

  • The merge operations check that bad bits have not been set. If they have then an exception is raised but the bad bits are still present and so the filter is now corrupt. The filters should clear the bad bits and thus act as if it is impossible to put bad bits into the filter.
  • Add tests using or, and and xor with different length filters. The fix is present but there are no tests to show it works.
  • Update the SimpleHasher to use integer math only by subtracting the increment instead of adding.
  • Move some of the public utility classes out of the interfaces. They are public by default but not all methods are public so cannot be fully used outside the package, and lack documented behaviour on these methods. The IndexFilter implementations are public but are implementation details likely to change with further optimisation.
  • Remove some of the public API. The methods/constructors appear only to be used in unit tests. There are a lot of public constructors for BloomFilters to fill them with bits but none for the ArrayCountingBloomFilter (inconsistent).

Possibles:

  • Drop mergeInPlace. It is confusing with merge being present as well. A user can always copy a filter and do a merge to a new copy. Currently the merge to a copy has a return statement of the mergeInPlace that is being ignored. So you may return a copy that has not been correctly merged. This is not possible with the current implementations except the ArrayCountingBloomFilter which can return false if the state is invalid after merge.
  • Provide support for non-fast-fail forEach. This to be used for fast merge and bit operations on filters (e.g. all set operations). The fail-fast is only really needed for the contains methods when you can stop checking as soon as you know it does not contain the argument. Many operations never require the result of the passed in predicate in the forEach loop. This could wait for some performance tests to see if a forEach without the checking is much faster.
  • Support a merge without checking and then provide a method add with checking for a change in the filter. This will duplicate a lot of (simple) code. But it provides a nice method to check if a filter contains an object and do something if it did not, rather than calling contains before then doing a merge, or doing a merge and checking cardinality change afterwards.

For a bitmap bloom filter this can be done branchless by accumulating all the zero bits that are set to 1 using:

long[] flag = {0};
int[] idx = new int[1];
bitMapProducer.forEachBitMap(value -> {
    int i = idx[0]++;
    flag[0] |= (~bitMap[i] & value);
    bitMap[i] |= value;
    return true;
});
return flag[0] != 0;

For a filter with an explicit cardinality you just check the cardinality before and after the merge.

import org.apache.commons.collections4.bloomfilter.hasher.Hasher;
import org.apache.commons.collections4.bloomfilter.hasher.Shape;
import org.apache.commons.collections4.bloomfilter.hasher.StaticHasher;

import org.junit.jupiter.api.Test;

/**
* Test {@link SetOperations}.
*/
public class SetOperationsTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added the fix to compute And/Or/Xor correctly with different length filters. This class has no tests comparing filters with a different shape (length). Thus you cannot avoid regressions if future implementations change the method back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter1 = new SparseBloomFilter(shape, IndexProducer.fromIndexArray(new int[] { 5, 63 }));
filter2 = new SparseBloomFilter(shape, IndexProducer.fromIndexArray(new int[] { 5, 64, 69 }));
assertEquals(4, SetOperations.orCardinality(filter1, filter2));
assertEquals(4, SetOperations.orCardinality(filter2, filter1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the previous test using different length filters:

        Shape shape2 = Shape.fromKM(3, 192);
        filter1 = new SparseBloomFilter(shape2, IndexProducer.fromIntArray(new int[] { 1, 63, 185}));
        filter2 = new SparseBloomFilter(shape, IndexProducer.fromIntArray(new int[] { 5, 64, 69 }));
        assertEquals(6, SetOperations.orCardinality(filter1, filter2));
        assertEquals(6, SetOperations.orCardinality(filter2, filter1));

Others should be added for and and xor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
long[] getBits();
boolean isSparse();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with IndexProducer this could be replaced with a characteristics flag:

int characteristics();

Currently the only characteristic is sparse. Are there any other characteristics to report that may be of use? This would allow them to be added without adding more methods to the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
int andCardinality(BloomFilter other);
default int estimateN() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a double?

Since this is a pass through method that is converting a precise value from the shape into an imprecise one I would recommend removing it. A paragraph can be added to the class javadoc on how to estimate N and also the size of the union and intersection with another filter. Those methods also suffer from the same inaccuracy. If this functionality was moved to class javadoc then it is clear to the user how to compute it and what the computation requires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
int orCardinality(BloomFilter other);
default int estimateUnion(BloomFilter other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting a double result to int.

Another issue is that the argument filter's shape is not checked. The merge should join the smaller filter into the bigger one and then the estimateN called on the larger filter. Otherwise this method is not symmetric. One way would be fine but the other would result in a merge error.

If this is only intended for filters with the same number of bits then the expected result of a mismatch is not documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*/
int xorCardinality(BloomFilter other);
default int estimateIntersection(BloomFilter other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting a double result to int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// idx[0] will be limit+1 so decrement it
idx[0]--;
int idxLimit = BitMap.getLongIndex(shape.getNumberOfBits());
if (idxLimit < idx[0]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no coverage of this branch. It is not possible. If this were true the bitMap[idx[0]] in the main forEach loop would have thrown an array index out of bound exception. I think this can be removed. The final check to determine if the final upper bitmap does not set incorrect bits is valid.

Note: Even though the check is made that the upper bits have been set correctly, if they have not the exception is raised and the filter now has incorrect bits for the rest of its lifetime, i.e. no recovery is made and no flag is present in the filter to indicate the bits are bad. So you could then merge this into another filter and get the same exception, or record the bits to file and have bad bits in the recording, etc.

The same 'bug' is present in the SparseBloomFilter. It will raise an exception after merge if the TreeSet first or last position are outside the shape but not correct it.

I think these filters should attempt to clear any invalid bits before throwing exceptions. This makes them act as if they are a plain int[] of size nbits containing 0 or 1s. It should not be possible to put an index outside the range of the shape in the filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garydgregory
Copy link
Member

Hi All. Ping. Where are we on this PR?

@Claudenw
Copy link
Contributor Author

Claudenw commented Jun 7, 2022

I was planning on getting back to this this week, however I have taken ill. I think that once the merge conflicts are resolved I think that it should be merged as is. it is not perfect and there are several proposed changes, however, it is MUCH better than what is in 4.5-SNAPSHOT now. Let's merge and open the proposed changes as tickets.

@aherbert
Copy link
Contributor

aherbert commented Jun 7, 2022

A lot of progress was made in cleaning up the package. The work stalled but did not reach a consensus. There are several feedback comments I left that are as yet unresolved. I think these should be addressed before a merge.

The final extent of the public API can be addressed after merge. Currently there are some helper classes that could be reduced to a package-level and optionally exposed via factory helper methods and interfaces to allow backing implementations to be changed.

@Claudenw
Copy link
Contributor Author

Claudenw commented Jun 7, 2022

While I agree that we did not reach consensus, I think we can all agree that the current pull request is much better than the current package in 4.5-SNAPSHOT. The size of this change is very large, I would like to see the change sets reduced as we resolve the remaining issues.

@garydgregory
Copy link
Member

Note: The build is failing.

@garydgregory
Copy link
Member

I am OK to merge this PR and continue with a new PR.

@Claudenw
"I would like to see the change sets reduced as we resolve the remaining issues."
Does that mean you want to update this PR or continue with a new one?

@aherbert
WDYT?

@Claudenw
Copy link
Contributor Author

Claudenw commented Jun 9, 2022

Yes I would like to merge this request and pick up the any issues as new issues to work through as smaller changes.

@garydgregory
Copy link
Member

Yes I would like to merge this request and pick up the any issues as new issues to work through as smaller changes.

That sounds good to me.

@aherbert
Copy link
Contributor

aherbert commented Jun 9, 2022

The drawback of merging now is losing all the comments that are as yet unresolved on this PR.

@garydgregory
Copy link
Member

The drawback of merging now is losing all the comments that are as yet unresolved on this PR.

Hi @aherbert
I see that JIRA issues are now open to track tasks:
https://issues.apache.org/jira/browse/COLLECTIONS-816 through https://issues.apache.org/jira/browse/COLLECTIONS-819
Is that good enough that we can merge?

@aherbert
Copy link
Contributor

Some of the issues are simple to fix here. But if all comments have been captured then a merge will at least bring in the large improvements in this update and finer details can be worked on after.

Also note that when reviewing I stopped commenting on trivial javadoc issues to reduce noise so there will be some work to clean up this aspect too.

@Claudenw
Copy link
Contributor Author

I think the open tickets capture the issues/discussions remaining and that the branch can now be merged.

@garydgregory garydgregory merged commit 87647d0 into apache:master Jun 15, 2022
@garydgregory
Copy link
Member

Merged! TY @Claudenw

@Claudenw
Copy link
Contributor Author

Claudenw commented Oct 11, 2022 via email

@Claudenw Claudenw deleted the simplify_bloom_filters branch November 6, 2022 09:27
anantdamle pushed a commit to anantdamle/commons-collections that referenced this pull request Aug 14, 2023
* Fixed some unit tests

* First set with complete test cases.

* Cleaned up hasher collecton processing

* cleaned up code

* added license headers

* Refactored and cleaned up

Moved to dependency on BitMapProducer, IndexProducer and
BitCountProducer to retrieve internal representations of the data.

* Added license header.

* Updated documentation

* Fixed bug and added tests

* Added "@SInCE 4.5" where necessary

* Added BitMapProducer constructor to SimpleBloomFilter

* added BitMapProducer.fromLongArray() and Hasher.isEmpty()

* Changes to speed up Simple filter processing

* Null hasher used when a hasher is required but no values are available.

* Added Hasher.Filter and Hasher.FilteredIntConsumer

* Updated documentation + formatted.

* Added license

* fixed checkstyle issues

* fixed javadoc issues

* fixed test issue

* fixed javadoc issues

* Reduced the acceptable delta for p tests

* Updated docs and test cases

* Updated docs and test cases

* fixed issue with Shape javadoc

* Added more test coverage.

* fixed formatting issues

* Updated tests to use assertThrows.

* fixed indents

* Added constructor with IndexProducer

* Fixed issue with compare and different length bitMap arrays

* fixed formatting issues

* Efficiency changes

cleaned up asIndexArray

BitMapProducer to IndexProducer conversion

* changed XProviers to use XPredicates

* Removed NoMatchException

* Removed unneeded BitMap funcs

Moves isSparse() to Shape.

* fixed checkstyle issues

* Fixed javadoc errors

* simplified parameter in BitMapProducer.fromIndexProducer

* fixed tests

* added BitMapping verification

* Added more tests

* Added more tests

* Fixed typos

* Changes requested  by aherbert

* fixed "bit map" in documentation

* Renamed tests

* Removed blank lines

* changed new X<foo> to new X<>

* updated documentation

* Added BloomFilter.copy()

* changed ArrayCountingBloomFilter to use copy() method

* cleaned up numberOfBitsMaps()

* added asBitMapArray() and makePredicate() to BitMapProducer

* Moved asIndexArray() to IndexProducer

* harmonized Simple and Sparse Bloom filter constructors

* Implemented AbstractCountingBloomFilter.asindexArray()

* updated documentation

* fixed up NullHasher

* implemented hasher filter

* Fixed style issues

* added default SimpleHasher increment.

* Added modulus calculation to SimpleHasher

* fixed Hashing issues

* moved hasher/filter/* to /hasher

* moved bloomfilter/hasher to bloomfilter

* fixed up checkstyle issues

* Made Filter -> IndexFilter -w- factory

* moved IndexFilter into Hasher

* updated hashing tests & fixed checksyle

* removed SingleItemhasherCollection as associated methods

* Fixed some unit tests

* First set with complete test cases.

* Cleaned up hasher collecton processing

* cleaned up code

* added license headers

* Refactored and cleaned up

Moved to dependency on BitMapProducer, IndexProducer and
BitCountProducer to retrieve internal representations of the data.

* Added license header.

* Updated documentation

* Fixed bug and added tests

* Added "@SInCE 4.5" where necessary

* Added BitMapProducer constructor to SimpleBloomFilter

* added BitMapProducer.fromLongArray() and Hasher.isEmpty()

* Changes to speed up Simple filter processing

* Null hasher used when a hasher is required but no values are available.

* Added Hasher.Filter and Hasher.FilteredIntConsumer

* Updated documentation + formatted.

* Added license

* fixed checkstyle issues

* fixed javadoc issues

* fixed test issue

* fixed javadoc issues

* Reduced the acceptable delta for p tests

* Updated docs and test cases

* Updated docs and test cases

* fixed issue with Shape javadoc

* Added more test coverage.

* fixed formatting issues

* Updated tests to use assertThrows.

* fixed indents

* Added constructor with IndexProducer

* Fixed issue with compare and different length bitMap arrays

* fixed formatting issues

* Efficiency changes

cleaned up asIndexArray

BitMapProducer to IndexProducer conversion

* changed XProviers to use XPredicates

* Removed NoMatchException

* Removed unneeded BitMap funcs

Moves isSparse() to Shape.

* fixed checkstyle issues

* Fixed javadoc errors

* simplified parameter in BitMapProducer.fromIndexProducer

* fixed tests

* added BitMapping verification

* Added more tests

* Added more tests

* Fixed typos

* Changes requested  by aherbert

* fixed "bit map" in documentation

* Renamed tests

* Removed blank lines

* changed new X<foo> to new X<>

* updated documentation

* Added BloomFilter.copy()

* changed ArrayCountingBloomFilter to use copy() method

* cleaned up numberOfBitsMaps()

* added asBitMapArray() and makePredicate() to BitMapProducer

* Moved asIndexArray() to IndexProducer

* harmonized Simple and Sparse Bloom filter constructors

* Implemented AbstractCountingBloomFilter.asindexArray()

* updated documentation

* fixed up NullHasher

* implemented hasher filter

* Fixed style issues

* added default SimpleHasher increment.

* Added modulus calculation to SimpleHasher

* fixed Hashing issues

* moved hasher/filter/* to /hasher

* moved bloomfilter/hasher to bloomfilter

* fixed up checkstyle issues

* Made Filter -> IndexFilter -w- factory

* moved IndexFilter into Hasher

* updated hashing tests & fixed checksyle

* removed SingleItemhasherCollection as associated methods

* fixed javadoc issues

* fixed javadoc issues

* added checks for BitMapProducer limits and index limits

* updated tests

* added tests

* fixed checkstyle issues

* fixed formatting and test coverage

* fixed javadoc issue

* put back checkstyle.xml

* switched to forEachBitMapPair

* updated BitMap and Index array production

* fixed merge with BitMapProducer

* Cleaned up formatting

* fixed checkstyle issues

* fixed coding issues

* updated documentation

* simplified test

* removed unwanted merge files

* removed duplicate entry

* put back test that incorrectly removed

* fixed asIndexArray error

* fixed checkstyle errors

* Changes for last review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants