Skip to content

Conversation

@Claudenw
Copy link
Contributor

fix for COLLECTIONS-384

Document bit count producer operation in the BitCountProducer.java file.

Add unit tests for all BitCountProducers.

@codecov-commenter
Copy link

codecov-commenter commented Sep 10, 2022

Codecov Report

Merging #335 (68bfbb8) into master (dca05e5) will decrease coverage by 4.71%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #335      +/-   ##
============================================
- Coverage     85.98%   81.27%   -4.72%     
+ Complexity     4671     4601      -70     
============================================
  Files           289      289              
  Lines         13447    13445       -2     
  Branches       1977     1973       -4     
============================================
- Hits          11563    10927     -636     
- Misses         1323     1920     +597     
- Partials        561      598      +37     
Impacted Files Coverage Δ
...pache/commons/collections4/bloomfilter/Hasher.java 100.00% <ø> (ø)
.../commons/collections4/bloomfilter/IndexFilter.java 100.00% <ø> (ø)
...ons/collections4/bloomfilter/BitCountProducer.java 100.00% <100.00%> (ø)
...ons/collections4/bloomfilter/HasherCollection.java 100.00% <100.00%> (+3.92%) ⬆️
...ommons/collections4/bloomfilter/IndexProducer.java 100.00% <100.00%> (ø)
...commons/collections4/map/UnmodifiableEntrySet.java 39.47% <0.00%> (-52.64%) ⬇️
.../commons/collections4/map/AbstractIterableMap.java 50.00% <0.00%> (-50.00%) ⬇️
.../commons/collections4/set/PredicatedSortedSet.java 53.84% <0.00%> (-46.16%) ⬇️
...org/apache/commons/collections4/map/LinkedMap.java 43.47% <0.00%> (-45.66%) ⬇️
...ns/collections4/multimap/AbstractSetValuedMap.java 55.00% <0.00%> (-45.00%) ⬇️
... and 40 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Claudenw
Copy link
Contributor Author

@aherbert Described the BitCountProducer contract in the BitCountProducer javadoc, and added tests to show that they work as expected. I hope I have not made too much of a hash of it.

@aherbert aherbert changed the title Collections 384 bit count producer operation is not clearly defined Collections-384: bit count producer operation is not clearly defined Sep 16, 2022
Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Claudenw,

Thanks for all the new test data. I had failed to spot previously that the expected indices from an IndexProducer are not actually checked. I think these changes address the issue with the disabled test in BitCountProducerFromIndexProducerTest and strengthen the test suite.

You test the expected indices using an array that mandates the order. However we already test the order behaviour separately. I suggested a change to only use the expected indices to detect correct/incorrect actual indices. The remaining behaviour tests will detect ordering and duplication problems.

I provided an implementation for AbstractBitCountProducerTest that matches the AbstractIndexProducerTest. This checks consistency between the indices output from the two interface forEach methods. Currently it includes new behaviour flags for forEachCount. However the test suite passes on the assumption the forEachIndex and forEachCount have the same behaviour. So maybe we can drop them. They could be added in the future if an implementation is created with different behaviour for the two methods.

I would change some of the tests from using IncrementingHasher(0, 1) to something else. This avoids a sequence from 0 which is simplistic test data. The data is easy to create with a stream:

    // For: IncrementingHasher(3, 2) and Shape.fromKM(17, 72)
    IntStream.iterate(3, i -> i + 2).limit(17).toArray();

For multiple hashers you can collate them:

    // For: IncrementingHasher(3, 2), IncrementingHasher(0, 1), and Shape.fromKM(17, 72)
    IntStream.concat(
        IntStream.iterate(3, i -> i + 2).limit(17),
        IntStream.iterate(0, i -> i + 1).limit(17)
    ).distinct().toArray();

@Claudenw Claudenw force-pushed the Collections-384_BitCountProducer_operation_is_not_clearly_defined branch from 348b6a9 to c52ebfb Compare October 17, 2022 15:44
@Claudenw Claudenw requested a review from aherbert October 17, 2022 16:17
Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested some changes to the AbstractBitCountProducerTest to assert on the behaviour flags separately from the bit count totals. Can you add these? Please see this note.


@Test
public final void testAsIndexArrayValues() {
List<Integer> lst = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is asserting the distinctness of indices using lst.contains (which is inefficient). If you only wish to test distinctness then use a BitSet not a List. Use of a list (over a set) would be to test the ordering.

Given we have a test for distinctness and ordering then I think this test should be:

@Test
public final void testAsIndexArrayValues() {
    BitSet bs = new BitSet();
    Arrays.stream(createProducer().asIndexArray()).forEach(bs::set);
    for (int i : getExpectedIndices()) {
        assertTrue(bs.get(i), () -> "Missing " + i);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have not removed the redundant part of this test. The distinctness is tested in testBehaviourAsIndexArray.

@garydgregory garydgregory changed the title Collections-384: bit count producer operation is not clearly defined [Collections-384][bloom filters] bit count producer operation is not clearly defined Oct 18, 2022
@garydgregory garydgregory mentioned this pull request Oct 18, 2022
@Claudenw Claudenw requested a review from aherbert October 18, 2022 21:45
@Claudenw Claudenw changed the title [Collections-384][bloom filters] bit count producer operation is not clearly defined [Collections-834][bloom filters] bit count producer operation is not clearly defined Oct 19, 2022
Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Claudenw

I've not looked at everything here. I found an issue where the update to the documentation of BitCountProducer is not correct (given the current implementations). See my test that is checking the count of indices output from the IndexProducer. It fails for ArrayCountingBloomFilter.

I don't think we want to change the implementations. We should change the documentation to be less strict. You can then remove some of the updates to the tests as we do not need int[] getExpectedForEach. It is only overridden in BitCountProducerFromHasherTest anyway.

Since you have not changed any of src/main besides javadoc I can only assume this was to pass the test that I think is not required if we are less strict with the interface definition.

@Test
public final void testBehaviourForEachCount() {
int flags = getBehaviour();
IntList list = new IntList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could do with:
assumeTrue((flags & (FOR_EACH_COUNT_ORDERED | FOR_EACH_COUNT_DISTINCT)) != 0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, assumeTrue should be in the line after you get the flags.

@Claudenw Claudenw requested a review from aherbert October 22, 2022 08:26
Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Claudenw

This is looking good. I suggested a bit more text for the BitCountProducer interface. There is also a simplification we can make in the AbstractBitCountProducerTest to lower maintenance by hiding the new flag constants (for now).

@Test
public final void testBehaviourForEachCount() {
int flags = getBehaviour();
IntList list = new IntList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, assumeTrue should be in the line after you get the flags.

/**
* Creates a BitCountProducer from an IndexProducer. The resulting
* producer will count each enabled bit once.
* producer will return every index from the IndexProducer with a count of 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering why do we need to do this: convert an IndexProducer to a BitCountProducer. It would be to add it to a CountingBloomFilter. In this case the conversion does not remove duplicate indices. This means that the BitCountProducer can increment the count of an index more than 1 when adding a single IndexProducer.

This has the following consequences:

  • The count at each index of a CountingBloomFilter does not represent the number of individual hashes added to the filter that had that index enabled. The count is bounded by the number of hashes as a lower limit, i.e. it may be higher.
  • Removal of an IndexProducer from a counting bloom filter must use the same method to construct the BitCountProducer as the addition in order to use the exact same index->count mapping.

I am fine with this behaviour. Filtering an IndexProducer to unique indices is extra work. The IndexFilter class helps with this but requires a Shape. So it cannot be done in the BitCountProducer interface which has no Shape; other mechanisms to filter the IndexProducer without a bound on the index would be inefficient.

So perhaps we require a bit of extra documentation here to serve as a user-beware warning:

 * <p>Note that the BitCountProducer does not remove duplicates. Any use of the
 * BitCountProducer to create an aggregate mapping of index to counts, such as a
 * CountingBloomFilter, should use the same BitCountProducer in both add and
 * subtract operations to maintain consistency.

I believe this is consistent with the new wording in the javadoc header for the BitCountProducer interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

public abstract class AbstractBitCountProducerTest extends AbstractIndexProducerTest {

/** Flag to indicate the {@link BitCountProducer#forEachCount(BitCountConsumer)} is ordered. */
protected static final int FOR_EACH_COUNT_ORDERED = 0x10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of simplicity, can these be dropped?

I previously stated that we should only have them if the behaviour of a BitCountProducer will be different from an IndexProducer. From what I can see in the tests these are used in 3 tests, and only when FOR_EACH_ORDERED and FOR_EACH_DISTINCT from the AbstractIndexProducerTest are also used. Dropping these constants will reduce maintenance and ensure the flags are correctly set for BitCount and Index producers.

You can add a note in the test to state that the same flags are reused.

I think the simplest way to do this is to update the flags to be private:

/** Flag to indicate the {@link BitCountProducer#forEachCount(BitCountConsumer)} is ordered.
 * This flag currently reuses the value from IndexProducer. At present no implementations
 * exhibit different behaviour for the two interfaces. This may change in the future and
 * an explicit flag value will be required. */
private static final int FOR_EACH_COUNT_ORDERED = FOR_EACH_ORDERED;
/** Flag to indicate the {@link BitCountProducer#forEachCount(BitCountConsumer)} is distinct.
 * This flag currently reuses the value from IndexProducer. At present no implementations
 * exhibit different behaviour for the two interfaces. This may change in the future and
 * an explicit flag value will be required. */
private static final int FOR_EACH_COUNT_DISTINCT = FOR_EACH_DISTINCT;

An alternative would be to have a flag for unspecified behaviour. This must be explicitly set by implementing tests:

FOR_EACH_COUNT_UNSPECIFIED = 0x10;
FOR_EACH_COUNT_ORDERED = 0x20;
FOR_EACH_COUNT_DISTINCT = 0x40;

The abstract test can then check if the behaviour is unspecified. If true then the other flags must be unset. If false then at least one of the other flags must be set. I think this is more work and not required, especially since we are testing the behaviour that is not a part of the contract given in the javadoc of the main implementations. This is more of a behaviour sense check of the implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this, but I can't think of a good counter argument. I could do the hand waving and say IndexProducer and BitCountProducer produce different orders but I can't see a reasonable example in any possible implementation I can come up with. So, changes made.

assertTrue(createEmptyProducer().forEachCount(FALSE_CONSUMER), "empty should be true");
assertTrue(createEmptyProducer().forEachCount(TRUE_CONSUMER), "empty should be true");
@Test
public final void testBehaviourForEachCount() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a javadoc note:

/**
 * Test the behaviour of {@link BitCountProducer#forEachCount(BitCountConsumer)} with respect
 * to ordered and distinct indices. Currently the behaviour is assumed to be the same as
 * {@link IndexProducer#forEachIndex(java.util.function.IntPredicate)}.
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@Claudenw
Copy link
Contributor Author

I discovered issues with the HasherCollection and Hasher generated IndexProducer and BitCountProducers created from those IndexProducers. Additional tests and corrections included in this latest push.

@Claudenw Claudenw requested a review from aherbert October 24, 2022 09:49
@aherbert
Copy link
Contributor

Last commit was on Oct 21st. Is this correct or do you need to push again?

@Claudenw
Copy link
Contributor Author

Claudenw commented Oct 24, 2022 via email

…er_operation_is_not_clearly_defined' into Collections-384_BitCountProducer_operation_is_not_clearly_defined
@Claudenw
Copy link
Contributor Author

Last commit was on Oct 21st. Is this correct or do you need to push again?

Found the issue on my system. Should be updated now.

}

@Override
public int[] asIndexArray() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this change is required otherwise the test suite fails.

Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. However there is a bug in the new HasherCollection.absoluteUniqueIndices method due to how the IndexFilter is created. I've suggested two quick fixes to get around this.

public IndexProducer absoluteUniqueIndices(final Shape shape) {
return consumer -> {
Objects.requireNonNull(consumer, "consumer");
return uniqueIndices(shape).forEachIndex(IndexFilter.create(shape, consumer));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this can exceed the capacity of the IndexFilter. The uniqueIndices(shape) will output up to k indices for each hasher in the collection (eliminating duplicates). Thus if we have two hashers with shape(k=5, n=1000) we could output 10 indices from the combined unique indices. Then we create an IndexFilter for shape which could be optimised only to check up to k=5 indices.

This throws an index out of bounds exception from IndexFilter.ArrayTracker.

@Test
public void testAbsoluteUniqueIndices() {
    int[] actual = new HasherCollection(
        new IncrementingHasher(1, 1),
        new IncrementingHasher(10, 1)
    ).absoluteUniqueIndices(Shape.fromKM(5, 1000)).asIndexArray();
    int[] expected = IntStream.concat(
            IntStream.range(1, 1 + 5),
            IntStream.range(10, 10 + 5)
        ).toArray();
    Assertions.assertArrayEquals(expected, actual);
}

When I update HasherCollection absoluteUniqueIndices to use a fake shape that forces the IndexTracker.BitMapTracker implementation then the above tests passes:

public IndexProducer absoluteUniqueIndices(final Shape shape) {
    return consumer -> {
        Objects.requireNonNull(consumer, "consumer");
        return uniqueIndices(shape).forEachIndex(IndexFilter.create(
                Shape.fromKM(shape.getNumberOfBits(), shape.getNumberOfBits()), consumer));
    };
}

I think that this can be easily solved by having a new package-private factory constructor in IndexFilter to choose the implementation that can handle exceeding the number of hash functions in the shape:

static IntPredicate create(int numberOfBits, IntPredicate consumer) {
    // Choose the BitMapTracker implementation to handle any number of hash functions

Or you could create a new Shape as:

public IndexProducer absoluteUniqueIndices(final Shape shape) {
    return consumer -> {
        Objects.requireNonNull(consumer, "consumer");
        return uniqueIndices(shape).forEachIndex(IndexFilter.create(
                Shape.fromKM(shape.getNumberOfHashFunctions() * hashers.size(),
                             shape.getNumberOfBits()), consumer));
    };
}

Perhaps the later is simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented later version with a check for hashers.size() == 0

@Claudenw
Copy link
Contributor Author

In all cases where AS_ARRAY_ORDERED is set FOR_EACH_ORDERED is also set, and the FOR_EACH_COUNT_ORDERED is therefore set as well.

In all cases where AS_ARRAY_DISTINCT is set FOR_EACH_DISTINCT is also set, and the FOR_EACH_COUNT_DISTINCT is therefore set as well.

I propose collapsing this into 2 statics:
INDICES_ORDERED : replaces AS_ARRAY_ORDERED, FOR_EACH_ORDERED, and FOR_EACH_COUNT_ORDERED.
INDICES_DISTINCT: replaces AS_ARRAY_DISTINCT, FOR_EACH_DISTINCT and FOR_EACH_COUNT_DISTINCT.

By "replaces" I mean remove all instances of the replaced constants (not make them equivalent).
Add documentation that states the testing assumes the ORDERED and DISTINCT properties carry across asArray(), forEachIndex() and forEachCount().

Also, since the AbstractBitCountProducerTest extends AbstractIndexProducerTest we can remove all IndexProducerFrom_X_Test implementations where there is a BitCountProducerFrom_X_Test

@aherbert unless you have an objection, I'll proceed with these changes.

@Claudenw Claudenw requested a review from aherbert October 29, 2022 16:03
@aherbert
Copy link
Contributor

Hi @Claudenw, consolidating all the behaviour flags is a good idea. It will avoid future tests failing to set them correctly. I'll wait until you have done that for a review.

Note: I have reservations about the HasherCollection and the requirement for an absoluteUniqueIndices method. In summary:

  1. I think it should be resolved after this PR
  2. Since the HasherCollection violates the IndexProducer contract by outputting more than k indices, we may need to revise the contract and javadoc as being able to output up to a multiple of k indices for a Shape. This would be helped by a size-type method in the IndexProducer interface to state what count of k will be output.
  3. The method absoluteUniqueIndices may be better served by a HasherSet, or some other naming variant, that ensures all indices output by the IndexProducer are unique.

@Claudenw
Copy link
Contributor Author

Claudenw commented Nov 2, 2022

@aherbert The changes are in. Have a look please

Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Claudenw

I think the consolidation of the ordered and distinct flags has reduced the flexibility of the test suite. It also led to you duplicating the flags in all test classes that override getBehaviour().

However the previous flexibility can be reinstated with my suggestion to have a method to get the behaviour for each of: asIndexArray; forEachIndex; forEachCount. By default this would return the same behaviour for all of them. But it can be configured to change and allow testing different behaviour for the methods such as that done in the default implementation of IndexProducer.


@Override
protected int getBehaviour() {
// The default method streams a BitSet so is distinct and ordered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This test has removed some testing of the default methods in IndexProducer.

It should be named IndexProducerFromIndexArrayTest. It is specifically targeting the default implementation of the IndexProducer when created from an index array. In that case the asIndexArray() method returns a clone of the original indices used to create it.

A DefaultIndexProducerTest would test the default implementation of asIndexArray() and look like this.

public class DefaultIndexProducerTest extends AbstractIndexProducerTest {

    /** Make forEachIndex unordered and contain duplicates. */
    private int[] values = {10, 1, 10, 1};

    @Override
    protected int[] getExpectedIndices() {
        return values;
    }

    @Override
    protected IndexProducer createProducer() {
        return new IndexProducer() {
            @Override
            public boolean forEachIndex(IntPredicate predicate) {
                Objects.requireNonNull(predicate);
                for (int i : values) {
                    if (!predicate.test(i)) {
                        return false;
                    }
                }
                return true;
            }
        };
    }

    @Override
    protected IndexProducer createEmptyProducer() {
        return new IndexProducer() {
            @Override
            public boolean forEachIndex(IntPredicate predicate) {
                Objects.requireNonNull(predicate);
                return true;
            }
        };
    }

    @Override
    protected int getBehaviour() {
        // The default method streams a BitSet so is distinct and ordered.
        // However the forEachIndex may not be distinct and ordered and
        // the test cannot distinguish the two cases.
        return 0;
    }
}

Note that this highlights a problem with consolidating all the behaviour flags. Currently the test suite cannot be used to test the asIndexArray behaviour is different from the forEachIndex behaviour.

I think we should revert the consolidation of the flags. However to avoid the duplication of flag setting through the various test cases we can change the AbstractIndexProducerTest test to do this:

public abstract class AbstractIndexProducerTest {

    protected static final int ORDERED = 0x1;
    protected static final int DISTINCT = 0x2;

    protected abstract int getAsIndexArrayBehaviour();

    protected int getForEachIndexBehaviour() {
        return getAsIndexArrayBehaviour();
    }

}

And AbtractBitCountProducerTest:

public abstract class AbstractBitCountProducerTest extends AbstractIndexProducerTest {

    protected int getForEachCountBehaviour() {
        return getAsIndexArrayBehaviour();
    }

}

Then get the correct flag in the relevant test. It provides the ability to override each test behaviour individually but will default to the same behaviour if not explicitly changed.

Doing this change (DefaultIndexProducerTest and IndexProducerFromIndexArrayTest) should remove the requirement for the additional test testFromIndexArray. The method testFromBitMapProducer is already covered in IndexProducerFromBitMapProducerTest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected. I also added DefaultBitCountProducerTest and removed duplicate test UniqueIndexProducerFromHashCollectionTest which was effectively a duplicate of BitCountProducerFromUniqueHasherCollectionTest

Added DefaultBitCountProducerTest
removed duplicate test UniqueIndexProducerFromHashCollectionTest was
effectivly a duplicate of BitCountProducerFromUniqueHasherCollectionTest
Copy link
Contributor

@aherbert aherbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Claudenw

Thanks for the changes. They look good. I requested a few comment changes to indicate what the tests are intentionally trying to test.

@Claudenw Claudenw requested a review from aherbert November 5, 2022 16:08
@aherbert aherbert merged commit a251c18 into apache:master Nov 5, 2022
anantdamle pushed a commit to anantdamle/commons-collections that referenced this pull request Aug 14, 2023
…che#335)

Document the expected behaviour of the BitCountProducer's mapping of indices to counts.

Updated IndexProducer and BitCountProducer tests to verify the expected indices and counts; and optionally verify the encounter order is sorted and indices are distinct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants