[BEAM-3705] ApproximateUnique resets its accumulator with each firing. #4688

rangadi · 2018-02-14T19:38:37Z

extractOutput() ended up resetting underlying aggregation. This is due to use of extractOrderedList() which removes all the elements from the heap. extractOrderedList() is costly and is not required either. extractOutput() does not mutate now and is cheaper too.

Merging was not tested as direct-runner does not seem to use combiner. Added test to merge and extract as happens in a window with multiple firings.

rangadi · 2018-02-14T19:44:28Z

+R: @kennknowles, @reuvenlax

rangadi · 2018-02-14T19:46:17Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java

      }
      return heap;
    }

    @Override
    public Long extractOutput(LargestUnique heap) {
-      List<Long> largestHashes = heap.extractOrderedList();


Moved implementation of extractOutput to LargestUnique. There is no reason to assume sampleSize is same for both. As such is based entirely on state within LargestUnique.

jkff · 2018-02-14T19:50:38Z

Quick comment: can you use CombineFnTester?

rangadi · 2018-02-14T22:16:22Z

can you use CombineFnTester?

That is great. Updated the test.

I reverted my 'improvement' to add(), it was buggy. It is fine as it is.

reuvenlax · 2018-02-14T22:17:49Z

Can we check to see if Python has a similar bug?

…

On Wed, Feb 14, 2018 at 2:16 PM, Raghu Angadi ***@***.***> wrote: can you use CombineFnTester? That is great. Updated the test. I reverted my 'improvement' to add(), it was buggy. It is fine as it is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4688 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1coJXh4jiLM5NLY2uNHPq75kn8ttks5tU1s8gaJpZM4SF3Ai> .

rangadi · 2018-02-14T22:35:15Z

I don't see a any Approx.* combiners in python. Python Beam can simply use existing reducers like 'sum'.

rangadi · 2018-02-15T00:31:48Z

R: @jkff

to use of `extractOrderedList()` which removes all the elements from the heap. `extractOrderedList()` is costly and is not required either. `extractOutput()` does not mutate now and is cheaper too. Updated LargestUnique.add() to avoid 'heap.contains()' call in common case with large input. Merging was not tested as direct-runner does not seem to use combiner. Added test using ConbineFnTester. Put back add() improvement. contains() is an O(n) operation. Avoid it in common case. I think the extractOrderedList() existed mainly to avoid this.

rangadi · 2018-02-15T23:14:04Z

Squashed commits.

jkff

Thanks!

jkff · 2018-02-15T23:55:22Z

sdks/java/core/src/main/java/org/apache/beam/sdk/testing/CombineFnTester.java

@@ -102,6 +102,7 @@
      } else {
        accumulator = fn.mergeAccumulators(Arrays.asList(accumulator, inputAccum));
      }
+      fn.extractOutput(accumulator); // Extract output to simulate multiple firings.


Hm don't understand this comment. Is there a way to more directly simulate multiple firings, and assert that the results of each firing are correct?

I wanted to invoking fn.extractOutput() multiple times as it seems to happen on the runners like Dataflow when there are multiple firings. I am not sure if there is an easy way to add validation for each firing here. Since the final extractOutput() is verified, intermediate values have some indirect validation. Without this, the test added in ApproximateUniqueTest passes even without the fix.

Do you have any suggestions for improving this?

jkff · 2018-02-15T23:58:56Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java

+        if (heap.size() >= sampleSize && value < heap.element()) {
+          return false; // Common case as input size increases.
+        }
+        if (!heap.contains(value)) {


What's the complexity of contains? Is it O(log n) or O(n)?

O(n) in java.util.PriorityQueue.

That seems quite bad - O(n) for every add(). Is this check actually needed?

jkff · 2018-02-16T00:00:31Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java

-            break; // The remainder of this list is all smaller.
-          }
-        }
+        iterator.next().heap.forEach(h -> heap.add(h));


It's rather confusing that "heap" refers both to PriorityQueue objects and LargestUnique objects, which have different behavior of .add() (one is limited and deduplicates, the other doesn't). In this line there's 2 things called "heap" referring to both of these. Maybe rename the variable "heap" to be called "accum"? (applies throughout the PR wherever a LargestUnique variable, field or parameter is called "heap")

Agreed. I felt it too. will update.

jkff · 2018-02-16T00:01:49Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ApproximateUniqueTest.java

+    if (uniqueCount <= sampleSize) {
+      return is(uniqueCount);
+    } else {
+      long maxError = (long) Math.ceil(2.0 * uniqueCount / Math.sqrt(sampleSize));


Just checking: is this a probabilistic guarantee (holds 99% of the time), or a hard guarantee (holds 100% of the time)? Wouldn't want the test to be flaky.

Good point. I didn't check. In the case of these tests, the input does not change run to run. Once it passes it should always pass. At the least it won't be flacky.

I could not find where the error bound comes from or what the cutoff probability is. The estimate is roughly k * hash_key_space / (max_hash - k_th_largest_hash). ApproximateUnique is likely to be replaced by ApproximateDistinct which estimates using HyperLogLog. The test are not flacky as the input is deterministic.

jkff · 2018-02-16T00:02:31Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ApproximateUniqueTest.java

+   * Test ApproximateUniqueCombineFn. TestPipeline does not use combiners.
+   */
+  @RunWith(Parameterized.class)
+  public static class ApproximateUniqueCombineFnTest {


There's only 3 parameters - I'd prefer a private utility method and 3 test methods calling it with each of the 3 parameters...

jkff · 2018-02-16T00:02:55Z

sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ApproximateUniqueTest.java

@@ -276,6 +301,50 @@ public void testApproximateUniqueWithDifferentSampleSizes() {
    }
  }

+  /**
+   * Test ApproximateUniqueCombineFn. TestPipeline does not use combiners.


It does (otherwise how would it evaluate the Combine transform), just not as extensively as CombineFnTester.

I should clarify, it does not seem to merge() at all. I had changed merge() to return null, and it had no effect. In that sense it does not combine. I will update the comment.

Updated the comment.

jkff

Thanks, mostly LG, one concern.

jkff · 2018-02-16T02:36:20Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java

+        if (heap.size() >= sampleSize && value < heap.element()) {
+          return false; // Common case as input size increases.
+        }
+        if (!heap.contains(value)) {


That seems quite bad - O(n) for every add(). Is this check actually needed?

rangadi · 2018-02-22T01:18:31Z

Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java, line 342 at r1 (raw file):

Previously, jkff (Eugene Kirpichov) wrote…

That seems quite bad - O(n) for every add(). Is this check actually needed?

Yes. The algorithm requires keeping track of top-k unique hashes. There is no option to insert() in PriorityQueue. That's why reordering the conditions is important (more efficient with as number of elements increases). If we need to increase performance here, I would consider using datastruct using primitive longs from fastutil.. may be a sorted set. For now its ok.. I haven't looked into why we have both ApproximateDistinct (using HLL+) and this.

Comments from Reviewable

jkff · 2018-02-22T01:35:29Z

Can we just use a TreeSet instead of the PriorityQueue? It seems strictly better, and very easy to switch to.

rangadi · 2018-02-22T17:11:12Z

Updated to use TreeSet. We maintain minHash so that there is no log(n) look up in fast path (i.e. when number elements > sampleSize, which is more likely to happen during merges). We can't be very certain this is better in practice, might need a micro benchmark. PriorityQueue uses an array. Since we always remove min, it could cause more rebalancing than a typical tree set. Either way, it can't be much worse. This is fine. We could use primitive long based tree if we are really concerned about the performance.

Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.

Comments from Reviewable

jkff · 2018-02-22T20:03:54Z

I don't see new changes; forgot to push?

rangadi · 2018-02-22T21:20:14Z

Oops. Just pushed my changes.

jkff · 2018-02-22T21:27:48Z

LGTM, please ping me when tests are green and I'll merge.

jkff · 2018-02-23T01:58:00Z

Seems CombineFnTesterTest is failing now.

…ping track of number of merges in accumulator.

rangadi · 2018-02-23T02:32:55Z

I was just pushing a fix. I updated the failing test to keep track of number of merges in the accumulator itself rather than expecting extractOutput() to be called exactly once per test.

rangadi · 2018-02-23T04:01:23Z

@jkff Thanks for the review. Looks like the tests have passed.

rangadi commented Feb 14, 2018

View reviewed changes

rangadi force-pushed the approx_unique branch from 0635ea0 to bd9c6aa Compare February 15, 2018 23:13

jkff requested changes Feb 16, 2018

View reviewed changes

review comments.

2cde00b

lukecwik assigned jkff Feb 21, 2018

jkff reviewed Feb 22, 2018

View reviewed changes

Use TreeSet in place of PriorityQueue.

8bc6f23

jkff approved these changes Feb 22, 2018

View reviewed changes

update checksWithMultipleMerges() to check for multiple merges by kee…

c24aebd

…ping track of number of merges in accumulator.

jkff merged commit 0be4c54 into apache:master Feb 23, 2018

[BEAM-3705] ApproximateUnique resets its accumulator with each firing. #4688

[BEAM-3705] ApproximateUnique resets its accumulator with each firing. #4688

Conversation

rangadi commented Feb 14, 2018 • edited

rangadi commented Feb 14, 2018

Choose a reason for hiding this comment

jkff commented Feb 14, 2018

rangadi commented Feb 14, 2018 • edited

reuvenlax commented Feb 14, 2018 via email

rangadi commented Feb 14, 2018

rangadi commented Feb 15, 2018

rangadi commented Feb 15, 2018

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rangadi commented Feb 22, 2018

jkff commented Feb 22, 2018

rangadi commented Feb 22, 2018

jkff commented Feb 22, 2018

rangadi commented Feb 22, 2018

jkff commented Feb 22, 2018

jkff commented Feb 23, 2018

rangadi commented Feb 23, 2018

rangadi commented Feb 23, 2018

rangadi commented Feb 14, 2018 •

edited

rangadi commented Feb 14, 2018 •

edited