[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519

robinyqiu · 2019-09-09T07:17:15Z

The current implementation of HllCount.Init and HllCount.MergePartial returns an empty PCollection when its input is empty. This is not ideal because, for example, if they are followed by a HllCount.Extract then the output PCollection will also be empty, but the result we really want is a PCollection of a single value 0.

Therefore, this PR is changing the behavior of empty aggregation such that it returns an "empty sketch" (represented by an empty byte[] of length 0), and also HllCount.Extract to handle it.

The reason why we don't use null to represent "empty sketch" is, in short:

Space-efficiency: We will have to override the default ByteArrayCoder, and use a NullableCoder of ByteArrayCoder to encode the output, which is less efficient in terms of space because it needs to encode the information of whether the element is null
Safety: null values may cause problem in downstream transforms

r: @zfraa @boyuanzz

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

zfraa

Looks good overall, thank you for the fast PR!
I didn't have time to do a full pass over the tests. Could you make sure that there's good coverage of empty aggregations in the unit tests -- e.g., merge partial with only empty sketches, with some empty sketches,

.../extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java

...sions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

zfraa · 2019-09-09T14:11:47Z

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

@@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
    return null;
  }

+  @Nullable
  @Override
  public HyperLogLogPlusPlus<HllT> addInput(
      @Nullable HyperLogLogPlusPlus<HllT> accumulator, byte[] input) {


tl;dr: why not handle nulls instead of throwing?

I didn't find any sources on the exact implied semantics of @nullable, but I would tend to assume if a parameter is annotated with @nullable, the method handles it benignly if it is actually null, vs. throwing an exception.
I would do one of the following two things:

Either remove the @nullable annotation and keep throwing below; (again, not feeling strongly about this)

Or -- I think we can safely assume that if a null is passed, it's supposed to be an empty sketch: maybe a BQ sketch that made it through importing without conversion, ... . We have the means to support this smoothly by just treating nulls like byte[0] -- why not do this and save the users some exceptions?
(would need to be consistently across methods)

The @Nullable annotation is on the accumulator parameter, which can be null and we are handling that properly without throwing an exception.

tl;dr: why not handle nulls instead of throwing?

There are pros and cons of supporting nulls:

The pro is that we can save the users from exceptions, as you mentioned

The cons are 1) then we will have two different representations for ''empty sketch"; and 2) I feel like if we accept nulls as input then we are encouraging users to produce nullable output (and use NullableCoder) from its upstream transform, which is more costly in terms of encoding/decoding and more error prone.

Currently I slightly prefer not accepting nulls as "empty sketches". What is your opinion? A good thing if we keep the implementation as it is: we can always change it later to support nulls and it will be backwards compatible (but we cannot go the other way).

Let's leave things as they are in this PR, so that this can be merged!

Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it.

I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us.

Agreed. Actually I figured out that we can accept nulls and leave a log warning to suggest replacement with byte[0]. Made that change. PTAL.

Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?

That depends on their pipeline implementation. If users do that in the PTransform where null is created (e.g. here in BigQueryIO), then it will not result in an extra-pass.

LGTM! Maybe Boyuan can also have a quick look at the logging? Not familiar with what's usual in Beam.

@boyuanzz PTAL at the added logging

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

robinyqiu · 2019-09-09T21:30:46Z

Hi @zfraa , I think the test cases you mentioned above are already covered!

boyuanzz

Thanks Robin! Left some comments.
@zfraa Could you please approve this PR if changes look good to you?

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java

zfraa

[requesting changes, but also approving]
Looks great, thank you for the fast PR, and test coverage also LG! Super-excited for this to make it into Beam 2.16! :D

...sions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

zfraa · 2019-09-10T10:04:31Z

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java

@@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
    return null;
  }

+  @Nullable
  @Override
  public HyperLogLogPlusPlus<HllT> addInput(
      @Nullable HyperLogLogPlusPlus<HllT> accumulator, byte[] input) {


Let's leave things as they are in this PR, so that this can be merged!

Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it.

.../extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java

...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java

zfraa · 2019-09-10T20:30:39Z

Still LGTM! On the test naming with underscores: Yep, zetasketch does this -- we follow the principle outlined e.g. here: https://osherove.com/blog/2005/4/3/naming-standards-for-unit-tests.html, which is also consistent with internal Java Style. But let's follow Beam style here!

zfraa suggested changes Sep 9, 2019

View reviewed changes

boyuanzz reviewed Sep 10, 2019

View reviewed changes

...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java Outdated Show resolved Hide resolved

...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java Outdated Show resolved Hide resolved

zfraa approved these changes Sep 10, 2019

View reviewed changes

Use a 0-length byte array to represent empty sketch in HllCount

03bc7e7

robinyqiu force-pushed the hll branch from 8df84b3 to 03bc7e7 Compare September 11, 2019 17:35

boyuanzz merged commit 17490af into apache:master Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519

[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519

robinyqiu commented Sep 9, 2019 •

edited

Loading

zfraa left a comment

zfraa Sep 9, 2019

robinyqiu Sep 9, 2019

zfraa Sep 10, 2019

robinyqiu Sep 10, 2019

zfraa Sep 10, 2019

robinyqiu Sep 10, 2019

robinyqiu commented Sep 9, 2019

boyuanzz left a comment

zfraa left a comment

zfraa Sep 10, 2019

zfraa commented Sep 10, 2019

[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519

[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519

Conversation

robinyqiu commented Sep 9, 2019 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

zfraa left a comment

Choose a reason for hiding this comment

zfraa Sep 9, 2019

Choose a reason for hiding this comment

robinyqiu Sep 9, 2019

Choose a reason for hiding this comment

zfraa Sep 10, 2019

Choose a reason for hiding this comment

robinyqiu Sep 10, 2019

Choose a reason for hiding this comment

zfraa Sep 10, 2019

Choose a reason for hiding this comment

robinyqiu Sep 10, 2019

Choose a reason for hiding this comment

robinyqiu commented Sep 9, 2019

boyuanzz left a comment

Choose a reason for hiding this comment

zfraa left a comment

Choose a reason for hiding this comment

zfraa Sep 10, 2019

Choose a reason for hiding this comment

zfraa commented Sep 10, 2019

robinyqiu commented Sep 9, 2019 •

edited

Loading