-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, thank you for the fast PR!
I didn't have time to do a full pass over the tests. Could you make sure that there's good coverage of empty aggregations in the unit tests -- e.g., merge partial with only empty sketches, with some empty sketches,
.../extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
Show resolved
Hide resolved
.../extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
Outdated
Show resolved
Hide resolved
...sions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
Show resolved
Hide resolved
...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
Show resolved
Hide resolved
@@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} | |||
return null; | |||
} | |||
|
|||
@Nullable | |||
@Override | |||
public HyperLogLogPlusPlus<HllT> addInput( | |||
@Nullable HyperLogLogPlusPlus<HllT> accumulator, byte[] input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: why not handle nulls instead of throwing?
I didn't find any sources on the exact implied semantics of @nullable, but I would tend to assume if a parameter is annotated with @nullable, the method handles it benignly if it is actually null, vs. throwing an exception.
I would do one of the following two things:
- Either remove the @nullable annotation and keep throwing below; (again, not feeling strongly about this)
- Or -- I think we can safely assume that if a null is passed, it's supposed to be an empty sketch: maybe a BQ sketch that made it through importing without conversion, ... . We have the means to support this smoothly by just treating nulls like byte[0] -- why not do this and save the users some exceptions?
(would need to be consistently across methods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @Nullable
annotation is on the accumulator
parameter, which can be null and we are handling that properly without throwing an exception.
tl;dr: why not handle nulls instead of throwing?
There are pros and cons of supporting nulls:
- The pro is that we can save the users from exceptions, as you mentioned
- The cons are 1) then we will have two different representations for ''empty sketch"; and 2) I feel like if we accept nulls as input then we are encouraging users to produce nullable output (and use
NullableCoder
) from its upstream transform, which is more costly in terms of encoding/decoding and more error prone.
Currently I slightly prefer not accepting nulls as "empty sketches". What is your opinion? A good thing if we keep the implementation as it is: we can always change it later to support nulls and it will be backwards compatible (but we cannot go the other way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave things as they are in this PR, so that this can be merged!
Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us.
Agreed. Actually I figured out that we can accept nulls and leave a log warning to suggest replacement with byte[0]. Made that change. PTAL.
Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
That depends on their pipeline implementation. If users do that in the PTransform
where null is created (e.g. here in BigQueryIO), then it will not result in an extra-pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Maybe Boyuan can also have a quick look at the logging? Not familiar with what's usual in Beam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boyuanzz PTAL at the added logging
...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
Show resolved
Hide resolved
...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
Show resolved
Hide resolved
...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
Show resolved
Hide resolved
Hi @zfraa , I think the test cases you mentioned above are already covered! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Robin! Left some comments.
@zfraa Could you please approve this PR if changes look good to you?
...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
Outdated
Show resolved
Hide resolved
...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[requesting changes, but also approving]
Looks great, thank you for the fast PR, and test coverage also LG! Super-excited for this to make it into Beam 2.16! :D
...sions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
Show resolved
Hide resolved
...tasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
Show resolved
Hide resolved
@@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} | |||
return null; | |||
} | |||
|
|||
@Nullable | |||
@Override | |||
public HyperLogLogPlusPlus<HllT> addInput( | |||
@Nullable HyperLogLogPlusPlus<HllT> accumulator, byte[] input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave things as they are in this PR, so that this can be merged!
Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it.
.../extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
Outdated
Show resolved
Hide resolved
.../extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
Show resolved
Hide resolved
...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
Show resolved
Hide resolved
...ensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
Outdated
Show resolved
Hide resolved
Still LGTM! On the test naming with underscores: Yep, zetasketch does this -- we follow the principle outlined e.g. here: https://osherove.com/blog/2005/4/3/naming-standards-for-unit-tests.html, which is also consistent with internal Java Style. But let's follow Beam style here! |
The current implementation of
HllCount.Init
andHllCount.MergePartial
returns an emptyPCollection
when its input is empty. This is not ideal because, for example, if they are followed by aHllCount.Extract
then the outputPCollection
will also be empty, but the result we really want is aPCollection
of a single value0
.Therefore, this PR is changing the behavior of empty aggregation such that it returns an "empty sketch" (represented by an empty
byte[]
of length 0), and alsoHllCount.Extract
to handle it.The reason why we don't use
null
to represent "empty sketch" is, in short:ByteArrayCoder
, and use aNullableCoder
ofByteArrayCoder
to encode the output, which is less efficient in terms of space because it needs to encode the information of whether the element isnull
null
values may cause problem in downstream transformsr: @zfraa @boyuanzz
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.