Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount #9519

Merged
merged 1 commit into from
Sep 11, 2019

Conversation

robinyqiu
Copy link
Contributor

@robinyqiu robinyqiu commented Sep 9, 2019

The current implementation of HllCount.Init and HllCount.MergePartial returns an empty PCollection when its input is empty. This is not ideal because, for example, if they are followed by a HllCount.Extract then the output PCollection will also be empty, but the result we really want is a PCollection of a single value 0.

Therefore, this PR is changing the behavior of empty aggregation such that it returns an "empty sketch" (represented by an empty byte[] of length 0), and also HllCount.Extract to handle it.

The reason why we don't use null to represent "empty sketch" is, in short:

  1. Space-efficiency: We will have to override the default ByteArrayCoder, and use a NullableCoder of ByteArrayCoder to encode the output, which is less efficient in terms of space because it needs to encode the information of whether the element is null
  2. Safety: null values may cause problem in downstream transforms

r: @zfraa @boyuanzz


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Copy link

@zfraa zfraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, thank you for the fast PR!
I didn't have time to do a full pass over the tests. Could you make sure that there's good coverage of empty aggregations in the unit tests -- e.g., merge partial with only empty sketches, with some empty sketches,

@@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
return null;
}

@Nullable
@Override
public HyperLogLogPlusPlus<HllT> addInput(
@Nullable HyperLogLogPlusPlus<HllT> accumulator, byte[] input) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: why not handle nulls instead of throwing?

I didn't find any sources on the exact implied semantics of @nullable, but I would tend to assume if a parameter is annotated with @nullable, the method handles it benignly if it is actually null, vs. throwing an exception.
I would do one of the following two things:

  • Either remove the @nullable annotation and keep throwing below; (again, not feeling strongly about this)
  • Or -- I think we can safely assume that if a null is passed, it's supposed to be an empty sketch: maybe a BQ sketch that made it through importing without conversion, ... . We have the means to support this smoothly by just treating nulls like byte[0] -- why not do this and save the users some exceptions?
    (would need to be consistently across methods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @Nullable annotation is on the accumulator parameter, which can be null and we are handling that properly without throwing an exception.

tl;dr: why not handle nulls instead of throwing?

There are pros and cons of supporting nulls:

  • The pro is that we can save the users from exceptions, as you mentioned
  • The cons are 1) then we will have two different representations for ''empty sketch"; and 2) I feel like if we accept nulls as input then we are encouraging users to produce nullable output (and use NullableCoder) from its upstream transform, which is more costly in terms of encoding/decoding and more error prone.

Currently I slightly prefer not accepting nulls as "empty sketches". What is your opinion? A good thing if we keep the implementation as it is: we can always change it later to support nulls and it will be backwards compatible (but we cannot go the other way).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave things as they are in this PR, so that this can be merged!

Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us.

Agreed. Actually I figured out that we can accept nulls and leave a log warning to suggest replacement with byte[0]. Made that change. PTAL.

Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?

That depends on their pipeline implementation. If users do that in the PTransform where null is created (e.g. here in BigQueryIO), then it will not result in an extra-pass.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Maybe Boyuan can also have a quick look at the logging? Not familiar with what's usual in Beam.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boyuanzz PTAL at the added logging

@robinyqiu
Copy link
Contributor Author

Hi @zfraa , I think the test cases you mentioned above are already covered!

Copy link
Contributor

@boyuanzz boyuanzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robin! Left some comments.
@zfraa Could you please approve this PR if changes look good to you?

Copy link

@zfraa zfraa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[requesting changes, but also approving]
Looks great, thank you for the fast PR, and test coverage also LG! Super-excited for this to make it into Beam 2.16! :D

@@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
return null;
}

@Nullable
@Override
public HyperLogLogPlusPlus<HllT> addInput(
@Nullable HyperLogLogPlusPlus<HllT> accumulator, byte[] input) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave things as they are in this PR, so that this can be merged!

Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data?
But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it.

@zfraa
Copy link

zfraa commented Sep 10, 2019

Still LGTM! On the test naming with underscores: Yep, zetasketch does this -- we follow the principle outlined e.g. here: https://osherove.com/blog/2005/4/3/naming-standards-for-unit-tests.html, which is also consistent with internal Java Style. But let's follow Beam style here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants