New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-8292] Portable Reshuffle for Go SDK #11197
Conversation
Run Go Postcommit |
R: @youngoli |
I'm definitely not merging this until both the PostCommit runs, and someone more familiar with windowing/trigger semantics looks over the configuration I copied over from python: |
Retest this please |
Run Go Postcommit |
Post commits run and pass which is a good sign! |
return err | ||
} | ||
*fv = FullValue{Elm: val} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just preserving the existing behavior, but it seems weird to return err
here instead of return nil
, even if it is guaranteed to be nil
at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thanks!
|
||
// FinishBundle propagates finish bundle to downstream nodes. | ||
func (n *ReshuffleOutput) FinishBundle(ctx context.Context) error { | ||
n.b = bytes.Buffer{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also be clearing n.ret
like ReshuffleInput
does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should. Thanks!
func (n *ReshuffleOutput) ProcessElement(ctx context.Context, value *FullValue, values ...ReStream) error { | ||
// Marshal the pieces into a temporary buffer since they must be transmitted on FnAPI as a single | ||
// unit. | ||
vs, err := values[0].Open() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's strange to me that values[]
can have multiple elements, but this method ends up actually reading all the values from the first element of it. Could you explain why that happens? Are there sometimes multiple ReStreams representing different things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the Go SDK future proofing itself, against CoGBK supporting multiple datastreams from the runner. Functionally, only datasource.go would need to change in that case. You can see a comment to that effect in datasource.go, and then nearly everything else deals with the values streams properly under that assumption.
In this case, we know that if this code is being used, it's coming from a single GBK, which means there's only a single stream of values, and then since we're framework side, we just handle the stream directly. In that way, it's similar to how we're handling CoGBKs presently, with synthetic inject and expand steps to get to the right number of joined streams, even though the Runner is only providing us with a single data stream for the grouped data.
gbk := &pb.PTransform{ | ||
UniqueName: gbkID, | ||
Spec: &pb.FunctionSpec{Urn: URNGBK}, | ||
Inputs: map[string]string{"i0": postReify}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this input supposed to be postReify
? I would've expected inputID
from the previous step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. postReify is the PCollection that has the random keys and the full windowed value as serialized bytes.
The input from the previous step is used on line 577.
in represents strictly inbound data (and specifically the main input), and From indicates the "Node" in the graph. In this model of the pipeline PCollections are Nodes, and Transforms are Edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I see it now. I was completely misinterpreting the expansion here. So if I understand correctly, it should look like this, right? (With pcollections/nodes in square brackets and transforms/edges as arrows)
[in.From] ---input---> [postReify] ---gbk---> [gbkOut] ---output---> [out.To]
Where input
and output
are the newly added Reshuffle transforms. That looks right to me, and rereading the code it looks consistent with that.
R: @reuvenlax |
same code, but shorter Co-Authored-By: Daniel Oliveira <younghoono@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
gbk := &pb.PTransform{ | ||
UniqueName: gbkID, | ||
Spec: &pb.FunctionSpec{Urn: URNGBK}, | ||
Inputs: map[string]string{"i0": postReify}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. postReify is the PCollection that has the random keys and the full windowed value as serialized bytes.
The input from the previous step is used on line 577.
in represents strictly inbound data (and specifically the main input), and From indicates the "Node" in the graph. In this model of the pipeline PCollections are Nodes, and Transforms are Edges.
return err | ||
} | ||
*fv = FullValue{Elm: val} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thanks!
func (n *ReshuffleOutput) ProcessElement(ctx context.Context, value *FullValue, values ...ReStream) error { | ||
// Marshal the pieces into a temporary buffer since they must be transmitted on FnAPI as a single | ||
// unit. | ||
vs, err := values[0].Open() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the Go SDK future proofing itself, against CoGBK supporting multiple datastreams from the runner. Functionally, only datasource.go would need to change in that case. You can see a comment to that effect in datasource.go, and then nearly everything else deals with the values streams properly under that assumption.
In this case, we know that if this code is being used, it's coming from a single GBK, which means there's only a single stream of values, and then since we're framework side, we just handle the stream directly. In that way, it's similar to how we're handling CoGBKs presently, with synthetic inject and expand steps to get to the right number of joined streams, even though the Runner is only providing us with a single data stream for the grouped data.
|
||
// FinishBundle propagates finish bundle to downstream nodes. | ||
func (n *ReshuffleOutput) FinishBundle(ctx context.Context) error { | ||
n.b = bytes.Buffer{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should. Thanks!
Run Go Postcommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
WindowFn: makeWindowFn(wfn), | ||
// ...output after every element is received... | ||
Trigger: &pb.Trigger{ | ||
// Should this be an Always trigger instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In answer to this comment, yes it should be Always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Retest this please |
Run Go Postcommit |
This adds a Reshuffle transform to the Go SDK. In particular, it configures windowing & trigggers for a GBK to allow for fusion breaks, where parallelism needs to increase, or decrease due to data bundling properties. Previous element window and timestamps are preserved. The SDK operations are wrapped with a higher level reshuffle URN so runners can optimize this step better. The Go Direct Runner is aware of the Reshuffle and correctly ignores it, as it's a single bundle runner. Note: While this should work for streaming cases, it hasn't been tested with them yet, due to the current state of streaming the Go SDK. Co-authored-by: lostluck <13907733+lostluck@users.noreply.github.com> Co-authored-by: Daniel Oliveira <younghoono@gmail.com>
This adds a Reshuffle transform to the Go SDK.
Note: While this should work for streaming cases, it hasn't been tested with them yet, due to the current state of streaming the Go SDK.
It further adds one small optimization for the internal Decoding interface, called the DecodeTo method to avoid extra allocations to the heap incurred by returning a *FullValue. Can't avoid extra allocations for KV types at present, but value PCollections should have lower overhead. A subsequent PR will use the DecodeTo method at a other applicable places in the decode stack.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.