-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Go SDK: Allow overriding the default iterable creator. #22057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
4 similar comments
|
Can one of the admins verify this patch? |
|
Can one of the admins verify this patch? |
|
Can one of the admins verify this patch? |
|
Can one of the admins verify this patch? |
…and CoGBK. This change defines a new exec.ReStreamFactory type for constructing a re-iterable stream from a byte stream and coder.Coder. The user may override the default behavior by calling exec.SetReStreamFactory. This customizable factory function allows overriding the default behavior of exec.ReStream construction. The default implementation buffers all elements being iterated over by a GBK or CoGBK function into an in-memory buffer, which may cause OOM errors when a single iterable is large and the runner does not supply a state-backed iterable. See apache#21817. The user may specify an alternative function that spills elements to disk if a memory threshold is exceeded, for instance.
31a40d7 to
8445f3a
Compare
Codecov Report
@@ Coverage Diff @@
## master #22057 +/- ##
==========================================
- Coverage 74.00% 74.00% -0.01%
==========================================
Files 703 703
Lines 92936 92987 +51
==========================================
+ Hits 68776 68811 +35
- Misses 22894 22906 +12
- Partials 1266 1270 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Assigning reviewers. If you would like to opt out of this review, comment R: @lostluck for label go. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
lostluck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to take another look at this when I'm fresher.
The main concern I have is this adds a user facing actuation point inside the internal exec package, and we do not want to add those if we can help it.
The package isn't intended for direct end user configuration and use, and we don't treat it as such. If left as is, the experimental nature of the API would need to be made very explicit in the documentation.
Now, if there were a harness configuration that was triggered by options that would switch between one or more pre-written, and tested implementations, that would probably be OK, as it reduces likelyhood of user misuse.
|
Really, what I'd want to tell you instead is to contribute to the Go Direct runner to resolve this for local single machine runs, but I'm working on a portability based replacement (in Go). Unfortunately it's not yet ready for migrating to the beam repo, and is very test focused. Read: currently everything is in memory, no parallelism, and doesn't do DoFn Graph fusion optimizations, each DoFn is executed one at a time. My intent is to allow it to be much more configurable though, for robustness, and be able to actuate all the various Beam FnAPI features. In this case, have it be able to spill out to disk. https://github.com/lostluck/experimental/tree/master/local/internal |
|
Reminder, please take a look at this pr: @lostluck |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @jrmccluskey for label go. Available commands:
|
|
TIL the bot takes the "review" state pretty seriously, instead of assuming something from the comment. (That's probably the correct approach). |
|
What if this change was pulled in, but the function to override the default
stream factory is unexported (lower cased) for now. This will make it
easier for me to patch the package in my fork without a complicated diff
that requires maintenance.
…On Thu, Jun 30, 2022 at 9:49 AM Robert Burke ***@***.***> wrote:
Really, what I'd want to tell you instead is to contribute to the Go
Direct runner to resolve this for local single machine runs, but I'm
working on a portability based replacement (in Go).
Unfortunately it's not yet ready for migrating to the beam repo, and is
very test focused.
Read: currently everything is in memory, no parallelism, and doesn't do
DoFn Graph fusion optimizations, each DoFn is executed one at a time. My
intent is to allow it to be much more configurable though, for robustness,
and be able to actuate all the various Beam FnAPI features. In this case,
have it be able to spill out to disk.
—
Reply to this email directly, view it on GitHub
<#22057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUO54PO7ST64SLJAM4W5TVRXFZVANCNFSM5Z2TT4DQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
That feels reasonable, in combination with the marking of the method as experimental, and probably directing it to the linked issue in question and similar. In practice, it's unlikely that we'll simply remove the entry point code, even if we mark it as experimental. This will allow us to better support this case in the future through alternate means. I'll try to take another look this afternoon. (I'm a bit busy preparing for next week's Beam Summit, and it's consuming much of my time.) |
lostluck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few changes, then this should be alright to merge.
Any unit tests you could add for the newly separated functions that don't have great coverage would be strongly appreciated.
| }() | ||
| // createChunkReStreams appends to chunkStreams and | ||
| // chunkStreamsCloser.children | ||
| createChunkReStreams := func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indirecting the error through an anon closure doesn't help with code readability here. The closure is too long for a meaningful readability benefit.
I'd strongly prefer that we move this out to a named method and pass parameters in and out, instead of making hard to follow code harder to follow.
And looking at this again, is this just to avoid adding a nopCloser{} return parameter on error cases? That would be clearer than the indirection.
| // for DoFns that iterate over values (GBK and CoGBK). | ||
| // | ||
| // The default implementation of this function is DefaultReadStreamToBuffer. | ||
| func SetReStreamFactory(fn ReStreamFactory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion, please label this as Experimental, and link to the issue we're attempting to solve with this.
|
|
||
| // DefaultReadStreamToReStream reads numElements from the byteStream using the | ||
| // element decoder dec and returns an in-memory ReStream. | ||
| func DefaultReadStreamToReStream(_ context.Context, encodedStream io.Reader, numElements int64, coder *coder.Coder) (ReStream, func() error, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion, please unexport the default implementation.
| if len(rest) == 0 { | ||
| return &concatReStream{first: first, next: nil} | ||
| } | ||
| return &concatReStream{first: first, next: newConcatReStream(rest...)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a huge fan of the recursive linked list construction here, but I think in practice this will end up being minimally used, if ever. But this is probably less fiddly than alternative approaches.
|
waiting on author |
|
FYI: I'm away for a while but plan to make all the requested changes.
…On Wed, Jul 13, 2022, 1:10 PM Robert Burke ***@***.***> wrote:
waiting on author
—
Reply to this email directly, view it on GitHub
<#22057 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUO562LAPVUDIRH3CWLSDVT3Z75ANCNFSM5Z2TT4DQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
That was for the bot. Sorry for the confusion. |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
|
I forgot about this but can make the requested changes when I get a chance. |
|
Reminder, please take a look at this pr: @jrmccluskey |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damccorm for label go. Available commands:
|
|
Good news @gonzojive ! The Go SDK has a local portable runner, implemented in Go, that would very much welcome support for offloading intermediate datasets to files, compared to the current direct runner, or flink implementations. The v0 is coming with the next release (2.46, being cut tomorrow). See the following for the current vision. That said, there's other work to be done that may block it from your use case, since Prism is currently in V0, and is presently intentionally Test skewed. But it is moving in the right direction, and is intended to ultimately be configurable for local production execution. The current missing features that may affect your usage are the following, since you're trying to do real work with it.
Other than the offloading to the file system, most of these will come along in due time. As mentioned before, the existing SDK side implementation for what you've got isn't as desirable as a runner based solution. But, as long as any particular feature is configurable between in-memory and more test focused (closer to 1 at a time), I'm very amenable to unblocking things. |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This change defines a new exec.ReStreamFactory type for constructing a
re-iterable stream from a byte stream and coder.Coder. The user may override the
default behavior by calling exec.SetReStreamFactory.
This customizable factory function allows overriding the default behavior of
exec.ReStream construction. The default implementation buffers all elements bein
iterated over by a GBK or CoGBK function into an in-memory buffer, which may
cause OOM errors when a single iterable is large and the runner does not supply
a state-backed iterable. See #21817. The
user may specify an alternative function that spills elements to disk if a
memory threshold is exceeded, for instance.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes. I did not update this because this change doesn't seem significant enough, but I can update it if requested.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.