-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove PubsubFileInjector and IntraBundleParallelization #957
Conversation
This is for https://issues.apache.org/jira/browse/BEAM-414 |
LGTM |
IntraBundleParallelization is gone!! 🌟 ❗ 👍 🎉 |
@swegner I'm sorry, why is this a good thing? This was the easiest way we had to make parallel HTTP requests without having to tinker with threads ourselves. For us it's a 10x speed difference if this options is there or not, for whatever reason the default DoFn just doesn't understand how to "parallelize network requests properly" and IntraBundleParallelization was the only fix. |
My comment was related to the fact that I believe also Alternatives for parallelizing network requests would be a good discussion; would you mind posting your use-case to the user list (user@beam.apache.org)? |
Fair enough :-) Where did the depency on the OldDoFn come from? As far as I can tell it should be fairly easy to migrate, just looks like a bit of boilerplate Threading code. I'll send in my use-case. It's actually a solution that we came up with together with our assigned Google Engineer based on some other projects that they'd done, so I'll have him send in details of those other projects as well. |
But I can see in the previous implementation various assumptions that are outside of the Beam model. In particular, scheduling work using the GCS executor service, and relying on an in-memory Sempaphore to track work state. Removing these assumptions gives runners more freedom to apply their own optimizations. |
The problem was wrapping an arbitrary dofn. Dofns in the model have a lot
of capabilities, not all of which can be supported in parallel.
IntraBundleParallelization gave a false sense of "just pass a dofn" so it
was removed.
I think there is definitely use for the ability to run element processing
in parallel, and we had some ideas there. The key difference would be the
parallel code isn't a dofn and so the concerns about supporting general
dofn features aren't present.
Maybe we should move to a Jira or the dev list to discuss?
On Wed, Jan 18, 2017, 9:14 AM Rik <notifications@github.com> wrote:
Fair enough :-) Where did the depency on the OldDoFn come from? As far as I
can tell it should be fairly easy to migrate, just looks like a bit of
boilerplate Threading code.
I'll send in my use-case. It's actually a solution that we came up with
together with our assigned Google Engineer based on some other projects
that they'd done, so I'll have him send in details of those other projects
as well.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#957 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AACMeMMPaWxkvbbBTHse1J_rdLCTczZIks5rTkhngaJpZM4J8Y-C>
.
|
The issue is that IntraBundleParallelization was fundamentally broken
because of how ProcessContext was being supplied and re-used.
The ProcessContext was being re-used across processElement calls for
performance reasons. This re-use meant that if you were processing elements
A and B in parallel, you wouldn't know which ProcessContext you had and it
could provide inaccurate information (e.g. the window information and side
inputs could be for the wrong element).
ProcessContext would need to provide guarantees that it can live outside
the scope of the processElement call (e.g. another thread).
Now with the new DoFn and its annotation based approach, it is possible
that we won't need IntraBundleParallelization at all and it could be done
automatically. It could detect that side inputs or windows aren't accessed
and parallelize execution when it detects a slow element call.
…On Wed, Jan 18, 2017 at 9:50 AM, Scott Wegner ***@***.***> wrote:
OldDoFn is the previous implementation of DoFn from Dataflow 1.x. I don't
know all the details on why it wasn't migrated, @lukecwik
<https://github.com/lukecwik> or @kennknowles
<https://github.com/kennknowles> might know more.
But I can see in the previous implementation
<https://github.com/apache/beam/pull/957/files#diff-d7e31beda9ea8e06bed04e483415066aL69>
various assumptions that are outside of the Beam model. In particular, scheduling
work using the GCS executor service
<https://github.com/apache/beam/pull/957/files#diff-d7e31beda9ea8e06bed04e483415066aL211>,
and relying on an in-memory Sempaphore to track work state
<https://github.com/apache/beam/pull/957/files#diff-d7e31beda9ea8e06bed04e483415066aL212>.
Removing these assumptions gives runners more freedom to apply their own
optimizations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#957 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJnK7LMK1CN_sxST9j3gItJkHPwGeDI9ks5rTlDNgaJpZM4J8Y-C>
.
|
Ah yeah I see now. Obviously I appreciate an automatic approach as much as the next lazy coder so there must be some way we can make that happen :-) |
Hey, can we open a new discussion about how to "thread" things properly now that IntraBundleParallelization is far gone. ATM, we're manually keying bundles and then executing a manual threaded DoFn over all the items in the bundle...but it's not ideal @lukecwik |
Yes, this is a good topic to bring up on user@beam.apache.org or dev@beam.apache.org. |
Honestly don't understand why discussions aren't more accessible...I really need to email a general address? Isn't the point of open source to collaborate and share the best ideas |
is there any update on this discussion? this is quite an important use case for our team. |
If you haven't yet, I recommend posting your question and use-case to dev@beam.apache.org or user@beam.apache.org mailing lists. I'm not aware of alternatives provided in the current version of Beam, but other users on the mailing list may have ideas. And contributions are of course welcome. |
No description provided.