-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataflowRunner: Experiment added to disable unbounded PCcollection checks turning batch into streaming #16773
Conversation
…ecks turning batch into streaming
R: @lukecwik |
@fpeter8 fyi |
( at the moment I'm not sure if I should open a JIRA ticket for this, or the PR in itself is enough) |
R: @kennknowles |
One problem here is that there is no batch implementation for This is an example where finite != bounded. It is up to the IO to produce a bounded PCollection, and the runner to figure out how to execute it. But batch Dataflow will never be able to evaluate an unbounded PCollection. |
Disabling the autodetection would have to have a scan for unbounded PCollections and then fail the job before submitting it. |
Specifically: #15951 (comment) |
Well this might be a hopefully pleasant surprise for you then, but with Actual data from the GCP console from two jobs launched with
The price difference seems significant, but I haven't tested with big amount of data so it might be more linear later... but what's the biggest PitA for us is the lack of batch processing with the simple grouping, etc, and the need to handle the data as streaming with windows, triggers, etc, when business-wise it would be absolutely NOT necessarily.
So based on the actual behaviour I respectfully disagree with your statement, because it actually works. I introduced it as an experiment, so if someone knows what he is doing, he can turn it off, but it's still there for everybody else "just to be sure".
I agree that this should be the "proper" solution for this "specific" issue, but this workaround doesn't only handle that, but every similar issue as well. |
If the pipeline does succeed in batch, then we actually have a bug. The intent is that anything that batch can run, it does run. |
Ah, I see now that it is in fact bounded data, but not "already sitting there" but streamed in. This is an interesting case to consider. |
Based on my reading of the code, use of |
I think that is one thing we all agree on that for This is strictly about giving developers the ability to circumvent cases where that implementations does not exist yet. I really can't see how having this option - disabled by default - could hurt. As far as I can tell if the experiment hasn't been enabled it behaves exactly the same. |
What is the next step on this PR? |
I need to convince myself that an experiment to explicitly enable unsupported behavior is safe, short term and long term. I see why it works for this case. I don't really know the overall failure mode. If this experiment makes it into a StackOverflow answer, I worry we'll have lots of people using it and having unpredictable problems. |
Any ETA? |
I looked into this, and I think the right fix is to have a separate I am less familiar with Kafka's details, but would it also be possible that the event timestamp / watermark on the topic does not advance? In that case we would want the above change (which is safe) along with this experiment, which should be renamed to something like |
As far as I understand:
I don't get this. If the bounded/unbounded separation is implemented, why would having this experiment make any sense for |
Thanks for all the details. Just to make sure the premises are clear:
So two separate code paths - either entry points or a conditional based on I think if you start at the
I agree. Not sure what I was thinking about. I thought there was still a case where you end up using this to force something. But ideally you would not need it. |
Ideally there would have been no need for this PR at all. The conceptual goal it tries to work around will still exist. Separating the Kafka-related DoFns only fixes one possible occurrence. |
I don't understand this comment. Once you have a bounded read code path that you can choose, what are the additional occurrences? |
I have no idea. That's the point. KafkaIO.Read isn't the only functionality in Beam. Anything might have something similar. |
Ah, OK. Then I think it is fine. Does it make sense? Would you be up for forking the DoFn and building a code path that does a proper bounded read? |
Due to the expected time requirements required for that task that is not up to me anymore. |
f0ed47c
to
2f4f00e
Compare
Hopefully this is enough. |
ping? :) |
@@ -197,6 +197,10 @@ | |||
}) | |||
public class DataflowRunner extends PipelineRunner<DataflowPipelineJob> { | |||
|
|||
/** Experiment to "disable unbounded pcollection checks turning batch into streaming". */ | |||
public static final String DISABLE_UNBOUNDED_PCOLLECTION_CHECKS_TURNING_BATCH_INTO_STREAMING = | |||
"disable_unbounded_pcollection_checks_turning_batch_into_streaming"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unsafely_attempt_to_process_unbounded_data_in_batch_mode"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Btw my bad, totally missed this in the previous comment due to the Kafka separation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course I couldn't do it from the GitHub UI. NOW done. (80a19be)
ping? |
@kennknowles - could someone else help with this review? |
ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through these changes. LGTM. There is a git conflict but I assume it will be immaterial. I also want to ping a few people so they know about this since it could come up in the future. @reuvenlax @chamikaramj @lukecwik. I am not 100% certain of the lifespan of this experiment, or if others on the team will have other opinions. It is a small change in code but a big UX thing.
…ng_batch_into_streaming # Conflicts: # sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFnTest.java
The conflict was only regarding separate test use cases there were added at the same location. Nothing relevant. |
…ng_batch_into_streaming
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImplTest.testInsertWithinRowCountLimits |
Run Java PreCommit |
Flaky again, not sure if I can trigger it or not, so lets see |
Run Java PreCommit |
1 similar comment
Run Java PreCommit |
…ng_batch_into_streaming
Finally all green :D |
…ction checks, allowing batch execution over unbounded PCollections (apache#16773)" This reverts commit e95ef97.
…ction checks, allowing batch execution over unbounded PCollections (apache#16773)" This reverts commit e95ef97.
…ction checks, allowing batch execution over unbounded PCollections (apache#16773)" This reverts commit e95ef97. (cherry picked from commit 25a3e56)
There are IOs where the previous logic detected them as unbounded thereforce enforced streaming behaviour when the pipeline itself should be considered as batch. For example
KafkaIO.withStopReadTime
always launches as streaming, meanwhile it's actually a limited amount of data. Until proper detection has been implemented, this is a workaround to let developers who know when a pipeline should actually be batch can override the built-in streaming enforcing.See #15951 for further context and original "trigger" for this contribution.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.[ ] UpdateCHANGES.md
with noteworthy changes.[ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.