-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7195] Fix BQ BatchLoads not creating new tables issue #12968
Conversation
@@ -74,7 +74,7 @@ | |||
* The result of the {@link WriteBundlesToFiles} transform. Corresponds to a single output file, | |||
* and encapsulates the table it is destined to as well as the file byte size. | |||
*/ | |||
static final class Result<DestinationT> implements Serializable { | |||
public static final class Result<DestinationT> implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to use getter or use @AutoValue
here
Question: will the PTransform renaming cause compatibility issues when users update their GCP Dataflow pipelines? |
db11b7c
to
ca23d68
Compare
so are you saying that the Reshuffle in writeTempTables is inheriting the pane index from this singleton GBK? I though that Reshuffle should be generating brand-new pane indices starting from 0. |
Ping. |
FYI - I think that this will fix the bug. Can you add a unit test that exposes the prior bug and is fixed now? |
Thanks for the review @reuvenlax Yes, I can add a unit test for this. Could you also take a look at this question? |
Yes, this will break update. At this point I'm not sure if we have another option here. The old code was based on an incorrect assumption that Reshuffle generates pane indices, when actually Reshuffle simply forwards the pane indices from the previous GBK. |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
Do we want to merge this? Should this be closed? Or do we not want to do this because it will break updates? |
Retest this please |
Run Dataflow ValidatesRunner |
Run Java PostCommit |
Run Java PreCommit |
Run Java PostCommit |
1 similar comment
Run Java PostCommit |
Been looking at this, and I don't think that this PR works in the the temp-table code path. Looking to see if there's an easy way to fix this. |
I tried to fix these issues in pr/14238 |
In WriteTables and WriteRename, the
createDisposition
is set toCREATE_NEVER
when the pane index is great than 0.beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java
Lines 205 to 210 in ecedd3e
beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteRename.java
Lines 145 to 148 in ecedd3e
Before actually triggering BQ load, BatchLoads groups all temporary files by a singleton key.
beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
Lines 308 to 313 in ecedd3e
This can be problematic when using DynamicDestinations in a streaming pipeline with triggers. Sometimes users would like to create tables in the middle of the pipeline based on the data or time. Because of the
GroupOntoSingleton
, different BQ destinations are grouped under the same key (alwaysVoid
) and everything is under a global window, therefore the pane index is always incremental and new tables won't be created even though users have specifiedCreateDisposition = CREATE_IF_NEEDED
Proposed Solution: Instead of grouping everything into a singleton, grouping by Bigquery destinations so that new destinations can have a pane with index == 0.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.