Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7195] Fix BQ BatchLoads not creating new tables issue #12968

Closed

Conversation

guangstick
Copy link

@guangstick guangstick commented Sep 29, 2020

In WriteTables and WriteRename, the createDisposition is set to CREATE_NEVER when the pane index is great than 0.

if (c.pane().getIndex() > 0 && !tempTable) {
// If writing directly to the destination, then the table is created on the first write
// and we should change the disposition for subsequent writes.
writeDisposition = WriteDisposition.WRITE_APPEND;
createDisposition = CreateDisposition.CREATE_NEVER;
} else if (tempTable) {

WriteDisposition writeDisposition =
(c.pane().getIndex() == 0) ? firstPaneWriteDisposition : WriteDisposition.WRITE_APPEND;
CreateDisposition createDisposition =
(c.pane().getIndex() == 0) ? firstPaneCreateDisposition : CreateDisposition.CREATE_NEVER;

Before actually triggering BQ load, BatchLoads groups all temporary files by a singleton key.
results
.apply("AttachSingletonKey", WithKeys.of((Void) null))
.setCoder(
KvCoder.of(VoidCoder.of(), WriteBundlesToFiles.ResultCoder.of(destinationCoder)))
.apply("GroupOntoSingleton", GroupByKey.create())
.apply("ExtractResultValues", Values.create())

This can be problematic when using DynamicDestinations in a streaming pipeline with triggers. Sometimes users would like to create tables in the middle of the pipeline based on the data or time. Because of the GroupOntoSingleton, different BQ destinations are grouped under the same key (always Void) and everything is under a global window, therefore the pane index is always incremental and new tables won't be created even though users have specified CreateDisposition = CREATE_IF_NEEDED

Proposed Solution: Instead of grouping everything into a singleton, grouping by Bigquery destinations so that new destinations can have a pane with index == 0.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@@ -74,7 +74,7 @@
* The result of the {@link WriteBundlesToFiles} transform. Corresponds to a single output file,
* and encapsulates the table it is destined to as well as the file byte size.
*/
static final class Result<DestinationT> implements Serializable {
public static final class Result<DestinationT> implements Serializable {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to use getter or use @AutoValue here

@guangstick
Copy link
Author

guangstick commented Sep 29, 2020

R: @reuvenlax @chamikaramj

Question: will the PTransform renaming cause compatibility issues when users update their GCP Dataflow pipelines?

@guangstick guangstick changed the title [BEAM-7195] Fix BQ BatchLoad not creating new tables issue [BEAM-7195] Fix BQ BatchLoads not creating new tables issue Sep 29, 2020
@reuvenlax
Copy link
Contributor

so are you saying that the Reshuffle in writeTempTables is inheriting the pane index from this singleton GBK? I though that Reshuffle should be generating brand-new pane indices starting from 0.

@reuvenlax
Copy link
Contributor

Ping.

@reuvenlax
Copy link
Contributor

FYI - I think that this will fix the bug. Can you add a unit test that exposes the prior bug and is fixed now?

@guangstick
Copy link
Author

R: @reuvenlax @chamikaramj

Question: will the PTransform renaming cause compatibility issues when users update their GCP Dataflow pipelines?

Thanks for the review @reuvenlax Yes, I can add a unit test for this. Could you also take a look at this question?

@reuvenlax
Copy link
Contributor

Yes, this will break update. At this point I'm not sure if we have another option here. The old code was based on an incorrect assumption that Reshuffle generates pane indices, when actually Reshuffle simply forwards the pane indices from the previous GBK.

@stale
Copy link

stale bot commented Dec 19, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Dec 19, 2020
@aaltay
Copy link
Member

aaltay commented Jan 14, 2021

Do we want to merge this? Should this be closed?

Or do we not want to do this because it will break updates?

@stale stale bot removed the stale label Jan 14, 2021
@chamikaramj
Copy link
Contributor

Retest this please

@chamikaramj
Copy link
Contributor

Run Dataflow ValidatesRunner

@chamikaramj
Copy link
Contributor

Run Java PostCommit

@chamikaramj
Copy link
Contributor

Run Java PreCommit

@chamikaramj
Copy link
Contributor

Run Java PostCommit

1 similar comment
@chamikaramj
Copy link
Contributor

Run Java PostCommit

@reuvenlax
Copy link
Contributor

Been looking at this, and I don't think that this PR works in the the temp-table code path. Looking to see if there's an easy way to fix this.

@reuvenlax
Copy link
Contributor

I tried to fix these issues in pr/14238

@lukecwik lukecwik closed this Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants