Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-876] Support schemaUpdateOption in BigQueryIO #9524

Merged
merged 5 commits into from
Nov 25, 2019

Conversation

ziel
Copy link
Contributor

@ziel ziel commented Sep 9, 2019

[BEAM-876] Support schemaUpdateOption in BigQueryIO

This adds schemaUpdateOptions to BigQueryIO.Write so that one can specify these for BiqQuery when writing in batch mode.

Usage example:

Write<TableRow> writer = BigQueryIO.writeTableRows()
  .to("name-goes-here")
  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
  .withSchemaUpdateOptions(EnumSet.of(BigQueryIO.Write.SchemaUpdate.Option.ALLOW_FIELD_ADDITION));

Implementation Notes

Hi, hello:
Hi all. I haven't contributed to this code base before, and am not super familiar with it. Style advice and such is super welcome.

Load vs Query Jobs in BigQuery:
BigQuery supports schema update side effects for load and query jobs. This implements support for load jobs only. In the context of Apache Beam, it doesn't seem like there's much utility to supporting queries which write to a table directly (as opposed to reading via a query and writing later in the pipeline). (Open to discussion/correction of course).

Avoiding Changing WriteTables Constructor:
I've added this to WriteTables with a withSchemaUpdateOptions method so as not to perturb the public constructor with an optional parameter. There's some awkwardness there -- but it seemed preferable to breaking the api or adding a secondary constructor.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@ziel
Copy link
Contributor Author

ziel commented Sep 9, 2019

R: @lukecwik @chamikaramj

@chamikaramj
Copy link
Contributor

Thanks for the contribution.

R: @pabloem will you be able to review ?

@pabloem
Copy link
Member

pabloem commented Oct 4, 2019

Yes, I'll be glad to review.

Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! BigQuery adds features often, and it's great to support them.
I think this looks fine to me, except for the one comment. Also, have you tried running this transform?

It may be good to have an integration test, just to be sure that it'll work fine.... do you know how to add one? I wouldn't say it's a must, but it would be amazing to have one.

@pabloem
Copy link
Member

pabloem commented Oct 5, 2019

Also, sorry about the delay. Don't hesitate to ping on the PR / email me.

@pabloem
Copy link
Member

pabloem commented Oct 23, 2019

Should I review once more?

@ziel
Copy link
Contributor Author

ziel commented Oct 23, 2019

I haven't had a chance to write the integration test yet... but was hoping to take a shot at it in the next week or so.

@pabloem
Copy link
Member

pabloem commented Oct 23, 2019

thanks!

@ziel
Copy link
Contributor Author

ziel commented Nov 7, 2019

@pabloem I found some time to write up an integration test.

I'm seeing a failing check (org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMarkSafety) which I think may be unrelated? I may be missing the connection though :-S There are a number of lines like this in the output like: java.lang.SecurityException: User name [test_user] or password is invalid. ...so I suspect this may be some sort of CI setup thing..

@pabloem
Copy link
Member

pabloem commented Nov 11, 2019

Run Java PreCommit

@ziel
Copy link
Contributor Author

ziel commented Nov 12, 2019

Run Java PreCommit

@pabloem
Copy link
Member

pabloem commented Nov 13, 2019

Thanks for writing the test! That's going above and beyond : ) - the changes LGTM. I'll try to run the test to check.

@pabloem
Copy link
Member

pabloem commented Nov 13, 2019

r: @chamikaramj would you like to do an extra pass?

@pabloem
Copy link
Member

pabloem commented Nov 25, 2019

Okay, bringing in now. Thanks @ziel

@pabloem pabloem merged commit 5fd93af into apache:master Nov 25, 2019
11moon11 pushed a commit to 11moon11/beam that referenced this pull request Dec 12, 2019
* [BEAM-876] Support schemaUpdateOption in BigQueryIO

* Move schemaUpdateOptions into WriteTables constructor

* Add integration test for SchemaUpdateOption in BigQuery

* Fix for funkiness from rebase to master

* Spotless apply
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants