Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
[BEAM-2879] Support writing data to BigQuery via avro #9665
This change enhances BigQueryIO.Write to support writing avro files rather than json when using FILE_LOADS (STREAMING_INSERTS is unchanged).
Preliminary results look good. The more CPU constrained a job is, the faster avro becomes.
My test dataset is a typical workload of ours, around 2 billion records (~130 GB serialized) representing the result of a combine. My tests read these records from GCS and wrote them to BigQuery. The jobs were run in dataflow with 150 x n1-standard-2 workers.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
pabloem left a comment
Thanks Steve! I've.. mostly gone through the changes to internal functionality, and they make sense to me. FWIW, I think you're on the right track.
Now I'm just trying to think about the user-facing API change. I'm also wondering how to make it easier to support with Beam schemas.
This is just a brain dump of what I'm thinking...
I wonder whether we need the
As for supporting Beam schemas + avro files, one could have a
Another option is to have
Overall, I like using
Thanks for the thoughts! My comments inline
I really went back and forth on this a few times. We could use
I do hate the name though, if you can think of anything better I'd love to rename this!
Yeah I struggled with this as well. The only thing stopping us from having a version that supports beam schemas is the interface.
I'd be up for adding that in a follow-up PR. I also have some ideas around
chamikaramj left a comment
Thanks. Looks great to me.
To make sure I understood correctly, this will not be enabled for existing users by default and to enable this users have to specify withAvroFormatFunction(), correct ?
Also, can we add a version of BigQueryIOIT so that we can continue to monitor both Avro and JSON based BQ write transforms ?
Correct, with schemas I think we could make this enabled transparently, but for now its opt-in only.
Yeah I can add that in there.
coolio, fwiw, the contribution guide is ambiguous wrt who should do the squashing.
Unfortunately, looks like BigQueryIOIT is a recently added test that is currently not captured by any of the test suites.
@steveniemitz Will you be able to add a test that writes to BQ using Avro to the BigQueryTornadoesIT that is captured by the Beam Java PostCommit test suite ?
@mwalenia can you comment on the status of BigQueryIOIT ?