Skip to content

BigQueryIO write is slow/fail with a bounded source #18421

@kennknowles

Description

@kennknowles

BigQueryIO Writer is slow / fail if the input source is bounded.

EDIT: Input BQ: 294 GB, 741,896,827 events

If the input source is bounded (GCS / BQ select / ...), BigQueryIO Writer use the "Method.FILE_LOADS" instead of streaming inserts.

Large amounts of input datas result in a java.lang.OutOfMemoryError / Java heap space (500 millions rows).

!PrepareWrite.BatchLoads.png|thumbnail!

We cannot use "Method.STREAMING_INSERTS" or control the batchs sizes since
withMaxFilesPerBundle is private :(

Someone reported a similar problem with GCS -> BQ on Stackoverflow:
Why is writing to BigQuery from a Dataflow/Beam pipeline slow?

Imported from Jira BEAM-2840. Original Jira may contain additional context.
Reported by: vspiewak.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions