[BEAM-6392] Add support for the BigQuery read API to BigQueryIO. #7441

kmjung · 2019-01-08T23:15:22Z

This change adds support for the new BigQuery high-throughput read API to BigQueryIO. The initial commit supports only reading from existing tables.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

kmjung · 2019-01-08T23:15:39Z

cc: @chamikaramj

chamikaramj · 2019-01-24T03:07:02Z

cc: @reuvenlax @pedapudi

chamikaramj

Thanks. Sorry about the delay in reviewing.

chamikaramj · 2019-02-01T19:40:01Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

+   * This method formats a BigQuery TIME value into a String matching the format used by JSON
+   * export. Time records are stored in "microseconds since midnight" format.
+   */
+  private static String formatTime(long timeMicros) {


Just to clarify, why don't we always use most precise version (ISO_LOCAL_TIME_FORMATTER_MICROS) here ?

This is necessary in order to match exactly the string format used by BigQuery export, which always uses the least-precise format if a value can be converted without loss of precision. It's likely that the approach you propose would not lead to correctness issues for pipelines; the integration test I've added wouldn't pass, though. :-)

chamikaramj · 2019-02-05T21:22:23Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

      case "GEOGRAPHY":
        // Avro will use a CharSequence to represent String objects, but it may not always use
        // java.lang.String; for example, it may prefer org.apache.avro.util.Utf8.
        verify(v instanceof CharSequence, "Expected CharSequence (String), got %s", v.getClass());
        return v.toString();
+      case "DATE":


Do we have any backwards incompatible data type conversions for people who migrate pipelines from export-based read transform to read API based read transform ? If so we should try to minimize that and any incompatibilities that are unavoidable should be clearly documented.

Depending on the caller's project context, export may or may not produce Avro records identical to the read API today. Going forward, we will standardize on the format used by the read API.

chamikaramj · 2019-02-05T21:23:44Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

      case "GEOGRAPHY":
        // Avro will use a CharSequence to represent String objects, but it may not always use
        // java.lang.String; for example, it may prefer org.apache.avro.util.Utf8.
        verify(v instanceof CharSequence, "Expected CharSequence (String), got %s", v.getClass());
        return v.toString();
+      case "DATE":
+        if (avroType == Type.INT) {
+          verify(v instanceof Integer, "Expected Integer, got %s", v.getClass());


Is this path only hit for Read API (if so please add a comment so that we don't loose the mapping).

No -- this code path (an Avro logical 'date' type for a BigQuery DATE record) can be reached in both the export and read API cases.

So is this change (and below) backwards incompatible changes for export based read path ? That'll break our existing users.

For the record (based on offline chat), this change just generalizes the current behavior and does not break backwards compatibility.

chamikaramj · 2019-02-05T21:23:51Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

+      case "TIME":
+        if (avroType == Type.LONG) {
+          verify(v instanceof Long, "Expected Long, got %s", v.getClass());
+          verifyNotNull(avroLogicalType, "Expected TimeMicros logical type");


No -- this code path (an Avro logical 'time-micros' type for a BigQuery TIME record) can be reached in both the export and read API cases.

chamikaramj · 2019-02-05T21:24:54Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+       */
+      EXPORT,
+
+      /** Read the contents of a table directly using the BigQuery storage API. */


storage API or read API ?
Can you also add a link since this is pretty new.

The official name is the BigQuery Storage API. I'll add a link here when the docs are published. :-)

chamikaramj · 2019-02-05T21:26:51Z

...-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageStreamSource.java

+  @Override
+  public long getEstimatedSizeBytes(PipelineOptions options) {
+    // The size of stream source can't be estimated due to server-side liquid sharding.
+    return 0L;


Probably add a TODO to better support this in the future ?

chamikaramj · 2019-02-07T01:14:52Z

Please try to upgrade $generated_grpc_beta_version to 0.39.0 and using it for google_cloud_bigquery_storage_proto instead of using a separate version there.

Also, seems like there's a conflict now.

LGTM for the rest.

kmjung · 2019-02-12T00:20:07Z

@chamikaramj the change to update the GCP connector versions (#7783) has been merged, and I've resolved the conflict here.

kmjung · 2019-02-12T22:01:09Z

Run Java PreCommit

This change adds new Source objects which support reading tables from BigQuery using the new high-throughput read API. It also modifies the Avro-to-JSON conversion code in the BigQuery connector to support the Avro records generated by both the existing export process and the new read API, and adds an integration test to verify that the TableRows which are constructed using each method are equivalent.

chamikaramj · 2019-02-12T23:34:46Z

LGTM. Will merge after tests pass.

Thanks.

kmjung · 2019-02-13T18:38:20Z

@chamikaramj we should be good to go here.

chamikaramj · 2019-02-13T21:55:52Z

Run Java PostCommit

kmjung force-pushed the bq_storage_read branch 3 times, most recently from 058a5c4 to 2ccd882 Compare January 23, 2019 22:40

kmjung force-pushed the bq_storage_read branch 2 times, most recently from 83baf95 to 4405cd3 Compare January 25, 2019 20:33

chamikaramj reviewed Feb 5, 2019

View reviewed changes

kmjung force-pushed the bq_storage_read branch from 4405cd3 to 31ac3ab Compare February 5, 2019 22:52

kmjung force-pushed the bq_storage_read branch 2 times, most recently from d577308 to 85f6508 Compare February 11, 2019 22:17

kmjung force-pushed the bq_storage_read branch 2 times, most recently from e645393 to 1d8c589 Compare February 12, 2019 21:18

kmjung force-pushed the bq_storage_read branch from 1d8c589 to 0025a63 Compare February 12, 2019 23:31

chamikaramj merged commit f6fdeaa into apache:master Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-6392] Add support for the BigQuery read API to BigQueryIO. #7441

[BEAM-6392] Add support for the BigQuery read API to BigQueryIO. #7441

kmjung commented Jan 8, 2019 •

edited

kmjung commented Jan 8, 2019

chamikaramj commented Jan 24, 2019

chamikaramj left a comment

chamikaramj Feb 1, 2019

kmjung Feb 5, 2019

chamikaramj Feb 5, 2019

kmjung Feb 5, 2019

chamikaramj Feb 5, 2019

kmjung Feb 5, 2019

chamikaramj Feb 6, 2019

chamikaramj Feb 7, 2019

chamikaramj Feb 5, 2019

kmjung Feb 5, 2019

chamikaramj Feb 5, 2019

kmjung Feb 5, 2019

chamikaramj Feb 5, 2019

kmjung Feb 5, 2019

chamikaramj commented Feb 7, 2019 •

edited

kmjung commented Feb 12, 2019

kmjung commented Feb 12, 2019

chamikaramj commented Feb 12, 2019

kmjung commented Feb 13, 2019

chamikaramj commented Feb 13, 2019

[BEAM-6392] Add support for the BigQuery read API to BigQueryIO. #7441

[BEAM-6392] Add support for the BigQuery read API to BigQueryIO. #7441

Conversation

kmjung commented Jan 8, 2019 • edited

Post-Commit Tests Status (on master branch)

kmjung commented Jan 8, 2019

chamikaramj commented Jan 24, 2019

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented Feb 7, 2019 • edited

kmjung commented Feb 12, 2019

kmjung commented Feb 12, 2019

chamikaramj commented Feb 12, 2019

kmjung commented Feb 13, 2019

chamikaramj commented Feb 13, 2019

kmjung commented Jan 8, 2019 •

edited

chamikaramj commented Feb 7, 2019 •

edited