[BEAM-4601][SQL] Support BigQuery read from SQL. by amaliujia · Pull Request #5830 · apache/beam

amaliujia · 2018-06-29T08:13:08Z

Add BigQuery read from SQL. Beam SQL could read all SQL types that Beam SQL can write.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Spark
Go	---	---	---	---	---
Java
Python	---		---	---	---

amaliujia · 2018-06-29T16:35:14Z

R @kennknowles
CC @apilloud @akedin

apilloud

Haven't done a full review yet (it mostly looks good). Here are two early comments on code location.

apilloud · 2018-06-29T17:44:54Z

...sions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/avro/AvroUtils.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.meta.provider.avro;


This is not the right location for this package. I'm not sure what the best location would be, but @reuvenlax says he knows.

I am putting this class here as the first step to support avro format table.

Not sure if utils that convert Avro to Beam row should be in somewhere else. Also want to see if @reuvenlax has some inputs.

apilloud · 2018-06-29T17:46:02Z

...c/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BeamBigQueryTable.java

+    return begin
+        .apply(
+            BigQueryIO.read(
+                    new SerializableFunction<SchemaAndRecord, Row>() {


This should definitely live in sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java

If I should wrap it into BigQueryUtils.java, then AvroUtils.java should also be in IO component so I don't need add dependency of SQL to IO (as IO will need call AvroUtils if not do so).

I believe AvroUtils.java is going to end up in something like org.apache.beam.sdk.schemas.utils, but until you hear from Reuven where it should go the Bigquery IO is a good location.

amaliujia · 2018-06-29T19:17:08Z

I refactored code a little bit to move code to Bigquery IO. If there is a better place I will be happy to refactor again.

reuvenlax · 2018-06-29T19:33:40Z

...va/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/AvroUtils.java

+
+/** Utils to help convert Apache Avro types to Beam types. */
+public class AvroUtils {
+  public static Object convertAvroFormat(Field beamField, Object value) throws RuntimeException {


so we don't support recursive rows or array types?

Yes because I didn't find a nested structure from Schema.Field. In order to convert data in Arvo format, beam type (INT32, etc.) has to be known. However, Schema.Field does not have a recursive structure to get type information.

I'm not sure I understand - Schema.Field can be of type ROW, in which case the field has a nested schema.

Can we know the STRING from ROW<ROW<ROW<STRING>> from Schema.Field?

O actually you are right, I missed this:

// For ROW types, returns the schema for the row. @Nullable public abstract Schema getRowSchema();

I do can get nested schema/fields.

kennknowles · 2018-06-29T21:59:09Z

It is better to get something working and then revise. So I think it is OK to start with non-nested data, though getting the nested stuff done is mandatory too.

amaliujia · 2018-06-29T22:25:05Z

run java postcommit

amaliujia · 2018-07-02T17:22:00Z

@kennknowles can you take a look?

kennknowles · 2018-07-02T17:56:19Z

Looking now.

kennknowles · 2018-07-02T17:58:43Z

LGTM. Is there a JIRA about adding complex types?

amaliujia · 2018-07-02T18:06:22Z

Yep, a JIRA is created to track it: https://issues.apache.org/jira/browse/BEAM-4710

lukecwik · 2018-07-06T21:35:16Z

org.apache.beam.sdk.extensions.sql.meta.provider.bigquery.BigQueryReadWriteIT.testSQLRead fails a lot:
https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1022/testReport/
https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1023/testReport/
https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1027/testReport/

apilloud · 2018-07-06T21:46:21Z

I think that is what #5892 is trying to address?

amaliujia force-pushed the rui_wang-read_bigquery branch from 4e9683f to b4879e5 Compare June 29, 2018 08:14

amaliujia force-pushed the rui_wang-read_bigquery branch from b4879e5 to fcd58d3 Compare June 29, 2018 17:34

apilloud reviewed Jun 29, 2018

View reviewed changes

Support BigQuery read from SQL.

a11997e

amaliujia force-pushed the rui_wang-read_bigquery branch from fcd58d3 to a11997e Compare June 29, 2018 19:15

reuvenlax reviewed Jun 29, 2018

View reviewed changes

kennknowles merged commit a58f1ff into apache:master Jul 2, 2018

amaliujia deleted the rui_wang-read_bigquery branch July 2, 2018 18:05

Conversation

amaliujia commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Uh oh!

amaliujia commented Jun 29, 2018

Uh oh!

apilloud left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apilloud Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles commented Jun 29, 2018

Uh oh!

amaliujia commented Jun 29, 2018

Uh oh!

amaliujia commented Jul 2, 2018

Uh oh!

kennknowles commented Jul 2, 2018

Uh oh!

kennknowles commented Jul 2, 2018

Uh oh!

amaliujia commented Jul 2, 2018

Uh oh!

lukecwik commented Jul 6, 2018

Uh oh!

apilloud commented Jul 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amaliujia commented Jun 29, 2018 •

edited

Loading

amaliujia Jun 29, 2018 •

edited

Loading

apilloud Jun 29, 2018 •

edited

Loading

amaliujia commented Jun 29, 2018 •

edited

Loading