[BEAM-4794] Move SQL and SQL Nexmark to the schema framework by reuvenlax · Pull Request #5956 · apache/beam

reuvenlax · 2018-07-16T03:46:36Z

This PR move SQL and Nexmark to the new schema framework.

Advantages:

SQL now works seamlessly on PCollection as long as UserType has a schema (which is simple for POJOs and Java Beans)
Row conversion for POJOs and Java Beans should be more efficient than the old version.
We now fully recursive and array fields in POJOs and Java Beans; these did not work previously in SQL.
A large amount of code is now deleted - InferredRowCoder, all the Nexmark ModelAdaptor code)

Note: this can not be merged until schemas are enabled for all major runners (BEAM-4793). A PR is currently out for review to make schemas work for all runners except for the Gearpump runner.

R: @apilloud

echauchot · 2018-07-16T08:03:02Z

This is cool to see the first (AFAIK) real-pipeline use of the schema PCollection framework !

apilloud

About half of the change is just reformatting so I might have missed something there, but LGTM.

apilloud · 2018-07-16T22:06:11Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/JsonToRow.java

                    }
                  }))
-          .setCoder(schema.getRowCoder());
+          .setSchema(schema, SerializableFunctions.identity(), SerializableFunctions.identity());


This pattern all over the place is kind of annoying. How about adding a setSchema(schema) function that uses the new of(Schema schema) so we don't have to repeat this pattern all over the place?

apilloud · 2018-07-16T22:10:11Z

...s/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/schema/BeamPCollectionTable.java

-public class BeamPCollectionTable extends BaseBeamTable {
-  private transient PCollection<Row> upstream;
+public class BeamPCollectionTable<InputT> extends BaseBeamTable {
+  private transient PCollection<InputT> upstream;


Not your bug, but this should be final instead of transient I believe.

apilloud · 2018-07-16T22:16:14Z

sdks/java/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/Auction.java

          LONG_CODER.encode(value.reserve, outStream);
-          LONG_CODER.encode(value.dateTime, outStream);
-          LONG_CODER.encode(value.expires, outStream);
+          INSTANT_CODER.encode(value.dateTime.toInstant(), outStream);


Thanks for cleaning this up!

apilloud · 2018-07-16T22:56:09Z

Also, a good chuck of tests are failing. It would be good to run the nexmark postcommits before this goes in.

reuvenlax · 2018-07-17T04:41:30Z

retest this please

akedin · 2018-07-17T18:14:56Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/JsonToRow.java

                    }
                  }))
-          .setCoder(schema.getRowCoder());
+          .setSchema(schema, SerializableFunctions.identity(), SerializableFunctions.identity());


+1, some shorthand version would be helpful. Maybe default to identity() and have a couple of overloads of setSchema() to allow customization? Or wire it up to schema registry and default to identity() there?

And is setSchema the right place to specify these transforms? Isn't it just conflating .apply(ToRow.withSchema())... .apply(ParDo.of(fromRow()))? My thought is that if the transforms are non-trivial, then they are probably better be specified as ParDos

These functions are not being applied here. All schemas in Beam must be registered with a conversion function to/from Row. These conversions are applied by Beam at the appropriate places (e.g. in the DoFn state machine). They can't really be specified as part of ParDo, because they are needed to implement ParDo behind the scenes!

In this case, it's clearly not needed as the type is already Row. Will add a simpler function for PCollection

reuvenlax · 2018-07-31T17:24:42Z

run Dataflow ValidatesRunner

reuvenlax · 2018-07-31T17:24:50Z

run Flink ValidatesRunner

reuvenlax · 2018-07-31T17:24:57Z

run Spark ValidatesRunner

reuvenlax · 2018-07-31T17:37:39Z

run Java PostCommit

reuvenlax · 2018-07-31T20:15:08Z

UnboundedEventSourceTest.resumeFromCheckpoint failed in Post Commit, but appears to simply be flaky. That same test ran successfully in PreCommit, and also passed locally.

reuvenlax requested review from aromanenko-dev, echauchot, kennknowles, lukecwik, mingmxu and xumingming as code owners July 16, 2018 03:46

reuvenlax removed request for aromanenko-dev, echauchot, kennknowles, lukecwik, mingmxu and xumingming July 16, 2018 03:46

apilloud approved these changes Jul 16, 2018

View reviewed changes

akedin reviewed Jul 17, 2018

View reviewed changes

reuvenlax force-pushed the sql_uses_schemas branch from 315099c to a514a44 Compare July 18, 2018 02:43

Reuven Lax added 10 commits July 31, 2018 10:22

Convert BeamSQL to use Schemas.

e3053bf

Deprecate getRowCoder.

ab7dd01

Add setSchema to remaining Table objects.

7478ef6

Delete a bunch of code that is no longer used.

d5d6dc0

Move utilities into schemas.utils package.

0956570

Convert Nexmark to use schemas.

e7bc066

remove deprecated Schema.getRowCoder

da97637

Apply spotless.

ecc8e47

Plumb schema through DoFn.

de7eea4

Add helper setRowSchema to eliminatre boilerplate.

1c969b4

reuvenlax force-pushed the sql_uses_schemas branch from a514a44 to 1c969b4 Compare July 31, 2018 17:23

reuvenlax merged commit 06128f2 into apache:master Jul 31, 2018

akedin mentioned this pull request Jul 31, 2018

[BEAM-5050] [SQL] Fix aggregation of nulls #6108

Merged

2 tasks

reuvenlax deleted the sql_uses_schemas branch December 9, 2018 23:07

Conversation

reuvenlax commented Jul 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Jul 16, 2018

Uh oh!

apilloud left a comment

Choose a reason for hiding this comment

Uh oh!

apilloud Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

apilloud Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

apilloud Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

apilloud commented Jul 16, 2018

Uh oh!

reuvenlax commented Jul 17, 2018

Uh oh!

akedin Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

reuvenlax Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

reuvenlax commented Jul 31, 2018

Uh oh!

reuvenlax commented Jul 31, 2018

Uh oh!

reuvenlax commented Jul 31, 2018

Uh oh!

reuvenlax commented Jul 31, 2018

Uh oh!

reuvenlax commented Jul 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reuvenlax commented Jul 16, 2018 •

edited

Loading