[BEAM-2676] move BeamSqlRow and BeamSqlRowType to sdk/java/core #3675

mingmxu · 2017-08-02T08:23:42Z

create a new PR to get rid of the huge rebase work after #3666.

Following the discussion in BEAM-2676, the changes are outlined as:

BeamRecord and BeamRecordTypeProvider are marked as @Experimental;
BeamRecord is moved to sdk/core, which includes a default BeamRecordTypeProvider to define type information; Also a BeamRecordCoder is provided as Coder;
In extension/sql, BeamSqlRecord extends BeamRecord, BeamSqlRecordTypeProvider extends BeamRecordTypeProvider to support SQL types; a new Coder BeamSqlRecordCoder is provided to align better;

mingmxu · 2017-08-02T08:23:59Z

R: @robertwb @takidau

coveralls · 2017-08-02T16:43:06Z

Changes Unknown when pulling 5b42e63 on XuMingmin:BEAM-2676_2 into ** on apache:DSL_SQL**.

robertwb

Thanks for doing this refactoring!

Some high level comments about the end state. Also, for ease of reviewing, could you separate out the bulk renaming from the rest of the (more interesting) changes? (If you want, you can rebase on robertwb@ab0ca82 ).

robertwb · 2017-08-03T03:56:38Z

sdks/java/core/src/main/java/org/apache/beam/sdk/coders/BeamRecordCoder.java

+      throws CoderException, IOException {
+    nullListCoder.encode(value.getNullFields(), outStream);
+    for (int idx = 0; idx < value.size(); ++idx) {
+      if (value.getNullFields().contains(idx)) {


Seems like there should be an isNull(idx) method.

Also, I wonder if this should be a BitSet instead.

robertwb · 2017-08-03T03:59:32Z

...tensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/schema/BeamSqlRecordCoder.java

 */
-public class BeamSqlRowCoder extends CustomCoder<BeamSqlRow> {
-  private BeamSqlRowType tableSchema;
+public class BeamSqlRecordCoder extends CustomCoder<BeamSqlRecord> {


We don't need a separate BeamSqlRecordCoder, just have the BeamSqlRecordTypeProvider's Coder create a vanilla BeamRecordCoder with the right list of coders.

Here a BeamSqlRecordCoder is provided as I don't find an existing Coder for short/float/Date/Boolean, so we do some conversion in this Coder. --SerializableCoder doesn't fit as it's not deterministic.

It would be preferable (and more useful) to add such coders to the SDK rather than hide the conversions here in BeamSqlRecordCoder. If you wanted to avoid exposing these coders you can make them inner classes of BeamSqlRecordTypeProvider (with a TODO to consider making them top-level).

makes sense to me, don't want to bring in the Coder tasks here.

BTW, BeamRecordType would take this list of coders in its constructor so that BeamRecordCoder could be created given a BeamRecordType.

Yes, given ShortCoder/FloatCoder/..., BeamSqlRecordCoder can be created with BeamSqlRecordType:

BeamSqlRecordCoder of(BeamSqlRecordType type)

If we get rid of BeamSqlRecord (below) we can get rid of BeamSqlRecordCoder as BeamRecordCoder would just take any BeamRecordType as its parameter.

robertwb · 2017-08-03T04:03:19Z

...va/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/schema/BeamSqlRecord.java

-public abstract class BeamSqlRowType implements Serializable {
-  public abstract List<String> getFieldsName();
-  public abstract List<Integer> getFieldsType();
+public class BeamSqlRecord extends BeamRecord {


One of the main points of the proposal was to not have a separate Record type--just use a BeamRecord that happens to have a BeamSqlRecordTypeProvider dataType.

BeamSqlRecord is a helper class for SQL, to avoid (BeamSqlRowType) BeamRecord.getDataType(); everywhere. --In Beam SQL code, it heavily relies on BeamSqlRecordTypeProvider which has more functions than BeamRecordTypeProvider .

To avoid a new BeamSqlRecord, maybe I can just add the cast line as a utility function. Any thoughts?

If BeamSqlRecord is just a helper class, it shouldn't be in the public API. If that means a couple of casts to BeamSqlRecordTypeProvider then that's OK. However, the only methods I see on this class are validateValueType (rarely used) and getFieldsType.

Of course getFieldsType (should this be getFieldTypes?) the interesting one. However, nearly every use of this method is to create a new BeamSqlRecordTypeProvider. It would actually be better to add methods on BeamRecordType that slice/concatenate/etc. to create new BeamRecordTypes (with overloads in BeamSqlRecordTypes to create BeamSqlRecordTypes).

Done right, simple SQL statements such as selections, joins, and projections should work on generic BeamRecords. Aggregations and comparisons would still (likely) require knowing the specific types (as SQL Types) and hence an actual BeamSqlRecordType, but that's OK as we're already in the domain of throwing errors if it's not the right type (and a generic BeamRecordType schema would always throw a "wrong type" error.) We don't, of course, have to get this working now.

The naming still seems to be in flux. I'd drop the Provider suffix, and get rid of Row everywhere. The suite of types would then be.

BeamRecord

BeamRecordCoder

BeamSqlRecordType extends BeamRecordType

BTW, in the short term, I'm OK with just adding lots of casts with a TODO to do this properly with polymorphism to unblock getting this in.

Let's merge the discussion of BeamSqlRecord and BeamSqlRecordCoder.

A BeamSqlRecordHelper class would be introduced, which has two methods

public static BeamSqlRecordType getSqlRecordType(BeamRecord); public static BeamRecordCoder getSqlRecordCoder(BeamSqlRecordType);

I don't like the CAST mixing in code, as we know a CAST is must, let's centralize it for better document;

As mentioned above, I'll prepare some inner Coders to cover SQL types. Would create a separated task to expose in sdk/core;

Thanks for the naming suggestion, I would do it after we clear all the questions. --A simple rename would cause hundreds of lines impacted.

coveralls · 2017-08-03T08:12:45Z

Changes Unknown when pulling 129ae96 on XuMingmin:BEAM-2676_2 into ** on apache:DSL_SQL**.

coveralls · 2017-08-03T08:18:24Z

Changes Unknown when pulling 129ae96 on XuMingmin:BEAM-2676_2 into ** on apache:DSL_SQL**.

coveralls · 2017-08-03T09:09:56Z

Changes Unknown when pulling 129ae96 on XuMingmin:BEAM-2676_2 into ** on apache:DSL_SQL**.

mingmxu · 2017-08-03T19:18:08Z

As discussed, I remove BeamSqlRecord, and now there're only:

BeamRecord
BeamRecordCoder
BeamSqlRecordType extends BeamRecordType

and BeamSqlRecordHelper to handle the type cast, coder in SQL.

--Sorry for the large lines, practically it's not doable to separate the renaming step.

Let's finish this first, then I'll move on to the windowInfo fields;

robertwb

Phew...looking good! Agree that the window info stuff should be a separate PR; filed BEAM-2722.

The main remaining point is that I think BeamRecordType should hold a list of Coders, and as such can provide a BeamRecordCoder, which will simplify things. Other than that just minor comments.

Also, having read the whole thing, could you try and build on top of these commits now :).

robertwb · 2017-08-03T19:33:25Z

sdks/java/core/src/main/java/org/apache/beam/sdk/coders/BeamRecordCoder.java

+  @Override
+  public void verifyDeterministic()
+      throws org.apache.beam.sdk.coders.Coder.NonDeterministicException {
+  }


Recursively call on all members of coderArray.

+1
happen to notice that DoubleCoder is not deterministic.

robertwb · 2017-08-03T19:37:15Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/BeamRecord.java

@@ -174,7 +137,7 @@ public boolean getBoolean(String fieldName) {
  }

  public Object getFieldValue(int fieldIdx) {
-    if (nullFields.contains(fieldIdx)) {
+    if (nullFields.get(fieldIdx)) {
      return null;


Curious, is dataValues.get(fieldIdx) already null here?

+1, this is needless.

robertwb · 2017-08-03T19:38:27Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/BeamRecord.java

    return dataType;
  }

-  public void setDataType(BeamSqlRowType dataType) {
+  public void setDataType(BeamRecordType dataType) {


Why is this ever needed? Seems it should be set at creation, never changed.

robertwb · 2017-08-03T19:38:57Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/BeamRecord.java

  }

-  public List<Integer> getNullFields() {
-    return nullFields;
+  public void setNullFields(BitSet nullFields) {


Similarly, couldn't this always be inferred based on what fields were set?

Actually, is there any reason to explicitly store this set, rather than have getNullFields() implicitly compute it based on what is actually null at the time. (Happy with deferring to a future JIRA.)

I would keep a BitSet to indicate null fields. It's useful when encoding/decoding.

But other than during encoding/decoding there's no reason to keep it around, right?

I think so, the information is MUST to decode as we skip null values when encoding.
--Calculate it on the fly is possible although, a trade-off between a loop-scan and store a BitSet.

Yes, we must store it in the encoded representation. But between the choice of setting it to all-true in the constructor and updating it as fields are mutated (we're not doing this correctly btw, see setDataValues and possibly elsewhere) vs. computing it on decode, I prefer the locality of the latter as it's just as cheap (and really cheap compared to the actual serialization), but your call.

robertwb · 2017-08-03T19:39:27Z

sdks/java/core/src/main/java/org/apache/beam/sdk/coders/BeamRecordCoder.java

+    BitSet nullFields = nullListCoder.decode(inStream);
+
+    BeamRecord record = new BeamRecord(recordType);
+    record.setNullFields(nullFields);


(See other comment.) Isn't this inferred by the setting (or not) below?

robertwb · 2017-08-03T20:12:37Z

sdks/java/core/src/main/java/org/apache/beam/sdk/coders/BeamRecordCoder.java

+      throws CoderException, IOException {
+    nullListCoder.encode(value.getNullFields(), outStream);
+    for (int idx = 0; idx < value.size(); ++idx) {
+      if (value.getNullFields().get(idx)) {


use isNull here

robertwb · 2017-08-03T20:18:08Z

...va/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSortRel.java

      for (int i = 0; i < fieldsIndices.size(); i++) {
        int fieldIndex = fieldsIndices.get(i);
        int fieldRet = 0;
-        SqlTypeName fieldType = CalciteUtils.getFieldType(row1.getDataType(), fieldIndex);
+        SqlTypeName fieldType = CalciteUtils.getFieldType(


Based on the implementation below, couldn't you just call row1.get(fieldIndex).compareTo(row2.get(fieldIndex)) iff they're instances of Comparable, and raise UnsupportedOperationException otherwise?

Right, the existing types are all comparable.

Any need to query fieldIndex at all, vs instanceof Comparable?

robertwb · 2017-08-03T20:20:33Z

...c/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamAggregationTransforms.java

@@ -119,21 +120,21 @@ public BeamSqlRow apply(BeamSqlRow input) {
      return keyOfRecord;
    }

-    private BeamSqlRowType exTypeOfKeyRecord(BeamSqlRowType dataType) {
+    private BeamSqlRecordType exTypeOfKeyRecord(BeamSqlRecordType dataType) {


Reference BEAM-2721 here.

robertwb · 2017-08-03T20:21:26Z

.../sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamJoinTransforms.java

      // build the type
      // the name of the join field is not important
      List<String> names = new ArrayList<>(joinColumns.size());
      List<Integer> types = new ArrayList<>(joinColumns.size());
      for (int i = 0; i < joinColumns.size(); i++) {
        names.add("c" + i);
        types.add(isLeft
-            ? input.getDataType().getFieldsType().get(joinColumns.get(i).getKey()) :
-            input.getDataType().getFieldsType().get(joinColumns.get(i).getValue()));
+            ? BeamSqlRecordHelper.getSqlRecordType(input).getFieldsType()


Reference BEAM-2721 here.

Ping on these two.

robertwb · 2017-08-03T20:22:23Z

.../sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/transform/BeamJoinTransforms.java

-    types.addAll(leftRow.getDataType().getFieldsType());
-    types.addAll(rightRow.getDataType().getFieldsType());
-    BeamSqlRowType type = BeamSqlRowType.create(names, types);
+    types.addAll(BeamSqlRecordHelper.getSqlRecordType(leftRow).getFieldsType());


Reference BEAM-2721 here

mingmxu · 2017-08-03T22:34:35Z

create a separate PR BEAM-2723 for the windowInfo fields, will do it after this.

robertwb · 2017-08-03T23:53:39Z

...va/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamSortRel.java

@@ -243,7 +230,7 @@ public BeamSqlRowComparator(List<Integer> fieldsIndices,
    }
  }

-  public static <T extends Number & Comparable> int numberCompare(T a, T b) {
+  public static <T extends Comparable> int compare(T a, T b) {


Is this even needed?

robertwb · 2017-08-04T00:14:09Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/BeamRecord.java

  }

-  public List<Integer> getNullFields() {
-    return nullFields;
+  public void setNullFields(BitSet nullFields) {


Yes, we must store it in the encoded representation. But between the choice of setting it to all-true in the constructor and updating it as fields are mutated (we're not doing this correctly btw, see setDataValues and possibly elsewhere) vs. computing it on decode, I prefer the locality of the latter as it's just as cheap (and really cheap compared to the actual serialization), but your call.

robertwb

Thanks for bearing with me--LGTM! The remainder of the comments are optional.

mingmxu · 2017-08-04T00:30:28Z

Really appreciate for your review @robertwb .

I would address the two in another PR, with a clean base. --Seems nullFields is not handled properly in setDataValues.

coveralls · 2017-08-04T00:46:27Z

Changes Unknown when pulling 706781d on XuMingmin:BEAM-2676_2 into ** on apache:DSL_SQL**.

mingmxu · 2017-08-04T05:49:32Z

@robertwb could you merge this PR? So I can start the left changes.

mingmxu · 2017-08-04T17:16:29Z

close #3675 Thanks @robertwb !

mingmxu force-pushed the BEAM-2676_2 branch from 09a7148 to 5b42e63 Compare August 2, 2017 15:16

robertwb reviewed Aug 3, 2017

View reviewed changes

mingmxu force-pushed the BEAM-2676_2 branch from 5b42e63 to 015532d Compare August 3, 2017 06:44

move BeamRecord to sdk/core

52933a6

mingmxu force-pushed the BEAM-2676_2 branch from 015532d to 52933a6 Compare August 3, 2017 06:52

use BitSet for nullFields

129ae96

refactor BeamRecord, BeamRecordType, BeamSqlRecordType, BeamRecordCoder

3e9f63b

robertwb reviewed Aug 3, 2017

View reviewed changes

fix up as comments

89faf63

fix JavaDoc error

706781d

robertwb reviewed Aug 4, 2017

View reviewed changes

robertwb approved these changes Aug 4, 2017

View reviewed changes

asfgit pushed a commit that referenced this pull request Aug 4, 2017

Closes #3675

8f922f7

mingmxu closed this Aug 4, 2017

[BEAM-2676] move BeamSqlRow and BeamSqlRowType to sdk/java/core #3675

[BEAM-2676] move BeamSqlRow and BeamSqlRowType to sdk/java/core #3675

Conversation

mingmxu commented Aug 2, 2017

mingmxu commented Aug 2, 2017

coveralls commented Aug 2, 2017

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb Aug 3, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Aug 3, 2017

coveralls commented Aug 3, 2017

coveralls commented Aug 3, 2017

mingmxu commented Aug 3, 2017

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmxu commented Aug 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

mingmxu commented Aug 4, 2017

coveralls commented Aug 4, 2017

mingmxu commented Aug 4, 2017

mingmxu commented Aug 4, 2017

robertwb Aug 3, 2017 •

edited