Add Row Json Deserializer by akedin · Pull Request #5089 · apache/beam

akedin · 2018-04-10T20:47:15Z

Add basic Deserializer to allow converting Json objects to Rows.
Doesn't support nullables at the moment. Isn't wired up to SQL yet.

Follow this checklist to help us incorporate your contribution quickly and easily:

akedin · 2018-04-10T20:48:00Z

R: @apilloud @kennknowles

apilloud

I learned lots of fancy java reviewing this. LGTM.

apilloud · 2018-04-10T22:24:30Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java

   */
  @AutoValue
  public abstract static class FieldType implements Serializable {
-    // Returns the type of this field.


All these comment changes seem unrelated to a Json Deserialzier. Separate commit?

Sure, removed it from this PR. Missing javadoc was confusing so i added it while i was at it.

apilloud · 2018-04-10T22:58:27Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowJson.java

+            .collect(Row.toRow(targetRowSchema));
+  }
+
+  private static Object getFieldValue(


nit: Following this is kind of confusing. It seems like this function should really part of your mapToObj call above.

I agree. Did this before schemas exposed fields list. Updated

apilloud · 2018-04-10T23:02:15Z

sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowJson.java

+          "Field " + currentFieldName + " is not present in the JSON object");
+    }
+
+    if (currentValueNode.isNull()) {


I didn't see a test that covered null values. Might be worth testing.

I agree. Added

kennknowles

This is very nice. I have two high-level comments:

It might be preferable to have the conversion driven by the expected type in the schema instead of by the JSON type. It doesn't seem to be a problem, though.
Can you have tests for out-of-bounds numerical constants? I may have just missed them. The reason I ask is to deliberately focus on the places where the JSON is valid and almost fits the schema, but there is a minor mismatch in the data models.

akedin · 2018-04-11T20:37:43Z

@kennknowles:

It might be preferable to have the conversion driven by the expected type in the schema instead of by the JSON type. It doesn't seem to be a problem, though.

That's what happens. These checks look at the schema type, and primitive value look up also happens by schema type name.

@kennknowles

Can you have tests for out-of-bounds numerical constants? I may have just missed them. The reason I ask is to deliberately focus on the places where the JSON is valid and almost fits the schema, but there is a minor mismatch in the data models.

Working on it

Add basic Deserializer to allow converting Json objects to Rows.

akedin · 2018-04-11T22:55:36Z

@kennknowles added test for overflowing integer numbers

kennknowles · 2018-04-11T23:17:27Z

Ah, sorry I misread that. Then the choice of when to reject and when to wrap and when to lose precision is a bit complicated. I think we might need a design doc and to discuss a little bit on dev@ as a next gen of the schema design, but we can move forward with almost anything for now.

Extract and validate json primitive types for stricter compatibility with java types

akedin · 2018-04-12T06:00:52Z

@kennknowles added stricter validation, does this approach look better?

Few quirks:

not sure if I can quickly come up with a prettier solution for validation, JSON+Jackson+Java makes a weird combination;
jackson allows few type widening configuration options, like it can automatically convert all ints to BigIntegers, I haven't tested that bit yet. And it's an ObjectMapper configuration option, so it's not directly exposed to the serializer. And it doesn't feel right to automatically widen everything;
jackson always parses 1.02e5 as float node but 1.02 as double node, which leads to things like 1.02d != (double)(float)(1.02d). Easy workaround is to define Schema field as Long but it feels super subtle and unclear if you don't have a debugger at hand. I don't believe there's a configuration flag for this;
additionally, rejecting 1.00 -> int conversion or number -> string feels restrictive;

At this point I think that having a more restricting approach is better even while we don't have a robust type system in Schemas:

probably easier to make it less restricting in general when we encounter the use cases which require less strict conversions;
configuration options to control strictness can be wired up easier when there's something to wire them up to;

akedin · 2018-04-12T06:11:31Z

run java precommit

kennknowles · 2018-04-13T15:57:33Z

Yes, I like this. If we do decide to make the extractors less strict, the structure of things can stay the same.

apilloud approved these changes Apr 10, 2018

View reviewed changes

akedin force-pushed the row-json-deserializer branch 2 times, most recently from 00b79ef to ae5e86d Compare April 11, 2018 19:26

kennknowles self-requested a review April 11, 2018 20:14

kennknowles self-assigned this Apr 11, 2018

kennknowles requested changes Apr 11, 2018

View reviewed changes

Add Row Json Deserializer

9386a44

Add basic Deserializer to allow converting Json objects to Rows.

akedin force-pushed the row-json-deserializer branch from ae5e86d to 9386a44 Compare April 11, 2018 22:54

Add RowJsonValueExtractors

25c60c1

Extract and validate json primitive types for stricter compatibility with java types

akedin mentioned this pull request Apr 12, 2018

[BEAM-4160] Add JsonToRow transform #5120

Merged

10 tasks

kennknowles approved these changes Apr 13, 2018

View reviewed changes

kennknowles merged commit 8f58285 into apache:master Apr 13, 2018

mingmxu pushed a commit to mingmxu/beam that referenced this pull request Apr 13, 2018

rebase as file conflict with apache#5089

1fab0a4

Conversation

akedin commented Apr 10, 2018

Uh oh!

akedin commented Apr 10, 2018

Uh oh!

apilloud left a comment

Choose a reason for hiding this comment

Uh oh!

apilloud Apr 10, 2018

Choose a reason for hiding this comment

Uh oh!

akedin Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

apilloud Apr 10, 2018

Choose a reason for hiding this comment

Uh oh!

akedin Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

apilloud Apr 10, 2018

Choose a reason for hiding this comment

Uh oh!

akedin Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

akedin commented Apr 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akedin commented Apr 11, 2018

Uh oh!

kennknowles commented Apr 11, 2018

Uh oh!

akedin commented Apr 12, 2018

Uh oh!

akedin commented Apr 12, 2018

Uh oh!

kennknowles commented Apr 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akedin commented Apr 11, 2018 •

edited

Loading