Skip to content

Add Row Json Deserializer#5089

Merged
kennknowles merged 2 commits intoapache:masterfrom
akedin:row-json-deserializer
Apr 13, 2018
Merged

Add Row Json Deserializer#5089
kennknowles merged 2 commits intoapache:masterfrom
akedin:row-json-deserializer

Conversation

@akedin
Copy link
Contributor

@akedin akedin commented Apr 10, 2018

Add basic Deserializer to allow converting Json objects to Rows.
Doesn't support nullables at the moment. Isn't wired up to SQL yet.


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand:
    • What the pull request does
    • Why it does it
    • How it does it
    • Why this approach
  • Each commit in the pull request should have a meaningful subject line and body.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@akedin
Copy link
Contributor Author

akedin commented Apr 10, 2018

R: @apilloud @kennknowles

Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned lots of fancy java reviewing this. LGTM.

*/
@AutoValue
public abstract static class FieldType implements Serializable {
// Returns the type of this field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these comment changes seem unrelated to a Json Deserialzier. Separate commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, removed it from this PR. Missing javadoc was confusing so i added it while i was at it.

.collect(Row.toRow(targetRowSchema));
}

private static Object getFieldValue(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Following this is kind of confusing. It seems like this function should really part of your mapToObj call above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Did this before schemas exposed fields list. Updated

"Field " + currentFieldName + " is not present in the JSON object");
}

if (currentValueNode.isNull()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a test that covered null values. Might be worth testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Added

@akedin akedin force-pushed the row-json-deserializer branch 2 times, most recently from 00b79ef to ae5e86d Compare April 11, 2018 19:26
@kennknowles kennknowles self-requested a review April 11, 2018 20:14
@kennknowles kennknowles self-assigned this Apr 11, 2018
Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice. I have two high-level comments:

  1. It might be preferable to have the conversion driven by the expected type in the schema instead of by the JSON type. It doesn't seem to be a problem, though.
  2. Can you have tests for out-of-bounds numerical constants? I may have just missed them. The reason I ask is to deliberately focus on the places where the JSON is valid and almost fits the schema, but there is a minor mismatch in the data models.

@akedin
Copy link
Contributor Author

akedin commented Apr 11, 2018

@kennknowles:

  1. It might be preferable to have the conversion driven by the expected type in the schema instead of by the JSON type. It doesn't seem to be a problem, though.

That's what happens. These checks look at the schema type, and primitive value look up also happens by schema type name.

@kennknowles

Can you have tests for out-of-bounds numerical constants? I may have just missed them. The reason I ask is to deliberately focus on the places where the JSON is valid and almost fits the schema, but there is a minor mismatch in the data models.

Working on it

Add basic Deserializer to allow converting Json objects to Rows.
@akedin akedin force-pushed the row-json-deserializer branch from ae5e86d to 9386a44 Compare April 11, 2018 22:54
@akedin
Copy link
Contributor Author

akedin commented Apr 11, 2018

@kennknowles added test for overflowing integer numbers

@kennknowles
Copy link
Member

Ah, sorry I misread that. Then the choice of when to reject and when to wrap and when to lose precision is a bit complicated. I think we might need a design doc and to discuss a little bit on dev@ as a next gen of the schema design, but we can move forward with almost anything for now.

Extract and validate json primitive types for stricter compatibility with java types
@akedin
Copy link
Contributor Author

akedin commented Apr 12, 2018

@kennknowles added stricter validation, does this approach look better?

Few quirks:

  • not sure if I can quickly come up with a prettier solution for validation, JSON+Jackson+Java makes a weird combination;
  • jackson allows few type widening configuration options, like it can automatically convert all ints to BigIntegers, I haven't tested that bit yet. And it's an ObjectMapper configuration option, so it's not directly exposed to the serializer. And it doesn't feel right to automatically widen everything;
  • jackson always parses 1.02e5 as float node but 1.02 as double node, which leads to things like 1.02d != (double)(float)(1.02d). Easy workaround is to define Schema field as Long but it feels super subtle and unclear if you don't have a debugger at hand. I don't believe there's a configuration flag for this;
  • additionally, rejecting 1.00 -> int conversion or number -> string feels restrictive;

At this point I think that having a more restricting approach is better even while we don't have a robust type system in Schemas:

  • probably easier to make it less restricting in general when we encounter the use cases which require less strict conversions;
  • configuration options to control strictness can be wired up easier when there's something to wire them up to;

@akedin
Copy link
Contributor Author

akedin commented Apr 12, 2018

run java precommit

@akedin akedin mentioned this pull request Apr 12, 2018
10 tasks
@kennknowles
Copy link
Member

Yes, I like this. If we do decide to make the extractors less strict, the structure of things can stay the same.

@kennknowles kennknowles merged commit 8f58285 into apache:master Apr 13, 2018
mingmxu pushed a commit to mingmxu/beam that referenced this pull request Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants