Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-4824: JSON with complex nested data produces incorrect output w… #580

Closed
wants to merge 1 commit into from

Conversation

KulykRoman
Copy link
Contributor

…ith missing fields

…ith missing fields

- Added changes to skip empty Lists or Maps.
- Changed TestJsonReader.testSplitAndTransferFailure() and TestJsonReader.testFieldSelectionBug() according to new output logic.
|| (v.getAccessor().getObject(index) instanceof List
&& ((List) v.getAccessor().getObject(index)).size() == 0)) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle the difference beteen a missing field, and a field that exists, but contains an empty map or list? Examples:

{ "a" : { "b" : 10 }, "c" : [ 1, 2, 3 ] }  // Baseline case
{ "a" : { }, "c" : [ 4 ] }                 // Keep "a"?
{ "b" : [ 5 ] }                            // Remove a: OK
{ "a" : { "b": 20 }, "c" : [ ] }           // Keep "c"?
{ "a" : { "b": 20 } }                      // Remove c: OK

That is, should we diffentiate between a empty map/list and a non-existent one?

The code seems to discard all empty maps & lists. Should the check actually be done in the parser to not pass along empty items? Is this possible (I'm on thin ice here in my detailed knowledge of value vectors...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-rogers, map fields have data mode required and they are the part of the schema, that's why there are no difference between missing field in some record, and the field that exists but empty.
This fix for your example will return result
{"a":{"b":10},"c":[1,2,3]}
{"c":[4]}
{"b":[5]}
{"a":{"b":20}}
{"a":{"b":20}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the JIRA entry for more comments. The key problem is that Drill does not support standard JSON rules. So, we are simply moving the problem around.

Consider this case:

{ }
{ "a": { } }
{ "a": { "b": {} } }

It appears that the code here would emit:

{ }
{ }
{ "a": { } }

If we are going to apply the empty-is-not-present rule we should do so recursively.

mapOf(
"inner_1", listOf(),
"inner_3", mapOf()))
mapOf())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file used here is not sufficient to fully test the code change. Perhaps create an additional test, using a new test file, that tests all the variations suggested in the comment below and in the JIRA entry.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, there is no good solution for this fix. So, we have to figure out what is good enough. I think that would be:

  1. Handle recursively empty structures (see comments.)
  2. Include tests for those cases.

The more correct JSON solution must wait for a larger project.

@ilooner
Copy link
Contributor

ilooner commented Jun 8, 2018

@KulykRoman @paul-rogers Is there a path to take the PR to completion? Or was it decided that things would be handled differently in another PR?

@paul-rogers
Copy link
Contributor

@ilooner, there are quite a few open issues around JSON. There are a number of JIRA tickets that explain the issues, as in a whole section in the "Batch Handling" document on my Wiki. This is not a simple bug fix.

@ilooner
Copy link
Contributor

ilooner commented Jun 8, 2018

Thanks @paul-rogers then it sounds like this PR should be closed. @KulykRoman if you'd like to look into the larger issues around JSON handling please look into the Jiras that Paul has mentioned. Or if you feel this PR has been closed in error, please feel free to reopen and address Paul's comments.

@ilooner ilooner closed this Jun 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants