Skip to content

fix issues with nested data conversion#13407

Merged
clintropolis merged 1 commit intoapache:masterfrom
clintropolis:fix-orc-null-value-stuff
Nov 28, 2022
Merged

fix issues with nested data conversion#13407
clintropolis merged 1 commit intoapache:masterfrom
clintropolis:fix-orc-null-value-stuff

Conversation

@clintropolis
Copy link
Member

Description

This PR fixes a regression caused by #13375 where null ORC inputs would be processed into {} instead of null as expected.

The cause of the regression was allowing the nested types to be returned during conversion to support nested ingestion, which exposed another underlying oddity of why the values were ending up as empty maps instead of null.

The ORC json provider isMap method looks like this

  @Override
  public boolean isMap(final Object o)
  {
    return o == null || o instanceof Map || o instanceof OrcStruct;
  }

which is a bit strange, however is consistent with the other implementations of other nested formats. This means toPlainJavaObject will treat null as a map for most types, resulting in the empty map when converting to java objects. I haven't quite discovered why these are implemented like this (if it was me, i cannot remember 😅), but to avoid changing the behavior here, toMap now checks for null response from toPlainJavaObject and returns an empty map if so, so that toPlainJavaObject will not translate null into an empty map.

While writing a test for this I noticed that toPlainJavaObject could still leak format specific types since the fall through value was not 'finalized' like the values inside of maps and lists are, so the json NullNode for example by processing a null input row would cause the sampler to explode. I'm unsure how common this example is, but it seems safer to finalize the values the fall through just to be safe.


This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Although due to the old behavior, would the sampler show empty map for null values in top level nested structures? But they'll be ingested as nulls.

@clintropolis clintropolis merged commit 37b8d48 into apache:master Nov 28, 2022
@clintropolis clintropolis deleted the fix-orc-null-value-stuff branch November 28, 2022 20:29
clintropolis added a commit to clintropolis/druid that referenced this pull request Nov 28, 2022
@clintropolis
Copy link
Member Author

Although due to the old behavior, would the sampler show empty map for null values in top level nested structures? But they'll be ingested as nulls.

I think there was actually an issue with truly null valued rows, https://github.com/apache/druid/pull/13407/files#diff-5dbb6b823e8eebdd2f12a3d2da6e56d74c9d8cca55a168bc6eac6095ace46809L290 would leak a NullNode at least for JSON inputs, which isn't an empty map at all so it would be a class cast exception. But yeah, the output of toMap for the sampler on one of these would effectively just be a row with all null values if it would have worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants