Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

Merged
merged 4 commits into from
Nov 4, 2022

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 2, 2022

Description

This PR fixes a flaw in Parquet list conversion for nullable lists of nullable complex elements (groups with many fields, etc), which is amusingly similar to #5433, which I fixed for avro and thought I had addressed for direct group conversion in #6360, but was missing this case. We did cover the other common list form conversions, we just weren't unwrapping stuff in this form


optional group a2 (LIST) {
  repeated group list {
    optional group element {
      optional int64 b1;
      optional int64 b2;
    }
  }
}

which has been added in this PR.

I've also cleaned some stuff up around conversion, and made some changes in preparation of adding support for Parquet to nested column indexers and expression transforms, via some of the same flattener/conversion mechanisms we use for flattenSpec.

Release note

Fixes an issue with Parquet list conversion, where lists of complex objects could unexpectedly be wrapped in an extra object, appearing as [{"element":<actual_list_element>},{"element":<another_one>}...] instead of the direct list.

Importantly, this changes the behavior of the parquet reader for lists of structured objects to be consistent with other parquet logical list conversions. Previously, this data would have to have been retrieved via this element as part of a jsonpath or jq flattenSpec, but now will be fetched directly, more closely matching its expected structure.

For example, prior to this patch due to the unhandled logical list conversion, data of the form {"list":[{"x":1, "y":2},{"x":3, "y": 4}]}, to extract x you would have to specify $.list[0].element.x. After this patch, the data would be extracted using $.list[0].x.

If you have any existing flattenSpec which are dependent on the current unintuitive behavior, they should be adjusted to drop the 'element' from the path.


This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Member

@rohangarg rohangarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM % comment

listItem.getType().getFieldCount() == 1 &&
listItem.getType().getName().equalsIgnoreCase("list") &&
listItem.getType().getFieldName(0).equalsIgnoreCase("element") &&
listItem.getGroup(0, 0).getType().isRepetition(Type.Repetition.OPTIONAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some doubts :

  1. should listItem.getType().getFieldCount() == 1 be checked just after listItem.getType().isRepetition(Type.Repetition.REPEATED)?
  2. could we use listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType()) for list check and assume the fieldName for 0 index to be element 'type'? was checking https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules and it seemed like the name of the field in the list block could be anything and not always 'element'. totally fine if you have thought about this and want to tackle a particular case
  3. should we do convertListElement(listItem.getGroup(0, 0) in the return?

Copy link
Member Author

@clintropolis clintropolis Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should listItem.getType().getFieldCount() == 1 be checked just after listItem.getType().isRepetition(Type.Repetition.REPEATED)

yeah, i think that makes more sense, will fix

could we use listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType()) for list check and assume the fieldName for 0 index to be element 'type'? was checking https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules and it seemed like the name of the field in the list block could be anything and not always 'element'. totally fine if you have thought about this and want to tackle a particular case

it 100% does seem better to use LogicalTypeAnnotation stuff. Poking at this fix was the first time I've seen these things because it didn't actually exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but this functionality was added in 1.11.0 😅. I'll look into how difficult it is to translate this (and probably logical map conversion) to use this instead of OriginalType and checking stuff by hand.

should we do convertListElement(listItem.getGroup(0, 0) in the return?

I think it isn't necessary the way things work right now; at the ParquetGroupConverter layer, this conversion is lazy in the sense that it only converts the current field being asked for and isn't recursive. We do want to remove the extra wrapper, but thats all we want to do here. The interaction between jsonpath/jq paths in the json provider digs through things to find stuff, and the flattenermaker performs any final cleanup for things leaving conversion before the make it into druid, so the recursive conversion is done there (well before this patch these things could actually leak, that is a change in this PR in preparation for supporting nested column stuffs).

return g.getInteger(fieldIndex, index);
case UINT_32:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! :)

@clintropolis clintropolis reopened this Nov 3, 2022
@clintropolis clintropolis merged commit e60e305 into apache:master Nov 4, 2022
@clintropolis clintropolis deleted the parquet-nullable-list-fix branch November 4, 2022 12:25
@kfaraz kfaraz added this to the 25.0 milestone Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants