fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

clintropolis · 2022-11-02T00:42:45Z

Description

This PR fixes a flaw in Parquet list conversion for nullable lists of nullable complex elements (groups with many fields, etc), which is amusingly similar to #5433, which I fixed for avro and thought I had addressed for direct group conversion in #6360, but was missing this case. We did cover the other common list form conversions, we just weren't unwrapping stuff in this form


optional group a2 (LIST) {
  repeated group list {
    optional group element {
      optional int64 b1;
      optional int64 b2;
    }
  }
}

which has been added in this PR.

I've also cleaned some stuff up around conversion, and made some changes in preparation of adding support for Parquet to nested column indexers and expression transforms, via some of the same flattener/conversion mechanisms we use for flattenSpec.

Release note

Fixes an issue with Parquet list conversion, where lists of complex objects could unexpectedly be wrapped in an extra object, appearing as [{"element":<actual_list_element>},{"element":<another_one>}...] instead of the direct list.

Importantly, this changes the behavior of the parquet reader for lists of structured objects to be consistent with other parquet logical list conversions. Previously, this data would have to have been retrieved via this element as part of a jsonpath or jq flattenSpec, but now will be fetched directly, more closely matching its expected structure.

For example, prior to this patch due to the unhandled logical list conversion, data of the form {"list":[{"x":1, "y":2},{"x":3, "y": 4}]}, to extract x you would have to specify $.list[0].element.x. After this patch, the data would be extracted using $.list[0].x.

If you have any existing flattenSpec which are dependent on the current unintuitive behavior, they should be adjusted to drop the 'element' from the path.

This PR has:

been self-reviewed.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

… nullable elements

rohangarg

👍 LGTM % comment

rohangarg · 2022-11-02T15:58:02Z

...tensions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupConverter.java

+        listItem.getType().getFieldCount() == 1 &&
+        listItem.getType().getName().equalsIgnoreCase("list") &&
+        listItem.getType().getFieldName(0).equalsIgnoreCase("element") &&
+        listItem.getGroup(0, 0).getType().isRepetition(Type.Repetition.OPTIONAL)


some doubts :

should listItem.getType().getFieldCount() == 1 be checked just after listItem.getType().isRepetition(Type.Repetition.REPEATED)?

could we use listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType()) for list check and assume the fieldName for 0 index to be element 'type'? was checking https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules and it seemed like the name of the field in the list block could be anything and not always 'element'. totally fine if you have thought about this and want to tackle a particular case

should we do convertListElement(listItem.getGroup(0, 0) in the return?

should listItem.getType().getFieldCount() == 1 be checked just after listItem.getType().isRepetition(Type.Repetition.REPEATED)

yeah, i think that makes more sense, will fix

could we use listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType()) for list check and assume the fieldName for 0 index to be element 'type'? was checking https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules and it seemed like the name of the field in the list block could be anything and not always 'element'. totally fine if you have thought about this and want to tackle a particular case

it 100% does seem better to use LogicalTypeAnnotation stuff. Poking at this fix was the first time I've seen these things because it didn't actually exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but this functionality was added in 1.11.0 😅. I'll look into how difficult it is to translate this (and probably logical map conversion) to use this instead of OriginalType and checking stuff by hand.

should we do convertListElement(listItem.getGroup(0, 0) in the return?

I think it isn't necessary the way things work right now; at the ParquetGroupConverter layer, this conversion is lazy in the sense that it only converts the current field being asked for and isn't recursive. We do want to remove the extra wrapper, but thats all we want to do here. The interaction between jsonpath/jq paths in the json provider digs through things to find stuff, and the flattenermaker performs any final cleanup for things leaving conversion before the make it into druid, so the recursive conversion is done there (well before this patch these things could actually leak, that is a change in this PR in preparation for supporting nested column stuffs).

rohangarg · 2022-11-02T15:58:50Z

...tensions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupConverter.java

            return g.getInteger(fieldIndex, index);
+          case UINT_32:


good catch! :)

fix issue with parquet list conversion of nullable lists with complex…

34b5557

… nullable elements

clintropolis added Bug Area - Batch Ingestion labels Nov 2, 2022

pom stuff

766a946

imply-cheddar approved these changes Nov 2, 2022

View reviewed changes

clintropolis added Incompatible Release Notes labels Nov 2, 2022

fix style

2139e1f

gianm approved these changes Nov 2, 2022

View reviewed changes

rohangarg reviewed Nov 2, 2022

View reviewed changes

adjustments

0d62e9a

clintropolis closed this Nov 3, 2022

clintropolis reopened this Nov 3, 2022

rohangarg approved these changes Nov 4, 2022

View reviewed changes

clintropolis mentioned this pull request Nov 4, 2022

modernize parquet logical type conversion #13312

Open

clintropolis merged commit e60e305 into apache:master Nov 4, 2022

clintropolis deleted the parquet-nullable-list-fix branch November 4, 2022 12:25

kfaraz added this to the 25.0 milestone Nov 22, 2022

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

clintropolis commented Nov 2, 2022 •

edited

Loading

rohangarg left a comment

rohangarg Nov 2, 2022

clintropolis Nov 2, 2022 •

edited

Loading

rohangarg Nov 2, 2022

fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

fix issue with parquet list conversion of nullable lists with complex nullable elements #13294

Conversation

clintropolis commented Nov 2, 2022 • edited Loading

Description

Release note

rohangarg left a comment

Choose a reason for hiding this comment

rohangarg Nov 2, 2022

Choose a reason for hiding this comment

clintropolis Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

rohangarg Nov 2, 2022

Choose a reason for hiding this comment

clintropolis commented Nov 2, 2022 •

edited

Loading

clintropolis Nov 2, 2022 •

edited

Loading