-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issue with parquet list conversion of nullable lists with complex nullable elements #13294
fix issue with parquet list conversion of nullable lists with complex nullable elements #13294
Conversation
… nullable elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM % comment
listItem.getType().getFieldCount() == 1 && | ||
listItem.getType().getName().equalsIgnoreCase("list") && | ||
listItem.getType().getFieldName(0).equalsIgnoreCase("element") && | ||
listItem.getGroup(0, 0).getType().isRepetition(Type.Repetition.OPTIONAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some doubts :
- should
listItem.getType().getFieldCount() == 1
be checked just afterlistItem.getType().isRepetition(Type.Repetition.REPEATED)
? - could we use
listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType())
for list check and assume thefieldName
for0
index to beelement
'type'? was checking https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules and it seemed like the name of the field in thelist
block could be anything and not always 'element'. totally fine if you have thought about this and want to tackle a particular case - should we do
convertListElement(listItem.getGroup(0, 0)
in the return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should listItem.getType().getFieldCount() == 1 be checked just after listItem.getType().isRepetition(Type.Repetition.REPEATED)
yeah, i think that makes more sense, will fix
could we use listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType()) for list check and assume the fieldName for 0 index to be element 'type'? was checking https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules and it seemed like the name of the field in the list block could be anything and not always 'element'. totally fine if you have thought about this and want to tackle a particular case
it 100% does seem better to use LogicalTypeAnnotation
stuff. Poking at this fix was the first time I've seen these things because it didn't actually exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but this functionality was added in 1.11.0 😅. I'll look into how difficult it is to translate this (and probably logical map conversion) to use this instead of OriginalType
and checking stuff by hand.
should we do convertListElement(listItem.getGroup(0, 0) in the return?
I think it isn't necessary the way things work right now; at the ParquetGroupConverter
layer, this conversion is lazy in the sense that it only converts the current field being asked for and isn't recursive. We do want to remove the extra wrapper, but thats all we want to do here. The interaction between jsonpath/jq paths in the json provider digs through things to find stuff, and the flattenermaker performs any final cleanup for things leaving conversion before the make it into druid, so the recursive conversion is done there (well before this patch these things could actually leak, that is a change in this PR in preparation for supporting nested column stuffs).
return g.getInteger(fieldIndex, index); | ||
case UINT_32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! :)
Description
This PR fixes a flaw in Parquet list conversion for nullable lists of nullable complex elements (groups with many fields, etc), which is amusingly similar to #5433, which I fixed for avro and thought I had addressed for direct group conversion in #6360, but was missing this case. We did cover the other common list form conversions, we just weren't unwrapping stuff in this form
which has been added in this PR.
I've also cleaned some stuff up around conversion, and made some changes in preparation of adding support for Parquet to nested column indexers and expression transforms, via some of the same flattener/conversion mechanisms we use for flattenSpec.
Release note
Fixes an issue with Parquet list conversion, where lists of complex objects could unexpectedly be wrapped in an extra object, appearing as
[{"element":<actual_list_element>},{"element":<another_one>}...]
instead of the direct list.Importantly, this changes the behavior of the parquet reader for lists of structured objects to be consistent with other parquet logical list conversions. Previously, this data would have to have been retrieved via this
element
as part of a jsonpath or jq flattenSpec, but now will be fetched directly, more closely matching its expected structure.For example, prior to this patch due to the unhandled logical list conversion, data of the form
{"list":[{"x":1, "y":2},{"x":3, "y": 4}]}
, to extract x you would have to specify$.list[0].element.x
. After this patch, the data would be extracted using$.list[0].x
.If you have any existing flattenSpec which are dependent on the current unintuitive behavior, they should be adjusted to drop the 'element' from the path.
This PR has: