fix issue with nested column processing/storage of empty fields#19072
fix issue with nested column processing/storage of empty fields#19072capistrant merged 7 commits intoapache:masterfrom
Conversation
changes: * fix bug in `NestedPathFinder.toNormalizedJsonPath` creating incorrect paths when given an empty field name * fix `NestedPathFinder.parseJsonPath` to correctly detect illegal empty paths consisting of consecutive . characters * added `NestedPathFinder.parseBadJsonPath` to read nested column fields dictionaries and detect and fixup illegal path expressions, using a newly added `FieldsFixupIndexed` to swap the bad values with good values * `NestedDataColumnSupplier` on column read attempts to detect bad paths written by the bugged version of `NestedPathFinder.toNormalizedJsonPath` * added 'pathParserVersion' field to nested column part serde so that newly written nested columns after the bug fix can skip checking for the bug
28f89c0 to
fb1546c
Compare
| @JsonProperty("bitmapSerdeFactory") BitmapSerdeFactory bitmapSerdeFactory, | ||
| @JsonProperty("columnFormatSpec") @Nullable FormatSpec columnFormatSpec | ||
| @JsonProperty("columnFormatSpec") @Nullable FormatSpec columnFormatSpec, | ||
| @JsonProperty("pathParserVersion") @Nullable Byte pathParserVersion |
There was a problem hiding this comment.
I don't see a @JsonProperty getter for this. Is it going to make it into the serialized form?
There was a problem hiding this comment.
added a serde test too
|
|
||
|
|
||
| @VisibleForTesting | ||
| static Supplier<? extends Indexed<ByteBuffer>> getAndFixFieldsSupplier( |
There was a problem hiding this comment.
Should have some javadoc explaining what "fix" means. Having a short description and then linking to a GitHub issue or PR for more details is a useful way to do it.
| } | ||
|
|
||
| @VisibleForTesting | ||
| public static class FieldsFixupIndexed implements Indexed<ByteBuffer> |
There was a problem hiding this comment.
Similar to the static creator method, this should have some javadoc explaining what "fix" means. Having a short description and then linking to a GitHub issue or PR for more details is a useful way to do it.
| @Override | ||
| public boolean isSorted() | ||
| { | ||
| return delegate.isSorted(); |
There was a problem hiding this comment.
Is this always going to be true (will the fixed-up keys sort the same way as the bad keys)? Also, does it matter (does anything care if this indexed is sorted)?
There was a problem hiding this comment.
nope, good catch, removed this since we can't guarantee it and added note in javadocs about it not mattering
| return entry.getIntKey(); | ||
| } | ||
| } | ||
| return delegate.indexOf(value); |
There was a problem hiding this comment.
Should return this only if the key is not present in fixup. Otherwise it will return a nonnegative index for the bad keys.
There was a problem hiding this comment.
Do you mean like if a caller is searching for the bad keys? I don't think that should be possible for most callers since when calling this method we generally run it through NestedPathFinder.toNormalizedJsonPath before searching because usually at the point we call this the path expressions have been converted to List<NestedPathPart> during SQL planning.
Also, i believe it should be harmless for indexOf to report the positions of the bad paths if using the bad path expressions.
Added explanation to javadocs
There was a problem hiding this comment.
Yes, I meant if someone is searching for the bad keys. In that case the behavior here doesn't adhere to the Indexed interface. If we're going to deviate from it, that can be fine given it's only used in this specific place, but any intentional deviation should be javadoc'd.
| } | ||
|
|
||
| /** | ||
| * split a JSONPath path into a series of extractors to find things in stuff. This method is mostly a duplicate of |
There was a problem hiding this comment.
Surely something can make more sense then "find things in stuff".
There was a problem hiding this comment.
updated javadocs to be a bit clearer
| /** | ||
| * split a JSONPath path into a series of extractors to find things in stuff. This method is mostly a duplicate of | ||
| * {@link #parseJsonPath(String)} fixing up any bugs encountered from bad paths like '$..a' or '$.[0].a' from | ||
| * previously bugged versions of {@link #toNormalizedJsonPath(List)} and should be kept in sync. |
There was a problem hiding this comment.
Can these be made to share a common parse function that is parameterized with some boolean like allowBadPaths? If not, at least add a comment in parseJsonPath too, so someone modifying that can remember to update this one too.
There was a problem hiding this comment.
done, called parameter allowFixEmptyFieldsBadPaths since it is pretty specific on the bad paths it allows
Description
changes:
NestedPathFinder.toNormalizedJsonPathcreating incorrect paths when given an empty field nameNestedPathFinder.parseJsonPathto correctly detect illegal empty paths consisting of consecutive . charactersNestedPathFinder.parseBadJsonPathto read nested column fields dictionaries and detect and fixup illegal path expressions, using a newly addedFieldsFixupIndexedto swap the bad values with good valuesNestedDataColumnSupplieron column read attempts to detect bad paths written by the bugged version ofNestedPathFinder.toNormalizedJsonPath