optimize constant auto columns#14412
Conversation
changes: * auto columns can now specialize the case when the column contains only constants, avoiding writing any real columns and just producing constant selectors and indexes. since this is backwards incompatible it is gated behind a flag on IndexSpec, optimizeJsonConstantColumns, which should be removed in Druid 28 * auto columns no longer participate in generic 'null column' handling, this was a mistake to try to support and caused ingestion failures due to mismatched ColumnFormat, and is replaced by the new constant column functionality * fix bugs with auto columns which contain empty objects, empty arrays, or primitive types mixed with either of these empty constructs
7060c71 to
8f3d8df
Compare
processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/nested/ConstantColumnAndIndexSupplier.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/nested/ConstantColumnAndIndexSupplier.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/nested/ConstantColumnAndIndexSupplier.java
Fixed
Show fixed
Hide fixed
| if (row != null) { | ||
| if (row instanceof List) { | ||
| Assert.assertArrayEquals(((List) row).toArray(), (Object[]) valueSelector.getObject()); | ||
| if (expectedType.getSingleType() != null) { |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null
| ) | ||
| throws IOException | ||
| { | ||
| SegmentWriteOutMediumFactory writeOutMediumFactory = TmpFileSegmentWriteOutMediumFactory.instance(); |
Check notice
Code scanning / CodeQL
Unread local variable
| NestedCommonFormatColumnPartSerde partSerde = NestedCommonFormatColumnPartSerde.serializerBuilder() | ||
| .isConstant(false) | ||
| .isVariantType(false) | ||
| .withByteOrder(ByteOrder.nativeOrder()) | ||
| .withHasNulls(true) | ||
| .withLogicalType(ColumnType.LONG) | ||
| .build(); |
Check notice
Code scanning / CodeQL
Unread local variable
There was a problem hiding this comment.
I think that CodeQL has a point here, is this test doing something?
| DefaultBitmapResultFactory resultFactory = new DefaultBitmapResultFactory(bitmapSerdeFactory.getBitmapFactory()); | ||
|
|
||
| public ConstantColumnSupplierTest( | ||
| @SuppressWarnings("unused") String name, |
Check notice
Code scanning / CodeQL
Useless parameter
| } else { | ||
| nulls = null; | ||
| } | ||
| switch (logicalType.getType()) { |
Check warning
Code scanning / CodeQL
Missing enum case in switch
There was a problem hiding this comment.
Can you add a default case that just falls through and have a comment that explains why it's good to fall through?
imply-cheddar
left a comment
There was a problem hiding this comment.
I think we can do better than using a Jackson-serialized form of the object as our constant identifier and would hope that we would do that before persisting any constant columns. So, I think that for this PR we either separate out the "store constant columns" into a different PR or update this PR to store the constants as a type and relevant dictionaryId.
| if (capabilities != null) { | ||
| bob.hasMultipleValues(capabilities.hasMultipleValues().isTrue()) | ||
| .hasNulls(capabilities.hasNulls().isMaybeTrue()); | ||
| } |
There was a problem hiding this comment.
Does this need to be inside of the try? I don't think it does, but I'm wondering if I'm missing something.
| FieldIndexer rootField = fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT); | ||
| ColumnType singleType = rootField.getTypes().getSingleType(); | ||
| return singleType == null ? ColumnType.NESTED_DATA : singleType; | ||
| if (!hasNestedData) { |
There was a problem hiding this comment.
Nit: make this positive instead of negative.
| if (fieldIndexers.isEmpty()) { | ||
| // we didn't see anything, so we can be anything, so why not a string? | ||
| return ColumnType.STRING; | ||
| } |
There was a problem hiding this comment.
I would've expected this case to be isConstant = true and constantValue = null, why not check for those states instead?
| if (column instanceof ConstantColumn) { | ||
| return new NestedColumnMergable( | ||
| new SortedValueDictionary( | ||
| column.getStringDictionary(), | ||
| column.getLongDictionary(), | ||
| column.getDoubleDictionary(), | ||
| column.getArrayDictionary(), | ||
| column | ||
| ), | ||
| column.getFieldTypeInfo(), | ||
| ColumnType.NESTED_DATA.equals(column.getLogicalType()), | ||
| true, | ||
| ((ConstantColumn) column).getConstantValue() | ||
| ); | ||
| } |
There was a problem hiding this comment.
Why not move this check to be before the if( col instanceof NestedCommonFormatColumn) check. It doesn't seem to be meaningful that this is nested?
| } else { | ||
| nulls = null; | ||
| } | ||
| switch (logicalType.getType()) { |
There was a problem hiding this comment.
Can you add a default case that just falls through and have a comment that explains why it's good to fall through?
| throw new RE(ex, "Failed to deserialize V%s column [%s].", version, columnName); | ||
| } | ||
| } else { | ||
| throw new RE("Unknown version " + version); |
There was a problem hiding this comment.
interpolate with []?
| final Object constantValue = NestedDataComplexTypeSerde.OBJECT_MAPPER.readValue( | ||
| IndexMerger.SERIALIZER_UTILS.readBytes(bb, valueLength), | ||
| Object.class | ||
| ); |
There was a problem hiding this comment.
We can do better than this. We know the type, let's persist the type and persist it as the lookup into whatever the relevant dictionary is. Given the type and the global dictionary id, we should be able to quite simply deserialize the thing...
| this.matchBitmap = bitmapSerdeFactory.getBitmapFactory().complement( | ||
| bitmapSerdeFactory.getBitmapFactory().makeEmptyImmutableBitmap(), | ||
| numRows | ||
| ); |
There was a problem hiding this comment.
Remind me of the lifecycle of the ConstantColumnAndIndexSupplier, I'm worried that it exists on-heap on segment-load, where I don't think it's good to pay the overhead of this on-heap object. It would be better to just build it on-demand at query time (it's not really expensive to build).
| FieldTypeInfo.MutableTypeSet rootOnlyType = new FieldTypeInfo.MutableTypeSet().add(getLogicalType()); | ||
| SortedMap<String, FieldTypeInfo.MutableTypeSet> fields = new TreeMap<>(); | ||
| fields.put(NestedPathFinder.JSON_PATH_ROOT, rootOnlyType); | ||
| if (!getLogicalType().equals(ColumnType.NESTED_DATA)) { |
There was a problem hiding this comment.
nit: invert it so that there's 0 chance of an NPE. I.e. ColumnType.NESTED_DATA.equals(getLogicalType()) cannot NPE on this line
| NestedCommonFormatColumnPartSerde partSerde = NestedCommonFormatColumnPartSerde.serializerBuilder() | ||
| .isConstant(false) | ||
| .isVariantType(false) | ||
| .withByteOrder(ByteOrder.nativeOrder()) | ||
| .withHasNulls(true) | ||
| .withLogicalType(ColumnType.LONG) | ||
| .build(); |
There was a problem hiding this comment.
I think that CodeQL has a point here, is this test doing something?
Fixes #14339
Description
changes:
Release note
TBD
This PR has: