Skip to content

optimize constant auto columns#14412

Closed
clintropolis wants to merge 3 commits intoapache:masterfrom
clintropolis:auto-constant-columns-and-bug-fixes
Closed

optimize constant auto columns#14412
clintropolis wants to merge 3 commits intoapache:masterfrom
clintropolis:auto-constant-columns-and-bug-fixes

Conversation

@clintropolis
Copy link
Member

Fixes #14339

Description

changes:

  • auto columns can now specialize the case when the column contains only constants, avoiding writing any real columns and just producing constant selectors and indexes. since this is backwards incompatible it is gated behind a flag on IndexSpec, optimizeJsonConstantColumns, which should be removed in Druid 28
  • auto columns no longer participate in generic 'null column' handling, this was a mistake to try to support and caused ingestion failures due to mismatched ColumnFormat, and is replaced by the new constant column functionality
  • fix bugs with auto columns which contain empty objects, empty arrays, or primitive types mixed with either of these empty constructs

Release note

TBD


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

changes:
* auto columns can now specialize the case when the column contains only constants, avoiding writing any real columns and just producing constant selectors and indexes. since this is backwards incompatible it is gated behind a flag on IndexSpec, optimizeJsonConstantColumns, which should be removed in Druid 28
* auto columns no longer participate in generic 'null column' handling, this was a mistake to try to support and caused ingestion failures due to mismatched ColumnFormat, and is replaced by the new constant column functionality
* fix bugs with auto columns which contain empty objects, empty arrays, or primitive types mixed with either of these empty constructs
@clintropolis clintropolis force-pushed the auto-constant-columns-and-bug-fixes branch from 7060c71 to 8f3d8df Compare June 13, 2023 03:02
if (row != null) {
if (row instanceof List) {
Assert.assertArrayEquals(((List) row).toArray(), (Object[]) valueSelector.getObject());
if (expectedType.getSingleType() != null) {

Check warning

Code scanning / CodeQL

Dereferenced variable may be null

Variable [expectedType](1) may be null at this access as suggested by [this](2) null guard. Variable [expectedType](1) may be null at this access as suggested by [this](3) null guard.
)
throws IOException
{
SegmentWriteOutMediumFactory writeOutMediumFactory = TmpFileSegmentWriteOutMediumFactory.instance();

Check notice

Code scanning / CodeQL

Unread local variable

Variable 'SegmentWriteOutMediumFactory writeOutMediumFactory' is never read.
Comment on lines +36 to +42
NestedCommonFormatColumnPartSerde partSerde = NestedCommonFormatColumnPartSerde.serializerBuilder()
.isConstant(false)
.isVariantType(false)
.withByteOrder(ByteOrder.nativeOrder())
.withHasNulls(true)
.withLogicalType(ColumnType.LONG)
.build();

Check notice

Code scanning / CodeQL

Unread local variable

Variable 'NestedCommonFormatColumnPartSerde partSerde' is never read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that CodeQL has a point here, is this test doing something?

DefaultBitmapResultFactory resultFactory = new DefaultBitmapResultFactory(bitmapSerdeFactory.getBitmapFactory());

public ConstantColumnSupplierTest(
@SuppressWarnings("unused") String name,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'name' is never used.
} else {
nulls = null;
}
switch (logicalType.getType()) {

Check warning

Code scanning / CodeQL

Missing enum case in switch

Switch statement does not have a case for [ARRAY](1). Switch statement does not have a case for [COMPLEX](2). Switch statement does not have a case for [STRING](3).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a default case that just falls through and have a comment that explains why it's good to fall through?

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better than using a Jackson-serialized form of the object as our constant identifier and would hope that we would do that before persisting any constant columns. So, I think that for this PR we either separate out the "store constant columns" into a different PR or update this PR to store the constants as a type and relevant dictionaryId.

Comment on lines +350 to +353
if (capabilities != null) {
bob.hasMultipleValues(capabilities.hasMultipleValues().isTrue())
.hasNulls(capabilities.hasNulls().isMaybeTrue());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be inside of the try? I don't think it does, but I'm wondering if I'm missing something.

FieldIndexer rootField = fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
ColumnType singleType = rootField.getTypes().getSingleType();
return singleType == null ? ColumnType.NESTED_DATA : singleType;
if (!hasNestedData) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make this positive instead of negative.

Comment on lines +320 to +323
if (fieldIndexers.isEmpty()) {
// we didn't see anything, so we can be anything, so why not a string?
return ColumnType.STRING;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've expected this case to be isConstant = true and constantValue = null, why not check for those states instead?

Comment on lines +189 to +203
if (column instanceof ConstantColumn) {
return new NestedColumnMergable(
new SortedValueDictionary(
column.getStringDictionary(),
column.getLongDictionary(),
column.getDoubleDictionary(),
column.getArrayDictionary(),
column
),
column.getFieldTypeInfo(),
ColumnType.NESTED_DATA.equals(column.getLogicalType()),
true,
((ConstantColumn) column).getConstantValue()
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this check to be before the if( col instanceof NestedCommonFormatColumn) check. It doesn't seem to be meaningful that this is nested?

} else {
nulls = null;
}
switch (logicalType.getType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a default case that just falls through and have a comment that explains why it's good to fall through?

throw new RE(ex, "Failed to deserialize V%s column [%s].", version, columnName);
}
} else {
throw new RE("Unknown version " + version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interpolate with []?

Comment on lines +73 to +76
final Object constantValue = NestedDataComplexTypeSerde.OBJECT_MAPPER.readValue(
IndexMerger.SERIALIZER_UTILS.readBytes(bb, valueLength),
Object.class
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do better than this. We know the type, let's persist the type and persist it as the lookup into whatever the relevant dictionary is. Given the type and the global dictionary id, we should be able to quite simply deserialize the thing...

Comment on lines +107 to +110
this.matchBitmap = bitmapSerdeFactory.getBitmapFactory().complement(
bitmapSerdeFactory.getBitmapFactory().makeEmptyImmutableBitmap(),
numRows
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me of the lifecycle of the ConstantColumnAndIndexSupplier, I'm worried that it exists on-heap on segment-load, where I don't think it's good to pay the overhead of this on-heap object. It would be better to just build it on-demand at query time (it's not really expensive to build).

FieldTypeInfo.MutableTypeSet rootOnlyType = new FieldTypeInfo.MutableTypeSet().add(getLogicalType());
SortedMap<String, FieldTypeInfo.MutableTypeSet> fields = new TreeMap<>();
fields.put(NestedPathFinder.JSON_PATH_ROOT, rootOnlyType);
if (!getLogicalType().equals(ColumnType.NESTED_DATA)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: invert it so that there's 0 chance of an NPE. I.e. ColumnType.NESTED_DATA.equals(getLogicalType()) cannot NPE on this line

Comment on lines +36 to +42
NestedCommonFormatColumnPartSerde partSerde = NestedCommonFormatColumnPartSerde.serializerBuilder()
.isConstant(false)
.isVariantType(false)
.withByteOrder(ByteOrder.nativeOrder())
.withHasNulls(true)
.withLogicalType(ColumnType.LONG)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that CodeQL has a point here, is this test doing something?

@clintropolis clintropolis mentioned this pull request Jun 14, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'json' and 'auto' columns improperly handle case where all rows are empty objects

2 participants