Skip to content

improve exception message for mismatched types when merging CapabilitiesBasedFormat#17760

Merged
gianm merged 2 commits intoapache:masterfrom
clintropolis:better-log-column-format-incompatible
Feb 26, 2025
Merged

improve exception message for mismatched types when merging CapabilitiesBasedFormat#17760
gianm merged 2 commits intoapache:masterfrom
clintropolis:better-log-column-format-incompatible

Conversation

@clintropolis
Copy link
Member

Improves exception message when merging CapabilitiesBasedFormat in IndexMergerV9 when faced with mismatched complex types. Prior to this change merging a complex type with a string type for example would create an exception message with a message like Cannot merge columns of type[someComplex] and [null] since it was using the complex type name of both types, which will be null for non-complex types. After this change this exception message for complex types will be something like Cannot merge columns of type[COMPLEX<someComplex>] and [STRING] since it will now use the type strings.

I also modified a merge behavior of ColumnCapabilities.areDictionaryValuesSorted and ColumnCapabilities.areDictionaryValuesUnique in CapabilitiesBasedFormat.merge that probably doesn't really matter since these are computed for query time stuff and they do not actually affect segment persistence but seemed odd behavior considering everything else in the merge, so now it is consistent and uses 'or' instead of 'and'.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error message change LGTM, but I'm not sure about the logic change.

merged.setHasMultipleValues(merged.hasMultipleValues().or(otherSnapshot.hasMultipleValues()).isTrue());
merged.setDictionaryValuesSorted(
merged.areDictionaryValuesSorted().and(otherSnapshot.areDictionaryValuesSorted()).isTrue()
merged.areDictionaryValuesSorted().or(otherSnapshot.areDictionaryValuesSorted()).isTrue()
Copy link
Contributor

@gianm gianm Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the correct choice depend on how the actual segment merging is expected to happen? Like:

  • if the merging preserves sortedness as long as both inputs are sorted, it should be and
  • if the merging sorts if either input is sorted, it should be or
  • if the merging always sorts regardless of input sortedness, it should just be hard coded to TRUE
  • if the merging requires that both inputs have the same sortedness, it should be an error if the two inputs have different sortedness

this change should be in a separate PR anyway, since it's a behavioral change beyond improving the exception message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I really need to do is update the javadoc on CapabilitiesBasedFormat to indicate that it's purpose is to capture the 'legacy' column formats that previously relied on merging ColumnCapabilities to work correctly, which in this case the change in this PR doesn't matter because it doesn't affect anything because these 2 properties are not used at all for decisions persisting stuff since they are just computed when the segment is loaded.

if the merging always sorts regardless of input sortedness, it should just be hard coded to TRUE

This is probably the most correct interpretation of the existing behavior for the things that rely on this ColumnFormat though it would be kind of odd to hard code to true because these settings are only applicable to STRING typed columns, which is why they are all using 'OR' logic instead of 'AND'.

Everything going forward should just implement ColumnFormat directly to capture what makes sense for the columns physical storage details, which would be the appropriate place to decide how to merge differing physical formats instead of using this capabilities based fallback thingy, so I should add javadoc to make this clear that CapabilitiesBasedFormat is not a "good" implementation of ColumnFormat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add, your comment highlights why ColumnFormat exists in the first place, which is because simply merging ColumnCapabilities in a generic/universal manner isn't really powerful enough to model these things in a satisfying way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY for the commentary, that makes sense.

@gianm gianm merged commit 86383e0 into apache:master Feb 26, 2025
75 checks passed
@clintropolis clintropolis deleted the better-log-column-format-incompatible branch February 26, 2025 23:40
@kgyrtkirk kgyrtkirk added this to the 33.0.0 milestone Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants