Skip to content

Comments

relax native expression multi-value string usage validation for conditional and null coalescing functions#13885

Closed
clintropolis wants to merge 1 commit intoapache:masterfrom
clintropolis:relax-expr-array-validation
Closed

relax native expression multi-value string usage validation for conditional and null coalescing functions#13885
clintropolis wants to merge 1 commit intoapache:masterfrom
clintropolis:relax-expr-array-validation

Conversation

@clintropolis
Copy link
Member

@clintropolis clintropolis commented Mar 6, 2023

Description

This PR relaxes multi-value string usage consistency validation to allow expressions like nvl, isnull, notnull, case and if statements, etc, to no longer automatically assume that their inputs must be scalar arguments.

This validation was added way back in #7588 as a means to ensure that a multi-value string column was being used consistently within an expression so that it cannot be considered both as an array and as a scalar value (which must be mapped across the multiple values). This system basically divided expressions up into expressions that definitely take array inputs and output arrays, and everything else. This worked well enough for lots of expressions, but a small handful don't fit this model, which is what this PR aims to help fix.

This system predates the native expression layer having full type inference, and I think it could be done a lot better, so I am also going to be trying to merge this much older analysis system into the type inference system in a follow-up PR, since all of the analysis of contextual usage and validation can be done at the same places where we doing the newer stuff (expression planner, etc).

While here I also consolidated some calcite -> native druid type stuff for the cast operator, which allowed some additional expressions to plan correctly.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@gianm
Copy link
Contributor

gianm commented Mar 7, 2023

Could you describe the motivation and risk a little more?

As to motivation— are there some queries that can plan after this change, that couldn't plan before?

As to risk— are there some cases where the validation was important, and now wouldn't be happening anymore?

@clintropolis
Copy link
Member Author

As to motivation— are there some queries that can plan after this change, that couldn't plan before?

Right now there are certain queries that are basically impossible to write, anything involving null value coercion or conditionals combined with explicit multi-value string functions, such as all of the where clauses in the query added in CalciteMultiValueStringQueryTest, because the first set of functions assume a multi-value string is being used as a scalar and mixing with any of the MV_ triggers the validation failure.

This same information is currently used to determine if an expression should be automatically 'applied' because the string is being used as a scalar value but is in fact multi-valued, so the risk of these changes is that if something was relying on NVL to be mapped across each element of the multi-value string instead of replace the value with some array then that behavior will no longer happen in this PR.

This is why I am working on a follow-up that is what i consider the real fix, which is to switch to using the type inference system so that way an expression could look at its inputs inferred output types to determine if a multi-value string is being used as a scalar or as an array, and enforce consistency (and determine if translation is required) that way, but I'm not completely sure how complicated this fix is so put this up as a interim fix because i feel like it improves the behavior, though its certainly worth a discussion.

The more I think about it though, i'm feeling a bit less good about this interim fix so I've been continuing on the more correct fix of consolidating the stuff on type inference.

@gianm
Copy link
Contributor

gianm commented Mar 7, 2023

The more I think about it though, i'm feeling a bit less good about this interim fix so I've been continuing on the more correct fix of consolidating the stuff on type inference.

Hmm does that mean you'd like to close the PR? Or would you be updating it?

@clintropolis
Copy link
Member Author

clintropolis commented Mar 7, 2023

Hmm does that mean you'd like to close the PR? Or would you be updating it?

I'm not entirely certain yet so will leave this open for now; I'm considering splitting out the changes to CastOperatorConversion.java and Calcites.java into a separate PR since it allows some queries which don't currently plan to be able to plan without the potential for modifying the behavior of existing queries, and then saving the rest for the follow-up of the more correct fix to switch to using type inference for stuff.

@clintropolis
Copy link
Member Author

I opened #13890 which has just the cast operator changes as a partial alternative to this PR, where the other part would be covered by an in progress change to consolidate expression analysis stuff into a single pass based on the expression type inference stuff.

@clintropolis
Copy link
Member Author

closing in favor of #13890

@clintropolis clintropolis deleted the relax-expr-array-validation branch March 7, 2023 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants