Skip to content

fix ExpressionVirtualColumn equivalence key to use Expr.stringify instead of raw input string to stabilize comparison#18334

Merged
clintropolis merged 6 commits intoapache:masterfrom
clintropolis:fix-expression-virtual-column-equivalence
Aug 2, 2025
Merged

fix ExpressionVirtualColumn equivalence key to use Expr.stringify instead of raw input string to stabilize comparison#18334
clintropolis merged 6 commits intoapache:masterfrom
clintropolis:fix-expression-virtual-column-equivalence

Conversation

@clintropolis
Copy link
Member

Description

Fixes a bug with ExpressionVirtualColumn equivalence checking (VirtualColumn.getEquivalenceKey()) caused by using the raw input expression string instead of like parsing the expression and using stringify to stabilize it to a homogeneous value. This method is used to match virtual columns in projections with query time virtual columns, so unless the projection was created with the exact same expression string as the query time expression, it would be unable to match correctly (really counter to the purpose of all the equivalence key stuff in the first place, which was added for this purpose 😅)

After the change, expressionString is barely needed, but i have retained it so that json serialization and such are not required to parse the expression and restringify just to serialize, and not change behavior of equals/hashcode of the parent ExpressionVirtualColumn type (the EquivalenceKey is the internal Expression type)

…tead of raw input string to stabilize comparison
Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

this.name = Preconditions.checkNotNull(name, "name");
this.expression = new Expression(Preconditions.checkNotNull(expression, "expression"), outputType);
this.parsedExpression = parsedExpression;
this.expression = new Expression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since parsedExpression.get() is already called in line 132, why does Expression need a Supplier<Expr> instead of just Expr? Maybe Parser.lazyParse is not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to remove Parser.lazyParse, this is behind a supplier so we don't spend time parsing expressions unless we actually need to do something with the Expr since parsing the expression can add up quite a bit in terms of cpu cost

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, so maybe we should do:

this.expressionAnalysis = Suppliers.memoize(() -> parsedExpression.get().analyzeInputs());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yea that is a mistake, good catch

}
Expression that = (Expression) o;
return Objects.equals(expressionString, that.expressionString) && Objects.equals(outputType, that.outputType);
return Objects.equals(parsed.get().stringify(), that.parsed.get().stringify())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spot checked a few Expr and they seems to have equals implemented, is there any reason to call stringify specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably ok... for whatever reason I am hesitant to trust equals even tho Expr is marked with SubclassesMustOverrideEqualsAndHashCode it has a lot less test coverage than stringify which has a lot of coverage to ensure that expr.stringify() makes a string that can be parsed back into the same expr. I guess the solution to my feelings is to just add more coverage using equals and hashcode instead of not using them.

Maybe I'll try to add some coverage and switch to equals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding some equals/hashcode checks to FunctionTest has hit a handful of failures, so my hunch was maybe correct. This is certainly wack that equals/hashcode are not equal, but i think the verdict is that i'll leave this as using stringify until i can fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an update, the failures i've seen so far seem related to typed null constants (like NullLongExpr not being equal to NullDoubleExpr or a StringExpr with a null value), but these all have the same stringified form. Not entirely sure how to best fix it, since they are technically different Expr, but they are equivalent... we might need a method like equals but less strict since a given string expression might have multiple possible Expr forms depending on stuff like for example whether asSingleThreaded is called or not, etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suppose more technically we need an actual equivalence method for Expr to power all of this stuff, then it could handle more general equivalence like a + b and b + a being interchangeable, but that's too much for this PR i think

public int hashCode()
{
return Objects.hash(name, expression);
return Objects.hash(name, expression.expressionString, expression.outputType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between expressionString and expression.parsed.get().stringify()?

Copy link
Member Author

@clintropolis clintropolis Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expressionString is the raw expression as input by the user from the json, stringifying the parsed expression homogenizes this so a+b and a + b and "a" + "b" become the same expr after being parsed. (this is the reason the equivalence check was broken, using the raw input strings is not correct because the user could write the same expression in a lot of different ways)

this.name = Preconditions.checkNotNull(name, "name");
this.expression = new Expression(Preconditions.checkNotNull(expression, "expression"), outputType);
this.parsedExpression = parsedExpression;
this.expression = new Expression(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, so maybe we should do:

this.expressionAnalysis = Suppliers.memoize(() -> parsedExpression.get().analyzeInputs());

@clintropolis clintropolis merged commit eb44f1e into apache:master Aug 2, 2025
133 of 134 checks passed
@clintropolis clintropolis deleted the fix-expression-virtual-column-equivalence branch August 2, 2025 03:50
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants