New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Pull query table scans support LIKE and BETWEEN operators #8299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @hli21! first contribution 🎉 - I left a few comments from a quick scan
@@ -275,7 +278,7 @@ private void extractKeysAndSystemCols() { | |||
@Override | |||
public Void process(final Expression node, final Object context) { | |||
if (!(node instanceof LogicalBinaryExpression) | |||
&& !(node instanceof ComparisonExpression)) { | |||
&& !(node instanceof ComparisonExpression) && !(node instanceof LikePredicate) && !(node instanceof BetweenPredicate)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering, with table scans implemented should there ever be a condition that we don't support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We validate the WHERE clause for pull queries when building logic plan.
"Like condition on non-existent column " + columnName, isWindowed)); | ||
|
||
if (SqlBaseType.STRING != col.type().baseType()) { | ||
throw invalidWhereClauseException("The column type for Like condition must be VARCHAR", isWindowed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good to include the the type that it was in the error message just to make it extra clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public Void visitLikePredicate(final LikePredicate node, final Object context) { | ||
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue(); | ||
if (column == null) { | ||
throw invalidWhereClauseException("Like condition must directly reference a key column", isWindowed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this true? won't a LIKE
expression need to do a table scan anyway? (perhaps we should be calling setTableScanOrElseThrow
here - in fact it may very well be a bug not to, right?)
also let's add tests for that (LIKE
expressions on a non-key field)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LIKE does table scan. Change it to setTableScanOrElseThrow. We test key and non-key fields for LIKE.
public Void visitBetweenPredicate(final BetweenPredicate node, final Object context) { | ||
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue(); | ||
if (column == null) { | ||
throw invalidWhereClauseException("Between condition must directly reference a key column", isWindowed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above - I think at the moment (until we fix the multiple-range queries merging issue) wouldn't all BETWEEN
calls end up in a table scan anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BETWEEN does table scan. Change it to setTableScanOrElseThrow.
], | ||
"tests": [ | ||
{ | ||
"name": "like operator must succeed with table scan enabled ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a test for it failing without table scan enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test of LIKE on non-varchar column.
I just wanted to let you know: I think some of the comments are addressed there. If it makes sense to take over anything, please feel free to do so: Thanks
|
.orElseThrow(() -> invalidWhereClauseException( | ||
"Like condition on non-existent column " + columnName, isWindowed)); | ||
|
||
if (SqlBaseType.STRING != col.type().baseType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal, but we could also allow a string literal as well, or even something that evaluates to a string. If you notice in the comparison expression visitor, we actually do this sort of check but only throw by calling setTableScanOrElseThrow
, which effectively doesn't throw when table scans are enabled, which they are now in all of production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we already have a type checker in use invoked here: https://github.com/confluentinc/ksql/blob/master/ksqldb-execution/src/main/java/io/confluent/ksql/execution/interpreter/InterpretedExpressionFactory.java#L53
This should ensure that all of the types you get are compatible, but I see that we are able to hit the invalidWhereClauseException
case here in the tests. Looking at ExpressionTypeManager, I can see that we actually skipped the a lot of checks there and just say that any LIKE evaluates to a bool, without type checking its children. To ensure that the children of the LIKE expression evaluated to string we would have to traverse the children too, so maybe my suggestion above to allow other types of expressions can be avoided until we do that. That doesn't have to be in this PR.
We should also ensure that the pattern is a string. Not sure if any checks are currently in place for this. Might be good to write a test for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow string literal. Add check to pattern and more tests.
@@ -358,6 +363,41 @@ public Void visitComparisonExpression( | |||
} | |||
} | |||
|
|||
@Override | |||
public Void visitLikePredicate(final LikePredicate node, final Object context) { | |||
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's no guarantee that it will be a column reference. You might want to do an instanceof check and throw if not. I think that's what the null check is meant to do below. I'm not sure if it will actually ever be null, but it can be a literal, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
@Override | ||
public Void visitBetweenPredicate(final BetweenPredicate node, final Object context) { | ||
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here. This isn't guaranteed to be a column reference. You have to do instanceof checks and potentially throw if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite the expression to comparison.
.orElseThrow(() -> invalidWhereClauseException( | ||
"Between condition on non-existent column " + columnName, isWindowed)); | ||
|
||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to like, I don't think we do type checking of the bounds in ExpressionTypeManager
. The complicating thing is that between can take multiple types. I think we need to either extend ExpressionTypeManager
or rewrite the expression similar to the in predicate and make this equivalent to (COL <= A AND COL >= B). In fact, if we did that, we would get the type checking automatically since it's done correct for comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite the expression to comparison.
"statements": [ | ||
"CREATE TABLE INPUT (ID INTEGER PRIMARY KEY, V0 STRING, V1 STRING) WITH (kafka_topic='test_topic', value_format='JSON');", | ||
"CREATE TABLE MATVIEW AS SELECT ID, V0, V1 FROM INPUT;", | ||
"SELECT ID, V0, V1 FROM MATVIEW WHERE ID LIKE '%12';" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might also want a test where the pattern isn't a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests are added.
"statements": [ | ||
"CREATE STREAM INPUT (ID1 STRING KEY, ID2 INT, V0 STRING) WITH (kafka_topic='test_topic', format='JSON');", | ||
"CREATE TABLE AGGREGATE AS SELECT ID1, ID2, V0, COUNT(1) AS COUNT FROM INPUT GROUP BY ID1, ID2, V0;", | ||
"SELECT * FROM AGGREGATE WHERE ID1 BETWEEN '10' and '11';", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want some error cases where ID1 isn't a column and its type disagrees with the literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests are added.
&& !(node instanceof ComparisonExpression)) { | ||
&& !(node instanceof ComparisonExpression) | ||
&& !(node instanceof LikePredicate) | ||
&& !(node instanceof BetweenPredicate)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove BetweenPredicate
from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"SELECT ID, V0, V1 FROM MATVIEW WHERE '12' LIKE '12';" | ||
], | ||
"properties": { | ||
"ksql.query.pull.table.scan.enabled": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to introduce at least one test without table scans enabled and show that we throw the error when not used with a column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test for table scan disable case.
@@ -36,10 +37,16 @@ private PullQueryRewriter() { | |||
public static Expression rewrite(final Expression expression) { | |||
final Expression pseudoTimestamp = new StatementRewriteForMagicPseudoTimestamp() | |||
.rewrite(expression); | |||
final Expression inPredicatesRemoved = rewriteInPredicates(pseudoTimestamp); | |||
final Expression betweenPredicatesRemoved = rewriteBetweenPredicates(pseudoTimestamp); | |||
final Expression inPredicatesRemoved = rewriteInPredicates(betweenPredicatesRemoved); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a test class PullQueryRewriterTest
. Would be good to add some good test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests for between predicate.
isWindowed); | ||
} | ||
} else { | ||
setTableScanOrElseThrow(() -> invalidWhereClauseException("Like condition must directly " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me, even if they reference a key column, we're still going to do a table scan for this, so might want to update the messaging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message changed.
if (node.getValue() instanceof UnqualifiedColumnReferenceExp) { | ||
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue(); | ||
final ColumnName columnName = column.getColumnName(); | ||
final Column col = schema.findColumn(columnName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're effectively doing the type checking here since it's not done in ExpressionTypeManager
, we should also verify that the column is a string type too, unless we actually automatically coerce the type, but I don't think we do. Might want to test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added string type column verification.
@@ -47,6 +49,7 @@ | |||
import io.confluent.ksql.schema.ksql.DefaultSqlValueCoercer; | |||
import io.confluent.ksql.schema.ksql.LogicalSchema; | |||
import io.confluent.ksql.schema.ksql.SystemColumns; | |||
import io.confluent.ksql.schema.ksql.types.SqlBaseType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RQTT tests are definitely good, though it might make sense to add a test or two to the unit tests for QueryFilterNodeTest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some unit tests.
Hi! After deploy the master branch, I tried to use this feature but I run in some problems: When I try to run:
I got the error: However, running this 2 queries:
The results are the excepted (and the same). What I expected: |
hi @sandro-waf - thanks for reporting, I created a separate issue to look at this as often comments on closed PRs are missed: #8359 |
Description
Add the support of LIKE and BETWEEN operators for pull query table scans.
This problem was reported in #7793
Testing done
Add new tests to Integration Test.
Reviewer checklist