Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Pull query table scans support LIKE and BETWEEN operators #8299

Merged
merged 7 commits into from Nov 5, 2021

Conversation

hli21
Copy link
Contributor

@hli21 hli21 commented Oct 28, 2021

Description

Add the support of LIKE and BETWEEN operators for pull query table scans.
This problem was reported in #7793

Testing done

Add new tests to Integration Test.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@hli21 hli21 requested a review from a team as a code owner October 28, 2021 00:02
Copy link
Contributor

@agavra agavra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @hli21! first contribution 🎉 - I left a few comments from a quick scan

@@ -275,7 +278,7 @@ private void extractKeysAndSystemCols() {
@Override
public Void process(final Expression node, final Object context) {
if (!(node instanceof LogicalBinaryExpression)
&& !(node instanceof ComparisonExpression)) {
&& !(node instanceof ComparisonExpression) && !(node instanceof LikePredicate) && !(node instanceof BetweenPredicate)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering, with table scans implemented should there ever be a condition that we don't support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We validate the WHERE clause for pull queries when building logic plan.

"Like condition on non-existent column " + columnName, isWindowed));

if (SqlBaseType.STRING != col.type().baseType()) {
throw invalidWhereClauseException("The column type for Like condition must be VARCHAR", isWindowed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to include the the type that it was in the error message just to make it extra clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public Void visitLikePredicate(final LikePredicate node, final Object context) {
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue();
if (column == null) {
throw invalidWhereClauseException("Like condition must directly reference a key column", isWindowed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this true? won't a LIKE expression need to do a table scan anyway? (perhaps we should be calling setTableScanOrElseThrow here - in fact it may very well be a bug not to, right?)

also let's add tests for that (LIKE expressions on a non-key field)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIKE does table scan. Change it to setTableScanOrElseThrow. We test key and non-key fields for LIKE.

public Void visitBetweenPredicate(final BetweenPredicate node, final Object context) {
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue();
if (column == null) {
throw invalidWhereClauseException("Between condition must directly reference a key column", isWindowed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above - I think at the moment (until we fix the multiple-range queries merging issue) wouldn't all BETWEEN calls end up in a table scan anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BETWEEN does table scan. Change it to setTableScanOrElseThrow.

],
"tests": [
{
"name": "like operator must succeed with table scan enabled ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a test for it failing without table scan enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test of LIKE on non-varchar column.

@agavra agavra requested a review from a team October 28, 2021 17:44
@andre0x01
Copy link

andre0x01 commented Oct 28, 2021

Hi @hli21 @agavra

I just wanted to let you know:
Coincidentally, I am working on the LIKE part of the same issue :-)
Though I haven't created a PR yet.

I think some of the comments are addressed there.
(e.g. table scanning, unit tests in QueryFilterNodeTest)
There is also a small docs update for pull queries.
The limitations regarding the WHERE clause are currently listed there.

If it makes sense to take over anything, please feel free to do so:
https://github.com/confluentinc/ksql/compare/master...andy0x01:7793-support-like-in-pull-queries?expand=1

Thanks

  • Andre

.orElseThrow(() -> invalidWhereClauseException(
"Like condition on non-existent column " + columnName, isWindowed));

if (SqlBaseType.STRING != col.type().baseType()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal, but we could also allow a string literal as well, or even something that evaluates to a string. If you notice in the comparison expression visitor, we actually do this sort of check but only throw by calling setTableScanOrElseThrow, which effectively doesn't throw when table scans are enabled, which they are now in all of production.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we already have a type checker in use invoked here: https://github.com/confluentinc/ksql/blob/master/ksqldb-execution/src/main/java/io/confluent/ksql/execution/interpreter/InterpretedExpressionFactory.java#L53

This should ensure that all of the types you get are compatible, but I see that we are able to hit the invalidWhereClauseException case here in the tests. Looking at ExpressionTypeManager, I can see that we actually skipped the a lot of checks there and just say that any LIKE evaluates to a bool, without type checking its children. To ensure that the children of the LIKE expression evaluated to string we would have to traverse the children too, so maybe my suggestion above to allow other types of expressions can be avoided until we do that. That doesn't have to be in this PR.

We should also ensure that the pattern is a string. Not sure if any checks are currently in place for this. Might be good to write a test for that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow string literal. Add check to pattern and more tests.

@@ -358,6 +363,41 @@ public Void visitComparisonExpression(
}
}

@Override
public Void visitLikePredicate(final LikePredicate node, final Object context) {
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's no guarantee that it will be a column reference. You might want to do an instanceof check and throw if not. I think that's what the null check is meant to do below. I'm not sure if it will actually ever be null, but it can be a literal, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@Override
public Void visitBetweenPredicate(final BetweenPredicate node, final Object context) {
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. This isn't guaranteed to be a column reference. You have to do instanceof checks and potentially throw if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite the expression to comparison.

.orElseThrow(() -> invalidWhereClauseException(
"Between condition on non-existent column " + columnName, isWindowed));

return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to like, I don't think we do type checking of the bounds in ExpressionTypeManager. The complicating thing is that between can take multiple types. I think we need to either extend ExpressionTypeManager or rewrite the expression similar to the in predicate and make this equivalent to (COL <= A AND COL >= B). In fact, if we did that, we would get the type checking automatically since it's done correct for comparisons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite the expression to comparison.

"statements": [
"CREATE TABLE INPUT (ID INTEGER PRIMARY KEY, V0 STRING, V1 STRING) WITH (kafka_topic='test_topic', value_format='JSON');",
"CREATE TABLE MATVIEW AS SELECT ID, V0, V1 FROM INPUT;",
"SELECT ID, V0, V1 FROM MATVIEW WHERE ID LIKE '%12';"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also want a test where the pattern isn't a string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests are added.

"statements": [
"CREATE STREAM INPUT (ID1 STRING KEY, ID2 INT, V0 STRING) WITH (kafka_topic='test_topic', format='JSON');",
"CREATE TABLE AGGREGATE AS SELECT ID1, ID2, V0, COUNT(1) AS COUNT FROM INPUT GROUP BY ID1, ID2, V0;",
"SELECT * FROM AGGREGATE WHERE ID1 BETWEEN '10' and '11';",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want some error cases where ID1 isn't a column and its type disagrees with the literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests are added.

&& !(node instanceof ComparisonExpression)) {
&& !(node instanceof ComparisonExpression)
&& !(node instanceof LikePredicate)
&& !(node instanceof BetweenPredicate)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove BetweenPredicate from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"SELECT ID, V0, V1 FROM MATVIEW WHERE '12' LIKE '12';"
],
"properties": {
"ksql.query.pull.table.scan.enabled": true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to introduce at least one test without table scans enabled and show that we throw the error when not used with a column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for table scan disable case.

@@ -36,10 +37,16 @@ private PullQueryRewriter() {
public static Expression rewrite(final Expression expression) {
final Expression pseudoTimestamp = new StatementRewriteForMagicPseudoTimestamp()
.rewrite(expression);
final Expression inPredicatesRemoved = rewriteInPredicates(pseudoTimestamp);
final Expression betweenPredicatesRemoved = rewriteBetweenPredicates(pseudoTimestamp);
final Expression inPredicatesRemoved = rewriteInPredicates(betweenPredicatesRemoved);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a test class PullQueryRewriterTest. Would be good to add some good test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for between predicate.

isWindowed);
}
} else {
setTableScanOrElseThrow(() -> invalidWhereClauseException("Like condition must directly "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me, even if they reference a key column, we're still going to do a table scan for this, so might want to update the messaging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message changed.

if (node.getValue() instanceof UnqualifiedColumnReferenceExp) {
final UnqualifiedColumnReferenceExp column = (UnqualifiedColumnReferenceExp) node.getValue();
final ColumnName columnName = column.getColumnName();
final Column col = schema.findColumn(columnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're effectively doing the type checking here since it's not done in ExpressionTypeManager, we should also verify that the column is a string type too, unless we actually automatically coerce the type, but I don't think we do. Might want to test that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added string type column verification.

@@ -47,6 +49,7 @@
import io.confluent.ksql.schema.ksql.DefaultSqlValueCoercer;
import io.confluent.ksql.schema.ksql.LogicalSchema;
import io.confluent.ksql.schema.ksql.SystemColumns;
import io.confluent.ksql.schema.ksql.types.SqlBaseType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RQTT tests are definitely good, though it might make sense to add a test or two to the unit tests for QueryFilterNodeTest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some unit tests.

@hli21 hli21 merged commit bc3ea64 into confluentinc:master Nov 5, 2021
@sandro-waf
Copy link

sandro-waf commented Nov 11, 2021

Hi!

After deploy the master branch, I tried to use this feature but I run in some problems:

When I try to run:

kslq> SELECT domain FROM QUERYABLE_TABLE WHERE domain='www.example.com.' AND domain LIKE 'www.example.com.' ;

I got the error:
Invalid expression: Logical binary expects two boolean values. Actual BOOLEAN and STRING.

However, running this 2 queries:

ksql > SELECT domain FROM QUERYABLE_TABLE WHERE domain = 'www.example.com.' ;

ksql > SELECT domain FROM QUERYABLE_TABLE WHERE domain LIKE 'www.example.com.' ;

The results are the excepted (and the same).

What I expected:
The predicate LIKE would work with boolean conditions (OR / AND) . It seems that LIKE predicate is treated as STRING when used with other boolean conditions.

@agavra
Copy link
Contributor

agavra commented Nov 15, 2021

hi @sandro-waf - thanks for reporting, I created a separate issue to look at this as often comments on closed PRs are missed: #8359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants