Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optimize where clause with sorting key expression move to prewhere for query with final #38950

Merged
merged 15 commits into from
Feb 15, 2023

Conversation

hexiaoting
Copy link
Contributor

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

close issue: #38893

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-improvement Pull request with some product improvements label Jul 7, 2022
/// Do not take into consideration the conditions consisting only of the first primary key column
&& !hasPrimaryKeyAtoms(node)
/// Do not take into consideration the conditions consisting of column that not belong to the primary key columns
&& isColumnAllPrimaryKey(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this change is unclear to me.
Previously, we did not allow to move condition to PREWHERE in case this condition was an expression over first PK column. Like, if we had PK (a, b, c) then we could not move condition where a = 42, but could move where b = 42 and where another_column = 42. This logic is understandable : most of the data would be filtered by PK if condition contains the first PK column (and this is mostly not true for other PK columns). Actually, I thought that maybe we can remove this check - because even for the condition over first PK column we will likely read 1-2 granules per part.

Now, we allow to move condition only if it is expression over PK key. Maybe we should check for FINAL as well? Something like && (!is_final || isColumnAllPrimaryKey(node))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, why here we use PK, not sorting key? Final requires sorting key to be processed, not just PK (which is a prefix of sorting key). PK is a set of columns for which we write an index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

bool hasPrimaryKeyAtoms(const ASTPtr & ast) const;

bool isPrimaryKeyAtom(const ASTPtr & ast) const;
bool isColumnAllPrimaryKey(const ASTPtr & ast) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very good name. Maybe isExpressionOverSortingKey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is better

|| (first_primary_key_column == first_arg_name && functionIsInOrGlobalInOperator(func->name)))
return true;
for (const auto & arg : args)
if (!isColumnAllPrimaryKey(arg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PK itself can contain an expression.
I think you should check pk_names.contains(arg->getColumnName()) first, otherwise the case

order by (toHour(event_time));
select ... where toHour(event_time) = toHour(now());

will not work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@KochetovNicolai KochetovNicolai self-assigned this Jul 7, 2022
@hexiaoting hexiaoting changed the title Dev prewhere Support optimize where clause with sorting key expression move to prewhere for query with final Jul 8, 2022
@KochetovNicolai
Copy link
Member

Now it looks fine.
Tests are broken mostly because more conditions were moved to prewhere

@hexiaoting
Copy link
Contributor Author

@KochetovNicolai
The error message in CI:

2022-07-08 13:51:03 --- /ClickHouse/tests/queries/0_stateless/01947_mv_subquery.reference	2022-07-08 13:30:17.834988543 +0300
2022-07-08 13:51:03 +++ /ClickHouse/tests/queries/0_stateless/01947_mv_subquery.stdout	2022-07-08 13:51:03.066948009 +0300
2022-07-08 13:51:03 @@ -1,6 +1,6 @@
2022-07-08 13:51:03  {"test":"1947 #1 CHECK - TRUE","sleep_calls":"0","sleep_microseconds":"0"}
2022-07-08 13:51:03  {"test":"1947 #2 CHECK - TRUE","sleep_calls":"2","sleep_microseconds":"2000"}
2022-07-08 13:51:03 -{"test":"1947 #3 CHECK - TRUE","sleep_calls":"0","sleep_microseconds":"0"}
2022-07-08 13:51:03 -{"test":"1947 #1 CHECK - FALSE","sleep_calls":"0","sleep_microseconds":"0"}
2022-07-08 13:51:03 -{"test":"1947 #2 CHECK - FALSE","sleep_calls":"2","sleep_microseconds":"2000"}
2022-07-08 13:51:03 -{"test":"1947 #3 CHECK - FALSE","sleep_calls":"0","sleep_microseconds":"0"}
2022-07-08 13:51:03 +{"test":"1947 #3 CHECK - TRUE","sleep_calls":"4","sleep_microseconds":"4000"}
2022-07-08 13:51:03 +{"test":"1947 #1 CHECK - FALSE","sleep_calls":"6","sleep_microseconds":"6000"}
2022-07-08 13:51:03 +{"test":"1947 #2 CHECK - FALSE","sleep_calls":"6","sleep_microseconds":"6000"}
2022-07-08 13:51:03 +{"test":"1947 #3 CHECK - FALSE","sleep_calls":"4","sleep_microseconds":"4000"}
2022-07-08 13:51:03 
2022-07-08 13:51:03 
2022-07-08 13:51:03 Database: test_po0t2c

I'm not sure the new result of query SELECT '1947 #3 CHECK - FALSE' as test, ProfileEvents['SleepFunctionCalls'] as sleep_calls, ProfileEvents['SleepFunctionMicroseconds'] as sleep_microseconds FROM system.query_log .... in test 01947_mv_subquery is right. Can you help me ?

@alexey-milovidov
Copy link
Member

@KochetovNicolai latest review was about one month ago.

@@ -193,8 +193,8 @@ void MergeTreeWhereOptimizer::analyzeImpl(Conditions & res, const ASTPtr & node,
/// Condition depend on some column. Constant expressions are not moved.
!cond.identifiers.empty()
&& !cannotBeMoved(node, is_final)
/// Do not take into consideration the conditions consisting only of the first primary key column
&& !hasPrimaryKeyAtoms(node)
/// when use final, do not take into consideration the conditions consisting of column that not belong to the sorting key columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand this: why we will not consider non-sorting columns when using final? Any correctness issues? @KochetovNicolai

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a correctness issue.
For queries with FINAL we may want to merge some rows with the same sorting key values. So, we cannot filter (except if the condition is fully on top of sorting key) before FINAL happens (=cannot move the condition to prewhere)

@hanfei1991
Copy link
Member

test failure unrelated

@hanfei1991 hanfei1991 merged commit b152419 into ClickHouse:master Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants