SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins#1384
SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins#1384atoomula merged 14 commits intoapache:masterfrom
Conversation
|
Can you please update the PR description according to - https://cwiki.apache.org/confluence/display/SAMZA/SEP-25%3A+PR+Title+And+Description+Guidelines |
|
|
||
| private RelRoot optimize(RelRoot relRoot) { | ||
| RelTraitSet relTraitSet = RelTraitSet.createEmpty(); | ||
| relTraitSet = relTraitSet.plus(EnumerableConvention.INSTANCE); |
There was a problem hiding this comment.
This is basically tricking the planner to use Enum Convention plus Volcano planner to generate a Plan with Convention of EnumerableConvention.INSTANCE. This will lead to complex branches down the line when doing the conversion as a follow up stage. You don't think the best way to do this is by using Calcite Conventions where Rules and Translation is happening at the same time to avoid complex code and very cryptic comments on what to expect when doing the translation ?
There was a problem hiding this comment.
This is vestige from my earlier attempt to enable volcano planner. I removed it now.
| LOG.info("query plan:\n" + RelOptUtil.toString(relRoot.rel, SqlExplainLevel.ALL_ATTRIBUTES)); | ||
| return relRoot; | ||
| LOG.info("query plan without optimization:\n" | ||
| + RelOptUtil.toString(relRoot.rel, SqlExplainLevel.EXPPLAN_ATTRIBUTES)); |
There was a problem hiding this comment.
When would you turn off the optimization ? if it is just optimization it should be always turned on ?
There was a problem hiding this comment.
I'm considering this as experimental right now considering this is the first version. Once we vet it thru' the real use-cases and gain confidence, we can enable it by default.
There was a problem hiding this comment.
If the main concern is that planning will fail with the new rules my suggestion is to have it on by default and catch the exception and re-plan without optimization. In this way we can learn the logs. It is up to you if you think this can be too much work
There was a problem hiding this comment.
I'm not strongly against turning it on by default. I turned it on by default now.
|
|
||
| //~ Methods ---------------------------------------------------------------- | ||
|
|
||
| protected void perform(RelOptRuleCall call, Filter filter, |
There was a problem hiding this comment.
As far I can tell it is only testing if filter is on the good side of the join, seems to me most of the work can be done at the onMatch, we can just extend and override onMatch ?
There was a problem hiding this comment.
I plan to add JoinConditionPushRule as well in next iteration (with no filter) which will reuse most of the code here and onMatch will be different.
There was a problem hiding this comment.
In my opinion if we can minimize the copy and past that will be ideal and have the rest in a follow up, but it is up to you this is not a blocking point.
| } | ||
|
|
||
| public RelRoot plan(String query) { | ||
| private Planner getPlanner() { |
There was a problem hiding this comment.
The planner is a closable resource I think it would be better to use it within a try block or make sure to close it when done.
There was a problem hiding this comment.
Sure. Added code to close it now.
| JoinInputNode.getInputType(join.getRight(), systemStreamConfigBySource); | ||
|
|
||
| // Disable this optimnization for queries using local table. | ||
| if (inputTypeOnLeft == InputType.LOCAL_TABLE || inputTypeOnRight == InputType.LOCAL_TABLE) { |
There was a problem hiding this comment.
Why are we turning off for local tables things work fine, not sure I am getting this ?
There was a problem hiding this comment.
There could be more optimizations done for local table as local tables do not have the limitations that remote tables have. We could directly enable Calcite's FilterJoinRule for local tables.
b-slim
left a comment
There was a problem hiding this comment.
Overall looks good to me, got some minor comments:
- Closing Planner.
- If possible Plan with optimization on and re-plan on exception if the concern is planning issues.
- If possible I highly recommend Avoid copying complex code if we can handle this within the onMatch call that will lead to better first cut.
shanthoosh
left a comment
There was a problem hiding this comment.
Thanks for the changes.
| try { | ||
| RelRoot optimizedRelRoot = | ||
| RelRoot.of(getPlanner().transform(0, relTraitSet, relRoot.project()), SqlKind.SELECT); | ||
| LOG.info("query plan with optimization:\n" |
There was a problem hiding this comment.
Nit:
- By default, Logger adds
\nto EOL. Unnecessary to explicitly add it. - Also, it's better to capitalize all the log-messages.
There was a problem hiding this comment.
- \n is not for EOL. It is for separating the log.
- You mean all log messages in Samza or just this ? And what is the point of capitalizing ? Is it to catch the eye while going thru logs ? If yes, we don't need to as their format makes them stand out in the logs,
2020-07-19 10:11:16.814 [main] [] QueryPlanner [INFO] query plan without optimization:
LogicalProject(key=[$9], pageKey=[$9], companyName=['N/A'], profileName=[$2], profileAddress=[$4])
LogicalFilter(condition=[AND(=($2, 'Mike'), =($10, 1))])
LogicalProject(key=[$0], id=[$1], name=[$2], companyId=[$3], address=[$4], selfEmployed=[$5], phoneNumbers=[$6], mapValues=[$7], __key__0=[$8], pageKey=[$9], profileId=[$10])
LogicalJoin(condition=[=($0, $11)], joinType=[inner])
LogicalTableScan(table=[[testRemoteStore, Profile, $table]])
LogicalProject(key=[$0], pageKey=[$1], profileId=[$2], $f3=[BuildOutputRecord('id', $2)])
LogicalTableScan(table=[[testavro, PAGEVIEW]])
2020-07-19 10:11:16.816 [main] [] QueryPlanner [INFO] query plan with optimization:
LogicalProject(key=[$9], pageKey=[$9], companyName=['N/A'], profileName=[$2], profileAddress=[$4])
LogicalFilter(condition=[=($2, 'Mike')])
LogicalJoin(condition=[=($0, $11)], joinType=[inner])
LogicalTableScan(table=[[testRemoteStore, Profile, $table]])
LogicalFilter(condition=[=($2, 1)])
LogicalProject(key=[$0], pageKey=[$1], profileId=[$2], $f3=[BuildOutputRecord('id', $2)])
LogicalTableScan(table=[[testavro, PAGEVIEW]])
| QueryPlanner planner = getQueryPlanner(getSqlConfig(sqlStmts, config)); | ||
| List<RelRoot> relRoots = new LinkedList<>(); | ||
| for (String sql: sqlStmts) { | ||
| QueryPlanner planner = getQueryPlanner(getSqlConfig(Collections.singletonList(sql), config)); |
There was a problem hiding this comment.
What is the rationale for recreating the planner for every sql-statement in the app?
There was a problem hiding this comment.
Good question. Calcite Planner as it stands today does not seem to be supported for reuse. Although the intent is there as they have exposed reset API. But it does not work. But it is such low cost to create new planner for each sql.
| sqlOperatorTables.add(new SamzaSqlOperatorTable()); | ||
| sqlOperatorTables.add(new SamzaSqlUdfOperatorTable(samzaSqlFunctions)); | ||
|
|
||
| // TODO: Introduce a pluggable rule factory. |
There was a problem hiding this comment.
Would be better to create a follow-up ticket for this action-item.
| JoinInputNode.InputType inputTypeOnRight = | ||
| JoinInputNode.getInputType(join.getRight(), systemStreamConfigBySource); | ||
|
|
||
| // Disable this optimnization for queries using local table. |
There was a problem hiding this comment.
nit: %s/optimnization/optimization
| filter != null | ||
| ? RelOptUtil.conjunctions(filter.getCondition()) | ||
| : new ArrayList<>(); | ||
| final com.google.common.collect.ImmutableList<RexNode> origAboveFilters = |
There was a problem hiding this comment.
Please import the ImmutableList class and don't hardcode the package paths. There're multiple occurrences in this file and else-where.
| !joinType.generatesNullsOnLeft() && !donotOptimizeLeft, | ||
| !joinType.generatesNullsOnRight() && !donotOptimizeRight, | ||
| joinFilters, | ||
| leftFilters, |
There was a problem hiding this comment.
Leftfilters and rightFilters are initialized and are not modified. It's very hard to find where they're populated. Please add a comment here that leftFilters and rightFilters will be populated by this classifyFilters method.
| LOG.info("query plan:\n" + RelOptUtil.toString(relRoot.rel, SqlExplainLevel.ALL_ATTRIBUTES)); | ||
| return relRoot; | ||
| LOG.info( | ||
| "query plan without optimization:\n" + RelOptUtil.toString(relRoot.rel, SqlExplainLevel.EXPPLAN_ATTRIBUTES)); |
There was a problem hiding this comment.
Please capitalize log messages and remove \n at the end(which gets added by default).
There was a problem hiding this comment.
Responded in earlier comment.
| String errorMsg = | ||
| "Error while optimizing query plan:\n" + RelOptUtil.toString(relRoot.rel, SqlExplainLevel.EXPPLAN_ATTRIBUTES); | ||
| LOG.error(errorMsg, e); | ||
| planner.close(); |
There was a problem hiding this comment.
Calling close() here seems unnecessary, It's already closed at the caller already with try-with-resources-closeable.
There was a problem hiding this comment.
oh yeah.. good point!
| sqlOperatorTables.add(new SamzaSqlUdfOperatorTable(samzaSqlFunctions)); | ||
|
|
||
| // TODO: Introduce a pluggable rule factory. | ||
| List<RelOptRule> rules = ImmutableList.of( |
There was a problem hiding this comment.
Is there a way to determine if these rel-rules are applied on a rel-plan and emit a metric(or log it before/after the optimization) for debugging purposes.
There was a problem hiding this comment.
AFAIK, only Hex/VolcanoPlanners make such decisions. Not sure if we can determine that in the rule itself.
| * This class is customized form of Calcite's {@link org.apache.calcite.rel.rules.FilterJoinRule} for | ||
| * remote table joins. | ||
| */ | ||
| public abstract class SamzaSqlFilterRemoteJoinRule extends RelOptRule { |
There was a problem hiding this comment.
Just curious. There seems to be considerable duplication with FilterJoinRule calcite native-class. Post CALCITE-3170, calcite supports anti-join on conditions push-down natively. If we upgrade to 1.21.0 rel-planner, then wouldn't overriding match suffice here?
There was a problem hiding this comment.
Are you talking about just anti-joins or in general about this rule ? If latter, Slim has a comment on this as well and I responded to him. I will have add another rule for join condition which reuses the same logic. Let me see at that time if I can inherit from Calcite rule and just override match.
apache#1384) * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins * Fix checkstyle errors * Fix checkstyle errors * Fix checkstyle errors * SAMZA-2549 - Samza-sql: Add query optimizations for remote table joins
Feature:
Samza Sql currently does not have a query optimizer. As a result, we rely on users to write their queries in such a way that the sql execution is optimized which is often clumsy and non-intuitive. Adding Query optimization will make sql query more intuitive at the same time making the plan optimized.
Changes
This change adds rule based optimizer (HepPlanner) to Samza Sql query planner. As part of this commit, FilterJoinRule is added for remote table joins.
Added remote table join tests to ensure the plan is optimized.