-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/distsql: push down NOT NULL constraints on equality columns during planning #20100
Comments
Hello, I am a new bee here. I would like to know whether you will code this issue yourself (I noted you fix #20090 ). If you will, I will look for another issue which maybe looks simple; if not, I want to have a try. |
@sydnever hello! Thanks for your interest in this issue. We haven't looked into this yet, so the issue is still "free". Do you know already what the solution looks like? |
@knz I have some ideas in my mind and I am trying to find where and how I will code. When I make it clear, I will share my plan here |
Thanks. |
@knz Should I make the final plan like this: root@localhost:26257/test> EXPLAIN(EXPRS,METADATA)
select * from
(select * from t1 where t1.id is not null) as r
inner join
(select * from t1 where t1.id is not null) as j
on r.id = j.id;
+-------+--------+----------+----------------+----------------------------------+-----------------------------------+
| Level | Type | Field | Description | Columns | Ordering |
+-------+--------+----------+----------------+----------------------------------+-----------------------------------+
| 0 | join | | | (id, num, id, num) | |
| 0 | | type | inner | | |
| 0 | | equality | (id) = (id) | | |
| 1 | render | | | (id, num) | id!=NULL |
| 1 | | render 0 | id | | |
| 1 | | render 1 | num | | |
| 2 | scan | | | (id, num, rowid[hidden,omitted]) | id!=NULL; rowid!=NULL; key(rowid) |
| 2 | | table | t1@primary | | |
| 2 | | spans | ALL | | |
| 2 | | filter | id IS NOT NULL | | |
| 1 | render | | | (id, num) | id!=NULL |
| 1 | | render 0 | id | | |
| 1 | | render 1 | num | | |
| 2 | scan | | | (id, num, rowid[hidden,omitted]) | id!=NULL; rowid!=NULL; key(rowid) |
| 2 | | table | t1@primary | | |
| 2 | | spans | ALL | | |
| 2 | | filter | id IS NOT NULL | | |
+-------+--------+----------+----------------+----------------------------------+-----------------------------------+
(17 rows)
Time: 4.084ms |
yes something like that. But that is not an implementation plan! Let us know how you plan to proceed. |
In lego's example SELECT * FROM t1 AS l INNER JOIN t1 AS r ON l.id = r.id;
In opt_filter.go#L541, I may change the return result = mergeConj(result, N) Or In opt_filter.go#L681, I may reuse 'addJoinFilter' to add There is another problem that there should not be a I think this problem should be resolved in another issue #20237 . |
@sydnever , I am also looking into this issue.
For Inner Join, both left and right column should fetch none-null values.
@knz , Do you think so? |
I think a more comprehensive fix (which would add these constraints in more situations, and wouldn't require join-specific code) would be to make |
@RaduBerinde Sorry, could you give me some queries as demos to make me get it? In my opinion, what you said may not do optimzation actually. |
@sydnever the query I mentioned in the first issue comment will have this optimization. Going into more details with the example, given the schema and values: CREATE TABLE t1 (
k int,
v int
);
INSERT INTO t1 VALUES (1, NULL), (2, NULL), (3, 1), (4, 1); If we do the query SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
ON L.k = R.k The results will be L.k | L.v | R.k | R.v
3 | 1 | 3 | 1
4 | 1 | 4 | 1 By doing what @RaduBerinde mentioned, The reason we can do this in If we do that split operation that is proposed, the query will become: SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
ON L.k = R.k AND L.k IS NOT NULL AND R.k IS NOT NULL Then, during query optimization (which already happens), this query transforms into SELECT *
FROM
(SELECT * FROM t1 WHERE k IS NOT NULL) AS L
INNER JOIN
(SELECT * FROM t1 WHERE k IS NOT NULL) AS R
ON L.k = R.k AND L.k Which will optimize the join, because we are filtering out rows before they begin joining (as joins are expensive!). I hope that helps :). Don't hesitate to ask any other questions! |
@LEGO Yeah,I got what you say. But @RaduBerinde mentioned Thank you all the same! :) |
@LEGO We can consider another situation. SELECT SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
USING k If we code in SELECT *
FROM
(SELECT * FROM t1 WHERE k IS NOT NULL) AS L
INNER JOIN
(SELECT * FROM t1 WHERE k IS NOT NULL) AS R
ON L.k = R.k AND L.k |
@adamyaoqinglin true, we may need a bit of special code for @sydnever one situation I can think of is index join (admittedly, it's still a join but it's different code). We use splitFilter to evaluate whatever we can from the filter on the (non-covering) index. Here's an example: create table t (k INT PRIMARY KEY, a INT, b INT, INDEX (b));
root@:26257/test> EXPLAIN SELECT * FROM t WHERE b < a ORDER BY b LIMIT 10;
+-------+------------+-------+-------------+
| Level | Type | Field | Description |
+-------+------------+-------+-------------+
| 0 | limit | | |
| 1 | index-join | | |
| 2 | scan | | |
| 2 | | table | t@t_b_idx |
| 2 | | spans | ALL |
| 2 | scan | | |
| 2 | | table | t@primary |
+-------+------------+-------+-------------+
(7 rows) Note the spans ALL; the spans would skip NULLs if we propagated Another situation is when we have something like |
@RaduBerinde thank you, I got it. |
@RaduBerinde @LEGO I have made some tests. func splitFilter(expr tree.TypedExpr, conv varConvertFunc) (restricted, remainder tree.TypedExpr) {
if expr == nil {
syd.Sydlog("splitFilter nil\n")
return nil, nil
}
syd.Sydlog("splitFilter expr: " + expr.String())
restricted, remainder = splitBoolExpr(expr, conv, true)
syd.Sydlog("splitFilter restricted: " + restricted.String())
syd.Sydlog("splitFilter remainder : " + remainder.String())
if restricted == tree.DBoolTrue {
restricted = nil
}
if remainder == tree.DBoolTrue {
remainder = nil
}
fmt.Println("")
return restricted, remainder
} root@:26257/test> explain(exprs,metadata)select * from t1 as a join t1 as b on a.id = b.id;
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
| Level | Type | Field | Description | Columns | Ordering |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
| 0 | render | | | (id, num, id, num) | |
| 0 | | render 0 | id | | |
| 0 | | render 1 | num | | |
| 0 | | render 2 | id | | |
| 0 | | render 3 | num | | |
| 1 | join | | | (id, num, rowid[hidden,omitted], id, num, rowid[hidden,omitted]) | |
| 1 | | type | inner | | |
| 1 | | equality | (id) = (id) | | |
| 2 | scan | | | (id, num, rowid[hidden,omitted]) | rowid!=NULL; key(rowid) |
| 2 | | table | t1@primary | | |
| 2 | | spans | ALL | | |
| 2 | scan | | | (id, num, rowid[hidden,omitted]) | rowid!=NULL; key(rowid) |
| 2 | | table | t1@primary | | |
| 2 | | spans | ALL | | |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
(14 rows)
Time: 9.51ms
root@:26257/test> explain(exprs,metadata)select * from t1 as a join t1 as b on a.id > b.id;
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
| Level | Type | Field | Description | Columns | Ordering |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
| 0 | render | | | (id, num, id, num) | |
| 0 | | render 0 | id | | |
| 0 | | render 1 | num | | |
| 0 | | render 2 | id | | |
| 0 | | render 3 | num | | |
| 1 | join | | | (id, num, rowid[hidden,omitted], id, num, rowid[hidden,omitted]) | |
| 1 | | type | inner | | |
| 1 | | pred | a.id > b.id | | |
| 2 | scan | | | (id, num, rowid[hidden,omitted]) | rowid!=NULL; key(rowid) |
| 2 | | table | t1@primary | | |
| 2 | | spans | ALL | | |
| 2 | scan | | | (id, num, rowid[hidden,omitted]) | rowid!=NULL; key(rowid) |
| 2 | | table | t1@primary | | |
| 2 | | spans | ALL | | |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
(14 rows)
Time: 3.925ms
Then we can see: SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
ON L.k = R.k In this query, the filter SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
ON L.k > R.k In this query, the filter And someting interesting:
So I don't think |
@RaduBerinde I notice that the SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
ON L.k IS NULL As previous discussion, the optimized query would be SELECT *
FROM
t1 AS L
INNER JOIN
t1 AS R
ON L.k IS NULL AND L.k IS NOT NULL |
@sydnever We probably remove the equality constraints that get converted to equality columns; so we would need to push down not-null constraints (as was pointed out above that we should for @adamyaoqinglin The Walk is called on every AST node. Regarding your example - first of all, |
@RaduBerinde Could you please explain it again with more details. The example can be |
The change I'm proposing is a special case in
There is one difficulty due to things like |
Thanks, @RaduBerinde . I think I have got it. |
Hi, @RaduBerinde. |
Nice! Just a tip: I expect some of the EXPLAIN outputs in logictests to change, the easiest to update those is to pass |
@RaduBerinde Oh! Thank you for your tip! I just spent my time on changing logictest files manually. However, I still need make sure that every test result is expected. My head is a query can (like tuna can) now. |
Hi @RaduBerinde , I'd like to address this issue. It has been long time ago since last discuss and codes in optimizer changed a lot. My sketch of solution is as follows:
InnerJoin: Now, let's consider InnerJoin first. In this situation, we only need to focus on [Eq] filters of InnerJoin, which implies all column to reject nulls in inputs. We can use them to populate [FiltersExpr]of this join operator by construct [FilterItem] like LeftJoin and RightJoin: In this situation, we can't derive columns to reject null directly by [Eq] filters, since match condition: We need to add some constraint on rule to prevent the rule being called recursively by "PushFilterIntoJoinLeftAndRight" which will cause an infinite recursion. We just need to check if columns we want to use to construct filters are subset of "NotNullCols", if so we just skip proceeding operation. |
Thanks @yongyanglai, your plan is exactly what I would do. The one issue is that the rule could in principle loop if we push the filter down deep and somewhere in that subtree we lose the information that the column is not null (the NotNullCols property is best-effort). This won't happen in most cases but there may be cornercases. We have a similar problem with pushing down limits, but there it's easier to check that all operators correctly reflect limits in their cardinality. CC @andy-kimball @rytaft for any ideas. |
We could get around that if we had good tests that verified that any time we wrap an expression in a not-null filter, the column shows up in NotNullCols. This could perhaps be a flag that we only use in a nightly test which (with a random probability) checks this every time we construct something. |
For the reasons Radu explained, pushing down NULL filters is surprisingly complex. It's easy to cause an infinite pushdown loop, and it's easy to create NOT NULL filters that end up "getting in the way" of other rules. That said, we already have substantial infrastructure to do NOT NULL pushdowns. See the header comment for There may be additional cases where we want to push down NOT NULL filters, beyond those we've already implemented, but I'm actually not aware of any remaining important cases. Do you know of any? If not, we can probably close this issue as "Fixed". |
Except for ones mentioned in this issued which seems haven't been implemented, I don't have any other cases. |
It seems like this could be useful if the underlying tables have a lot of null values on the join column. If we can push the NOT NULL filter into the scan, that would be valuable. For example:
If we could convert those full scans to constrained scans, that would be an improvement. I like the ideas that you presented above, @yongyanglai, and I think Radu's idea of using a nightly test to catch edge cases sounds promising. |
I think we could use the "reject nulls" infrastructure for that, though it's not trivial. We would have to make |
I have concern that rules checking against properties of expression being rewritten (check against "RejectNullCols") gain possibility of being struck in infinite recursion. But you are right, we should make use of existing "reject-null" infrastructure and make some changes to our need. Maybe the best way to verify codes is the test you mentioned except for formal methods. |
We have marked this issue as stale because it has been inactive for |
Not yet addressed. |
We have |
We can push down a
NOT NULL
filter depending on the join type. This can reduce the data that is streamed to the join. For example, if we have:If either
l.id
orr.id
are NULL they will never be in the join result due toNULL = <any>
is false-y. This applies to joins that are composed of only equalities, in particular:LEFT OUTER JOIN
, we can push it down the right side of the join)Mentioned from #20090 and #17135 (comment).
cc: @RaduBerinde
Jira issue: CRDB-5942
The text was updated successfully, but these errors were encountered: