Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/distsql: push down NOT NULL constraints on equality columns during planning #20100

Open
lgo opened this issue Nov 16, 2017 · 35 comments
Open
Labels
A-sql-optimizer SQL logical planning and optimizations. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team

Comments

@lgo
Copy link
Contributor

lgo commented Nov 16, 2017

We can push down a NOT NULL filter depending on the join type. This can reduce the data that is streamed to the join. For example, if we have:

SELECT * FROM t1 AS l INNER JOIN t1 AS r ON l.id = r.id

If either l.id or r.id are NULL they will never be in the join result due to NULL = <any> is false-y. This applies to joins that are composed of only equalities, in particular:

  • All inner joins
  • When doing a left or right outer join, we can push the constraint down to the other side of the join (e.g. when doing a LEFT OUTER JOIN, we can push it down the right side of the join)

Mentioned from #20090 and #17135 (comment).

cc: @RaduBerinde

Jira issue: CRDB-5942

@lgo lgo added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Nov 16, 2017
@lgo lgo changed the title distsql: push down NOT NULL constraints on equality columns during planning sql/distsql: push down NOT NULL constraints on equality columns during planning Nov 16, 2017
@sydnever
Copy link
Contributor

Hello, I am a new bee here. I would like to know whether you will code this issue yourself (I noted you fix #20090 ). If you will, I will look for another issue which maybe looks simple; if not, I want to have a try.

@knz
Copy link
Contributor

knz commented Nov 21, 2017

@sydnever hello! Thanks for your interest in this issue. We haven't looked into this yet, so the issue is still "free". Do you know already what the solution looks like?

@sydnever
Copy link
Contributor

@knz I have some ideas in my mind and I am trying to find where and how I will code. When I make it clear, I will share my plan here

@knz
Copy link
Contributor

knz commented Nov 21, 2017

Thanks.

@sydnever
Copy link
Contributor

@knz Should I make the final plan like this:

root@localhost:26257/test> EXPLAIN(EXPRS,METADATA) 
select * from 
(select * from t1 where t1.id is not null) as r 
inner join 
(select * from t1 where t1.id is not null) as j 
on r.id = j.id;
+-------+--------+----------+----------------+----------------------------------+-----------------------------------+
| Level |  Type  |  Field   |  Description   |             Columns              |             Ordering              |
+-------+--------+----------+----------------+----------------------------------+-----------------------------------+
|     0 | join   |          |                | (id, num, id, num)               |                                   |
|     0 |        | type     | inner          |                                  |                                   |
|     0 |        | equality | (id) = (id)    |                                  |                                   |
|     1 | render |          |                | (id, num)                        | id!=NULL                          |
|     1 |        | render 0 | id             |                                  |                                   |
|     1 |        | render 1 | num            |                                  |                                   |
|     2 | scan   |          |                | (id, num, rowid[hidden,omitted]) | id!=NULL; rowid!=NULL; key(rowid) |
|     2 |        | table    | t1@primary     |                                  |                                   |
|     2 |        | spans    | ALL            |                                  |                                   |
|     2 |        | filter   | id IS NOT NULL |                                  |                                   |
|     1 | render |          |                | (id, num)                        | id!=NULL                          |
|     1 |        | render 0 | id             |                                  |                                   |
|     1 |        | render 1 | num            |                                  |                                   |
|     2 | scan   |          |                | (id, num, rowid[hidden,omitted]) | id!=NULL; rowid!=NULL; key(rowid) |
|     2 |        | table    | t1@primary     |                                  |                                   |
|     2 |        | spans    | ALL            |                                  |                                   |
|     2 |        | filter   | id IS NOT NULL |                                  |                                   |
+-------+--------+----------+----------------+----------------------------------+-----------------------------------+
(17 rows)

Time: 4.084ms

@knz
Copy link
Contributor

knz commented Nov 26, 2017

yes something like that. But that is not an implementation plan! Let us know how you plan to proceed.

@sydnever
Copy link
Contributor

sydnever commented Nov 27, 2017

@knz

In lego's example

SELECT * FROM t1 AS l INNER JOIN t1 AS r ON l.id = r.id;
  1. I may build a typedExpr called N as a filter which looks like
N.String() = "(id IS NOT NULL) AND (id IS NOT NULL)"
  1. There are two way to do:

In opt_filter.go#L541, I may change the return result of the expandOnCond by

result = mergeConj(result, N)

Or

In opt_filter.go#L681, I may reuse 'addJoinFilter' to add N in the JoinFilter.


There is another problem that there should not be a N filter when id is defined as NOT NULL.

I think this problem should be resolved in another issue #20237 .

@adamyaoqinglin
Copy link
Contributor

@sydnever , I am also looking into this issue.
In my opinion, changing the result of the expandOnCond may be better.
The implementation can be:

  1. loops all the left/rightEqualityIndices to construct sqls(IsNot ComparisonExpression) which indicates fetching none-null values(columns)

  2. There are three situations: Inner Join, Left Outer Join, Right Outer Join.

For Inner Join, both left and right column should fetch none-null values.
For Left Join, right column should fetch none-null values.
For Right Join, left column should fetch none-null values.

  1. Be careful the rightEqualityIndice when using it retrieve IndexedVar, the Index should be rightEqualityIndice + Len(joinNode.left.info.sourceColumns)

@knz , Do you think so?

@RaduBerinde
Copy link
Member

I think a more comprehensive fix (which would add these constraints in more situations, and wouldn't require join-specific code) would be to make splitFilter convert something like a = b into a IS NOT NULL when we restrict the filter on a (currently, we just ignore the entire constraint). I believe that any comparison op that has a variable as an argument implies that it is not-null.

@sydnever
Copy link
Contributor

@RaduBerinde Sorry, could you give me some queries as demos to make me get it? In my opinion, what you said may not do optimzation actually.

@lgo
Copy link
Contributor Author

lgo commented Nov 28, 2017

@sydnever the query I mentioned in the first issue comment will have this optimization. Going into more details with the example, given the schema and values:

CREATE TABLE t1 (
   k int,
   v int
);

INSERT INTO t1 VALUES (1, NULL), (2, NULL), (3, 1), (4, 1);

If we do the query

SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
  ON L.k = R.k

The results will be

L.k   | L.v   | R.k   | R.v
3     | 1     | 3     | 1
4     | 1     | 4     | 1

By doing what @RaduBerinde mentioned, splitFilter will take the filter L.k = R.k and also generate the filter conditions L IS NOT NULL and R IS NOT NULL. That is because L.k = <any> will always be false when L.k is NULL, so that can be added as an explicit filter for both sides of the equality (=).

The reason we can do this in splitFilter is that L.k = <any> will be semantically equivalent to L.k = <and> AND L.k IS NOT NULL, but this is important when it comes to the actual optimization happening. That happens in a later step where filters are pushed through joins to reduce the number of rows that end up getting joined.

If we do that split operation that is proposed, the query will become:

SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
  ON L.k = R.k AND L.k IS NOT NULL AND R.k IS NOT NULL

Then, during query optimization (which already happens), this query transforms into

SELECT *
FROM
  (SELECT * FROM t1 WHERE k IS NOT NULL) AS L
INNER JOIN
  (SELECT * FROM t1 WHERE k IS NOT NULL) AS R
  ON L.k = R.k AND L.k

Which will optimize the join, because we are filtering out rows before they begin joining (as joins are expensive!).

I hope that helps :). Don't hesitate to ask any other questions!

@sydnever
Copy link
Contributor

@LEGO Yeah,I got what you say. But @RaduBerinde mentioned more situations (not only in join) which made me confused. What I really need are queries in more situations.

Thank you all the same! :)

@adamyaoqinglin
Copy link
Contributor

adamyaoqinglin commented Nov 28, 2017

@LEGO We can consider another situation.

SELECT SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
USING k

If we code in splitFilter , we can not, during query optimization, transforms the query into

SELECT *
FROM
  (SELECT * FROM t1 WHERE k IS NOT NULL) AS L
INNER JOIN
  (SELECT * FROM t1 WHERE k IS NOT NULL) AS R
  ON L.k = R.k AND L.k

@RaduBerinde
Copy link
Member

@adamyaoqinglin true, we may need a bit of special code for USING and NATURAL joins.

@sydnever one situation I can think of is index join (admittedly, it's still a join but it's different code). We use splitFilter to evaluate whatever we can from the filter on the (non-covering) index. Here's an example:

create table t (k INT PRIMARY KEY, a INT, b INT, INDEX (b));
root@:26257/test> EXPLAIN SELECT * FROM t WHERE b < a ORDER BY b LIMIT 10;
+-------+------------+-------+-------------+
| Level |    Type    | Field | Description |
+-------+------------+-------+-------------+
|     0 | limit      |       |             |
|     1 | index-join |       |             |
|     2 | scan       |       |             |
|     2 |            | table | t@t_b_idx   |
|     2 |            | spans | ALL         |
|     2 | scan       |       |             |
|     2 |            | table | t@primary   |
+-------+------------+-------+-------------+
(7 rows)

Note the spans ALL; the spans would skip NULLs if we propagated b IS NOT NULL to the index scan.

Another situation is when we have something like ON left.a > left.b. We can still skip the NULLs even though we don't have any equality columns.

@sydnever
Copy link
Contributor

@RaduBerinde thank you, I got it.

@sydnever
Copy link
Contributor

@RaduBerinde @LEGO I have made some tests.

func splitFilter(expr tree.TypedExpr, conv varConvertFunc) (restricted, remainder tree.TypedExpr) {
	if expr == nil {
		syd.Sydlog("splitFilter nil\n")
		return nil, nil
	}
	syd.Sydlog("splitFilter expr: " + expr.String())
	restricted, remainder = splitBoolExpr(expr, conv, true)
	syd.Sydlog("splitFilter restricted: " + restricted.String())
	syd.Sydlog("splitFilter remainder : " + remainder.String())
	if restricted == tree.DBoolTrue {
		restricted = nil
	}
	if remainder == tree.DBoolTrue {
		remainder = nil
	}
	fmt.Println("")
	return restricted, remainder
}
root@:26257/test> explain(exprs,metadata)select * from t1 as a join t1 as b on a.id = b.id;
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
| Level |  Type  |  Field   | Description |                             Columns                              |        Ordering         |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
|     0 | render |          |             | (id, num, id, num)                                               |                         |
|     0 |        | render 0 | id          |                                                                  |                         |
|     0 |        | render 1 | num         |                                                                  |                         |
|     0 |        | render 2 | id          |                                                                  |                         |
|     0 |        | render 3 | num         |                                                                  |                         |
|     1 | join   |          |             | (id, num, rowid[hidden,omitted], id, num, rowid[hidden,omitted]) |                         |
|     1 |        | type     | inner       |                                                                  |                         |
|     1 |        | equality | (id) = (id) |                                                                  |                         |
|     2 | scan   |          |             | (id, num, rowid[hidden,omitted])                                 | rowid!=NULL; key(rowid) |
|     2 |        | table    | t1@primary  |                                                                  |                         |
|     2 |        | spans    | ALL         |                                                                  |                         |
|     2 | scan   |          |             | (id, num, rowid[hidden,omitted])                                 | rowid!=NULL; key(rowid) |
|     2 |        | table    | t1@primary  |                                                                  |                         |
|     2 |        | spans    | ALL         |                                                                  |                         |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
(14 rows)

Time: 9.51ms

root@:26257/test> explain(exprs,metadata)select * from t1 as a join t1 as b on a.id > b.id;
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
| Level |  Type  |  Field   | Description |                             Columns                              |        Ordering         |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
|     0 | render |          |             | (id, num, id, num)                                               |                         |
|     0 |        | render 0 | id          |                                                                  |                         |
|     0 |        | render 1 | num         |                                                                  |                         |
|     0 |        | render 2 | id          |                                                                  |                         |
|     0 |        | render 3 | num         |                                                                  |                         |
|     1 | join   |          |             | (id, num, rowid[hidden,omitted], id, num, rowid[hidden,omitted]) |                         |
|     1 |        | type     | inner       |                                                                  |                         |
|     1 |        | pred     | a.id > b.id |                                                                  |                         |
|     2 | scan   |          |             | (id, num, rowid[hidden,omitted])                                 | rowid!=NULL; key(rowid) |
|     2 |        | table    | t1@primary  |                                                                  |                         |
|     2 |        | spans    | ALL         |                                                                  |                         |
|     2 | scan   |          |             | (id, num, rowid[hidden,omitted])                                 | rowid!=NULL; key(rowid) |
|     2 |        | table    | t1@primary  |                                                                  |                         |
|     2 |        | spans    | ALL         |                                                                  |                         |
+-------+--------+----------+-------------+------------------------------------------------------------------+-------------------------+
(14 rows)

Time: 3.925ms
 [DEBUG]  2017-11-28 14:34:42.657385 +0800 CST m=+36.097869682           splitFilter nil

 [DEBUG]  2017-11-28 14:34:42.657422 +0800 CST m=+36.097906734           splitFilter nil

 [DEBUG]  2017-11-28 14:34:49.550919 +0800 CST m=+42.991169103           splitFilter expr: id > id
 [DEBUG]  2017-11-28 14:34:49.550982 +0800 CST m=+42.991232246           splitFilter restricted: true
 [DEBUG]  2017-11-28 14:34:49.551023 +0800 CST m=+42.991273074           splitFilter remainder : id > id

 [DEBUG]  2017-11-28 14:34:49.551062 +0800 CST m=+42.991311632           splitFilter expr: id > id
 [DEBUG]  2017-11-28 14:34:49.551098 +0800 CST m=+42.991348421           splitFilter restricted: true
 [DEBUG]  2017-11-28 14:34:49.551134 +0800 CST m=+42.991384117           splitFilter remainder : id > id

Then we can see:

SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
  ON L.k = R.k

In this query, the filter L.k = R.k will not be taken by splitFilter.

SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
  ON L.k > R.k

In this query, the filter L.k > R.k will.

And someting interesting:

on 1=1 will not.
on 1>1 will.

So I don't think splitFilter is a good point to make something we want.

@adamyaoqinglin
Copy link
Contributor

adamyaoqinglin commented Nov 28, 2017

@RaduBerinde
The question :

I notice that the Walk only execute functions on Variables(maybe it is the leaf?).
If added not null constraint using splitFilter , the following situation is difficult to implement.

SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
  ON  L.k IS NULL

As previous discussion, the optimized query would be

SELECT *
FROM
  t1 AS L
INNER JOIN
  t1 AS R
  ON  L.k IS NULL AND L.k IS NOT NULL

@RaduBerinde
Copy link
Member

@sydnever We probably remove the equality constraints that get converted to equality columns; so we would need to push down not-null constraints (as was pointed out above that we should for USING and NATURAL JOIN).

@adamyaoqinglin The Walk is called on every AST node. Regarding your example - first of all, L.k IS NULL would be passed through splitFilter to the left side unmodified because it only contains left-side variables. Second, of course that x IS NULL does not imply x IS NOT NULL; we would only do this for operators that imply not-null (most importantly comparison operators).

@adamyaoqinglin
Copy link
Contributor

@RaduBerinde Could you please explain it again with more details. The example can be (left.x IS NULL) = (right.y IS NULL)

@RaduBerinde
Copy link
Member

The change I'm proposing is a special case in splitBoolExpr which checks if we have a ComparisonExpr (with a restricted set of operators, e.g. EQ, LT, GT, LE, GE, NE) for which one side is just a variable that can be "converted". From something like x CMP <something> we can return x IS NOT NULL

(left.x IS NULL) = (right.y IS NULL) does not match the special case, we wouldn't propagate anything.

There is one difficulty due to things like NOT (a = <something>) which does not mean NOT (a is not NULL). It actually still implies a IS NOT NULL. We could avoid this by not applying the special case if we're under a NOT.

@adamyaoqinglin
Copy link
Contributor

Thanks, @RaduBerinde . I think I have got it.

@sydnever
Copy link
Contributor

Hi, @RaduBerinde.
@adamyaoqinglin has completed the work and I am doing tests on it. It may take a lot time to make sure of the correctness of results.

@RaduBerinde
Copy link
Member

Nice! Just a tip: I expect some of the EXPLAIN outputs in logictests to change, the easiest to update those is to pass TESTFLAGS=--rewrite-results-in-testfiles. This will actually modify the testfiles with the new results and you can check the differences with git diff.

@sydnever
Copy link
Contributor

@RaduBerinde Oh! Thank you for your tip! I just spent my time on changing logictest files manually. However, I still need make sure that every test result is expected.

My head is a query can (like tuna can) now.

@andy-kimball andy-kimball added this to Higher Priority Backlog in BACKLOG, NO NEW ISSUES: SQL Optimizer Aug 25, 2018
@RaduBerinde RaduBerinde moved this from Higher Priority Backlog to Plan enhancements (higher priority) in BACKLOG, NO NEW ISSUES: SQL Optimizer Apr 18, 2020
@yongyanglai
Copy link
Contributor

yongyanglai commented Jun 9, 2020

Hi @RaduBerinde , I'd like to address this issue. It has been long time ago since last discuss and codes in optimizer changed a lot. My sketch of solution is as follows:
Every expression has a property "NotNullCols", it is derived from inputs and filters of expression. In join operator, we can use "NotNullCols" from left input and right input together with FD of join expression to calculate filters to reject nulls in inputs. This solution works in these three join operator:

  • InnerJoin
  • LeftJoin
  • RightJoin

InnerJoin: Now, let's consider InnerJoin first. In this situation, we only need to focus on [Eq] filters of InnerJoin, which implies all column to reject nulls in inputs. We can use them to populate [FiltersExpr]of this join operator by construct [FilterItem] like x IS NOT NULL , then the optimizer push down these filters to left input and right input using rule "PushFilterIntoJoinLeftAndRight".At the same time, we don't want to push down duplicated filters to input, so we need to filter out "NotNullCols" before construct filters.

LeftJoin and RightJoin: In this situation, we can't derive columns to reject null directly by [Eq] filters, since t1.x=t2.x doesn't imply both t1.x and t2.x are not null. Let me take LeftJoin for example, the RightJoin is handled in the same way with all operations are symmetric. In LeftJoin all null rejecting filters comes from left input, so we can take "NotNullCols" from left input, and calculate its equivalence closure with FD and filter out "NotNullCols" from left input out of equivalence closure, then we obtain columns in right side that should be not null.

match condition: We need to add some constraint on rule to prevent the rule being called recursively by "PushFilterIntoJoinLeftAndRight" which will cause an infinite recursion. We just need to check if columns we want to use to construct filters are subset of "NotNullCols", if so we just skip proceeding operation.

@RaduBerinde
Copy link
Member

Thanks @yongyanglai, your plan is exactly what I would do. The one issue is that the rule could in principle loop if we push the filter down deep and somewhere in that subtree we lose the information that the column is not null (the NotNullCols property is best-effort). This won't happen in most cases but there may be cornercases. We have a similar problem with pushing down limits, but there it's easier to check that all operators correctly reflect limits in their cardinality.

CC @andy-kimball @rytaft for any ideas.

@RaduBerinde
Copy link
Member

We could get around that if we had good tests that verified that any time we wrap an expression in a not-null filter, the column shows up in NotNullCols. This could perhaps be a flag that we only use in a nightly test which (with a random probability) checks this every time we construct something.

@andy-kimball
Copy link
Contributor

For the reasons Radu explained, pushing down NULL filters is surprisingly complex. It's easy to cause an infinite pushdown loop, and it's easy to create NOT NULL filters that end up "getting in the way" of other rules.

That said, we already have substantial infrastructure to do NOT NULL pushdowns. See the header comment for reject_nulls.opt for an overview, and code in reject_null_funcs.go as well. The literature calls NOT NULL pushdown "null rejection", and is mostly used in the context of simplifying outer joins.

There may be additional cases where we want to push down NOT NULL filters, beyond those we've already implemented, but I'm actually not aware of any remaining important cases. Do you know of any? If not, we can probably close this issue as "Fixed".

@yongyanglai
Copy link
Contributor

Except for ones mentioned in this issued which seems haven't been implemented, I don't have any other cases.
I did find reject-null related codes, but they seems have a drawback that some functions don't check equivalence dependency thoroughly(I may be wrong), thus some of them are not able to deliver all null rejection columns.

@rytaft
Copy link
Collaborator

rytaft commented Jun 9, 2020

It seems like this could be useful if the underlying tables have a lot of null values on the join column. If we can push the NOT NULL filter into the scan, that would be valuable.

For example:

> create table t (x int, index (x));
CREATE TABLE

Time: 2.787ms

> create table u (x int, index (x));
CREATE TABLE

Time: 2.611ms

> explain select * from t, u where t.x=u.x;
     tree    |     field      | description
-------------+----------------+--------------
             | distributed    | true
             | vectorized     | false
  merge-join |                |
   │         | type           | inner
   │         | equality       | (x) = (x)
   │         | mergeJoinOrder | +"(x=x)"
   ├── scan  |                |
   │         | table          | t@t_x_idx
   │         | spans          | FULL SCAN
   └── scan  |                |
             | table          | u@u_x_idx
             | spans          | FULL SCAN
(12 rows)

If we could convert those full scans to constrained scans, that would be an improvement.

I like the ideas that you presented above, @yongyanglai, and I think Radu's idea of using a nightly test to catch edge cases sounds promising.

@RaduBerinde
Copy link
Member

I think we could use the "reject nulls" infrastructure for that, though it's not trivial. We would have to make Select pass through all null-rejection columns, except those which are already constrained to be not-null; and make Scan advertise all nullable columns as null-rejection. There is still a small chance of something going wrong if some normalization rule determines that an is not null filter is redundant with some other more complicated filter but the Select's NotNullCols don't reflect that (though this is a very specific, fixable case that I'd argue is a bug).

@yongyanglai
Copy link
Contributor

I have concern that rules checking against properties of expression being rewritten (check against "RejectNullCols") gain possibility of being struck in infinite recursion. But you are right, we should make use of existing "reject-null" infrastructure and make some changes to our need. Maybe the best way to verify codes is the test you mentioned except for formal methods.
I can make an attempt to implement it if there is no problem.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
BACKLOG, NO NEW ISSUES: SQL Optimizer automation moved this from Plan enhancements (higher priority) to Done Oct 9, 2023
@andy-kimball
Copy link
Contributor

Not yet addressed.

@DrewKimball
Copy link
Collaborator

We have RejectNullsUnderJoinLeft and RejectNullsUnderJoinRight that infer IS NOT NULL filters from join conditions. The remaining item would be to have some similar logic for constraining an index scan. This could be done as an exploration rule. Note that we already do this for partial-index scans.

@DrewKimball DrewKimball removed their assignment Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
10 participants