Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: fix memo cycle caused by join ordering #83875

Merged
merged 1 commit into from Jul 7, 2022

Conversation

DrewKimball
Copy link
Collaborator

In some rare cases when null-rejection rules fail to fire, a redundant filter
can be inferred in an InnerJoin - LeftJoin complex. This could previously
result in the JoinOrderBuilder attempting to add a Select to the same memo
group as its input, which would create a memo cycle. This patch prevents the
JoinOrderBuilder from adding the Select to the memo in such cases.

What follows is a more detailed explanation of the conditions that could
previously cause a memo cycle.

InnerJoin operators have two properties that make them more 'reorderable'
than other types of joins: (1) their conjuncts can be reordered separately
and (2) new conjuncts can be inferred from equalities. As a special case of
(1), an InnerJoin can be pushed into the left side of a LeftJoin, leaving
behind any Select conjuncts that reference the right side of the LeftJoin.

This allows the JoinOrderBuilder to make the following transformation:

(InnerJoin
   A
   (InnerJoin
      B
      (LeftJoin
         C
         D
         c=d
      )
      b=c
   )
   a=b, a=d
)
=>
(InnerJoin
   A
   (Select
      (LeftJoin
         (InnerJoin
            B
            C
            b=c
         )
         D
         c=d
      )
      b=d // Inferred filter!
   )
   a=b, a=d
)

Note the new b=d filter that was inferred and subsequently left on
a Select operator after the reordering. The crucial point is that
this filter is redundant - the input to the Select is already a
valid reordering of the BCD join complex.

The JoinOrderBuilder avoids adding redundant filters to InnerJoin
operators, but does not do the same for the Select case because it
was assumed that the filters left on the Select would never be redundant.
This is because the a=d filter should rejects nulls, so the LeftJoin
should have been simplified. However, in rare cases null-rejection does not
take place. Because the input to the Select is already a valid reordering,
the JoinOrderBuilder ends up trying to add the Select to the same group
as its input - namely, the BCD join group. This causes a cycle in the memo.

Fixes #80901

Release note (bug fix): Fixed a bug that could cause an optimizer
panic in rare cases when a query had a left join in the input of
an inner join.

@DrewKimball DrewKimball requested a review from a team as a code owner July 6, 2022 01:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DrewKimball
Copy link
Collaborator Author

@mgartner I added the comment we talked about in #76334 while I was at it.

@mgartner mgartner self-requested a review July 6, 2022 22:31
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find and explanation! Thanks for getting to the bottom of this one. :lgtm:

This is because the a=d filter should rejects nulls, so the LeftJoin
should have been simplified.

A couple follow-up questions for my curiosity:

  1. If null rejection was working for this filter, what would the LeftJoin be simplified to?
  2. Why isn't null-rejection working in this case? Should we fix that at some point too?

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)


pkg/sql/opt/testutils/opttester/reorder_joins.go line 85 at r1 (raw file):

			var selString string
			if len(selRefs) > 0 {
				selString = fmt.Sprintf(" [select, refs=%s]", jof.formatVertexSet(selRefs))

Should we add a join reordering test that uses this new formatting?

CREATE TABLE table80901_3 (col3_2 OID, col3_4 FLOAT8, col3_9 STRING);
----

memo memo-cycles
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If there is a cycle, the optimizer should panic here and opttester should print out the error in the test output. So maybe the memo-cycles option is unnecessary?

In some rare cases when null-rejection rules fail to fire, a redundant filter
can be inferred in an `InnerJoin` - `LeftJoin` complex. This could previously
result in the `JoinOrderBuilder` attempting to add a `Select` to the same memo
group as its input, which would create a memo cycle. This patch prevents the
`JoinOrderBuilder` from adding the `Select` to the memo in such cases.

What follows is a more detailed explanation of the conditions that could
previously cause a memo cycle.

`InnerJoin` operators have two properties that make them more 'reorderable'
than other types of joins: (1) their conjuncts can be reordered separately
and (2) new conjuncts can be inferred from equalities. As a special case of
(1), an `InnerJoin` can be pushed into the left side of a `LeftJoin`, leaving
behind any `Select` conjuncts that reference the right side of the `LeftJoin`.

This allows the `JoinOrderBuilder` to make the following transformation:
```
(InnerJoin
   A
   (InnerJoin
      B
      (LeftJoin
         C
         D
         c=d
      )
      b=c
   )
   a=b, a=d
)
=>
(InnerJoin
   A
   (Select
      (LeftJoin
         (InnerJoin
            B
            C
            b=c
         )
         D
         c=d
      )
      b=d // Inferred filter!
   )
   a=b, a=d
)
```
Note the new `b=d` filter that was inferred and subsequently left on
a `Select` operator after the reordering. The crucial point is that
this filter is redundant - the input to the `Select` is already a
valid reordering of the `BCD` join complex.

The `JoinOrderBuilder` avoids adding redundant filters to `InnerJoin`
operators, but does not do the same for the `Select` case because it
was assumed that the filters left on the `Select` would never be redundant.
This is because the `a=d` filter *should* rejects nulls, so the `LeftJoin`
should have been simplified. However, in rare cases null-rejection does not
take place. Because the input to the `Select` is already a valid reordering,
the `JoinOrderBuilder` ends up trying to add the `Select` to the same group
as its input - namely, the `BCD` join group. This causes a cycle in the memo.

Fixes cockroachdb#80901

Release note (bug fix): Fixed a bug that could cause an optimizer
panic in rare cases when a query had a left join in the input of
an inner join.
Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If null rejection was working for this filter, what would the LeftJoin be simplified to?

The LeftJoin should become an InnerJoin because there is an equality filter (null-rejecting) that references its right input.

Why isn't null-rejection working in this case? Should we fix that at some point too?

It's because the query has outer columns. We have to be careful about when we push down IS NOT NULL filters to prevent interactions with decorrelation rules that attempt to pull those filters back up - see #35171. I do have an idea of how to fix it; instead of physically adding an IS NOT NULL filter, we can traverse the operator tree wherever it would be correct to add an IS NOT NULL filter and perform outer-join simplification as we go.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/testutils/opttester/reorder_joins.go line 85 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Should we add a join reordering test that uses this new formatting?

Good point, I'm surprised there wasn't already one. Done.


pkg/sql/opt/xform/testdata/rules/join_order line 2622 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: If there is a cycle, the optimizer should panic here and opttester should print out the error in the test output. So maybe the memo-cycles option is unnecessary?

It doesn't seem to set off a cycle in the optimization pattern; only in the structure of the memo. So, you can run without the memo-cycles command and it will happily print the memo without any errors.

@mgartner
Copy link
Collaborator

mgartner commented Jul 6, 2022

pkg/sql/opt/xform/testdata/rules/join_order line 2622 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

It doesn't seem to set off a cycle in the optimization pattern; only in the structure of the memo. So, you can run without the memo-cycles command and it will happily print the memo without any errors.

Do you mean without the fix or with the fix? The panic should occur during optimization like it did in #80901, right?

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LeftJoin should become an InnerJoin because there is an equality filter (null-rejecting) that references its right input.

Oh right. I originally thought that but somehow convinced myself that would not be equivalent.

It's because the query has outer columns. We have to be careful about when we push down IS NOT NULL filters to prevent interactions with decorrelation rules that attempt to pull those filters back up - see #35171. I do have an idea of how to fix it; instead of physically adding an IS NOT NULL filter, we can traverse the operator tree wherever it would be correct to add an IS NOT NULL filter and perform outer-join simplification as we go.

I see. No need to prioritize this, but feel free to document your idea in an issue so we have a record of it.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)

Copy link
Collaborator Author

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner)


pkg/sql/opt/xform/testdata/rules/join_order line 2622 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Do you mean without the fix or with the fix? The panic should occur during optimization like it did in #80901, right?

Hm... I mean without the fix. When running through the opttester the memo prints as usual with no error or detected cycle, but with the cycle visible within the memo. I assumed this was an artifact of the opttester, but maybe not. Similarly, there is no problem running the actual query on a cluster, but I get the error when I try to run explain or explain (opt) from the command line. I'm also using an M1 mac...

Some printf debugging in the case that doesn't throw the error showed that the number of passes never exceeded 1, and we never even reached that maxGroupPasses check more than 20 times.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)


pkg/sql/opt/xform/testdata/rules/join_order line 2622 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Hm... I mean without the fix. When running through the opttester the memo prints as usual with no error or detected cycle, but with the cycle visible within the memo. I assumed this was an artifact of the opttester, but maybe not. Similarly, there is no problem running the actual query on a cluster, but I get the error when I try to run explain or explain (opt) from the command line. I'm also using an M1 mac...

Some printf debugging in the case that doesn't throw the error showed that the number of passes never exceeded 1, and we never even reached that maxGroupPasses check more than 20 times.

Interesting. I am getting similar results to you. Expect I do see the cycle error when running these in cockroach demo. Maybe it is a problem with opttester or with the test catalog.

Let's leave this as-is for now. I've created #83996 to track the broken cycle detection.

@mgartner
Copy link
Collaborator

mgartner commented Jul 7, 2022

pkg/sql/opt/xform/testdata/rules/join_order line 2622 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Interesting. I am getting similar results to you. Expect I do see the cycle error when running these in cockroach demo. Maybe it is a problem with opttester or with the test catalog.

Let's leave this as-is for now. I've created #83996 to track the broken cycle detection.

Expect => Except***

@DrewKimball
Copy link
Collaborator Author

TFTR!

@DrewKimball
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 7, 2022

Build succeeded:

@craig craig bot merged commit 6b7082a into cockroachdb:master Jul 7, 2022
@DrewKimball
Copy link
Collaborator Author

blathers backport 22.2 22.1

@blathers-crl
Copy link

blathers-crl bot commented Nov 10, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ac77889 to blathers/backport-release-22.2-83875: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


error creating merge commit from ac77889 to blathers/backport-release-22.1-83875: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@DrewKimball
Copy link
Collaborator Author

No need to backport this, since #88779 already fixed the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

opt: internal error: memo group optimization passes surpassed limit of 100000
3 participants