Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: incorrect joins to inverted expression indexes on json equality #111963

Closed
michae2 opened this issue Oct 6, 2023 · 6 comments · Fixed by #112654
Closed

sql: incorrect joins to inverted expression indexes on json equality #111963

michae2 opened this issue Oct 6, 2023 · 6 comments · Fixed by #112654
Assignees
Labels
A-sql-json JSON handling in SQL. A-sql-optimizer SQL logical planning and optimizations. branch-master Failures on the master branch. branch-release-23.2 Used to mark GA and release blockers and technical advisories for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-sql-queries SQL Queries Team

Comments

@michae2
Copy link
Collaborator

michae2 commented Oct 6, 2023

Inverted indexes on JSON can be used both for (a) matching within the value and (b) matching the entire value. For some reason, though, it appears that inverted indexes on JSON expressions can only be used for (a) matching within the value and not (b) matching the entire value. And more alarming, sometimes we're constructing the inverted join to the inverted expression index incorrectly. Here's a demonstration using v23.2.0-alpha.2-dev:

CREATE TABLE t (i INT PRIMARY KEY, j JSONB);
INSERT INTO t SELECT i, to_json(i % 10) FROM generate_series(0, 9999) AS s(i);
INSERT INTO t SELECT i, json_build_object('a', i % 10) FROM generate_series(10000, 19999) AS s(i);
INSERT INTO t SELECT i, json_build_object('a', json_build_object('b', i % 10)) FROM generate_series(20000, 29999) AS s(i);

-- First, test an inverted index on the column.
CREATE INVERTED INDEX ON t (j);
ANALYZE t;

-- These all use the inverted index.
EXPLAIN SELECT i FROM t WHERE j = '5';
EXPLAIN SELECT i FROM t WHERE j = '{"a": 5}';
EXPLAIN SELECT i FROM t WHERE j = '{"a": {"b": 5}}';
EXPLAIN SELECT i FROM t WHERE j->'a' = '5';
EXPLAIN SELECT i FROM t WHERE j->'a' = '{"b": 5}';
EXPLAIN SELECT i FROM t WHERE j->'a'->'b' = '5';
EXPLAIN SELECT i FROM t WHERE j @> '5';
EXPLAIN SELECT i FROM t WHERE j @> '{"a": 5}';
EXPLAIN SELECT i FROM t WHERE j @> '{"a": {"b": 5}}';
EXPLAIN SELECT i FROM t WHERE j->'a' @> '5';
EXPLAIN SELECT i FROM t WHERE j->'a' @> '{"b": 5}';
EXPLAIN SELECT i FROM t WHERE j->'a'->'b' @> '5';

-- Then, test an inverted index on (j->'a').
DROP INDEX t_j_idx;
CREATE INVERTED INDEX ON t ((j->'a'));
ANALYZE t;

-- These should all be able to use the inverted expression index, but only the last one can.
EXPLAIN SELECT i FROM t WHERE j->'a' = '5';
EXPLAIN SELECT i FROM t WHERE j->'a' = '{"b": 5}';
EXPLAIN SELECT i FROM t WHERE j->'a'->'b' = '5';

-- More concerning, these should not use the inverted expression index, but they do, and
-- get incorrect results!
EXPLAIN SELECT i FROM t WHERE j = '5';
EXPLAIN SELECT i FROM t WHERE j = '{"a": 5}';
EXPLAIN SELECT i FROM t WHERE j = '{"a": {"b": 5}}';

-- Interestingly, the cases using @> seem to behave correctly.

Jira issue: CRDB-32156

@michae2 michae2 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-json JSON handling in SQL. A-sql-optimizer SQL logical planning and optimizations. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. T-sql-queries SQL Queries Team release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 6, 2023

Hi @michae2, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@michae2 michae2 added branch-master Failures on the master branch. branch-release-23.2 Used to mark GA and release blockers and technical advisories for 23.2 labels Oct 6, 2023
@michae2
Copy link
Collaborator Author

michae2 commented Oct 6, 2023

Looks like this wasn't present in any releases prior to 23.2, adding release-blocker.

@michae2 michae2 added the O-qa label Oct 6, 2023
@mgartner
Copy link
Collaborator

One approach to diagnosing this is to git bisect to figure out what commit introduced this.

@mgartner mgartner self-assigned this Oct 18, 2023
@mgartner
Copy link
Collaborator

Great catch @michae2. How did you discover this one? I'm surprised that randomized tests haven't caught this—maybe there is a hole in our inverted expression index testing.

@mgartner
Copy link
Collaborator

I've bisected this to 85400d3.

@mgartner
Copy link
Collaborator

This one is pretty bad. We'll scan an inverted index on a column that's not even filtered in the query:

defaultdb> CREATE TABLE t (
        ->   k INT PRIMARY KEY,
        ->   j1 JSON,
        ->   j2 JSON,
        ->   INVERTED INDEX (j1)
        -> );
CREATE TABLE

defaultdb> EXPLAIN SELECT * FROM t WHERE j2 = '5';
             info
-------------------------------
  distribution: local
  vectorized: true

  • filter
  │ filter: j2 = '5'
  │
  └── • index join
      │ table: t@t_pkey
      │
      └── • scan
            missing stats
            table: t@t_j1_idx
            spans: 2 spans
(13 rows)

As I work on fixing this, I'll try to figure out why our randomized testing did not catch this.

mgartner added a commit to mgartner/cockroach that referenced this issue Oct 18, 2023
This commit fixes a bug introduced in cockroachdb#101178 that allows the optimizer
to generated inverted index scans on columns that are not filtered by
the query. For example, an inverted index over the column `j1` could be
scanned for a filter involving a different column, like `j2 = '5'`. The
bug is caused by a simple omission of code that must check that the
column in the filter is an indexed column.

Fixes cockroachdb#111963

There is no release note because this bug is not present in any
releases.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this issue Oct 18, 2023
This commit makes `randgen` more likely to generate single-column
indexes. It is motivated by the bug cockroachdb#111963, which surprisingly lived on
the master branch for sixth months without being detected. It's not
entirely clear why TLP or other randomized tests did not catch the bug,
which has such a simple reproduction.

One theory is that indexes tend to be multi-column and constrained scans
on multi-column inverted indexes are not commonly planned for randomly
generated queries because the set of requirements to generate the scan
are very specific: the query must hold each prefix column constant, e.g.
`a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating
such an expression may be so low that the bug was not caught.

By making 10% of indexes single-column, this bug may have been more
likely to be caught because only the inverted index column needs to be
constrained by an equality filter.

Release note: None
@mgartner mgartner added S-1 High impact: many users impacted, serious risk of high unavailability or data loss and removed S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Oct 18, 2023
mgartner added a commit to mgartner/cockroach that referenced this issue Oct 19, 2023
This commit makes `randgen` more likely to generate single-column
indexes. It is motivated by the bug cockroachdb#111963, which surprisingly lived on
the master branch for sixth months without being detected. It's not
entirely clear why TLP or other randomized tests did not catch the bug,
which has such a simple reproduction.

One theory is that indexes tend to be multi-column and constrained scans
on multi-column inverted indexes are not commonly planned for randomly
generated queries because the set of requirements to generate the scan
are very specific: the query must hold each prefix column constant, e.g.
`a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating
such an expression may be so low that the bug was not caught.

By making 50% of indexes single-column, this bug may have been more
likely to be caught because only the inverted index column needs to be
constrained by an equality filter.

Release note: None
mgartner added a commit to mgartner/cockroach that referenced this issue Oct 20, 2023
This commit makes `randgen` more likely to generate single-column
indexes. It is motivated by the bug cockroachdb#111963, which surprisingly lived on
the master branch for sixth months without being detected. It's not
entirely clear why TLP or other randomized tests did not catch the bug,
which has such a simple reproduction.

One theory is that indexes tend to be multi-column and constrained scans
on multi-column inverted indexes are not commonly planned for randomly
generated queries because the set of requirements to generate the scan
are very specific: the query must hold each prefix column constant, e.g.
`a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating
such an expression may be so low that the bug was not caught.

By making 50% of indexes single-column, this bug may have been more
likely to be caught because only the inverted index column needs to be
constrained by an equality filter.

Release note: None
craig bot pushed a commit that referenced this issue Oct 20, 2023
111713: sql: fix nil-pointer error in local retry r=DrewKimball a=DrewKimball

#### tree: return correct parse error for pg_lsn

This patch changes the error returned upon failing to parse a PG_LSN
value to match postgres. Previously, the error was an internal error.

Informs #111327

Release note: None

#### sql: fix nil-pointer error in local retry

In #105451, we added logic to locally retry a distributed query
after an error. However, the retry logic unconditionally updated a
field of `DistSQLReceiver` that may be nil, which could cause a
nil-pointer error in some code paths (e.g. apply-join). This patch
adds a check that the field is non-nil, as is done for other places
where it is updated.

There is no release note because the change has not yet made it into
a release.

Fixes #111327

Release note: None

112654: opt: fix inverted index constrained scans for equality filters r=mgartner a=mgartner

#### opt: fix inverted index constrained scans for equality filters

This commit fixes a bug introduced in #101178 that allows the optimizer
to generated inverted index scans on columns that are not filtered by
the query. For example, an inverted index over the column `j1` could be
scanned for a filter involving a different column, like `j2 = '5'`. The
bug is caused by a simple omission of code that must check that the
column in the filter is an indexed column.

Fixes #111963

There is no release note because this bug is not present in any
releases.

Release note: None

#### randgen: generate single-column indexes more often

This commit makes `randgen` more likely to generate single-column
indexes. It is motivated by the bug #111963, which surprisingly lived on
the master branch for sixth months without being detected. It's not
entirely clear why TLP or other randomized tests did not catch the bug,
which has such a simple reproduction.

One theory is that indexes tend to be multi-column and constrained scans
on multi-column inverted indexes are not commonly planned for randomly
generated queries because the set of requirements to generate the scan
are very specific: the query must hold each prefix column constant, e.g.
`a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating
such an expression may be so low that the bug was not caught.

By making 10% of indexes single-column, this bug may have been more
likely to be caught because only the inverted index column needs to be
constrained by an equality filter.

Release note: None


112690: sql: disallow invocation of procedures outside of CALL r=mgartner a=mgartner

#### sql: disallow invocation of procedures outside of CALL

This commit adds some missing checks to ensure that procedures cannot be
invoked in any context besides as the root expression in `CALL`
statements.

Epic: CRDB-25388

Release note: None

#### sql: add tests with function invocation in procedure argument

This commit adds a couple of tests that show that functions can be used
in procedure argument expressions.

Release note: None


112698: sql: clarify comments/naming of descriptorChanged flag r=rafiss a=rafiss

fixes #110727
Release note: None

112701: sql/logictest: fix flakes in select_for_update_read_committed r=mgartner a=mgartner

The `select_for_update_read_committed` tests were flaking because not
all statements were being run under READ COMMITTED isolation. The logic
test infrastructure does not allow fine-grained control of sessions, and
setting the isolation level in one statement would only apply to a
single session. Subsequent statements are not guaranteed to run in the
same session because they could run in any session in the connection
pool. This commit wraps each statement in an explicitly transaction with
an explicit isolation level to ensure READ COMMITTED is used.

In the future, we should investigate allowing fine-grained and explicit
control of sessions in logic tests.

Fixes #112677

Release note: None


112726: sql: make tests error if a leaf txn is not created when expected r=rharding6373 a=rharding6373

This adds a test-only error if a leaf transaction is expected to be used by a plan but a root transaction is used instead.

Epic: none
Informs: #111097

Release note: None

112767: log: fix stacktrace test goroutine counts r=rickystewart a=dhartunian

Previously, we would use the count of the string `goroutine ` as a proxy for the number of goroutines in the stacktrace. This stopped working in go 1.21 due to this change:
golang/go@51225f6

We should consider using a stacktrace parser in the future.

Supports #112088

Epic: None
Release note: None

Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
@craig craig bot closed this as completed in fc1ca9e Oct 20, 2023
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2023
This commit fixes a bug introduced in #101178 that allows the optimizer
to generated inverted index scans on columns that are not filtered by
the query. For example, an inverted index over the column `j1` could be
scanned for a filter involving a different column, like `j2 = '5'`. The
bug is caused by a simple omission of code that must check that the
column in the filter is an indexed column.

Fixes #111963

There is no release note because this bug is not present in any
releases.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 20, 2023
This commit makes `randgen` more likely to generate single-column
indexes. It is motivated by the bug #111963, which surprisingly lived on
the master branch for sixth months without being detected. It's not
entirely clear why TLP or other randomized tests did not catch the bug,
which has such a simple reproduction.

One theory is that indexes tend to be multi-column and constrained scans
on multi-column inverted indexes are not commonly planned for randomly
generated queries because the set of requirements to generate the scan
are very specific: the query must hold each prefix column constant, e.g.
`a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating
such an expression may be so low that the bug was not caught.

By making 50% of indexes single-column, this bug may have been more
likely to be caught because only the inverted index column needs to be
constrained by an equality filter.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-json JSON handling in SQL. A-sql-optimizer SQL logical planning and optimizations. branch-master Failures on the master branch. branch-release-23.2 Used to mark GA and release blockers and technical advisories for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants