Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,util,opt: add support for inverted filter with JSON and Array columns #56732

Merged
merged 2 commits into from
Nov 25, 2020

Conversation

rytaft
Copy link
Collaborator

@rytaft rytaft commented Nov 16, 2020

sql,util: return unique boolean from EncodeContainingInvertedIndexSpans

This commit adds a result parameter for EncodeContainingInvertedIndexSpans
called unique. The function returns unique=true if the spans are
guaranteed not to produce duplicate primary keys. Otherwise, returns
unique=false. The logic is based on the fact that spans for a given JSON
or Array datum will not produce duplicate primary keys if there is exactly one
JSON or Array path ending in a scalar (i.e., not an empty object or array).

sql,opt: add support for inverted filter with JSON and Array columns

This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs #47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}').
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.

@rytaft rytaft requested a review from a team as a code owner November 16, 2020 14:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rytaft
Copy link
Collaborator Author

rytaft commented Nov 16, 2020

Please hold off reviewing this, since there are some plan regressions. Trying to figure out how to fix.

@rytaft
Copy link
Collaborator Author

rytaft commented Nov 19, 2020

I think this is finally ready for review. There are still a few minor plan regressions, but I've created issues for them and included TODOs in the test files (#56870, #56868 and #56799). PTAL - thanks!

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice! Looks like this was a big lift.

Reviewed 1 of 5 files at r1, 4 of 19 files at r3, 14 of 25 files at r4, 9 of 10 files at r5, 3 of 3 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde, @rytaft, and @sumeerbhola)


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 1 at r6 (raw file):

# Test cases for using invertedFilterer on an inverted json or array index.

Some of these tests are for zigzag joins. Should all these tests just live in inverted_index? I'm not really sure...

Also, there's nothing here that ensures that these tests will continue to use the intended operators (zigzag vs inverted filter) in the future. This has come up before I think in a previous PR. I'm just a little uneasy about it. I don't know of a good solution though.


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 81 at r6 (raw file):


query I
SELECT a FROM json_tab WHERE b @> '1' ORDER BY a

Is this case the same as above but without an index join?


pkg/sql/opt/invertedidx/inverted_index_expr.go, line 265 at r6 (raw file):

// - pre-filterer state that can be used by the invertedFilterer operator to
//   reduce the number of false positives returned by the span expression.
func TryFilterInvertedIndex(

[nit] move this function up directly above TryJoinInvertedIndex since this is kind of an "entry point" into this file.


pkg/sql/opt/invertedidx/inverted_index_expr.go, line 438 at r6 (raw file):

// - remaining filters that must be applied if the inverted expression is not
//   tight, and
// - pre-filterer state that can be used to reduce false positives.

It looks like you only return a pre-filterer state if filterCond is a leaf expression. Can a pf state not be generated from conjunctions and disjunctions? If not, maybe add a comment here explaining this.


pkg/sql/opt/invertedidx/json_array.go, line 278 at r6 (raw file):

) {
	switch t := expr.(type) {
	case *memo.ContainsExpr:

[nit] add TODO to support JSON fetch val operator ->


pkg/sql/opt/invertedidx/json_array.go, line 313 at r6 (raw file):

	}
	if variable.Typ.Family() == types.ArrayFamily &&
		j.index.Version() < descpb.EmptyArraysInInvertedIndexesVersion {

What's the upgrade story for users to new inverted index versions?

If we took out the current Scan.Constraint-based inverted index scanning now, when a user upgrades to 21.1 some of their queries could go from using inverted indexes to full table scans until they upgraded their inverted index versions.


pkg/sql/opt/invertedidx/json_array_test.go, line 327 at r6 (raw file):

		{
			// We cannot guarantee that the span expression is tight when there is a
			// nested array.

Hmmm.. Why not?


pkg/sql/opt/xform/select_funcs.go, line 866 at r6 (raw file):

		// We scan the PK columns, and the inverted key column if there is an
		// inverted filter.

[nit] => "We scan the PK columns if needed". You could also move the initialization of needInvertedFilter here and combine the comments.


pkg/sql/opt/xform/testdata/rules/select, line 1821 at r6 (raw file):

      └── key: (1)

# Disjunction.

🎉


pkg/sql/rowenc/index_encoding.go, line 939 at r6 (raw file):

		spans[i] = roachpb.Spans{{Key: key, EndKey: endKey}}
	}
	unique = len(keys) == 1

[nit] add a comment that reminds the reader that more than 1 key means that there could be duplicate primary keys in an inverted index.


pkg/sql/rowenc/index_encoding_test.go, line 436 at r6 (raw file):

		{`{NULL}`, `{}`, true, false},
		{`{NULL}`, `{NULL}`, false, true},
		{`{2, NULL}`, `{2, NULL}`, false, true},

It's not obvious to me why the spans for {2, NULL} would guarantee non-duplicate primary keys (unique=true). For an array with the value [2, NULL], wouldn't there be 2 entries in the inverted index for that row?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, 4 of 19 files at r3, 1 of 25 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @rytaft)


pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):

		// INTERSECTION.
		needInvertedFilter := spanExpr != nil &&
			(!spanExpr.Unique || spanExpr.Operator != invertedexpr.None)

Based on my earlier comment on encodeContainingInvertedIndexSpansFromLeaf I wonder if there is really any case where there is more than one span in a roachpb.Spans and unique is true.

I was expecting some unification of the code used in GenerateInvertedIndexZigzagJoins and the inverted-filterer, for example using the same encodeContainingInvertedIndexSpans. If I remember correctly, these recent PRs have not been changing indexConstraintCtx.makeInvertedIndexSpansForExpr. Is that code still correct, in light of the better understanding of NULL semantics? And it seems we wouldn't want to maintain both as we add support for more operators (beyond ContainsOp).

We would have something like:
unique (which implies sorted) for each roachpb.Spans which would be and-ed as is done now, but not say anything about the length of []roachpb.Spans. These would be used in building a SpanExpression which would set unique to false when Or-ing (but not for and-ing).
If unique was true and the len of this slice was >= 2, a zig-zag join would use the two tightest spans (based on stats -- there is a TODO in GenerateInvertedIndexZigzagJoins to use stats). Ideally, we could use the SpanExpression for getting the spans rather than returning to these slices, but it aggressively factors the spans, and we don't want that for zig-zag joins. Hmm, we could tweak SpanExpression to operate in a mode where it doesn't factor and ignores SpanExpression that are being and-ed that have unique=false -- i.e., it would be behaving in a way that would allow us to easily build a zig-zag join.


pkg/sql/rowenc/index_encoding.go, line 849 at r3 (raw file):

//
// Returns unique=true if the spans are guaranteed not to produce duplicate
// primary keys. Otherwise, returns unique=false.

I am having trouble interpreting this statement. If the different roachpb.Spans involved in the intersection produce different sets of keys, the intersection will be empty.
Then I thought this meant that the the different spans in each roachpb.Spans would be non-overlapping wrt primary keys (which typically means there will be only one span in each roachpb.Spans). And maybe the purpose here was to see when zig-zag joins were viable. But that would also require saying something more than unique -- that each roachpb.Spans produces keys in sorted order. I suppose uniqueness and sorted are equivalent given our encoding.
I think this needs a longer comment since []roachpb.Spans represents an expression and the existing statement is hard to relate to that.


pkg/util/json/json.go, line 1069 at r3 (raw file):

		// EncodeArrayAscending generates the prefix that is used for all non-empty
		// arrays.
		keys = append(keys, encoding.EncodeArrayAscending(prefix))

Can you add a comment like

// The span that will be generated for this key can have duplicate PKs, so unique=false.


pkg/util/json/json.go, line 1095 at r3 (raw file):

			// This end key is equal to jsonInvertedIndex + 1.
			EndKey: roachpb.Key(prefix).PrefixEnd(),
		})

same comment would be useful here.


pkg/util/json/json.go, line 1115 at r3 (raw file):

			keys = append(keys, arrKeys...)

			// Even though we now have two spans, we can guarantee no duplicates

... no PK duplicates ...

The span for the array will contain duplicate PKs inside it. Worth stating that in this comment, even though we may not care.

@rytaft rytaft force-pushed the filter-json-arr branch 2 times, most recently from dce5e0a to 714bd5a Compare November 23, 2020 17:18
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! Updated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, and @sumeerbhola)


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 1 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Some of these tests are for zigzag joins. Should all these tests just live in inverted_index? I'm not really sure...

Also, there's nothing here that ensures that these tests will continue to use the intended operators (zigzag vs inverted filter) in the future. This has come up before I think in a previous PR. I'm just a little uneasy about it. I don't know of a good solution though.

I moved the zigzag join tests to inverted_index, and the corresponding EXPLAIN tests to the appropriate files. I removed the mention of which plan to choose here, since as you pointed out there's no way to enforce it. But I left the comment about the preferred plan in the two EXPLAIN varieties, so hopefully we'll catch any plan changes there.

I'd like to leave the rest of the tests in this file, since they correspond to the plans in the two EXPLAIN files, and it's easier to test this operator methodically. It also mirrors the geospatial tests. Maybe at some point we can combine them all into one file with multiple subtests, but I think this is easier and less error-prone for now.


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 81 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Is this case the same as above but without an index join?

Yep, added a comment.


pkg/sql/opt/invertedidx/inverted_index_expr.go, line 265 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] move this function up directly above TryJoinInvertedIndex since this is kind of an "entry point" into this file.

Done.


pkg/sql/opt/invertedidx/inverted_index_expr.go, line 438 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

It looks like you only return a pre-filterer state if filterCond is a leaf expression. Can a pf state not be generated from conjunctions and disjunctions? If not, maybe add a comment here explaining this.

Done.


pkg/sql/opt/invertedidx/json_array.go, line 278 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] add TODO to support JSON fetch val operator ->

Done.


pkg/sql/opt/invertedidx/json_array.go, line 313 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

What's the upgrade story for users to new inverted index versions?

If we took out the current Scan.Constraint-based inverted index scanning now, when a user upgrades to 21.1 some of their queries could go from using inverted indexes to full table scans until they upgraded their inverted index versions.

I had a TODO for this below, but it was an easy fix so I've just done it now. Added a few tests too.


pkg/sql/opt/invertedidx/json_array_test.go, line 327 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Hmmm.. Why not?

Added a comment.


pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):
The one case where you can have two spans and unique=true is when the right hand side of the contains expression is a scalar. There is a pretty extensive comment inside encodeContainingInvertedIndexSpansFromLeaf that explains that case, but let me know if you think I should also add an example here or elsewhere.

But I think you make a good point that we may be able to make use of the unique boolean in cases where we will need to intersect the results anyway, which will allow us to use this code for zig-zag joins as suggested below. I think this is going to require more thought, though, so I'd rather save that for a future PR if that's ok with you.

I was expecting some unification of the code used in GenerateInvertedIndexZigzagJoins...

I plan to do that in a follow-on PR.

... these recent PRs have not been changing indexConstraintCtx.makeInvertedIndexSpansForExpr. Is that code still correct ...

indexConstraintCtx.makeInvertedIndexSpansForExpr seems to be correct because it explicitly avoids using the index if an array is empty or has nulls. But I definitely plan to get rid of that code as soon as possible once GenerateInvertedIndexZigzagJoins uses the new code.

We would have something like: ... Hmm, we could tweak SpanExpression to operate in a mode ...

This is a good idea, but I think it can wait till the follow-on PR.


pkg/sql/opt/xform/select_funcs.go, line 866 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] => "We scan the PK columns if needed". You could also move the initialization of needInvertedFilter here and combine the comments.

Done.


pkg/sql/rowenc/index_encoding.go, line 849 at r3 (raw file):

Previously, sumeerbhola wrote…

I am having trouble interpreting this statement. If the different roachpb.Spans involved in the intersection produce different sets of keys, the intersection will be empty.
Then I thought this meant that the the different spans in each roachpb.Spans would be non-overlapping wrt primary keys (which typically means there will be only one span in each roachpb.Spans). And maybe the purpose here was to see when zig-zag joins were viable. But that would also require saying something more than unique -- that each roachpb.Spans produces keys in sorted order. I suppose uniqueness and sorted are equivalent given our encoding.
I think this needs a longer comment since []roachpb.Spans represents an expression and the existing statement is hard to relate to that.

I added a comment -- let me know if this clarifies it. The spans aren't guaranteed to be sorted (it's possible that they are in fact sorted, but it doesn't really matter, since we'll sort them anyway as part of constructing the SpanExpression).


pkg/sql/rowenc/index_encoding.go, line 939 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

[nit] add a comment that reminds the reader that more than 1 key means that there could be duplicate primary keys in an inverted index.

Done.


pkg/sql/rowenc/index_encoding_test.go, line 436 at r6 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

It's not obvious to me why the spans for {2, NULL} would guarantee non-duplicate primary keys (unique=true). For an array with the value [2, NULL], wouldn't there be 2 entries in the inverted index for that row?

Any array that contains a null must return empty spans, which are trivially unique. Added a comment.


pkg/util/json/json.go, line 1069 at r3 (raw file):

Previously, sumeerbhola wrote…

Can you add a comment like

// The span that will be generated for this key can have duplicate PKs, so unique=false.

Done.


pkg/util/json/json.go, line 1095 at r3 (raw file):

Previously, sumeerbhola wrote…

same comment would be useful here.

Done.


pkg/util/json/json.go, line 1115 at r3 (raw file):

... no PK duplicates ...

Done.

The span for the array will contain duplicate PKs inside it.

Not sure I understand... can you give an example?

This commit adds a result parameter for EncodeContainingInvertedIndexSpans
called unique. The function returns unique=true if the spans are
guaranteed not to produce duplicate primary keys. Otherwise, returns
unique=false. The logic is based on the fact that spans for a given JSON
or Array datum will not produce duplicate primary keys if there is exactly one
JSON or Array path ending in a scalar (i.e., not an empty object or array).

Release note: None
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! some minor comments.

I remembered that there are no microbenchmarks for invertedJoiner and invertedFilterer -- there are TODOs in the test files.
I did not see any obvious costly paths when looking at whole node profiles for geospatial queries, but that may be because the number of spans in those queries were causing other costs to be high. Now that we are going to exercise that code with fewer spans, I think it is worthwhile for someone to add microbenchmarks specifically for array and json and see if there are any optimizations possible there. I doubt I will have the time for it in the next month, so perhaps you or someone from sql execution would like to pick that up.

:lgtm:

Reviewed 4 of 30 files at r7, 25 of 29 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 63 at r8 (raw file):


statement ok
ALTER TABLE json_tab EXPERIMENTAL_RELOCATE VALUES (ARRAY[1], 1), (ARRAY[2], 10), (ARRAY[3], 20)

I was expecting split and relocate on the inverted index. What is the purpose of distributing the table?

Though I think we can't do the split and relocate on the index for json since these are not datums. And I think the similar thing in the geospatial tests is now broken for the same reason (since it is no longer using an int datum). I wonder if we should try to extend the EXPERIMENTAL_RELOCATE to handle encoded values, though it seems hard to maintain. Another possibility would be to rewrite these tests with multi-column inverted indexes and use the fact that the first column there will be a datum.


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 130 at r8 (raw file):

ORDER BY a]
----
https://cockroachdb.github.io/distsqlplan/decode.html#eJyUklFr2zAUhd_3Ky73pQnTiOUwBnpKu7gsI4szO7CN2BQluu08PMmT5BII-e_DdkPqQbL2xVj33O_oHNAe3Z8SBabRPPq4gtqWcJvEX2AdfV_Or2cLGExn6Sr9Oh_C04rsFn45o--83MC3T1ESwWCwgUlWB8GY4GrN8yshPqfx4mYIcQI9LTxpQ7heTPvo-CyY4UbaDJ_BcTKNErj5ATJHhtooWsjf5FCskWPOsLJmS84Z24z27cJM7VAEDAtd1b4Z5wy3xhKKPfrCl4QCV3JTUkJSkR0FyFCRl0XZ2h5LT-6NuSv0IzJMK6mdgNFTvHfHn9HS0n2xi7QCqRW8B-N_knWYHxia2p-ud14-EAp-YC-PONOPZD2p26L0ZMmOeD_nUY92lQWjYcIFuCYoOC-tFxl-yLIgCJoPz7IgDJvMpwkCafWfrTBD6FdjGNdewISfLRm-pmRqrCc7CvvVJvztWfvxa-wTcpXRjnr255yDQ86Q1AN1T8mZ2m5pac22vaY7xi3XDhQ536m8O8x0JzUBn8P8IhxehsOL8PgfOD-8-RsAAP__85I68g==

Nice that many of these don't need a JoinReader.


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 191 at r8 (raw file):

query T
SELECT url FROM [EXPLAIN (DISTSQL)
SELECT a FROM array_tab WHERE b @> '{1}' OR a = 1 ORDER BY a]

will this unioning of (a) scan on the primary index and (b) inverted index+join-reader, also happen if (b) needed an inverted filterer?


pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):

ORDER BY j1.a, j2.a]
----
https://cockroachdb.github.io/distsqlplan/decode.html#eJzUk01v2kwUhffvr7i67yIQJthjQz5mNUlxJUfUTm1UtUpQZPA0MiUed2xHqRD_vbIhH47CAItK7ZLr-8w554qzwPznHBmGztD5MIJSzeFj4H-Ca-fr1fDc9aA1cMNR-HnYhvXKjHYPCcys7uFqc5bL9LaIJnAewswC1_OcAFzvixOMnAFc-q7XXKHge9BqzWh3AvymNE1bVK9N2nDuDaA5P1jcYHSDDBbL5UH7acPqRlAvTMEy2-AHAyeAi2-Vs6h2Fo2RYCpj4UX3Ikd2jRQJWjgmmCk5FXkuVTVe1Etu_IjMJJikWVlU4zHBqVQC2QKLpJgLZDiKJnMRiCgWyjCRYCyKKJnXTz-F45lK7iP1CwmGWZTmDI4MauJ4SVCWxcvTeRHdCWR0SXaXd9MHoQoRX8okFcqwNjj4LuVtkj4geQacx0xBi_eeL8qt9RVfzRpXZoxdhr530UaCflkw4JRwi3B7YxJrnyRVgvUde1vvOJTyR5nBTCYpyJQBtytTHrR4_51A_T0DEd7fmMneJ1MoVSGUcdzMw-0O4bTzrGoT3iO19kbV3kbVFzGpYqFE_EaLdgi3OzhevmPPk0cyM04bxCYH_YYDunsp6K6lMKh5ZNAz4_8dm7HFw5tm2H9xM7YkedWM_j_TjC2Z1s04-YPNeEc1EHkm01zs9I83q8qI-E6sapbLUk3FlZLTWmb106-5ehCLvFh9pasfbrr6VBl8DVMtbOlhSwvbetjWwr0GTN_CPS18plfua-FjPXyshU_08IkWPtXDp3sdbLz873cAAAD__7iL4aw=

what is causing this to change?

and more generally, there are many changes to the inverted_join testdata files, which I am unclear about -- I thought this PR only touched inverted filter.


pkg/sql/opt/exec/execbuilder/testdata/distsql_inverted_index, line 126 at r8 (raw file):

query T
SELECT url FROM [EXPLAIN (DISTSQL)
SELECT a FROM array_tab@foo_inv WHERE b @> '{1, 2}' ORDER BY a]

Same question about doing the relocate above on the table instead of the index. The plan here is not distributed.


pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

The one case where you can have two spans and unique=true is when the right hand side of the contains expression is a scalar. There is a pretty extensive comment inside encodeContainingInvertedIndexSpansFromLeaf that explains that case, but let me know if you think I should also add an example here or elsewhere.

But I think you make a good point that we may be able to make use of the unique boolean in cases where we will need to intersect the results anyway, which will allow us to use this code for zig-zag joins as suggested below. I think this is going to require more thought, though, so I'd rather save that for a future PR if that's ok with you.

I was expecting some unification of the code used in GenerateInvertedIndexZigzagJoins...

I plan to do that in a follow-on PR.

... these recent PRs have not been changing indexConstraintCtx.makeInvertedIndexSpansForExpr. Is that code still correct ...

indexConstraintCtx.makeInvertedIndexSpansForExpr seems to be correct because it explicitly avoids using the index if an array is empty or has nulls. But I definitely plan to get rid of that code as soon as possible once GenerateInvertedIndexZigzagJoins uses the new code.

We would have something like: ... Hmm, we could tweak SpanExpression to operate in a mode ...

This is a good idea, but I think it can wait till the follow-on PR.

Follow-on PR sounds good.


pkg/sql/rowenc/index_encoding.go, line 855 at r8 (raw file):

// allows the optimizer to remove the UNION altogether (implemented
// with the invertedFilterer), and simply return the results of the constrained
// scan.

Can you add that unique is also set to false when length of spans > 1.


pkg/util/json/json.go, line 1115 at r3 (raw file):

Not sure I understand... can you give an example?

My statement was incorrect. The code is fine.

This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (@>) predicate on a JSON or
array column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.

Informs cockroachdb#47340

Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated
predicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`).
An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.
Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

... I think it is worthwhile for someone to add microbenchmarks ...

Opened #57101

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @RaduBerinde, and @sumeerbhola)


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 63 at r8 (raw file):

Previously, sumeerbhola wrote…

I was expecting split and relocate on the inverted index. What is the purpose of distributing the table?

Though I think we can't do the split and relocate on the index for json since these are not datums. And I think the similar thing in the geospatial tests is now broken for the same reason (since it is no longer using an int datum). I wonder if we should try to extend the EXPERIMENTAL_RELOCATE to handle encoded values, though it seems hard to maintain. Another possibility would be to rewrite these tests with multi-column inverted indexes and use the fact that the first column there will be a datum.

I rewrote the tests using a multi-column inverted index.


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 130 at r8 (raw file):

Previously, sumeerbhola wrote…

Nice that many of these don't need a JoinReader.

Agreed!


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 191 at r8 (raw file):

Previously, sumeerbhola wrote…

will this unioning of (a) scan on the primary index and (b) inverted index+join-reader, also happen if (b) needed an inverted filterer?

Yes it will -- Added another test case for that.


pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):

Previously, sumeerbhola wrote…

what is causing this to change?

and more generally, there are many changes to the inverted_join testdata files, which I am unclear about -- I thought this PR only touched inverted filter.

Now that we support inverted filtering for JSON, the optimizer plans an inverted filter for this query with a non-inverted join. In order to force this query to use an inverted join I added the INVERTED JOIN hint, but that also changed the order of the tables.

The other two queries in this file also changed because the optimizer began planning an inverted filter instead of an inverted join. We can't yet force semi and anti joins to use a particular join algorithm, so instead I changed the query so that the inverted join was chosen.


pkg/sql/opt/exec/execbuilder/testdata/distsql_inverted_index, line 126 at r8 (raw file):

Previously, sumeerbhola wrote…

Same question about doing the relocate above on the table instead of the index. The plan here is not distributed.

Zigzag joins aren't supported yet with multicolumn indexes. I left a TODO in this file to fix it once they are supported.


pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):

Previously, sumeerbhola wrote…

Follow-on PR sounds good.

Ack.


pkg/sql/rowenc/index_encoding.go, line 855 at r8 (raw file):

Previously, sumeerbhola wrote…

Can you add that unique is also set to false when length of spans > 1.

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 29 files at r8, 4 of 4 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 191 at r8 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Yes it will -- Added another test case for that.

nice!


pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Now that we support inverted filtering for JSON, the optimizer plans an inverted filter for this query with a non-inverted join. In order to force this query to use an inverted join I added the INVERTED JOIN hint, but that also changed the order of the tables.

The other two queries in this file also changed because the optimizer began planning an inverted filter instead of an inverted join. We can't yet force semi and anti joins to use a particular join algorithm, so instead I changed the query so that the inverted join was chosen.

Is this because the heuristic cost we assign to an inverted join using j1.b @> j2.b is greater than the estimated cost for doing j1.b @> '{"a": {}}' and j2.a < 20 on each side and then doing a hash join?
What is in the plan for 21.1 regarding cost estimation for inverted joins?

Copy link
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)


pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):

Previously, sumeerbhola wrote…

Is this because the heuristic cost we assign to an inverted join using j1.b @> j2.b is greater than the estimated cost for doing j1.b @> '{"a": {}}' and j2.a < 20 on each side and then doing a hash join?
What is in the plan for 21.1 regarding cost estimation for inverted joins?

Yea, that's exactly right -- without the hint, the optimizer chooses a hash join with the two sides constrained.

The stats/cost estimation for inverted indexes need a lot of improvement in general. I think improving stats for inverted index scan, inverted filter and inverted join all need to be a priority for this release. Using histograms to estimate the cardinality of joins will be part of the solution for improving inverted joins (right now we just use distinct counts).

@craig
Copy link
Contributor

craig bot commented Nov 25, 2020

Build succeeded:

@craig craig bot merged commit 8ef0622 into cockroachdb:master Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants