-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql,util,opt: add support for inverted filter with JSON and Array columns #56732
Conversation
Please hold off reviewing this, since there are some plan regressions. Trying to figure out how to fix. |
410f6c4
to
809cdbf
Compare
809cdbf
to
9a747ce
Compare
c19805b
to
977d7e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks like this was a big lift.
Reviewed 1 of 5 files at r1, 4 of 19 files at r3, 14 of 25 files at r4, 9 of 10 files at r5, 3 of 3 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde, @rytaft, and @sumeerbhola)
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 1 at r6 (raw file):
# Test cases for using invertedFilterer on an inverted json or array index.
Some of these tests are for zigzag joins. Should all these tests just live in inverted_index
? I'm not really sure...
Also, there's nothing here that ensures that these tests will continue to use the intended operators (zigzag vs inverted filter) in the future. This has come up before I think in a previous PR. I'm just a little uneasy about it. I don't know of a good solution though.
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 81 at r6 (raw file):
query I SELECT a FROM json_tab WHERE b @> '1' ORDER BY a
Is this case the same as above but without an index join?
pkg/sql/opt/invertedidx/inverted_index_expr.go, line 265 at r6 (raw file):
// - pre-filterer state that can be used by the invertedFilterer operator to // reduce the number of false positives returned by the span expression. func TryFilterInvertedIndex(
[nit] move this function up directly above TryJoinInvertedIndex
since this is kind of an "entry point" into this file.
pkg/sql/opt/invertedidx/inverted_index_expr.go, line 438 at r6 (raw file):
// - remaining filters that must be applied if the inverted expression is not // tight, and // - pre-filterer state that can be used to reduce false positives.
It looks like you only return a pre-filterer state if filterCond
is a leaf expression. Can a pf state not be generated from conjunctions and disjunctions? If not, maybe add a comment here explaining this.
pkg/sql/opt/invertedidx/json_array.go, line 278 at r6 (raw file):
) { switch t := expr.(type) { case *memo.ContainsExpr:
[nit] add TODO to support JSON fetch val operator ->
pkg/sql/opt/invertedidx/json_array.go, line 313 at r6 (raw file):
} if variable.Typ.Family() == types.ArrayFamily && j.index.Version() < descpb.EmptyArraysInInvertedIndexesVersion {
What's the upgrade story for users to new inverted index versions?
If we took out the current Scan.Constraint
-based inverted index scanning now, when a user upgrades to 21.1 some of their queries could go from using inverted indexes to full table scans until they upgraded their inverted index versions.
pkg/sql/opt/invertedidx/json_array_test.go, line 327 at r6 (raw file):
{ // We cannot guarantee that the span expression is tight when there is a // nested array.
Hmmm.. Why not?
pkg/sql/opt/xform/select_funcs.go, line 866 at r6 (raw file):
// We scan the PK columns, and the inverted key column if there is an // inverted filter.
[nit] => "We scan the PK columns if needed". You could also move the initialization of needInvertedFilter
here and combine the comments.
pkg/sql/opt/xform/testdata/rules/select, line 1821 at r6 (raw file):
└── key: (1) # Disjunction.
🎉
pkg/sql/rowenc/index_encoding.go, line 939 at r6 (raw file):
spans[i] = roachpb.Spans{{Key: key, EndKey: endKey}} } unique = len(keys) == 1
[nit] add a comment that reminds the reader that more than 1 key means that there could be duplicate primary keys in an inverted index.
pkg/sql/rowenc/index_encoding_test.go, line 436 at r6 (raw file):
{`{NULL}`, `{}`, true, false}, {`{NULL}`, `{NULL}`, false, true}, {`{2, NULL}`, `{2, NULL}`, false, true},
It's not obvious to me why the spans for {2, NULL}
would guarantee non-duplicate primary keys (unique=true). For an array with the value [2, NULL]
, wouldn't there be 2 entries in the inverted index for that row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r1, 4 of 19 files at r3, 1 of 25 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @rytaft)
pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):
// INTERSECTION. needInvertedFilter := spanExpr != nil && (!spanExpr.Unique || spanExpr.Operator != invertedexpr.None)
Based on my earlier comment on encodeContainingInvertedIndexSpansFromLeaf
I wonder if there is really any case where there is more than one span in a roachpb.Spans
and unique is true.
I was expecting some unification of the code used in GenerateInvertedIndexZigzagJoins
and the inverted-filterer, for example using the same encodeContainingInvertedIndexSpans
. If I remember correctly, these recent PRs have not been changing indexConstraintCtx.makeInvertedIndexSpansForExpr
. Is that code still correct, in light of the better understanding of NULL semantics? And it seems we wouldn't want to maintain both as we add support for more operators (beyond ContainsOp
).
We would have something like:
unique (which implies sorted) for each roachpb.Spans
which would be and-ed as is done now, but not say anything about the length of []roachpb.Spans
. These would be used in building a SpanExpression
which would set unique to false when Or-ing (but not for and-ing).
If unique was true and the len of this slice was >= 2, a zig-zag join would use the two tightest spans (based on stats -- there is a TODO in GenerateInvertedIndexZigzagJoins to use stats). Ideally, we could use the SpanExpression
for getting the spans rather than returning to these slices, but it aggressively factors the spans, and we don't want that for zig-zag joins. Hmm, we could tweak SpanExpression
to operate in a mode where it doesn't factor and ignores SpanExpression
that are being and-ed that have unique=false -- i.e., it would be behaving in a way that would allow us to easily build a zig-zag join.
pkg/sql/rowenc/index_encoding.go, line 849 at r3 (raw file):
// // Returns unique=true if the spans are guaranteed not to produce duplicate // primary keys. Otherwise, returns unique=false.
I am having trouble interpreting this statement. If the different roachpb.Spans
involved in the intersection produce different sets of keys, the intersection will be empty.
Then I thought this meant that the the different spans in each roachpb.Spans
would be non-overlapping wrt primary keys (which typically means there will be only one span in each roachpb.Spans
). And maybe the purpose here was to see when zig-zag joins were viable. But that would also require saying something more than unique -- that each roachpb.Spans
produces keys in sorted order. I suppose uniqueness and sorted are equivalent given our encoding.
I think this needs a longer comment since []roachpb.Spans
represents an expression and the existing statement is hard to relate to that.
pkg/util/json/json.go, line 1069 at r3 (raw file):
// EncodeArrayAscending generates the prefix that is used for all non-empty // arrays. keys = append(keys, encoding.EncodeArrayAscending(prefix))
Can you add a comment like
// The span that will be generated for this key can have duplicate PKs, so unique=false.
pkg/util/json/json.go, line 1095 at r3 (raw file):
// This end key is equal to jsonInvertedIndex + 1. EndKey: roachpb.Key(prefix).PrefixEnd(), })
same comment would be useful here.
pkg/util/json/json.go, line 1115 at r3 (raw file):
keys = append(keys, arrKeys...) // Even though we now have two spans, we can guarantee no duplicates
... no PK duplicates ...
The span for the array will contain duplicate PKs inside it. Worth stating that in this comment, even though we may not care.
dce5e0a
to
714bd5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! Updated.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, and @sumeerbhola)
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 1 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Some of these tests are for zigzag joins. Should all these tests just live in
inverted_index
? I'm not really sure...Also, there's nothing here that ensures that these tests will continue to use the intended operators (zigzag vs inverted filter) in the future. This has come up before I think in a previous PR. I'm just a little uneasy about it. I don't know of a good solution though.
I moved the zigzag join tests to inverted_index
, and the corresponding EXPLAIN
tests to the appropriate files. I removed the mention of which plan to choose here, since as you pointed out there's no way to enforce it. But I left the comment about the preferred plan in the two EXPLAIN varieties, so hopefully we'll catch any plan changes there.
I'd like to leave the rest of the tests in this file, since they correspond to the plans in the two EXPLAIN files, and it's easier to test this operator methodically. It also mirrors the geospatial tests. Maybe at some point we can combine them all into one file with multiple subtests, but I think this is easier and less error-prone for now.
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array, line 81 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Is this case the same as above but without an index join?
Yep, added a comment.
pkg/sql/opt/invertedidx/inverted_index_expr.go, line 265 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
[nit] move this function up directly above
TryJoinInvertedIndex
since this is kind of an "entry point" into this file.
Done.
pkg/sql/opt/invertedidx/inverted_index_expr.go, line 438 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
It looks like you only return a pre-filterer state if
filterCond
is a leaf expression. Can a pf state not be generated from conjunctions and disjunctions? If not, maybe add a comment here explaining this.
Done.
pkg/sql/opt/invertedidx/json_array.go, line 278 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
[nit] add TODO to support JSON fetch val operator
->
Done.
pkg/sql/opt/invertedidx/json_array.go, line 313 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
What's the upgrade story for users to new inverted index versions?
If we took out the current
Scan.Constraint
-based inverted index scanning now, when a user upgrades to 21.1 some of their queries could go from using inverted indexes to full table scans until they upgraded their inverted index versions.
I had a TODO for this below, but it was an easy fix so I've just done it now. Added a few tests too.
pkg/sql/opt/invertedidx/json_array_test.go, line 327 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Hmmm.. Why not?
Added a comment.
pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):
The one case where you can have two spans and unique=true
is when the right hand side of the contains expression is a scalar. There is a pretty extensive comment inside encodeContainingInvertedIndexSpansFromLeaf
that explains that case, but let me know if you think I should also add an example here or elsewhere.
But I think you make a good point that we may be able to make use of the unique boolean in cases where we will need to intersect the results anyway, which will allow us to use this code for zig-zag joins as suggested below. I think this is going to require more thought, though, so I'd rather save that for a future PR if that's ok with you.
I was expecting some unification of the code used in GenerateInvertedIndexZigzagJoins...
I plan to do that in a follow-on PR.
... these recent PRs have not been changing indexConstraintCtx.makeInvertedIndexSpansForExpr. Is that code still correct ...
indexConstraintCtx.makeInvertedIndexSpansForExpr
seems to be correct because it explicitly avoids using the index if an array is empty or has nulls. But I definitely plan to get rid of that code as soon as possible once GenerateInvertedIndexZigzagJoins
uses the new code.
We would have something like: ... Hmm, we could tweak SpanExpression to operate in a mode ...
This is a good idea, but I think it can wait till the follow-on PR.
pkg/sql/opt/xform/select_funcs.go, line 866 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
[nit] => "We scan the PK columns if needed". You could also move the initialization of
needInvertedFilter
here and combine the comments.
Done.
pkg/sql/rowenc/index_encoding.go, line 849 at r3 (raw file):
Previously, sumeerbhola wrote…
I am having trouble interpreting this statement. If the different
roachpb.Spans
involved in the intersection produce different sets of keys, the intersection will be empty.
Then I thought this meant that the the different spans in eachroachpb.Spans
would be non-overlapping wrt primary keys (which typically means there will be only one span in eachroachpb.Spans
). And maybe the purpose here was to see when zig-zag joins were viable. But that would also require saying something more than unique -- that eachroachpb.Spans
produces keys in sorted order. I suppose uniqueness and sorted are equivalent given our encoding.
I think this needs a longer comment since[]roachpb.Spans
represents an expression and the existing statement is hard to relate to that.
I added a comment -- let me know if this clarifies it. The spans aren't guaranteed to be sorted (it's possible that they are in fact sorted, but it doesn't really matter, since we'll sort them anyway as part of constructing the SpanExpression
).
pkg/sql/rowenc/index_encoding.go, line 939 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
[nit] add a comment that reminds the reader that more than 1 key means that there could be duplicate primary keys in an inverted index.
Done.
pkg/sql/rowenc/index_encoding_test.go, line 436 at r6 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
It's not obvious to me why the spans for
{2, NULL}
would guarantee non-duplicate primary keys (unique=true). For an array with the value[2, NULL]
, wouldn't there be 2 entries in the inverted index for that row?
Any array that contains a null must return empty spans, which are trivially unique. Added a comment.
pkg/util/json/json.go, line 1069 at r3 (raw file):
Previously, sumeerbhola wrote…
Can you add a comment like
// The span that will be generated for this key can have duplicate PKs, so unique=false.
Done.
pkg/util/json/json.go, line 1095 at r3 (raw file):
Previously, sumeerbhola wrote…
same comment would be useful here.
Done.
pkg/util/json/json.go, line 1115 at r3 (raw file):
... no PK duplicates ...
Done.
The span for the array will contain duplicate PKs inside it.
Not sure I understand... can you give an example?
This commit adds a result parameter for EncodeContainingInvertedIndexSpans called unique. The function returns unique=true if the spans are guaranteed not to produce duplicate primary keys. Otherwise, returns unique=false. The logic is based on the fact that spans for a given JSON or Array datum will not produce duplicate primary keys if there is exactly one JSON or Array path ending in a scalar (i.e., not an empty object or array). Release note: None
714bd5a
to
4b95dcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! some minor comments.
I remembered that there are no microbenchmarks for invertedJoiner
and invertedFilterer
-- there are TODOs in the test files.
I did not see any obvious costly paths when looking at whole node profiles for geospatial queries, but that may be because the number of spans in those queries were causing other costs to be high. Now that we are going to exercise that code with fewer spans, I think it is worthwhile for someone to add microbenchmarks specifically for array and json and see if there are any optimizations possible there. I doubt I will have the time for it in the next month, so perhaps you or someone from sql execution would like to pick that up.
Reviewed 4 of 30 files at r7, 25 of 29 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 63 at r8 (raw file):
statement ok ALTER TABLE json_tab EXPERIMENTAL_RELOCATE VALUES (ARRAY[1], 1), (ARRAY[2], 10), (ARRAY[3], 20)
I was expecting split and relocate on the inverted index. What is the purpose of distributing the table?
Though I think we can't do the split and relocate on the index for json since these are not datums. And I think the similar thing in the geospatial tests is now broken for the same reason (since it is no longer using an int datum). I wonder if we should try to extend the EXPERIMENTAL_RELOCATE
to handle encoded values, though it seems hard to maintain. Another possibility would be to rewrite these tests with multi-column inverted indexes and use the fact that the first column there will be a datum.
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 130 at r8 (raw file):
ORDER BY a] ---- https://cockroachdb.github.io/distsqlplan/decode.html#eJyUklFr2zAUhd_3Ky73pQnTiOUwBnpKu7gsI4szO7CN2BQluu08PMmT5BII-e_DdkPqQbL2xVj33O_oHNAe3Z8SBabRPPq4gtqWcJvEX2AdfV_Or2cLGExn6Sr9Oh_C04rsFn45o--83MC3T1ESwWCwgUlWB8GY4GrN8yshPqfx4mYIcQI9LTxpQ7heTPvo-CyY4UbaDJ_BcTKNErj5ATJHhtooWsjf5FCskWPOsLJmS84Z24z27cJM7VAEDAtd1b4Z5wy3xhKKPfrCl4QCV3JTUkJSkR0FyFCRl0XZ2h5LT-6NuSv0IzJMK6mdgNFTvHfHn9HS0n2xi7QCqRW8B-N_knWYHxia2p-ud14-EAp-YC-PONOPZD2p26L0ZMmOeD_nUY92lQWjYcIFuCYoOC-tFxl-yLIgCJoPz7IgDJvMpwkCafWfrTBD6FdjGNdewISfLRm-pmRqrCc7CvvVJvztWfvxa-wTcpXRjnr255yDQ86Q1AN1T8mZ2m5pac22vaY7xi3XDhQ536m8O8x0JzUBn8P8IhxehsOL8PgfOD-8-RsAAP__85I68g==
Nice that many of these don't need a JoinReader.
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 191 at r8 (raw file):
query T SELECT url FROM [EXPLAIN (DISTSQL) SELECT a FROM array_tab WHERE b @> '{1}' OR a = 1 ORDER BY a]
will this unioning of (a) scan on the primary index and (b) inverted index+join-reader, also happen if (b) needed an inverted filterer?
pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):
ORDER BY j1.a, j2.a] ---- https://cockroachdb.github.io/distsqlplan/decode.html#eJzUk01v2kwUhffvr7i67yIQJthjQz5mNUlxJUfUTm1UtUpQZPA0MiUed2xHqRD_vbIhH47CAItK7ZLr-8w554qzwPznHBmGztD5MIJSzeFj4H-Ca-fr1fDc9aA1cMNR-HnYhvXKjHYPCcys7uFqc5bL9LaIJnAewswC1_OcAFzvixOMnAFc-q7XXKHge9BqzWh3AvymNE1bVK9N2nDuDaA5P1jcYHSDDBbL5UH7acPqRlAvTMEy2-AHAyeAi2-Vs6h2Fo2RYCpj4UX3Ikd2jRQJWjgmmCk5FXkuVTVe1Etu_IjMJJikWVlU4zHBqVQC2QKLpJgLZDiKJnMRiCgWyjCRYCyKKJnXTz-F45lK7iP1CwmGWZTmDI4MauJ4SVCWxcvTeRHdCWR0SXaXd9MHoQoRX8okFcqwNjj4LuVtkj4geQacx0xBi_eeL8qt9RVfzRpXZoxdhr530UaCflkw4JRwi3B7YxJrnyRVgvUde1vvOJTyR5nBTCYpyJQBtytTHrR4_51A_T0DEd7fmMneJ1MoVSGUcdzMw-0O4bTzrGoT3iO19kbV3kbVFzGpYqFE_EaLdgi3OzhevmPPk0cyM04bxCYH_YYDunsp6K6lMKh5ZNAz4_8dm7HFw5tm2H9xM7YkedWM_j_TjC2Z1s04-YPNeEc1EHkm01zs9I83q8qI-E6sapbLUk3FlZLTWmb106-5ehCLvFh9pasfbrr6VBl8DVMtbOlhSwvbetjWwr0GTN_CPS18plfua-FjPXyshU_08IkWPtXDp3sdbLz873cAAAD__7iL4aw=
what is causing this to change?
and more generally, there are many changes to the inverted_join
testdata files, which I am unclear about -- I thought this PR only touched inverted filter.
pkg/sql/opt/exec/execbuilder/testdata/distsql_inverted_index, line 126 at r8 (raw file):
query T SELECT url FROM [EXPLAIN (DISTSQL) SELECT a FROM array_tab@foo_inv WHERE b @> '{1, 2}' ORDER BY a]
Same question about doing the relocate above on the table instead of the index. The plan here is not distributed.
pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
The one case where you can have two spans and
unique=true
is when the right hand side of the contains expression is a scalar. There is a pretty extensive comment insideencodeContainingInvertedIndexSpansFromLeaf
that explains that case, but let me know if you think I should also add an example here or elsewhere.But I think you make a good point that we may be able to make use of the unique boolean in cases where we will need to intersect the results anyway, which will allow us to use this code for zig-zag joins as suggested below. I think this is going to require more thought, though, so I'd rather save that for a future PR if that's ok with you.
I was expecting some unification of the code used in GenerateInvertedIndexZigzagJoins...
I plan to do that in a follow-on PR.
... these recent PRs have not been changing indexConstraintCtx.makeInvertedIndexSpansForExpr. Is that code still correct ...
indexConstraintCtx.makeInvertedIndexSpansForExpr
seems to be correct because it explicitly avoids using the index if an array is empty or has nulls. But I definitely plan to get rid of that code as soon as possible onceGenerateInvertedIndexZigzagJoins
uses the new code.We would have something like: ... Hmm, we could tweak SpanExpression to operate in a mode ...
This is a good idea, but I think it can wait till the follow-on PR.
Follow-on PR sounds good.
pkg/sql/rowenc/index_encoding.go, line 855 at r8 (raw file):
// allows the optimizer to remove the UNION altogether (implemented // with the invertedFilterer), and simply return the results of the constrained // scan.
Can you add that unique is also set to false when length of spans > 1.
pkg/util/json/json.go, line 1115 at r3 (raw file):
Not sure I understand... can you give an example?
My statement was incorrect. The code is fine.
This commit adds support for planning and executing an inverted index scan + invertedFilter when there is a contains (@>) predicate on a JSON or array column that has an inverted index. Similar to spatial columns, the optimizer may be able to plan an inverted index scan + invertedFilter for JSON and Array columns even if the filter condition is complex, with multiple predicates combined with AND and OR expressions. Informs cockroachdb#47340 Release note (performance improvement): The optimizer now supports using an inverted index on JSON or Array columns for a wider variety of filter predicates. Previously, inverted index usage was only supported for simple predicates (e.g., `WHERE a @> '{"foo": "bar"}'), but now more complicated predicates are supported by combining simple contains (@>) expressions with AND and OR (e.g., `WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'`). An inverted index will be used if it is available on the filtered column and the optimizer expects it to be more efficient than any alternative plan. This may result in performance improvements for queries involving JSON and Array columns.
4b95dcf
to
3d1e994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
... I think it is worthwhile for someone to add microbenchmarks ...
Opened #57101
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @RaduBerinde, and @sumeerbhola)
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 63 at r8 (raw file):
Previously, sumeerbhola wrote…
I was expecting split and relocate on the inverted index. What is the purpose of distributing the table?
Though I think we can't do the split and relocate on the index for json since these are not datums. And I think the similar thing in the geospatial tests is now broken for the same reason (since it is no longer using an int datum). I wonder if we should try to extend the
EXPERIMENTAL_RELOCATE
to handle encoded values, though it seems hard to maintain. Another possibility would be to rewrite these tests with multi-column inverted indexes and use the fact that the first column there will be a datum.
I rewrote the tests using a multi-column inverted index.
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 130 at r8 (raw file):
Previously, sumeerbhola wrote…
Nice that many of these don't need a JoinReader.
Agreed!
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 191 at r8 (raw file):
Previously, sumeerbhola wrote…
will this unioning of (a) scan on the primary index and (b) inverted index+join-reader, also happen if (b) needed an inverted filterer?
Yes it will -- Added another test case for that.
pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):
Previously, sumeerbhola wrote…
what is causing this to change?
and more generally, there are many changes to the
inverted_join
testdata files, which I am unclear about -- I thought this PR only touched inverted filter.
Now that we support inverted filtering for JSON, the optimizer plans an inverted filter for this query with a non-inverted join. In order to force this query to use an inverted join I added the INVERTED JOIN
hint, but that also changed the order of the tables.
The other two queries in this file also changed because the optimizer began planning an inverted filter instead of an inverted join. We can't yet force semi and anti joins to use a particular join algorithm, so instead I changed the query so that the inverted join was chosen.
pkg/sql/opt/exec/execbuilder/testdata/distsql_inverted_index, line 126 at r8 (raw file):
Previously, sumeerbhola wrote…
Same question about doing the relocate above on the table instead of the index. The plan here is not distributed.
Zigzag joins aren't supported yet with multicolumn indexes. I left a TODO in this file to fix it once they are supported.
pkg/sql/opt/xform/select_funcs.go, line 857 at r6 (raw file):
Previously, sumeerbhola wrote…
Follow-on PR sounds good.
Ack.
pkg/sql/rowenc/index_encoding.go, line 855 at r8 (raw file):
Previously, sumeerbhola wrote…
Can you add that unique is also set to false when length of spans > 1.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 29 files at r8, 4 of 4 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)
pkg/sql/logictest/testdata/logic_test/inverted_filter_json_array_dist, line 191 at r8 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Yes it will -- Added another test case for that.
nice!
pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Now that we support inverted filtering for JSON, the optimizer plans an inverted filter for this query with a non-inverted join. In order to force this query to use an inverted join I added the
INVERTED JOIN
hint, but that also changed the order of the tables.The other two queries in this file also changed because the optimizer began planning an inverted filter instead of an inverted join. We can't yet force semi and anti joins to use a particular join algorithm, so instead I changed the query so that the inverted join was chosen.
Is this because the heuristic cost we assign to an inverted join using j1.b @> j2.b
is greater than the estimated cost for doing j1.b @> '{"a": {}}'
and j2.a < 20
on each side and then doing a hash join?
What is in the plan for 21.1 regarding cost estimation for inverted joins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @sumeerbhola)
pkg/sql/logictest/testdata/logic_test/inverted_join_json_array_dist, line 93 at r8 (raw file):
Previously, sumeerbhola wrote…
Is this because the heuristic cost we assign to an inverted join using
j1.b @> j2.b
is greater than the estimated cost for doingj1.b @> '{"a": {}}'
andj2.a < 20
on each side and then doing a hash join?
What is in the plan for 21.1 regarding cost estimation for inverted joins?
Yea, that's exactly right -- without the hint, the optimizer chooses a hash join with the two sides constrained.
The stats/cost estimation for inverted indexes need a lot of improvement in general. I think improving stats for inverted index scan, inverted filter and inverted join all need to be a priority for this release. Using histograms to estimate the cardinality of joins will be part of the solution for improving inverted joins (right now we just use distinct counts).
Build succeeded: |
sql,util: return
unique
boolean fromEncodeContainingInvertedIndexSpans
This commit adds a result parameter for
EncodeContainingInvertedIndexSpans
called
unique
. The function returnsunique=true
if the spans areguaranteed not to produce duplicate primary keys. Otherwise, returns
unique=false
. The logic is based on the fact that spans for a given JSONor Array datum will not produce duplicate primary keys if there is exactly one
JSON or Array path ending in a scalar (i.e., not an empty object or array).
sql,opt: add support for inverted filter with JSON and Array columns
This commit adds support for planning and executing an inverted index
scan + invertedFilter when there is a contains (
@>
) predicate on a JSON orarray column that has an inverted index. Similar to spatial columns, the
optimizer may be able to plan an inverted index scan + invertedFilter
for JSON and Array columns even if the filter condition is complex, with
multiple predicates combined with AND and OR expressions.
Informs #47340
Release note (performance improvement): The optimizer now supports using
an inverted index on JSON or Array columns for a wider variety of filter
predicates. Previously, inverted index usage was only supported for simple
predicates (e.g.,
WHERE a @> '{"foo": "bar"}'
), but now more complicatedpredicates are supported by combining simple contains (@>) expressions with
AND and OR (e.g.,
WHERE a @> '{"foo": "bar"}' OR a @> '{"foo": "baz"}'
).An inverted index will be used if it is available on the filtered column and
the optimizer expects it to be more efficient than any alternative plan.
This may result in performance improvements for queries involving JSON and
Array columns.