New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: use value encoding for upper bounds #117714
stats: use value encoding for upper bounds #117714
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job fixing this!
Is there a unit test that would have caught this? Maybe a test that creates a histogram with a single bucket of a random type, and then checks that we can round-trip the histogram?
Reviewed 19 of 19 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)
-- commits
line 22 at r1:
Actually, we do collect histograms on JSON and arrays. For example:
demo@127.0.0.1:26257/demoapp/defaultdb> CREATE TABLE ijk (i INT PRIMARY KEY, j JSON, k STRING[], INDEX (j), INDEX (k));
CREATE TABLE
Time: 3ms total (execution 3ms / network 0ms)
demo@127.0.0.1:26257/demoapp/defaultdb> INSERT INTO ijk VALUES (1, '{"a": 3.25, "b": 3.50}', ARRAY['x', 'y', 'z']), (2, '{"a": 4.75, "b": 5.50}', ARRAY['m', 'n', 'o']);
INSERT 0 2
Time: 2ms total (execution 1ms / network 0ms)
demo@127.0.0.1:26257/demoapp/defaultdb> ANALYZE ijk;
ANALYZE
Time: 40ms total (execution 39ms / network 0ms)
demo@127.0.0.1:26257/demoapp/defaultdb> SHOW STATISTICS FOR TABLE ijk;
statistics_name | column_names | created | row_count | distinct_count | null_count | avg_size | partial_predicate | histogram_id | full_histogram_id
------------------+--------------+-------------------------------+-----------+----------------+------------+----------+-------------------+--------------------+--------------------
__auto__ | {i} | 2024-01-16 23:31:13.296228+00 | 0 | 0 | 0 | 0 | NULL | 935124904503312385 | 0
__auto__ | {j} | 2024-01-16 23:31:13.296228+00 | 0 | 0 | 0 | 0 | NULL | 935124904510324737 | 0
__auto__ | {k} | 2024-01-16 23:31:13.296228+00 | 0 | 0 | 0 | 0 | NULL | 935124904517369857 | 0
NULL | {i} | 2024-01-16 23:32:10.872031+00 | 2 | 2 | 0 | 1 | NULL | 935125093168578561 | 0
NULL | {j} | 2024-01-16 23:32:10.872031+00 | 2 | 2 | 0 | 35 | NULL | 935125093175918593 | 0
NULL | {k} | 2024-01-16 23:32:10.872031+00 | 2 | 2 | 0 | 11 | NULL | 935125093182996481 | 0
(6 rows)
Time: 4ms total (execution 3ms / network 0ms)
demo@127.0.0.1:26257/demoapp/defaultdb> SHOW HISTOGRAM 935125093175918593;
upper_bound | range_rows | distinct_range_rows | equal_rows
--------------------------+------------+---------------------+-------------
'{"a": 3.25, "b": 3.5}' | 0 | 0 | 1
'{"a": 4.75, "b": 5.5}' | 0 | 0 | 1
(2 rows)
Time: 2ms total (execution 2ms / network 0ms)
demo@127.0.0.1:26257/demoapp/defaultdb> SHOW HISTOGRAM 935125093182996481;
upper_bound | range_rows | distinct_range_rows | equal_rows
---------------------+------------+---------------------+-------------
ARRAY['m','n','o'] | 0 | 0 | 1
ARRAY['x','y','z'] | 0 | 0 | 1
(2 rows)
Time: 3ms total (execution 2ms / network 0ms)
Does this mean we need to use value encodings for other types?
Would it be safer to use value encodings for all types, in case we add another composite type one day?
pkg/sql/show_histogram.go
line 86 at r1 (raw file):
if err != nil { v.Close(ctx) return nil, err
Instead of returning the error, could we print it to the string in a similar way to EncDatum.stringWithAlloc
?
pkg/sql/stats/histogram.go
line 684 at r1 (raw file):
// cannot yet use the newest histogram version to preserve // backwards-compatibility. version = HistVersion - 1
suggestion: This works, but it requires us to remember to change this -1 if we add another histogram version during the time when we could still upgrade from 23.2.
It might be a little safer to ignore HistVersion
(or get rid of HistVersion
) and instead structure this as "positive" checks instead of "negative" checks like:
// 22.2 - 23.2
version := HistogramVersion(2)
if st.Version.IsActive(ctx, clusterversion.V24_1) {
version = collatedStringsValueEncodedVersion
}
// and then future versions can add if statements like:
// if st.Version.IsActive(ctx, clusterversion.V24_2) {
// version = HistogramVersion(4)
// }
This is a little more robust if we add another histogram version during 24.1 development (or 24.2 development and need to support upgrades from 23.2 to 24.2).
And it might even be worth getting rid of HistVersion
and instead creating fixed constants for HistogramVersion(0)
, 1, and 2, similar to collatedStringsValueEncodedVersion
that aren't expected to change like HistVersion
did. This would better match what we do in clusterversion
.
pkg/sql/opt/exec/execbuilder/testdata/collated_strings
line 34 at r1 (raw file):
│ │ <--- 1 --- 2 --- 3 --- 4 --- 5 --- 6 --- 7 --- 8 --- 9 --- 10 --- 11 --- 12 --- 13 │ │ histogram(2)= 0 1 10 2 │ │ <--- 'foo' COLLATE en_US_u_ks_level2 ---- 'world' COLLATE en_US_u_ks_level2
Nice! 🙂
4d77dd6
to
1102e76
Compare
5bfca24
to
3cb7cee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the careful review!
Is there a unit test that would have caught this? Maybe a test that creates a histogram with a single bucket of a random type, and then checks that we can round-trip the histogram?
That's a good idea, done. The newly-added test would have caught the problem for collated strings.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
-- commits
line 22 at r1:
Hm, that's a good point - I saw a comment along these lines in deepCopyDatum
(but it must be stale) and didn't check - thanks for checking! (I think that function is redundant at this point and actually can be removed. Did that in a separate commit.)
Would it be safer to use value encodings for all types, in case we add another composite type one day?
This sounds like a good idea.
Possible disadvantage of doing this, though, is the fact that we collect histograms mostly on the indexed columns, meaning that if we were to scan the index to get datums, they would come with key-encoding present, which means that storing the value encoding would require additional step to go from one encoding to the other.
However, at least right now (and probably for the foreseeable future) we perform the scan of the primary index for stats collection purposes, so what we get is key-encoding for columns in the primary key and value-encoding columns for everything else. This observation actually makes me think that we do have value-encoding for majority of columns, so switching to always use value-encoding would be an improvement to stats collection overall (in addition to being bullet-proof from compositeness). Done.
pkg/sql/show_histogram.go
line 86 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Instead of returning the error, could we print it to the string in a similar way to
EncDatum.stringWithAlloc
?
Good point, done.
pkg/sql/stats/histogram.go
line 684 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
suggestion: This works, but it requires us to remember to change this -1 if we add another histogram version during the time when we could still upgrade from 23.2.
It might be a little safer to ignore
HistVersion
(or get rid ofHistVersion
) and instead structure this as "positive" checks instead of "negative" checks like:// 22.2 - 23.2 version := HistogramVersion(2) if st.Version.IsActive(ctx, clusterversion.V24_1) { version = collatedStringsValueEncodedVersion } // and then future versions can add if statements like: // if st.Version.IsActive(ctx, clusterversion.V24_2) { // version = HistogramVersion(4) // }
This is a little more robust if we add another histogram version during 24.1 development (or 24.2 development and need to support upgrades from 23.2 to 24.2).
And it might even be worth getting rid of
HistVersion
and instead creating fixed constants forHistogramVersion(0)
, 1, and 2, similar tocollatedStringsValueEncodedVersion
that aren't expected to change likeHistVersion
did. This would better match what we do inclusterversion
.
That's a good point. I like the idea of introducing named constants for each histogram version that we need to gate the code on, done. I decided to not remove HistVersion
still so that places outside of histogram creation code could still easily create the histogram of the latest version (currently there is only one caller though randHistogram
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test! This is great!
Hmm, looks like some other tests are unhappy with that value encoding for all types... hopefully nothing serious.
Reviewed 10 of 10 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft and @yuzefovich)
pkg/sql/stats/row_sampling.go
line 268 at r3 (raw file):
afterSize = dst[i].Size() } else { dst[i].Datum = deepCopyDatum(dst[i].Datum)
Nice!!
3cb7cee
to
0fbb2b0
Compare
Currently, whenever we're encoding the upper bounds of the histogram buckets we always use the datum's key-encoding. However, some datums are composite, and we need to operate on their value encoding. This strategy happens to be ok for floats and decimals because their value part only stores some auxiliary information (like the number of trailing zeroes and the sign of zero), so even if we lose that auxiliary information, when decoding the upper bound we end up with a datum that has the same comparison properties. Unfortunately, this is not the case for collated strings where key-encoding and value-encoding can be vastly different. So the fact that we always use the key-encoding for collated strings makes it so that when we key-decode upper bounds, we end up with garbage datums. As a result, our histograms are meaningless for collated strings. This problem is now fixed by always using the value encoding. As mentioned above, this is required for correctness for some types (in addition to collated strings probably also arrays of collated strings), but it also seems beneficial overall. We're getting the upper bound datums when scanning the primary index, and there we get key-encoded datums from indexed and value-encoded datums non-indexed columns, so probably we should be getting more value-encoded datums already. As a result, we should need to do the change in encoding on less amount of data now. In order to make this change backwards-compatible we gate it based on the new histogram version - if the histogram has the previous version, then it must have used key-encoding for upper bounds, so we will keep on using key-decoding as well. Only once the cluster upgrades to 24.1 version do we start creating the histograms with the new version which will use value-encoding for upper bound datums. Release note (bug fix): CockroachDB now correctly uses the histograms on columns of COLLATED STRING type. The bug has been present since pre-22.1 release.
This commit removes the deep copy of strings and collated strings that was introduced in 2d55ca3 in order to allow for the backing KV batch to be GCed. This was necessary at the time because when key-decoding these datums we would hold onto the encoded byte slice. However, since then (in 4b2f115) we changed that to no longer just use the memory from KV batch directly and always perform the deep copy during decoding. Doing the deep copy in the stats sampling code is, thus, no longer needed but also is at the wrong layer - in case we need to perform the deep copy for some datum (and at the moment I think we do all necessary copies already), that should be done during the decoding phase. Release note: None
0fbb2b0
to
f86b532
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failures were minor - I forgot to update a few places where we still used key-encoding to generate histograms.
Also, as it turns out, we were pretty close to noticing this bug earlier, but I chose to ignore itnot dig into it by skipping collated strings for histogram generation in 1902d3d :/
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rytaft)
Build is green, TFTR! bors r+ |
Build succeeded: |
Currently, whenever we're encoding the upper bounds of the histogram buckets we always use the datum's key-encoding. However, some datums are composite, and we need to operate on their value encoding. This strategy happens to be ok for floats and decimals because their value part only stores some auxiliary information (like the number of trailing zeroes and the sign of zero), so even if we lose that auxiliary information, when decoding the upper bound we end up with a datum that has the same comparison properties.
Unfortunately, this is not the case for collated strings where key-encoding and value-encoding can be vastly different. So the fact that we always use the key-encoding for collated strings makes it so that when we key-decode upper bounds, we end up with garbage datums. As a result, our histograms are meaningless for collated strings.
This problem is now fixed by always using the value encoding. As mentioned above, this is required for correctness for some types (in addition to collated strings probably also arrays of collated strings), but it also seems beneficial overall. We're getting the upper bound datums when scanning the primary index, and there we get key-encoded datums from indexed and value-encoded datums non-indexed columns, so probably we should be getting more value-encoded datums already. As a result, we should need to do the change in encoding on less amount of data now.
In order to make this change backwards-compatible we gate it based on the new histogram version - if the histogram has the previous version, then it must have used key-encoding for upper bounds, so we will keep on using key-decoding as well. Only once the cluster upgrades to 24.1 version do we start creating the histograms with the new version which will use value-encoding for upper bound datums.
Fixes: #98400.
Release note (bug fix): CockroachDB now correctly uses the histograms on columns of COLLATED STRING type. The bug has been present since pre-22.1 release.