-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add unordered unique row id as builtin and serial_normalization option #70338
sql: add unordered unique row id as builtin and serial_normalization option #70338
Conversation
78eb4a6
to
fae6d9d
Compare
Fixed CI fail due to go test broken in bazel; reduced number of row insertions in TestSerialNormalizationWithUniqueUnorderedID to 1000 to withstand stress tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review! This is close.
Reviewed 1 of 1 files at r1, 4 of 8 files at r2, 2 of 2 files at r3, 1 of 1 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @jameswsj10)
pkg/sql/sem/builtins/builtins.go, line 7652 at r4 (raw file):
Quoted 4 lines of code…
// ts is [16 bits of 0][48 bits of ts] ts := (orig & ((uint64(math.MaxUint64) >> 16) << 15)) >> 15 // v is [0][48 bits of reversed ts][15 bits of node id] v := (bits.Reverse64(ts) >> 1) | (orig & (1<<15 - 1))
this might feel silly but it will make me feel a lot better, can you pull just the middle two lines (and the comment on the top line) into a little helper and then, in a unit test, in a for loop, generate random uint64 values, do this little bit manipulation dance, and then confirm that bits.OnesCount64
is the same for the input and output. It'll make me feel better that we aren't clobbering any bits.
pkg/sql/sem/builtins/builtins_test.go, line 109 at r4 (raw file):
fmt.Println("Key counts in each split range") for i, keyCount := range keyCounts { fmt.Printf("range %d: %d\n", i, keyCount)
this isn't actually checking anything, it's just printing things. Can we update this test to make some assertion?
7acc850
to
e1432af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @fqazi, and @jameswsj10)
pkg/sql/sem/builtins/builtins_test.go, line 73 at r5 (raw file):
} // TestMapToUniqueUnorderedID verifies that the mapping preserves the ones count
nit: period.
pkg/sql/sem/builtins/builtins_test.go, line 84 at r5 (raw file):
outputOnesCount := bits.OnesCount64(output) if inputOnesCount != outputOnesCount { t.Errorf("input: %b, output %b\nExpected: %d, got: %d", randInput, output, inputOnesCount, outputOnesCount)
nit: wrap
nit: require.Equalf
pkg/sql/sem/builtins/builtins_test.go, line 111 at r5 (raw file):
// testing configurations const numberOfRows = 1000
for the number of rows, if you wanted, you could do something like:
numberOfRows := 10000
if util.RaceEnabled {
numberOfRows = 100
}
Please comment briefly that you want to use a relatively small number of rows because inserting them under race is slow.
pkg/sql/sem/builtins/builtins_test.go, line 115 at r5 (raw file):
// enforce range splits to collect range statistics after row insertions // 3 bits worth of splits in the high order
nit: comment should be sentences
pkg/sql/sem/builtins/builtins_test.go, line 128 at r5 (raw file):
"='t'::regclass;").Scan(&keyCounts) // Calculate mean and variance using NumericStat
let's get a little bit mathier here. What we want to know is whether this distribution over the ranges is not uniform. A nice tool for that is chi-squared. Fortunately for this sort of thing it's really easy to compute. The null hypothesis distribution says that each range should have the same probability. We want to see if we can reject that at a pretty high p-value. Maybe .00001 or something like that*. So, here's what I think this looks like. We need to pick our failing statistic value based on the p value we want. If we look at https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html, a very conservative choice seems like 19.5114
which gives us a p value of .00001
that things really are uniform and we saw what we saw.
From then, we just need to compute the stat. Consider:
func discreteUniformChiSquared(counts []int64) float64 {
var n int64
for _, c := range counts {
n += c
}
p := float64(1) / float64(len(counts))
var stat float64
for _, c := range counts {
oSubE := float64(c)/float64(n) - p
stat += (oSubE * oSubE) / p
}
stat *= float64(n)
return stat
}
For the failing example of []int64{127, 136, 107, 118, 108, 142, 127, 135}
we get the value 9.44
which is below our threshold. If we were to, say, move the counts from the last bucket to the first one, we'd get []int64{262, 136, 107, 118, 108, 142, 127, 0}
which gives us 283.76
as our stat, which seems good because that's not very uniform.
pkg/sql/sem/builtins/builtins_test.go, line 137 at r5 (raw file):
mean := stats.Mean variance := stats.GetVariance(count) fmt.Printf("Mean: %f Variance: %f\n", mean, variance)
if you want to log something from the test, use t.Logf
e1432af
to
130d3e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name the builtin and serial_normalization
value in the release note for the docs team.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @fqazi, and @jameswsj10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait, actually, can you rework the git history to be a bit more sane here? Namely, remove the [dnm]
and put the release note about the new builtin in the first commit and then remove the new builtin from the second release note?
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @fqazi, and @jameswsj10)
130d3e3
to
7b971bb
Compare
Resolved.
Resolved.
Done.
Resolved. Double-checked comments to adhere to engineering guidelines.
I've replaced the existing method of checking variance not exceeding a hard-coded threshold to the chi-squared goodness of fit test against the uniform distribution null hypothesis. Let me know if the comments are descriptive enough!
Removed all fmt.Prints and logged only the range distributions. Let me know if you'd like me to remove this log as well.
Modified commit descriptions and release notes for both commits. Thanks for the feedback! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r2, 2 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 9 of 9 files at r7.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @jameswsj10)
-- commits, line 5 at r7:
probably should say hot spot was a side effect of the distribution somehow
-- commits, line 7 at r7:
May be a bit clearer and say that a new builtin without ordering avoids that problem.
Previously, people were complaining about not wanting to rekey databases that assume a 64-bit key. Our default for that use-case incurred a write hot-spot which scared off customers. To address this, this solution has no ordering to avoid hot-spots while guaranteeing the same uniqueness properties. Previously, people were complaining about not wanting to rekey databases that assume a 64-bit key. Our default rowids were generated so that they would be globally unique and always increasing (Note: not exactly serial). This distribution incurred hot spots which scared off customers. To address this, the new builtin guarantees unordered ids with a nearly uniform distribution while maintaining the global uniqueness. This helps avoid hot spots mentioned above. Release note (sql change) : Added a new sql builtin 'unordered_unique_rowid' which generates a globally unique 64-bit integer that does not have ordering.
See cockroachdb#54675. Previously, there were only ordered unique row ids generated which incurs write hot-spots. The introduction of unordered unique row ids enable distributed row ids to avoid hotspots while maintaining the 64-bit integer data type. Basic testing proving distribution of row insertions with unique unordered row id as PK provided. Release note (sql change) : Added a new serial_normalization case 'unordered_rowid' which generates a globally unique 64-bit integer that does not have ordering.
7b971bb
to
73aa411
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @fqazi)
Previously, fqazi (Faizan Qazi) wrote…
probably should say hot spot was a side effect of the distribution somehow
Done.
Previously, fqazi (Faizan Qazi) wrote…
May be a bit clearer and say that a new builtin without ordering avoids that problem.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r9.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @fqazi)
bors r+ |
Build succeeded: |
73004: sql: minor assignment cast fixes r=mgartner a=mgartner #### tree: add missing casts to and from UUIDs Automatic IO conversion casts are possible from the UUID type to string types, and from string types to the UUID type. This commit adds these missing casts to `castMap` which lists all valid casts. Release note: None #### sql: allow cast from a type T to T in all contexts This commit updates `tree.ValidCast` so that it returns true when the given types are identical for any cast context. Casts from a type without modifiers (like width modifiers) to the same type have been removed from the cast map because they are no longer necessary. Release note: None #### sql: remove UNKNOWN from castMap The UNKNOWN type, which is given to expressions that statically evaluate to NULL, can be cast to any type. Codifying all possible casts from with an UNKNOWN source type in the `castMap` is cumbersome and it is easy to accidentally omit some target types. This commit removes the UNKNOWN casts from the `castMap` and adds logic in `ValidCast` that allows UNKNOWN types to be cast to any other type in all contexts. Release note: None 73101: roachpb: add Spans.String r=yuzefovich a=yuzefovich Release note: None 73264: sql: reorder serial_normalization enum r=ajwerner a=rafiss The commit that added this changed the order of this enum, which is not backwards compatible if an upgraded cluster is trying to read the cluster setting. (refs #70338) There's no release note since this fixes an issue in a non-released version. Release note: None Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
See #54675. Previously, there were only ordered unique
row ids generated which incurs write hot-spots.
The introduction of unordered unique row ids enable distributed
row ids to avoid hotspots while maintaining the 64-bit integer
data type. Basic testing proving distribution of row insertions provided.
Release note (sql change): Added a new sql builtin 'unordered_unique_rowid'
and serial_normalization case 'unordered_rowid' which generates a globally
unique 64-bit integer that does not have ordering.