sql: add unordered unique row id as builtin and serial_normalization option #70338

jameswsj10 · 2021-09-16T21:58:51Z

See #54675. Previously, there were only ordered unique
row ids generated which incurs write hot-spots.

The introduction of unordered unique row ids enable distributed
row ids to avoid hotspots while maintaining the 64-bit integer
data type. Basic testing proving distribution of row insertions provided.

Release note (sql change): Added a new sql builtin 'unordered_unique_rowid'
and serial_normalization case 'unordered_rowid' which generates a globally
unique 64-bit integer that does not have ordering.

cockroach-teamcity · 2021-09-16T21:58:59Z

This change is

jameswsj10 · 2021-09-21T19:06:21Z

Fixed CI fail due to go test broken in bazel; reduced number of row insertions in TestSerialNormalizationWithUniqueUnorderedID to 1000 to withstand stress tests.

ajwerner

Sorry for the late review! This is close.

Reviewed 1 of 1 files at r1, 4 of 8 files at r2, 2 of 2 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @jameswsj10)

pkg/sql/sem/builtins/builtins.go, line 7652 at r4 (raw file):

Quoted 4 lines of code…

	// ts is [16 bits of 0][48 bits of ts]
	ts := (orig & ((uint64(math.MaxUint64) >> 16) << 15)) >> 15
	// v is [0][48 bits of reversed ts][15 bits of node id]
	v := (bits.Reverse64(ts) >> 1) | (orig & (1<<15 - 1))

this might feel silly but it will make me feel a lot better, can you pull just the middle two lines (and the comment on the top line) into a little helper and then, in a unit test, in a for loop, generate random uint64 values, do this little bit manipulation dance, and then confirm that bits.OnesCount64 is the same for the input and output. It'll make me feel better that we aren't clobbering any bits.

pkg/sql/sem/builtins/builtins_test.go, line 109 at r4 (raw file):

	fmt.Println("Key counts in each split range")
	for i, keyCount := range keyCounts {
		fmt.Printf("range %d: %d\n", i, keyCount)

this isn't actually checking anything, it's just printing things. Can we update this test to make some assertion?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @fqazi, and @jameswsj10)

pkg/sql/sem/builtins/builtins_test.go, line 73 at r5 (raw file):

}

// TestMapToUniqueUnorderedID verifies that the mapping preserves the ones count

nit: period.

pkg/sql/sem/builtins/builtins_test.go, line 84 at r5 (raw file):

		outputOnesCount := bits.OnesCount64(output)
		if inputOnesCount != outputOnesCount {
			t.Errorf("input: %b, output %b\nExpected: %d, got: %d", randInput, output, inputOnesCount, outputOnesCount)

nit: wrap
nit: require.Equalf

pkg/sql/sem/builtins/builtins_test.go, line 111 at r5 (raw file):

	// testing configurations
	const numberOfRows = 1000

for the number of rows, if you wanted, you could do something like:

numberOfRows := 10000
if util.RaceEnabled {
    numberOfRows = 100
}

Please comment briefly that you want to use a relatively small number of rows because inserting them under race is slow.

pkg/sql/sem/builtins/builtins_test.go, line 115 at r5 (raw file):

	// enforce range splits to collect range statistics after row insertions
	// 3 bits worth of splits in the high order

nit: comment should be sentences

pkg/sql/sem/builtins/builtins_test.go, line 128 at r5 (raw file):

		"='t'::regclass;").Scan(&keyCounts)

	// Calculate mean and variance using NumericStat

let's get a little bit mathier here. What we want to know is whether this distribution over the ranges is not uniform. A nice tool for that is chi-squared. Fortunately for this sort of thing it's really easy to compute. The null hypothesis distribution says that each range should have the same probability. We want to see if we can reject that at a pretty high p-value. Maybe .00001 or something like that*. So, here's what I think this looks like. We need to pick our failing statistic value based on the p value we want. If we look at https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html, a very conservative choice seems like 19.5114 which gives us a p value of .00001 that things really are uniform and we saw what we saw.

From then, we just need to compute the stat. Consider:

func discreteUniformChiSquared(counts []int64) float64 {
	var n int64
	for _, c := range counts {
		n += c
	}
	p := float64(1) / float64(len(counts))
	var stat float64
	for _, c := range counts {
		oSubE := float64(c)/float64(n) - p
		stat += (oSubE * oSubE) / p
	}
	stat *= float64(n)
	return stat
}

For the failing example of []int64{127, 136, 107, 118, 108, 142, 127, 135} we get the value 9.44 which is below our threshold. If we were to, say, move the counts from the last bucket to the first one, we'd get []int64{262, 136, 107, 118, 108, 142, 127, 0} which gives us 283.76 as our stat, which seems good because that's not very uniform.

pkg/sql/sem/builtins/builtins_test.go, line 137 at r5 (raw file):

	mean := stats.Mean
	variance := stats.GetVariance(count)
	fmt.Printf("Mean: %f Variance: %f\n", mean, variance)

if you want to log something from the test, use t.Logf

ajwerner

Name the builtin and serial_normalization value in the release note for the docs team.

Other than that, .

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @fqazi, and @jameswsj10)

ajwerner

Oh, wait, actually, can you rework the git history to be a bit more sane here? Namely, remove the [dnm] and put the release note about the new builtin in the first commit and then remove the new builtin from the second release note?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @fqazi, and @jameswsj10)

jameswsj10 · 2021-09-23T16:47:45Z

nit: period.

Resolved.

nit: wrap
nit: require.Equalf

Resolved.

Please comment briefly that you want to use a relatively small number of rows because inserting them under race is slow.

Done.

nit: comment should be sentences

Resolved. Double-checked comments to adhere to engineering guidelines.

let's get a little bit mathier here. What we want to know is whether this distribution over the ranges is not uniform. A nice tool for that is chi-squared. Fortunately for this sort of thing it's really easy to compute. The null hypothesis distribution says that each range should have the same probability. We want to see if we can reject that at a pretty high p-value. Maybe .00001 or something like that*. So, here's what I think this looks like. We need to pick our failing statistic value based on the p value we want. If we look at https://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html, a very conservative choice seems like 19.5114 which gives us a p value of .00001 that things really are uniform and we saw what we saw.

I've replaced the existing method of checking variance not exceeding a hard-coded threshold to the chi-squared goodness of fit test against the uniform distribution null hypothesis. Let me know if the comments are descriptive enough!

if you want to log something from the test, use t.Logf

Removed all fmt.Prints and logged only the range distributions. Let me know if you'd like me to remove this log as well.

Oh, wait, actually, can you rework the git history to be a bit more sane here? Namely, remove the [dnm] and put the release note about the new builtin in the first commit and then remove the new builtin from the second release note?

Modified commit descriptions and release notes for both commits.

Thanks for the feedback! 👍

ajwerner

Reviewed 2 of 8 files at r2, 2 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @fqazi)

fqazi

Reviewed 1 of 1 files at r1, 9 of 9 files at r7.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @fqazi and @jameswsj10)

-- commits, line 5 at r7:
probably should say hot spot was a side effect of the distribution somehow

-- commits, line 7 at r7:
May be a bit clearer and say that a new builtin without ordering avoids that problem.

Previously, people were complaining about not wanting to rekey databases that assume a 64-bit key. Our default for that use-case incurred a write hot-spot which scared off customers. To address this, this solution has no ordering to avoid hot-spots while guaranteeing the same uniqueness properties. Previously, people were complaining about not wanting to rekey databases that assume a 64-bit key. Our default rowids were generated so that they would be globally unique and always increasing (Note: not exactly serial). This distribution incurred hot spots which scared off customers. To address this, the new builtin guarantees unordered ids with a nearly uniform distribution while maintaining the global uniqueness. This helps avoid hot spots mentioned above. Release note (sql change) : Added a new sql builtin 'unordered_unique_rowid' which generates a globally unique 64-bit integer that does not have ordering.

See cockroachdb#54675. Previously, there were only ordered unique row ids generated which incurs write hot-spots. The introduction of unordered unique row ids enable distributed row ids to avoid hotspots while maintaining the 64-bit integer data type. Basic testing proving distribution of row insertions with unique unordered row id as PK provided. Release note (sql change) : Added a new serial_normalization case 'unordered_rowid' which generates a globally unique 64-bit integer that does not have ordering.

jameswsj10

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)

-- commits, line 5 at r7:

Previously, fqazi (Faizan Qazi) wrote…

probably should say hot spot was a side effect of the distribution somehow

Done.

-- commits, line 7 at r7:

Previously, fqazi (Faizan Qazi) wrote…

May be a bit clearer and say that a new builtin without ordering avoids that problem.

Done.

fqazi

Reviewed 9 of 9 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)

jameswsj10 · 2021-09-24T15:50:31Z

bors r+

craig · 2021-09-24T17:42:54Z

Build succeeded:

GitHub CI (Cockroach)

73004: sql: minor assignment cast fixes r=mgartner a=mgartner #### tree: add missing casts to and from UUIDs Automatic IO conversion casts are possible from the UUID type to string types, and from string types to the UUID type. This commit adds these missing casts to `castMap` which lists all valid casts. Release note: None #### sql: allow cast from a type T to T in all contexts This commit updates `tree.ValidCast` so that it returns true when the given types are identical for any cast context. Casts from a type without modifiers (like width modifiers) to the same type have been removed from the cast map because they are no longer necessary. Release note: None #### sql: remove UNKNOWN from castMap The UNKNOWN type, which is given to expressions that statically evaluate to NULL, can be cast to any type. Codifying all possible casts from with an UNKNOWN source type in the `castMap` is cumbersome and it is easy to accidentally omit some target types. This commit removes the UNKNOWN casts from the `castMap` and adds logic in `ValidCast` that allows UNKNOWN types to be cast to any other type in all contexts. Release note: None 73101: roachpb: add Spans.String r=yuzefovich a=yuzefovich Release note: None 73264: sql: reorder serial_normalization enum r=ajwerner a=rafiss The commit that added this changed the order of this enum, which is not backwards compatible if an upgraded cluster is trying to read the cluster setting. (refs #70338) There's no release note since this fixes an issue in a non-released version. Release note: None Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

jameswsj10 requested a review from ajwerner September 16, 2021 21:58

jameswsj10 requested a review from fqazi September 16, 2021 22:00

jameswsj10 force-pushed the include-unordered-unique-rowid branch 3 times, most recently from 78eb4a6 to fae6d9d Compare September 21, 2021 17:03

ajwerner reviewed Sep 22, 2021

View reviewed changes

jameswsj10 force-pushed the include-unordered-unique-rowid branch 5 times, most recently from 7acc850 to e1432af Compare September 22, 2021 23:59

ajwerner reviewed Sep 23, 2021

View reviewed changes

jameswsj10 force-pushed the include-unordered-unique-rowid branch from e1432af to 130d3e3 Compare September 23, 2021 15:31

ajwerner approved these changes Sep 23, 2021

View reviewed changes

ajwerner reviewed Sep 23, 2021

View reviewed changes

jameswsj10 force-pushed the include-unordered-unique-rowid branch from 130d3e3 to 7b971bb Compare September 23, 2021 16:45

ajwerner approved these changes Sep 23, 2021

View reviewed changes

fqazi requested changes Sep 23, 2021

View reviewed changes

ajwerner and others added 2 commits September 23, 2021 11:43

jameswsj10 force-pushed the include-unordered-unique-rowid branch from 7b971bb to 73aa411 Compare September 23, 2021 18:52

jameswsj10 commented Sep 23, 2021

View reviewed changes

jameswsj10 requested a review from fqazi September 23, 2021 18:53

fqazi approved these changes Sep 24, 2021

View reviewed changes

craig bot merged commit ad275ab into cockroachdb:master Sep 24, 2021

This was referenced Sep 24, 2021

sql: add unordered unique row id as builtin and serial_normalization option cockroachdb/docs#11777

Open

sql: add unordered unique row id as builtin and serial_normalization option cockroachdb/docs#11778

Closed

rafiss mentioned this pull request Nov 29, 2021

sql: reorder serial_normalization enum #73264

Merged

rafiss mentioned this pull request Jan 20, 2022

[DNM] sql: add unique_unordered_rowid builtin #54675

Closed

michae2 mentioned this pull request Mar 10, 2022

sql: coster prefers full table scan when index join is better #77367

Closed

otan mentioned this pull request Jun 13, 2022

sql/sem/builtins: TestSerialNormalizationWithUniqueUnorderedID failed #78075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: add unordered unique row id as builtin and serial_normalization option #70338

sql: add unordered unique row id as builtin and serial_normalization option #70338

jameswsj10 commented Sep 16, 2021 •

edited

Loading

cockroach-teamcity commented Sep 16, 2021

jameswsj10 commented Sep 21, 2021

ajwerner left a comment

ajwerner left a comment

ajwerner left a comment

ajwerner left a comment

jameswsj10 commented Sep 23, 2021

ajwerner left a comment

fqazi left a comment

jameswsj10 left a comment

fqazi left a comment

jameswsj10 commented Sep 24, 2021

craig bot commented Sep 24, 2021

sql: add unordered unique row id as builtin and serial_normalization option #70338

sql: add unordered unique row id as builtin and serial_normalization option #70338

Conversation

jameswsj10 commented Sep 16, 2021 • edited Loading

cockroach-teamcity commented Sep 16, 2021

jameswsj10 commented Sep 21, 2021

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

jameswsj10 commented Sep 23, 2021

ajwerner left a comment

Choose a reason for hiding this comment

fqazi left a comment

Choose a reason for hiding this comment

jameswsj10 left a comment

Choose a reason for hiding this comment

fqazi left a comment

Choose a reason for hiding this comment

jameswsj10 commented Sep 24, 2021

craig bot commented Sep 24, 2021

jameswsj10 commented Sep 16, 2021 •

edited

Loading