-
Notifications
You must be signed in to change notification settings - Fork 2
crsync: add sharded counters and general support for sharding #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. Just a few small comments.
Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @jbowens)
crsync/counters_test.go
line 1 at r1 (raw file):
// Copyright 2025 The Cockroach Authors.
Is it worth having very simple correctness tests of Counters.Add() and Get()?
crsync/counters_test.go
line 39 at r1 (raw file):
// c=1/p=4-10 169ns ±10% 53ns ± 4% 21ns ±39% 13ns ±26% // c=1/p=10-10 752ns ± 3% 125ns ± 0% 51ns ±20% 51ns ± 7% * // c=1/p=40-10 3.04µs ± 2% 0.67µs ±13% 0.26µs ±31% 0.29µs ±16%
I wonder why crsync-cr is slower than crsync here. Doesn't really matter given how much faster this is than simple and randshards.
crsync/counters.go
line 111 at r1 (raw file):
func makeCounters(numShards, numCounters int) Counters { // Round up to the nearest cacheline size, to avoid false sharing. shardSize := (numCounters + countersPerCacheLine - 1) &^ (countersPerCacheLine - 1)
So each shard contains all of the counters and is at least 8 counters in size (the typical 64-byte cache line size)? It took me a minute to understand this. I think the comments here or above Counters could make this clearer. Is the reason for this structure to avoid wasting excessive space? That makes sense. Again, the commentary could make this a bit clearer.
crsync/counters.go
line 155 at r1 (raw file):
// - Complexity is O(NumShards() * numCounters). // - Snapshot semantics: no ordering guarantees w.r.t. concurrent updates. func (c *Counters) All() []int64 {
When do you envision All() being used? I'm wondering why a caller would prefer All() to iterating over the counters and calling Get(counter) for each given the latter will not require a memory allocation.
Or perhaps this should return an iter.Seq[int64]
using the Go iterator facility. I believe that should avoid allocations while providing more convenient iteration.
6cc6491
to
88490a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @jbowens and @petermattis)
crsync/counters.go
line 111 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
So each shard contains all of the counters and is at least 8 counters in size (the typical 64-byte cache line size)? It took me a minute to understand this. I think the comments here or above Counters could make this clearer. Is the reason for this structure to avoid wasting excessive space? That makes sense. Again, the commentary could make this a bit clearer.
shardSize is a multiple of 8 to avoid false sharing. Updating the comments and making the code less cryptic.
crsync/counters.go
line 155 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
When do you envision All() being used? I'm wondering why a caller would prefer All() to iterating over the counters and calling Get(counter) for each given the latter will not require a memory allocation.
Or perhaps this should return an
iter.Seq[int64]
using the Go iterator facility. I believe that should avoid allocations while providing more convenient iteration.
I wanted to minimize disruption of ongoing Add()s by looking at each cache line only once. Added a comment. I kept this property while presenting the iterator interface.
crsync/counters_test.go
line 1 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Is it worth having very simple correctness tests of Counters.Add() and Get()?
Oops, I meant to add some but jumped the gun. Done.
crsync/counters_test.go
line 39 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I wonder why crsync-cr is slower than crsync here. Doesn't really matter given how much faster this is than simple and randshards.
In this case we are almost surely preempted while updating the counter. This is not a concern in a real usecase where a goroutine is doing other work in-between counter updates.
Given the large stddev, not sure if there is a real difference. If it is, my speculation is that in the crsync-cr case, when we get preempted, we will use the "wrong" shard once and the cache line bounces to the new CPU then back; in the crsync case, if we get preempted in-between the Get and the Put, we will move the shard from one CPU to another (it doesn't bounce back).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another suggestion about improving the commentary, but feel free to ignore.
Reviewable status: 0 of 7 files reviewed, 3 unresolved discussions (waiting on @jbowens and @RaduBerinde)
crsync/counters.go
line 111 at r1 (raw file):
Previously, RaduBerinde wrote…
shardSize is a multiple of 8 to avoid false sharing. Updating the comments and making the code less cryptic.
You're only avoiding false sharing between the shards for an individual counter? Adjacent counters can still experience false sharing because multiple counters (up to 8) are packed into a cache line. I suspect this is a good design trade-off (reduces space overhead from these sharded counters), but the comments could still make this setup a bit clearer.
crsync/counters.go
line 98 at r2 (raw file):
shardSize uint32 // counters contains numShards * shardSize counters; shardSize is a multiple // of countersPerCacheLine to avoid false sharing (at shard boundaries).
Adjacent counter ordinals are packed tightly within a shard to reduce space overhead, while the shards for a given counter ordinal are on distinct cache lines to avoid false sharing.
crsync/counters.go
line 171 at r2 (raw file):
for i := 0; i < c.numCounters; i += countersPerCacheLine { vals = [countersPerCacheLine]int64{} numCounters := min(c.numCounters-i, countersPerCacheLine)
Confusing to have both numCounters
and c.numCounters
so close to each other in the code. Perhaps name this variable numShardCounters
.
88490a6
to
3fc6724
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated. Also bumped go to 1.23 for the iter stuff.
Reviewable status: 0 of 8 files reviewed, 3 unresolved discussions (waiting on @jbowens and @petermattis)
crsync/counters.go
line 111 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
You're only avoiding false sharing between the shards for an individual counter? Adjacent counters can still experience false sharing because multiple counters (up to 8) are packed into a cache line. I suspect this is a good design trade-off (reduces space overhead from these sharded counters), but the comments could still make this setup a bit clearer.
There is a strong correlation between the current CPU and the shard that we are using, so the same CPU would be updating any of the counters in the shard (no false sharing).
crsync/counters.go
line 98 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Adjacent counter ordinals are packed tightly within a shard to reduce space overhead, while the shards for a given counter ordinal are on distinct cache lines to avoid false sharing.
Changed to this:
// counters contains numShards * shardSize counters; shardSize is a multiple
// of countersPerCacheLine to avoid false sharing at shard boundaries. Note
// that there is a high correlation between the current CPU and the chosen
// shard, so different counters inside a shard can share cache lines.
crsync/counters.go
line 171 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Confusing to have both
numCounters
andc.numCounters
so close to each other in the code. Perhaps name this variablenumShardCounters
.
Done.
0fc25a8
to
5f590ce
Compare
5f590ce
to
d5d6643
Compare
Add support for sharding with CPU locality and add sharded counters. See counters_test.go for benchmark results. Sharding has two implementations, one is be used with the CRDB Go runtime (specifically cockroachdb/go#6) and the `cockroach_go` build tag. We also add a script that can be used to easily run tests against the CRDB Go (along with a CI job).
d5d6643
to
456ff8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola reviewed 1 of 5 files at r3.
Reviewable status: 1 of 10 files reviewed, 7 unresolved discussions
crsync/counters.go
line 169 at r4 (raw file):
return func(yield func(int64) bool) { // To access each cache line only once, we calculate countersPerCacheLine // counters at a time.
This was very neat.
I found it hard to understand what was going on in this method and took some time to finally get it. Some comment suggestions below to make it easier.
crsync/counters.go
line 172 at r4 (raw file):
var vals [countersPerCacheLine]int64 for i := 0; i < c.numCounters; i += countersPerCacheLine { vals = [countersPerCacheLine]int64{}
// Reminder: We have a total of c.numCounters logical counters, but we have allocated space for a multiple of countersPerCacheLine counters. So the last iteration of this loop will have fewer than countersPerCacheLine logical counters to read. n is the number of logical counters to read in this iteration.
crsync/counters.go
line 174 at r4 (raw file):
vals = [countersPerCacheLine]int64{} n := min(c.numCounters-i, countersPerCacheLine) for s := range c.numShards {
// Iterate over all the shards and for this logical cache line, read the physical cache line from each shard and aggregate into vals.
crsync/counters.go
line 182 at r4 (raw file):
vals[j] += counters[j].Load() } }
// Yield the values for the next set of n logical counters. Across the iterations of the outer loop we will in total yield c.numCounters values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 10 files reviewed, 7 unresolved discussions
crsync/counters.go
line 169 at r4 (raw file):
Previously, sumeerbhola wrote…
This was very neat.
I found it hard to understand what was going on in this method and took some time to finally get it. Some comment suggestions below to make it easier.
TFTR! Opened #25 to apply these comments
crsync/counters.go
line 172 at r4 (raw file):
Previously, sumeerbhola wrote…
// Reminder: We have a total of c.numCounters logical counters, but we have allocated space for a multiple of countersPerCacheLine counters. So the last iteration of this loop will have fewer than countersPerCacheLine logical counters to read. n is the number of logical counters to read in this iteration.
Done.
crsync/counters.go
line 174 at r4 (raw file):
Previously, sumeerbhola wrote…
// Iterate over all the shards and for this logical cache line, read the physical cache line from each shard and aggregate into vals.
Done.
Add support for sharding with CPU locality and add sharded counters.
See counters_test.go for benchmark results.
Sharding has two implementations, one is be used with the CRDB Go
runtime (specifically cockroachdb/go#6) and
the
cockroach_go
build tag.We also add a script that can be used to easily run tests against the
CRDB Go (along with a CI job).
This change is