crsync: add sharded counters and general support for sharding #24

RaduBerinde · 2025-09-29T16:04:02Z

Add support for sharding with CPU locality and add sharded counters.
See counters_test.go for benchmark results.

Sharding has two implementations, one is be used with the CRDB Go
runtime (specifically cockroachdb/go#6) and
the cockroach_go build tag.

We also add a script that can be used to easily run tests against the
CRDB Go (along with a CI job).

This change is

petermattis

Neat. Just a few small comments.

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @jbowens)

crsync/counters_test.go line 1 at r1 (raw file):

// Copyright 2025 The Cockroach Authors.

Is it worth having very simple correctness tests of Counters.Add() and Get()?

crsync/counters_test.go line 39 at r1 (raw file):

// c=1/p=4-10      169ns ±10%    53ns ± 4%     21ns ±39%    13ns ±26%
// c=1/p=10-10     752ns ± 3%   125ns ± 0%     51ns ±20%    51ns ± 7% *
// c=1/p=40-10    3.04µs ± 2%  0.67µs ±13%   0.26µs ±31%  0.29µs ±16%

I wonder why crsync-cr is slower than crsync here. Doesn't really matter given how much faster this is than simple and randshards.

crsync/counters.go line 111 at r1 (raw file):

func makeCounters(numShards, numCounters int) Counters {
	// Round up to the nearest cacheline size, to avoid false sharing.
	shardSize := (numCounters + countersPerCacheLine - 1) &^ (countersPerCacheLine - 1)

So each shard contains all of the counters and is at least 8 counters in size (the typical 64-byte cache line size)? It took me a minute to understand this. I think the comments here or above Counters could make this clearer. Is the reason for this structure to avoid wasting excessive space? That makes sense. Again, the commentary could make this a bit clearer.

crsync/counters.go line 155 at r1 (raw file):

//   - Complexity is O(NumShards() * numCounters).
//   - Snapshot semantics: no ordering guarantees w.r.t. concurrent updates.
func (c *Counters) All() []int64 {

When do you envision All() being used? I'm wondering why a caller would prefer All() to iterating over the counters and calling Get(counter) for each given the latter will not require a memory allocation.

Or perhaps this should return an iter.Seq[int64] using the Go iterator facility. I believe that should avoid allocations while providing more convenient iteration.

RaduBerinde

TFTR!

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @jbowens and @petermattis)

crsync/counters.go line 111 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

So each shard contains all of the counters and is at least 8 counters in size (the typical 64-byte cache line size)? It took me a minute to understand this. I think the comments here or above Counters could make this clearer. Is the reason for this structure to avoid wasting excessive space? That makes sense. Again, the commentary could make this a bit clearer.

shardSize is a multiple of 8 to avoid false sharing. Updating the comments and making the code less cryptic.

crsync/counters.go line 155 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

When do you envision All() being used? I'm wondering why a caller would prefer All() to iterating over the counters and calling Get(counter) for each given the latter will not require a memory allocation.

Or perhaps this should return an iter.Seq[int64] using the Go iterator facility. I believe that should avoid allocations while providing more convenient iteration.

I wanted to minimize disruption of ongoing Add()s by looking at each cache line only once. Added a comment. I kept this property while presenting the iterator interface.

crsync/counters_test.go line 1 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is it worth having very simple correctness tests of Counters.Add() and Get()?

Oops, I meant to add some but jumped the gun. Done.

crsync/counters_test.go line 39 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder why crsync-cr is slower than crsync here. Doesn't really matter given how much faster this is than simple and randshards.

In this case we are almost surely preempted while updating the counter. This is not a concern in a real usecase where a goroutine is doing other work in-between counter updates.

Given the large stddev, not sure if there is a real difference. If it is, my speculation is that in the crsync-cr case, when we get preempted, we will use the "wrong" shard once and the cache line bounces to the new CPU then back; in the crsync case, if we get preempted in-between the Get and the Put, we will move the shard from one CPU to another (it doesn't bounce back).

petermattis

Another suggestion about improving the commentary, but feel free to ignore.

Reviewable status: 0 of 7 files reviewed, 3 unresolved discussions (waiting on @jbowens and @RaduBerinde)

crsync/counters.go line 111 at r1 (raw file):

Previously, RaduBerinde wrote…

shardSize is a multiple of 8 to avoid false sharing. Updating the comments and making the code less cryptic.

You're only avoiding false sharing between the shards for an individual counter? Adjacent counters can still experience false sharing because multiple counters (up to 8) are packed into a cache line. I suspect this is a good design trade-off (reduces space overhead from these sharded counters), but the comments could still make this setup a bit clearer.

crsync/counters.go line 98 at r2 (raw file):

	shardSize uint32
	// counters contains numShards * shardSize counters; shardSize is a multiple
	// of countersPerCacheLine to avoid false sharing (at shard boundaries).

Adjacent counter ordinals are packed tightly within a shard to reduce space overhead, while the shards for a given counter ordinal are on distinct cache lines to avoid false sharing.

crsync/counters.go line 171 at r2 (raw file):

		for i := 0; i < c.numCounters; i += countersPerCacheLine {
			vals = [countersPerCacheLine]int64{}
			numCounters := min(c.numCounters-i, countersPerCacheLine)

Confusing to have both numCounters and c.numCounters so close to each other in the code. Perhaps name this variable numShardCounters.

RaduBerinde

Thanks! Updated. Also bumped go to 1.23 for the iter stuff.

Reviewable status: 0 of 8 files reviewed, 3 unresolved discussions (waiting on @jbowens and @petermattis)

crsync/counters.go line 111 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You're only avoiding false sharing between the shards for an individual counter? Adjacent counters can still experience false sharing because multiple counters (up to 8) are packed into a cache line. I suspect this is a good design trade-off (reduces space overhead from these sharded counters), but the comments could still make this setup a bit clearer.

There is a strong correlation between the current CPU and the shard that we are using, so the same CPU would be updating any of the counters in the shard (no false sharing).

crsync/counters.go line 98 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Adjacent counter ordinals are packed tightly within a shard to reduce space overhead, while the shards for a given counter ordinal are on distinct cache lines to avoid false sharing.

Changed to this:

	// counters contains numShards * shardSize counters; shardSize is a multiple
	// of countersPerCacheLine to avoid false sharing at shard boundaries. Note
	// that there is a high correlation between the current CPU and the chosen
	// shard, so different counters inside a shard can share cache lines.

crsync/counters.go line 171 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Confusing to have both numCounters and c.numCounters so close to each other in the code. Perhaps name this variable numShardCounters.

Done.

Add support for sharding with CPU locality and add sharded counters. See counters_test.go for benchmark results. Sharding has two implementations, one is be used with the CRDB Go runtime (specifically cockroachdb/go#6) and the `cockroach_go` build tag. We also add a script that can be used to easily run tests against the CRDB Go (along with a CI job).

sumeerbhola

@sumeerbhola reviewed 1 of 5 files at r3.
Reviewable status: 1 of 10 files reviewed, 7 unresolved discussions

crsync/counters.go line 169 at r4 (raw file):

	return func(yield func(int64) bool) {
		// To access each cache line only once, we calculate countersPerCacheLine
		// counters at a time.

This was very neat.
I found it hard to understand what was going on in this method and took some time to finally get it. Some comment suggestions below to make it easier.

crsync/counters.go line 172 at r4 (raw file):

		var vals [countersPerCacheLine]int64
		for i := 0; i < c.numCounters; i += countersPerCacheLine {
			vals = [countersPerCacheLine]int64{}

// Reminder: We have a total of c.numCounters logical counters, but we have allocated space for a multiple of countersPerCacheLine counters. So the last iteration of this loop will have fewer than countersPerCacheLine logical counters to read. n is the number of logical counters to read in this iteration.

crsync/counters.go line 174 at r4 (raw file):

			vals = [countersPerCacheLine]int64{}
			n := min(c.numCounters-i, countersPerCacheLine)
			for s := range c.numShards {

// Iterate over all the shards and for this logical cache line, read the physical cache line from each shard and aggregate into vals.

crsync/counters.go line 182 at r4 (raw file):

					vals[j] += counters[j].Load()
				}
			}

// Yield the values for the next set of n logical counters. Across the iterations of the outer loop we will in total yield c.numCounters values.

RaduBerinde

Reviewable status: 1 of 10 files reviewed, 7 unresolved discussions

crsync/counters.go line 169 at r4 (raw file):

Previously, sumeerbhola wrote…

This was very neat.
I found it hard to understand what was going on in this method and took some time to finally get it. Some comment suggestions below to make it easier.

TFTR! Opened #25 to apply these comments

crsync/counters.go line 172 at r4 (raw file):

Previously, sumeerbhola wrote…

// Reminder: We have a total of c.numCounters logical counters, but we have allocated space for a multiple of countersPerCacheLine counters. So the last iteration of this loop will have fewer than countersPerCacheLine logical counters to read. n is the number of logical counters to read in this iteration.

Done.

crsync/counters.go line 174 at r4 (raw file):

Previously, sumeerbhola wrote…

// Iterate over all the shards and for this logical cache line, read the physical cache line from each shard and aggregate into vals.

Done.

RaduBerinde requested review from jbowens and petermattis September 29, 2025 16:04

petermattis reviewed Sep 29, 2025

View reviewed changes

RaduBerinde force-pushed the sharding branch from 6cc6491 to 88490a6 Compare September 29, 2025 20:20

RaduBerinde commented Sep 29, 2025

View reviewed changes

petermattis approved these changes Sep 30, 2025

View reviewed changes

RaduBerinde force-pushed the sharding branch from 88490a6 to 3fc6724 Compare September 30, 2025 17:32

RaduBerinde commented Sep 30, 2025

View reviewed changes

RaduBerinde force-pushed the sharding branch 2 times, most recently from 0fc25a8 to 5f590ce Compare September 30, 2025 17:40

go.mod: require go 1.23

a4c8a13

RaduBerinde force-pushed the sharding branch from 5f590ce to d5d6643 Compare September 30, 2025 17:41

RaduBerinde force-pushed the sharding branch from d5d6643 to 456ff8b Compare September 30, 2025 20:51

RaduBerinde merged commit ab3080f into cockroachdb:main Sep 30, 2025
8 of 9 checks passed

RaduBerinde deleted the sharding branch September 30, 2025 22:15

sumeerbhola reviewed Oct 1, 2025

View reviewed changes

RaduBerinde commented Oct 1, 2025

View reviewed changes

RaduBerinde mentioned this pull request Oct 1, 2025

crsync: add a few more comments for Counters #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

crsync: add sharded counters and general support for sharding #24

crsync: add sharded counters and general support for sharding #24

Uh oh!

RaduBerinde commented Sep 29, 2025 •

edited by cockroach-dev-inf

Loading

Uh oh!

petermattis left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

petermattis left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

Uh oh!

sumeerbhola left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

crsync: add sharded counters and general support for sharding #24

crsync: add sharded counters and general support for sharding #24

Uh oh!

Conversation

RaduBerinde commented Sep 29, 2025 • edited by cockroach-dev-inf Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RaduBerinde commented Sep 29, 2025 •

edited by cockroach-dev-inf

Loading