Skip to content

Conversation

RaduBerinde
Copy link
Member

@RaduBerinde RaduBerinde commented Sep 29, 2025

Add support for sharding with CPU locality and add sharded counters.
See counters_test.go for benchmark results.

Sharding has two implementations, one is be used with the CRDB Go
runtime (specifically cockroachdb/go#6) and
the cockroach_go build tag.

We also add a script that can be used to easily run tests against the
CRDB Go (along with a CI job).


This change is Reviewable

Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. Just a few small comments.

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @jbowens)


crsync/counters_test.go line 1 at r1 (raw file):

// Copyright 2025 The Cockroach Authors.

Is it worth having very simple correctness tests of Counters.Add() and Get()?


crsync/counters_test.go line 39 at r1 (raw file):

// c=1/p=4-10      169ns ±10%    53ns ± 4%     21ns ±39%    13ns ±26%
// c=1/p=10-10     752ns ± 3%   125ns ± 0%     51ns ±20%    51ns ± 7% *
// c=1/p=40-10    3.04µs ± 2%  0.67µs ±13%   0.26µs ±31%  0.29µs ±16%

I wonder why crsync-cr is slower than crsync here. Doesn't really matter given how much faster this is than simple and randshards.


crsync/counters.go line 111 at r1 (raw file):

func makeCounters(numShards, numCounters int) Counters {
	// Round up to the nearest cacheline size, to avoid false sharing.
	shardSize := (numCounters + countersPerCacheLine - 1) &^ (countersPerCacheLine - 1)

So each shard contains all of the counters and is at least 8 counters in size (the typical 64-byte cache line size)? It took me a minute to understand this. I think the comments here or above Counters could make this clearer. Is the reason for this structure to avoid wasting excessive space? That makes sense. Again, the commentary could make this a bit clearer.


crsync/counters.go line 155 at r1 (raw file):

//   - Complexity is O(NumShards() * numCounters).
//   - Snapshot semantics: no ordering guarantees w.r.t. concurrent updates.
func (c *Counters) All() []int64 {

When do you envision All() being used? I'm wondering why a caller would prefer All() to iterating over the counters and calling Get(counter) for each given the latter will not require a memory allocation.

Or perhaps this should return an iter.Seq[int64] using the Go iterator facility. I believe that should avoid allocations while providing more convenient iteration.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @jbowens and @petermattis)


crsync/counters.go line 111 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

So each shard contains all of the counters and is at least 8 counters in size (the typical 64-byte cache line size)? It took me a minute to understand this. I think the comments here or above Counters could make this clearer. Is the reason for this structure to avoid wasting excessive space? That makes sense. Again, the commentary could make this a bit clearer.

shardSize is a multiple of 8 to avoid false sharing. Updating the comments and making the code less cryptic.


crsync/counters.go line 155 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

When do you envision All() being used? I'm wondering why a caller would prefer All() to iterating over the counters and calling Get(counter) for each given the latter will not require a memory allocation.

Or perhaps this should return an iter.Seq[int64] using the Go iterator facility. I believe that should avoid allocations while providing more convenient iteration.

I wanted to minimize disruption of ongoing Add()s by looking at each cache line only once. Added a comment. I kept this property while presenting the iterator interface.


crsync/counters_test.go line 1 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is it worth having very simple correctness tests of Counters.Add() and Get()?

Oops, I meant to add some but jumped the gun. Done.


crsync/counters_test.go line 39 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder why crsync-cr is slower than crsync here. Doesn't really matter given how much faster this is than simple and randshards.

In this case we are almost surely preempted while updating the counter. This is not a concern in a real usecase where a goroutine is doing other work in-between counter updates.

Given the large stddev, not sure if there is a real difference. If it is, my speculation is that in the crsync-cr case, when we get preempted, we will use the "wrong" shard once and the cache line bounces to the new CPU then back; in the crsync case, if we get preempted in-between the Get and the Put, we will move the shard from one CPU to another (it doesn't bounce back).

Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Another suggestion about improving the commentary, but feel free to ignore.

Reviewable status: 0 of 7 files reviewed, 3 unresolved discussions (waiting on @jbowens and @RaduBerinde)


crsync/counters.go line 111 at r1 (raw file):

Previously, RaduBerinde wrote…

shardSize is a multiple of 8 to avoid false sharing. Updating the comments and making the code less cryptic.

You're only avoiding false sharing between the shards for an individual counter? Adjacent counters can still experience false sharing because multiple counters (up to 8) are packed into a cache line. I suspect this is a good design trade-off (reduces space overhead from these sharded counters), but the comments could still make this setup a bit clearer.


crsync/counters.go line 98 at r2 (raw file):

	shardSize uint32
	// counters contains numShards * shardSize counters; shardSize is a multiple
	// of countersPerCacheLine to avoid false sharing (at shard boundaries).

Adjacent counter ordinals are packed tightly within a shard to reduce space overhead, while the shards for a given counter ordinal are on distinct cache lines to avoid false sharing.


crsync/counters.go line 171 at r2 (raw file):

		for i := 0; i < c.numCounters; i += countersPerCacheLine {
			vals = [countersPerCacheLine]int64{}
			numCounters := min(c.numCounters-i, countersPerCacheLine)

Confusing to have both numCounters and c.numCounters so close to each other in the code. Perhaps name this variable numShardCounters.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated. Also bumped go to 1.23 for the iter stuff.

Reviewable status: 0 of 8 files reviewed, 3 unresolved discussions (waiting on @jbowens and @petermattis)


crsync/counters.go line 111 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You're only avoiding false sharing between the shards for an individual counter? Adjacent counters can still experience false sharing because multiple counters (up to 8) are packed into a cache line. I suspect this is a good design trade-off (reduces space overhead from these sharded counters), but the comments could still make this setup a bit clearer.

There is a strong correlation between the current CPU and the shard that we are using, so the same CPU would be updating any of the counters in the shard (no false sharing).


crsync/counters.go line 98 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Adjacent counter ordinals are packed tightly within a shard to reduce space overhead, while the shards for a given counter ordinal are on distinct cache lines to avoid false sharing.

Changed to this:

	// counters contains numShards * shardSize counters; shardSize is a multiple
	// of countersPerCacheLine to avoid false sharing at shard boundaries. Note
	// that there is a high correlation between the current CPU and the chosen
	// shard, so different counters inside a shard can share cache lines.

crsync/counters.go line 171 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Confusing to have both numCounters and c.numCounters so close to each other in the code. Perhaps name this variable numShardCounters.

Done.

@RaduBerinde RaduBerinde force-pushed the sharding branch 2 times, most recently from 0fc25a8 to 5f590ce Compare September 30, 2025 17:40
Add support for sharding with CPU locality and add sharded counters.
See counters_test.go for benchmark results.

Sharding has two implementations, one is be used with the CRDB Go
runtime (specifically cockroachdb/go#6) and
the `cockroach_go` build tag.

We also add a script that can be used to easily run tests against the
CRDB Go (along with a CI job).
@RaduBerinde RaduBerinde merged commit ab3080f into cockroachdb:main Sep 30, 2025
8 of 9 checks passed
@RaduBerinde RaduBerinde deleted the sharding branch September 30, 2025 22:15
Copy link

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 1 of 5 files at r3.
Reviewable status: 1 of 10 files reviewed, 7 unresolved discussions


crsync/counters.go line 169 at r4 (raw file):

	return func(yield func(int64) bool) {
		// To access each cache line only once, we calculate countersPerCacheLine
		// counters at a time.

This was very neat.
I found it hard to understand what was going on in this method and took some time to finally get it. Some comment suggestions below to make it easier.


crsync/counters.go line 172 at r4 (raw file):

		var vals [countersPerCacheLine]int64
		for i := 0; i < c.numCounters; i += countersPerCacheLine {
			vals = [countersPerCacheLine]int64{}

// Reminder: We have a total of c.numCounters logical counters, but we have allocated space for a multiple of countersPerCacheLine counters. So the last iteration of this loop will have fewer than countersPerCacheLine logical counters to read. n is the number of logical counters to read in this iteration.


crsync/counters.go line 174 at r4 (raw file):

			vals = [countersPerCacheLine]int64{}
			n := min(c.numCounters-i, countersPerCacheLine)
			for s := range c.numShards {

// Iterate over all the shards and for this logical cache line, read the physical cache line from each shard and aggregate into vals.


crsync/counters.go line 182 at r4 (raw file):

					vals[j] += counters[j].Load()
				}
			}

// Yield the values for the next set of n logical counters. Across the iterations of the outer loop we will in total yield c.numCounters values.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 7 unresolved discussions


crsync/counters.go line 169 at r4 (raw file):

Previously, sumeerbhola wrote…

This was very neat.
I found it hard to understand what was going on in this method and took some time to finally get it. Some comment suggestions below to make it easier.

TFTR! Opened #25 to apply these comments


crsync/counters.go line 172 at r4 (raw file):

Previously, sumeerbhola wrote…

// Reminder: We have a total of c.numCounters logical counters, but we have allocated space for a multiple of countersPerCacheLine counters. So the last iteration of this loop will have fewer than countersPerCacheLine logical counters to read. n is the number of logical counters to read in this iteration.

Done.


crsync/counters.go line 174 at r4 (raw file):

Previously, sumeerbhola wrote…

// Iterate over all the shards and for this logical cache line, read the physical cache line from each shard and aggregate into vals.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants