Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: abnormal CPU usage growth with increasing node count #51838

Open
Drahflow opened this issue Jul 23, 2020 · 33 comments
Open

gossip: abnormal CPU usage growth with increasing node count #51838

Drahflow opened this issue Jul 23, 2020 · 33 comments
Labels
A-kv-gossip C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community T-kv KV Team
Projects

Comments

@Drahflow
Copy link

Drahflow commented Jul 23, 2020

Is your feature request related to a problem? Please describe.

I am trying to estimate maximum reachable OLTP performance for a client of mine. To my frustration I was not able to scale a CockroachDB cluster to significantly more than 256 nodes, due to high CPU load when adding more nodes (most of which is taken up by gossip-protocol related functions according to profiling). My measurements suggest that the work done for gossiping on each node scales quadratically in the number of nodes, which puts an upper limit on the maximum cluster size at about 1000 nodes.

Describe the solution you'd like

The gossip protocol should only perform linear work in the number of nodes.

Describe alternatives you've considered

The gossip protocol intervals could be configurable so larger clusters could be achievable by trading away DDL and node failure detection speed. However, this would only add a small factor to the maximum size until the quadratic growth would have pushed failure detection times too high.

Jira issue: CRDB-4006

@blathers-crl
Copy link

blathers-crl bot commented Jul 23, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

I have CC'd a few people who may be able to assist you:

  • @tim-o (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community X-blathers-oncall labels Jul 23, 2020
@knz knz added this to Incoming in KV via automation Jul 23, 2020
@knz
Copy link
Contributor

knz commented Jul 23, 2020

Thanks @Drahflow for filing this. Could you provide a bit more data from your experiments? What analysis did you carry out and how did you compute your conclusions?

Looking at your experiments would inform (and accelerate) our own investigation. Thanks

@irfansharif
Copy link
Contributor

CRDB does not maintain n^2 gossip connections, it should be logarithmic. The gossip information disseminated cluster-wide should be linear to the number of nodes in the system, so it's a bit surprising to observe what you're observing. I'd be curious to see this data/profile as well.

@Drahflow
Copy link
Author

Drahflow commented Jul 23, 2020

Cockroach version was v20.1.3-linux-amd64 downloaded from binaries.cockroachdb.com.

My experiments were as follows, always starting an fresh cluster, only init ever ran against it. I observed total CPU usage on the servers (which ran nothing else) for a few minutes after the last node joined.

Setup 1: 16 instances of CockroachDB per m5.16xlarge
4 servers / 64 instances: 3.5% CPU usage (keep in mind this is 16 instances, so only 0.21% CPU per instance.)
8 servers / 128 instances: 6.5% CPU usage
16 servers / 256 instances: 36.7 CPU usage (2.2% CPU per instance)
The cluster did not stabilize for 24 server / 384 instances; so, what if I use less instances per server?

Setup 2: 8 instances of CockroachDB per m5.16xlarge:
4 servers / 32 instances: 1.7% CPU usage
16 servers / 128 instances: 5.0% CPU usage
32 servers / 256 instances: 22.0% CPU usage (2.7% CPU per instance)
The cluster did not stabilize for 48 servers / 384 instances (even though the available CPU per instance is now of course twice that from setup 1).

Free memory was plenty during all runs (as reported per top). The only traffic to the cluster was my browser on the built-in UI "Overview" page.

I clicked the "Profile (with labels)" link in the built-in UI on both occasions when the cluster became unstable (which btw. is to say there were continuously some ranges unavailable, many underreplicated). In both cases the profile was >60% full of functions which had "Gossip" in their name and the rest looked like it might be related to I/O or network scheduling, but I didn't look very closely.

In both cases the jump from 128 to 256 nodes increased CPU consumption about 4-fold, which strongly suggest something is O(nodes^2).

In any case, it's easy enough to replicate: Try to bring up a 512 node cluster of 20.1.3 and marvel at the CPU consumption. If you only use one node per physical machine it could potentially even work, but even then the CPU profile should be perfectly clear.

@irfansharif irfansharif added C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. and removed X-blathers-oncall labels Jul 23, 2020
@Drahflow
Copy link
Author

Drahflow commented Jul 23, 2020

Re-reading the above, one correction:

always starting an fresh cluster,

is not correct. I setup a new cluster (incl. new machines) between experiment setup 1 and 2. However, within each setup I added more nodes to an existing cluster.

@Drahflow
Copy link
Author

Drahflow commented Aug 3, 2020

In preparation of a larger-scale load test run, I today confirmed the same problem on 256 t3.medium instances hosting one cockroach instance each (so it's not some artifact of shared IPs). Also I checked the CPU profile more closely, the non-gossip related functions are golang garbage collection.

@knz
Copy link
Contributor

knz commented Aug 4, 2020

I have looked at the code and I have found something "interesting":

// maxPeers returns the maximum number of peers each gossip node
// may connect to. This is based on maxHops, which is a preset
// maximum for number of hops allowed before the gossip network
// will seek to "tighten" by creating new connections to distant
// nodes.
λ maxPeers(nodeCount int) int {
  // ...
  maxPeersint(math.Ceil(math.Exp(math.Log(float64(nodeCount)) / float64(maxHops-2))))
  if maxPeers < minPeers {
    return minPeers
  }
  return maxPeers
}

So on each node we will have F(n) = ceil(exp(log(n)/3)) connections to other nodes.
The division by 3 does not really alter the complexity of the formula.
This means that we have O(n) connections to other nodes, per node.

So if we look at n nodes together, we have O(n * n) inter-node connections.

So the behavior is indeed quadratic (in the worst case).

@knz
Copy link
Contributor

knz commented Aug 4, 2020

Sorry I was wrong, this "/3" does matter, it turns the complexity into a cubic root.

So we have O(n^(1/3)) conns per node, and thus O(n^(4/3)) connections for the cluster as a whole. That's much better.

@knz
Copy link
Contributor

knz commented Aug 6, 2020

Hi @Drahflow thanks again for your investigation.
We've queued an analysis to our roadmap internally, to check what is going on here and then take action.

Our current standpoint is that while the behavior you found is concerning, because it seems to violate our design objectives, the CPU overhead you're measuring on 200 nodes is not high enough yet for us to consider this issue as "urgent". We feel that users will be OK with 1-5% CPU overhead per CockroachDB node just for gossip up to 500 nodes.

Of course that does not mean we want to ignore the problem and that is why we queued an investigation. We plan to look at this in the v21.1 timeframe. Again thank you for your analysis.

@knz knz removed the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 6, 2020
@knz knz changed the title Scaling of gossip protocol gossip: abnormal CPU usage growth with increasing node count Aug 6, 2020
@Drahflow
Copy link
Author

For the record: I saw a (superficially similar) instability today on a 365 node cluster (mix of c5.9xlarge / m5.8xlarge) which was not CPU (nor RAM) starved. Is the gossip handling happening in a only one or two go-routines (and thus limited to a single core or two cores)?

The cluster initially appeared healthy, I started actual application load whereupon most nodes became "suspect". I then stopped the application load but >350 nodes remained in "suspect" status over the next 15 minutes (until I had to stop the experiment). Nodes occasionally moved to "healthy" for a short time, until again becoming "suspect". CPU utilization as reported in top was ~200% (i.e. two full cores) after the application load was stopped. Thus the low-number-of-goroutines hypothesis above.

@ajwerner
Copy link
Contributor

Which version were you running? How are you deploying the database (i.e. via k8s, systemd, etc).

Can you look at the storage dashboard and look at the Log Commit Latency and also the Hardware dashboard and look at the disk utilization metrics (dashboard screen shots are an appreciated thing if you're willing).

@Drahflow
Copy link
Author

Version: https://binaries.cockroachdb.com/cockroach-v20.1.3.linux-amd64.tgz
Deployment: Terraform onto new EC2 instances (Ubuntu ami-0062c497b55437b01) and then started via

while sleep 1; do
  /usr/local/bin/cockroach start \
    --insecure \
    --advertise-addr="$LOCAL_IP":5432 \
    --listen-addr=0.0.0.0:5432 \
    --join=10.10.224.8:5432,10.10.224.9:5432,10.10.224.10:5432,10.10.224.11:5432,10.10.224.12:5432 \
    --cache=.25 \
    --max-sql-memory=.25 \
    --store=path=/mnt/cockroach \
    --max-offset=500ms \
    --background;
done

(with the understanding that further invocations will immediately terminate upon finding the running instance had locked the storage).

The exact same setup (but on "only" 265 instances) worked 100% fine under more than 50 times the load.

@Drahflow
Copy link
Author

Drahflow commented Nov 4, 2020

I looked into this further today (via /_status/gossip/local):

  • The gossip key set is growing linearly with number of servers (distsql-draining:<node>, gossip-clients:<node>, store:<node> and liveness:<node>, node:<node>)
  • There are constant updates (at least on origStamp) on gossip-clients:<node>, liveness:<node> and store:<node>.

(But maybe this is as expected.)

@dankinder
Copy link

I just started up a fresh 256-node cluster on V21.1.0-alpha.3 and found the CPU overhead of a totally idle cluster surprisingly high. Each node is consuming 200-300% CPU (about 23% of available CPU) and I haven't done anything with the cluster yet.

Screen Shot 2021-04-30 at 11 13 33 AM

Graphviz version:
cockroach-cpu.html.zip

@knz
Copy link
Contributor

knz commented Apr 30, 2021

Thanks for checking. Is that more or less than in the previous release? (We're expecting about the same)

@dankinder
Copy link

I didn't compare to the previous, I've been using this release for the past month or two since I need some of the changes in it.

@irfansharif
Copy link
Contributor

@dankinder, do you mind grabbing a profile for this running cluster, if still running? You could do it using the following (using the real address of course):

 go tool pprof 'http://localhost:8080/debug/pprof/heap?seconds=5&labels=true'

@dankinder
Copy link

This includes a 5-second CPU profile, heap, and heap delta: Archive.zip

@mwang1026 mwang1026 moved this from Incoming to Unprioritized in KV May 28, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@a-robinson
Copy link
Contributor

a-robinson commented Sep 22, 2022

When I tried testing a large cluster (95 nodes) with a bit of load on it I already started seeing some bad behavior from gossip. Specifically I'd see a small number of nodes occasionally reporting a large fraction of other nodes in the cluster as unlive (going by their liveness_livenodes metric) and all of their replicas as unavailable (going by their ranges_unavailable metric). At this time no nodes were actually failing liveness heartbeats or having liveness epochs incremented, so it appeared relatively clear that a delay in processing gossip entries was at fault.

After seeing this I decided to see what was going on.


The first thing that stood out is how ridiculously often tightenNetwork gets called. When I used the logspy debug endpoint with vmodule="gossip=2" I'd generally hit my limit of 100 log messages within 350ms due to 100 separate calls to tightenNetwork() all logging the distantHops: ... VEventf. That suggests it's being called ~285 times per second, and it's very much worth noting that tightenNetwork() takes the Gossip struct's mutex lock each time (with a Lock(), not an RLock())...

This is almost certainly worth fixing. We currently call maybeTightenLocked every time we're sent data on a server connection or receive a response back on a client connection. Making things worse in terms of mutex contention, maybeTightenLocked already requires the mutex to be held but doesn't just do the check directly -- it sends a signal over a channel of size 1, and the recipient of the channel send has to retake the mutex to do its check. I see no reason why we'd ever need to check tightenNetwork hundreds of times a second.

While we could probably go farther, a simple time-based check that ensured we only called it once a second (with an exception allowing it to run again sooner if the previous call actually did some work to tighten the network rather than deciding it wasn't needed). This change would be trivial and I'm not aware of any real downside.


The other more frustrating thing I found when poking around was just how much of the gossip network's capacity is eaten up by "gossip-clients:" infos. They get gossiped more than twice as frequently as liveness information:

$ grep -o "gossip-clients:" logspy-gossip-client.txt | wc -l
    1443
$ grep -o "liveness:" logspy-gossip-client.txt | wc -l
     601
The data looks roughly the same if I check the gossip server logs:
$ grep -o "gossip-clients:" logspy-gossip-server.txt | wc -l
    1407
$ grep -o "liveness:" logspy-gossip-server.txt | wc -l
     689

The above logspy-gossip-client.txt came from collecting 2000 log lines from /debug/logspy?count=2000&duration=30s&grep=gossip&flatten=1&vmodule=client=2 and logspy-gossip-server.txt came from the same but with vmodule=server=2.

There's an old comment in gossip.go that mentions the gossip clients key has to be gossipped very frequently because "we want to detect gossip partitions faster that the node liveness timeout (9s)", but that looks like it was added by @petermattis in 18ce86b (#28963), and the main goal of that PR was to make it easier to debug gossip. It appears he also threw in the side goal of attempting to get rid of the sentinel key in the future, but but it looks like that never actually happened -- the sentinel key still appears to be what we use to detect partitions/stalls. The "gossip-clients" keys look like they're only used for logging the gossip graph, which is definitely not worth sending one info per-node every two seconds... And I think it makes sense to keep it that way given how much cheaper it is to frequently propagate a single sentinel key than one "gossip-clients" key per node.

It seems to me that we should be sending the gossip-clients keys much less frequently given their very limited use.


There's also some other minor inefficiencies around the key names for each gossip info being longer and more verbose than is really needed -- the infos themselves are protocol buffers, but the keys aren't so we pay for each key name on every send. And even pickier than that, we re-gossip all the fields in the store descriptor every minute (or less) even though only some of them can actually change.

But these are significantly more minor concerns -- it looks like tightenNetwork and the "gossip-clients" keys are the big opportunities for easy improvements.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 22, 2022
Touches cockroachdb#51838. See the thread there for more detail, but the gist is
that tightenNetwork gets called hundreds of times per second (taking a
full mutex lock each time) on large clusters, even when nothing needs
tightening.

Release note (performance improvement): Avoid wasteful contention on the
gossip mutex caused by checking if the network needs tightening hundreds
of times per second.
@a-robinson
Copy link
Contributor

Dumb example fix for the tightenNetwork performance issue: #88472

a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 22, 2022
Touches cockroachdb#51838. See the thread there for more detail, but the gist is
that tightenNetwork gets called hundreds of times per second (taking a
full mutex lock each time) on large clusters, even when nothing needs
tightening.

Release note (performance improvement): Avoid wasteful contention on the
gossip mutex caused by checking if the network needs tightening hundreds
of times per second.
@a-robinson
Copy link
Contributor

a-robinson commented Sep 22, 2022

Looking at it a bit more, visitInfos is also burning a bunch of CPU. There are a couple issues there:

  • We store all infos in a single map regardless of their type, but a number of our visitInfos() calls only care about a particular type of info. If we split the infoStore's infoMap up into a separate map for each type of info and let callers of visitInfos specify the key prefix they want, we could avoid iterating over a bunch of stuff the caller doesn't care about.
  • We keep a few types of infos around for decommissioned nodes forever, it seems. So as more nodes in a cluster are decommissioned and replaced over time, the gossip network wastes more and more cpu iterating over these mostly pointless entries (and hits bottlenecks sooner as a result). In particular, I see values for each of the following key prefixes for all node IDs that have ever existed in a cluster even if they were decommissioned long ago -- distsql-draining:<nodeid>, distsql-version:<nodeid>, and liveness:<nodeid>. Ideally none of them would be needed, but FWIW it seems to me that the liveness key is much more reasonable than the distsql ones.

Splitting up the infos map would probably be a medium sized change but I wouldn't expect it to have any big issues.

I don't know, however, whether there's some reason those old keys are needed for decommissioned node IDs for the lifetime of the cluster. It does seem relevant that while the liveness keys for decommissioned nodes are actively re-gossiped regularly, it looks like the distsql keys haven't been gossiped since the corresponding node was shut down and they're only still around because they got given a ttlStamp of MaxInt64.

@a-robinson
Copy link
Contributor

I suppose that given that they're never re-gossiped once the nodes are removed, something like the new crdb_internal.unsafe_clear_gossip_info("gossip-key") function from #85013 could be used to manually clear out the old distsql keys from decommissioned nodes. That's certainly less ideal than having them automatically dealt with (e.g. re-sending those gossip infos with a sooner TTL when decommissioning the given node ID) but it's at least a solvable problem for someone who cares.

craig bot pushed a commit that referenced this issue Sep 22, 2022
88472: gossip: Don't check if the network should be tightened so often r=kvoli,nvanbenschoten a=a-robinson

Touches #51838. See the thread there for more detail, but the gist is that tightenNetwork gets called hundreds of times per second (taking a full mutex lock each time) on large clusters, even when nothing needs tightening.

Release note (performance improvement): Avoid wasteful contention on the gossip mutex caused by checking if the network needs tightening hundreds of times per second.

88482: kv: remove ConditionalPutRequest.DeprecatedExpValue r=AlexTalks a=nvanbenschoten

This commit removes the ConditionalPutRequest.DeprecatedExpValue field, which was replaced by the ExpBytes introduced in 9e14f24. This has been possible since v21.1.

Release justification: None, wait.

Release note: None.

88486: github: route admission prs to targeted github team r=irfansharif a=irfansharif

Release note: None

Co-authored-by: Alex Robinson <arobinson@cloudflare.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@a-robinson
Copy link
Contributor

@kvoli @nvanbenschoten I'm also up for helping with a change that meaningfully changes the frequency of gossiping "gossip-clients" keys given that all they're used for is the periodic logging of the connectivity graph (and manual debugging of gossip issues I guess, although I'd be surprised if that comes up often). But I don't want to write up a change that you guys aren't interested in accepting/maintaining since that may be a bit more contentious than #88472.

For example I personally think a cluster setting to control whether we gossip the gossip-clients keys at all would be reasonable (instead of just gossiping them less frequently), but I don't know what you guys think.

Or if one of the other ideas is more enticing as something you'd want, let me know. But the gossip-clients keys appear to be the most egregious waste of cpu / mutex time.

@kvoli
Copy link
Collaborator

kvoli commented Sep 27, 2022

@kvoli @nvanbenschoten I'm also up for helping with a change that meaningfully changes the frequency of gossiping "gossip-clients" keys given that all they're used for is the periodic logging of the connectivity graph (and manual debugging of gossip issues I guess, although I'd be surprised if that comes up often).

Changing the frequency of gossip-clients appears strictly beneficial. A cluster setting sounds good.

crdb_internal.unsafe_clear_gossip_info("gossip-key") also sounds enticing but a bigger item.

@a-robinson
Copy link
Contributor

a-robinson commented Sep 27, 2022

Changing the frequency of gossip-clients appears strictly beneficial. A cluster setting sounds good.

Which sounds good - a cluster setting to change the frequency or to stop gossiping it entirely? To be clear, if we stop gossiping it entirely then the only real consequences are:

  1. We can no longer include the connectivity graph in the periodic logging of (*Gossip).LogStatus.
  2. We can no longer use it to determine whether each gossiped node is alive in crdb_internal.gossip_nodes. But I don't see why that code couldn't just use the gossiped node liveness keys instead.
  3. The crdb_internal.gossip_network virtual table wouldn't work anymore (other than to display the local node's client connections, but that's obviously not the same thing as what the table currently does).

crdb_internal.unsafe_clear_gossip_info("gossip-key") also sounds enticing but a bigger item.

I'm not sure whether I'm misunderstanding your comment or you missed the link to #85013, but that function already exists as of #85505. I tried it out and it worked well for clearing out old decommissioned nodes' "distsql-version" and "distsql-draining" keys.

@a-robinson
Copy link
Contributor

It's fine if you aren't interested in a patch at the moment, but I thought I'd share some data from having completely ripped out the "gossip-clients" keys just to give a bit more evidence of how wasteful they are, because it's even bigger than I expected.

I took v22.1.6, then applied two patches: #88472 and a-robinson@952c501

Testing on that 95 node cluster that had been fairly bogged down by gossip, deploying the change made a huge difference in the gossip_bytes_sent/received and gossip_infos_sent/received metrics as well as in CPU usage:

Before (aggregated across all nodes in the cluster):
Sent: 650 MB/s
Received: 310 MB/s
Sent: 23.9k infos/s
Received: 21.3k infos/s

After (aggregated across all nodes in the cluster):
Sent: 31.3 MB/s
Received: 16.2 MB/s
Sent: 6.5k infos/s
Received: 6.4k infos/s

CPU usage per node dropped by nearly a full core (by about 90 CPU seconds/second on the 95 node cluster).


The savings are also pretty dramatic at somewhat smaller sizes too, with a 75 node cluster saving 40-50 cores and having a before/after of:

Before:
Sent: 340 MB/s
Received: 165 MB/s
Sent: 14.9k infos/s
Received: 13.8k infos/s

After:
Sent: 10 MB/s
Received: 6 MB/s
Sent: 3.8k infos/s
Received: 3.5k infos/s


The numbers were a bit less dramatic on a completely fresh 75 node cluster that had no data in it, load on it, or history of decommissioned nodes, but it still cut out about two-thirds of the gossiped bytes/infos and saved about 10 cores.

@kvoli
Copy link
Collaborator

kvoli commented Oct 5, 2022

Testing on that 95 node cluster that had been fairly bogged down by gossip, deploying the change made a huge difference in the gossip_bytes_sent/received and gossip_infos_sent/received metrics as well as in CPU usage:

Before (aggregated across all nodes in the cluster): Sent: 650 MB/s Received: 310 MB/s Sent: 23.9k infos/s Received: 21.3k infos/s

After (aggregated across all nodes in the cluster): Sent: 31.3 MB/s Received: 16.2 MB/s Sent: 6.5k infos/s Received: 6.4k infos/s

That's a great improvement. I think it would be great to patch in. Thanks for taking a look into this further @a-robinson.

@a-robinson
Copy link
Contributor

I think it would be great to patch in.

Sorry for not being totally sure what you mean, but by "it" do you mean a complete rip out like in a-robinson@952c501 and accepting the minor changes called out in #51838 (comment)? Or something else like adding a cluster setting that disables it?

Personally I'm still in the camp of just ripping it out since I don't think I ever benefited from it even back when I worked on cockroach full time, but I don't have any visibility into how often it comes in handy for you guys these days.

@irfansharif
Copy link
Contributor

yo @a-robinson, a-robinson@952c501 LGTM. These internal vtables are not that useful, definitely not worth the cost you're seeing. Some of them are being used in roachtests that have either already aged or could look at alternative sources instead. The connectivity graph has come in handy in some recent incidents, but again, I don't think it's worth the cost. If you want to polish up a-robinson@952c501 for a PR that SGTM. If there's roachtest fallout from it, we're happy to squash them ourselves. I don't think these changes are going to be backportable -- even if we tack on release notes, it seems unwise to rip out vtables in already published releases. So if you care about backporting it, I'd do so by introducing a cluster version to stop gossiping the gossip-clients key but left disabled by default. On master, and perhaps 22.2 (which will be out in a few weeks), we can enable the cluster setting. Does that sound all right to you? Also happy to take over backport duty to 22.2 since it's not out yet.

@irfansharif
Copy link
Contributor

irfansharif commented Oct 5, 2022

These aged roachtests could also just explicitly enable this cluster setting if we want to keep them (== roachtests) around, but that's all just unnecessary hedginess -- this feels fine to rip out altogether on master.

@a-robinson
Copy link
Contributor

a-robinson commented Oct 5, 2022

Hey @irfansharif , thanks for clarifying. I'm not particularly interested in doing the work to add a cluster setting and backport it, but given that it sounds like you're cool with it I can submit a slightly cleaned up version of a-robinson@952c501 to master sometime within the next week.

Waiting until v23.1 for it to be in a mainline release is fine -- I can carry it as a patch in the meantime. Although if large cluster performance/overhead/stability becomes a priority for you at all before then, I would think the cluster setting is something you may want sooner.

The connectivity graph has come in handy in some recent incidents

FWIW if you really are interested in keeping it, it still seems totally plausible to construct it when needed, either by:

  1. Continuing to have each node log who it's connected to, then processing the logs to reconstruct the graph.
  2. Adding a debug RPC endpoint that lets you get the gossip connections on demand.

@irfansharif
Copy link
Contributor

Ack, can cluster-settings-ify, etc. as needed. Thanks for all the digging, this area continues to be as unloved as it was way back when, so this is all much appreciated.

craig bot pushed a commit that referenced this issue Dec 21, 2022
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson

These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them.

These gossip-clients keys make up two thirds or more of the gossip
traffic in various large clusters I've inspected, consuming almost an
entire CPU core in the worst case I've seen. They don't provide enough
value to justify that sort of ongoing cost, so this commit removes them
entirely as well as the periodic logging of the gossip network and the
crdb_internal.gossip_network table, both of which relied on them.

Release note (backward-incompatible change): We've stopped
supporting/populating the crdb_internal.gossip_network table. It was an
internal table with no API guarantees (so perhaps no meriting a release
note?).

Release note (performance improvement): Significantly reduced CPU usage
of the underlying gossip network in large clusters.

---

Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done)

This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change.

I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time.

`@irfansharif` `@kvoli` 

93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd

Part of #89949

Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region.

This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience.

This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly.

Release note: None

93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav

The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d).

A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots.

This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic.

Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests.

Closes #93849.

Release note: None.

Epic: CRDB-20293

Co-authored-by: Alex Robinson <arobinson@cloudflare.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
@kvoli kvoli removed their assignment Aug 3, 2023
@andrewbaptist
Copy link
Collaborator

One minor point about the gossip network and growth relative to cluster size is the "amount of gossiped data".
Taking into account all the factors for growth there are:

  1. Number of gossip connections - O(n^(4/3)) as @knz notes above
  2. Infos that scale linearly with cluster size. This includes KeyStoreDescPrefix, KeyNodeDescPrefix, KeyNodeLivenessPrefix,
  3. Infos that scale super-linearly with cluster size. Previously this would have included KeyGossipClientsPrefix, although I'm not sure if there are any more cases of this.

However note that from a nodes perspective, item 2 above means that as the cluster grows the gossip burden on each individual node grows linearly also and because of item 1 above it is slightly faster than linear growth.

craig bot pushed a commit that referenced this issue Jan 9, 2024
117011: gossip: Include highWaterStamps map in gossip debug info r=shralex a=a-robinson

This can be useful information when debugging gossip-related problems (e.g. to determine how much of the data being set around the gossip network is directly accounted for by the highWaterStamps) and it's very easy to include.

Epic: none

Release note (ops change): The gossip status advanced debug page now includes information about the server's high water stamps for every other node it knows about in the gossip cluster.

---

In my particular case, it was valuable when looking into why the amount of data being gossiped was so large (as measured via the `gossip_(bytes|infos)_(sent|received)` prometheus metrics).

Given a large cluster where nodes many nodes are decommissioned over time, you can end up with an ever-accumulating amount of gossip from old nodes -- "distsql-draining:<node-id>" and "distsql-version:<node-id>". These keys can stick around forever (as I've previously called out on #51838), and if you don't manually clear them (using `crdb_internal.unsafe_clear_gossip_info()`) then they can have a larger effect on gossip than I'd have expected because they cause the old decommmissioned node IDs to be kept around in every node's highWaterStamps map, and that highWaterStamp map gets copied into [*every* gossip Request and Response](https://github.com/cockroachdb/cockroach/blob/master/pkg/gossip/gossip.proto#L29-L66), which can add up to a ton of extra data getting shipped around if you have a lot of decommissioned nodes.

I have a bunch more thoughts on inefficiencies in the gossip network and how it scales at larger cluster sizes and/or when there are a lot of decommissioned nodes, but don't know how much interest there is in them. If y'all are interested, let me know and I'm happy to share some notes and ideas.

Co-authored-by: Alex Robinson <arobinson@cloudflare.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-gossip C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-community Originated from the community T-kv KV Team
Projects
KV
On Hold
Development

No branches or pull requests

9 participants