Skip to content

Commit

Permalink
Merge #117011
Browse files Browse the repository at this point in the history
117011: gossip: Include highWaterStamps map in gossip debug info r=shralex a=a-robinson

This can be useful information when debugging gossip-related problems (e.g. to determine how much of the data being set around the gossip network is directly accounted for by the highWaterStamps) and it's very easy to include.

Epic: none

Release note (ops change): The gossip status advanced debug page now includes information about the server's high water stamps for every other node it knows about in the gossip cluster.

---

In my particular case, it was valuable when looking into why the amount of data being gossiped was so large (as measured via the `gossip_(bytes|infos)_(sent|received)` prometheus metrics).

Given a large cluster where nodes many nodes are decommissioned over time, you can end up with an ever-accumulating amount of gossip from old nodes -- "distsql-draining:<node-id>" and "distsql-version:<node-id>". These keys can stick around forever (as I've previously called out on #51838), and if you don't manually clear them (using `crdb_internal.unsafe_clear_gossip_info()`) then they can have a larger effect on gossip than I'd have expected because they cause the old decommmissioned node IDs to be kept around in every node's highWaterStamps map, and that highWaterStamp map gets copied into [*every* gossip Request and Response](https://github.com/cockroachdb/cockroach/blob/master/pkg/gossip/gossip.proto#L29-L66), which can add up to a ton of extra data getting shipped around if you have a lot of decommissioned nodes.

I have a bunch more thoughts on inefficiencies in the gossip network and how it scales at larger cluster sizes and/or when there are a lot of decommissioned nodes, but don't know how much interest there is in them. If y'all are interested, let me know and I'm happy to share some notes and ideas.

Co-authored-by: Alex Robinson <arobinson@cloudflare.com>
  • Loading branch information
craig[bot] and a-robinson committed Jan 9, 2024
2 parents 9c35302 + a4a3033 commit c80a161
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 3 deletions.
7 changes: 4 additions & 3 deletions pkg/gossip/gossip.go
Original file line number Diff line number Diff line change
Expand Up @@ -997,9 +997,10 @@ func (g *Gossip) GetInfoStatus() InfoStatus {
g.mu.RLock()
defer g.mu.RUnlock()
is := InfoStatus{
Infos: make(map[string]Info),
Client: clientStatus,
Server: serverStatus,
Infos: make(map[string]Info),
Client: clientStatus,
Server: serverStatus,
HighWaterStamps: g.mu.is.getHighWaterStamps(),
}
for k, v := range g.mu.is.Infos {
is.Infos[k] = *protoutil.Clone(v).(*Info)
Expand Down
1 change: 1 addition & 0 deletions pkg/gossip/gossip.proto
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ message InfoStatus {
ClientStatus client = 2 [(gogoproto.nullable) = false];
ServerStatus server = 3 [(gogoproto.nullable) = false];
Connectivity connectivity = 4 [(gogoproto.nullable) = false];
map<int32, int64> high_water_stamps = 5 [(gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID", (gogoproto.nullable) = false];
}

// Info is the basic unit of information traded over the
Expand Down

0 comments on commit c80a161

Please sign in to comment.