New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip: Include highWaterStamps map in gossip debug info #117011
gossip: Include highWaterStamps map in gossip debug info #117011
Conversation
This can be useful information when debugging gossip-related problems (e.g. to determine how much of the data being set around the gossip network is directly accounted for by the highWaterStamps) and it's very easy to include. Epic: none Release note (ops change): The gossip status advanced debug page now includes information about the server's high water stamps for every other node it knows about in the gossip cluster.
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR, Alex! I think we are approaching a point where more investment in Gossip is warranted, and will certainly reach out, it would be great to collaborate.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @a-robinson)
pkg/gossip/gossip.proto
line 134 at r1 (raw file):
ServerStatus server = 3 [(gogoproto.nullable) = false]; Connectivity connectivity = 4 [(gogoproto.nullable) = false]; map<int32, int64> high_water_stamps = 5 [(gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID", (gogoproto.nullable) = false];
Is there any concern about backward compatibility when this field is marked non-nullable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shralex)
pkg/gossip/gossip.proto
line 134 at r1 (raw file):
Previously, shralex wrote…
Is there any concern about backward compatibility when this field is marked non-nullable ?
I can't say I've gone back to read the gogoproto spec to confirm what's expected, but I did just test it out in both directions and it's worked ok:
- From node containing this patch to node not containing this patch -> the result included an empty object for the
highWaterStamps
field, i.e."highWaterStamps": {}
- From node not containing this patch to node containing this patch -> the result omitted the
highWaterStamps
field entirely.
But if you'd rather I do something else here (nullable = true, I guess?), I'm happy to change it. Just let me know.
bors r+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @a-robinson)
pkg/gossip/gossip.proto
line 134 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
I can't say I've gone back to read the gogoproto spec to confirm what's expected, but I did just test it out in both directions and it's worked ok:
- From node containing this patch to node not containing this patch -> the result included an empty object for the
highWaterStamps
field, i.e."highWaterStamps": {}
- From node not containing this patch to node containing this patch -> the result omitted the
highWaterStamps
field entirely.But if you'd rather I do something else here (nullable = true, I guess?), I'm happy to change it. Just let me know.
Thank you for checking!
Build succeeded: |
This can be useful information when debugging gossip-related problems (e.g. to determine how much of the data being set around the gossip network is directly accounted for by the highWaterStamps) and it's very easy to include.
Epic: none
Release note (ops change): The gossip status advanced debug page now includes information about the server's high water stamps for every other node it knows about in the gossip cluster.
In my particular case, it was valuable when looking into why the amount of data being gossiped was so large (as measured via the
gossip_(bytes|infos)_(sent|received)
prometheus metrics).Given a large cluster where nodes many nodes are decommissioned over time, you can end up with an ever-accumulating amount of gossip from old nodes -- "distsql-draining:" and "distsql-version:". These keys can stick around forever (as I've previously called out on #51838), and if you don't manually clear them (using
crdb_internal.unsafe_clear_gossip_info()
) then they can have a larger effect on gossip than I'd have expected because they cause the old decommmissioned node IDs to be kept around in every node's highWaterStamps map, and that highWaterStamp map gets copied into every gossip Request and Response, which can add up to a ton of extra data getting shipped around if you have a lot of decommissioned nodes.I have a bunch more thoughts on inefficiencies in the gossip network and how it scales at larger cluster sizes and/or when there are a lot of decommissioned nodes, but don't know how much interest there is in them. If y'all are interested, let me know and I'm happy to share some notes and ideas.