Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip: Include highWaterStamps map in gossip debug info #117011

Merged
merged 1 commit into from Jan 9, 2024

Conversation

a-robinson
Copy link
Contributor

This can be useful information when debugging gossip-related problems (e.g. to determine how much of the data being set around the gossip network is directly accounted for by the highWaterStamps) and it's very easy to include.

Epic: none

Release note (ops change): The gossip status advanced debug page now includes information about the server's high water stamps for every other node it knows about in the gossip cluster.


In my particular case, it was valuable when looking into why the amount of data being gossiped was so large (as measured via the gossip_(bytes|infos)_(sent|received) prometheus metrics).

Given a large cluster where nodes many nodes are decommissioned over time, you can end up with an ever-accumulating amount of gossip from old nodes -- "distsql-draining:" and "distsql-version:". These keys can stick around forever (as I've previously called out on #51838), and if you don't manually clear them (using crdb_internal.unsafe_clear_gossip_info()) then they can have a larger effect on gossip than I'd have expected because they cause the old decommmissioned node IDs to be kept around in every node's highWaterStamps map, and that highWaterStamp map gets copied into every gossip Request and Response, which can add up to a ton of extra data getting shipped around if you have a lot of decommissioned nodes.

I have a bunch more thoughts on inefficiencies in the gossip network and how it scales at larger cluster sizes and/or when there are a lot of decommissioned nodes, but don't know how much interest there is in them. If y'all are interested, let me know and I'm happy to share some notes and ideas.

This can be useful information when debugging gossip-related problems
(e.g. to determine how much of the data being set around the gossip
network is directly accounted for by the highWaterStamps) and it's very
easy to include.

Epic: none

Release note (ops change): The gossip status advanced debug page now
includes information about the server's high water stamps for every
other node it knows about in the gossip cluster.
@a-robinson a-robinson requested a review from a team as a code owner December 22, 2023 15:32
Copy link

blathers-crl bot commented Dec 22, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Dec 22, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, Alex! I think we are approaching a point where more investment in Gossip is warranted, and will certainly reach out, it would be great to collaborate.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @a-robinson)


pkg/gossip/gossip.proto line 134 at r1 (raw file):

  ServerStatus server = 3 [(gogoproto.nullable) = false];
  Connectivity connectivity = 4 [(gogoproto.nullable) = false];
  map<int32, int64> high_water_stamps = 5 [(gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID", (gogoproto.nullable) = false];

Is there any concern about backward compatibility when this field is marked non-nullable ?

Copy link
Contributor Author

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shralex)


pkg/gossip/gossip.proto line 134 at r1 (raw file):

Previously, shralex wrote…

Is there any concern about backward compatibility when this field is marked non-nullable ?

I can't say I've gone back to read the gogoproto spec to confirm what's expected, but I did just test it out in both directions and it's worked ok:

  • From node containing this patch to node not containing this patch -> the result included an empty object for the highWaterStamps field, i.e. "highWaterStamps": {}
  • From node not containing this patch to node containing this patch -> the result omitted the highWaterStamps field entirely.

But if you'd rather I do something else here (nullable = true, I guess?), I'm happy to change it. Just let me know.

@shralex
Copy link
Contributor

shralex commented Jan 8, 2024

bors r+

Copy link
Contributor

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @a-robinson)


pkg/gossip/gossip.proto line 134 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I can't say I've gone back to read the gogoproto spec to confirm what's expected, but I did just test it out in both directions and it's worked ok:

  • From node containing this patch to node not containing this patch -> the result included an empty object for the highWaterStamps field, i.e. "highWaterStamps": {}
  • From node not containing this patch to node containing this patch -> the result omitted the highWaterStamps field entirely.

But if you'd rather I do something else here (nullable = true, I guess?), I'm happy to change it. Just let me know.

Thank you for checking!

@craig
Copy link
Contributor

craig bot commented Jan 9, 2024

Build succeeded:

@craig craig bot merged commit c80a161 into cockroachdb:master Jan 9, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants