Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statusccl: temporarily serve /_status/nodes to tenants #93268

Merged
merged 1 commit into from
Dec 9, 2022
Merged

statusccl: temporarily serve /_status/nodes to tenants #93268

merged 1 commit into from
Dec 9, 2022

Conversation

matthewtodd
Copy link
Contributor

Fixes #92065
Part of #89949

Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this /_status/nodes endpoint to get the information it needs.

This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259.

In the meantime, as a stop-gap measure, we implement a reduced version of this /_status/nodes endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@matthewtodd matthewtodd marked this pull request as ready for review December 8, 2022 18:34
@matthewtodd matthewtodd requested review from a team as code owners December 8, 2022 18:34
@knz knz requested a review from dhartunian December 8, 2022 18:38
@knz knz added T-multitenant Issues owned by the multi-tenant virtual team A-multitenancy Related to multi-tenancy and removed T-multitenant Issues owned by the multi-tenant virtual team labels Dec 8, 2022
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: only have an idea to consider just inlining the code.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @matthewtodd)


pkg/server/fanout_clients.go line 106 at r1 (raw file):

		})
	}
	return &resp, err

Don't feel too strongly but since there's no matching impl for the kvFanoutClient, I'd just inline this in the Nodes handler.


pkg/server/status.go line 1516 at r1 (raw file):

	}

	return s.serverIterator.nodes(ctx)

FYI: you probably checked already but this could create some bad failure state in DB Console. It's used to getting an unimplemented 5xx error on this endpoint for tenants, and this will lead it to get a 200 which could trigger additional runtime exceptions elsewhere due to missing data.

Copy link
Contributor Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)


pkg/server/status.go line 1516 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

FYI: you probably checked already but this could create some bad failure state in DB Console. It's used to getting an unimplemented 5xx error on this endpoint for tenants, and this will lead it to get a 200 which could trigger additional runtime exceptions elsewhere due to missing data.

Thanks for calling this out. Looking through the selectors that depend on this response, I think we're good, but I'll double-check with fresh eyes in the morning before merging.


pkg/server/fanout_clients.go line 106 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Don't feel too strongly but since there's no matching impl for the kvFanoutClient, I'd just inline this in the Nodes handler.

Good idea, done. (Should keep your getAllNodes / nodesList work easier to reason about, too!)

Copy link
Contributor Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)


pkg/server/status.go line 1516 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Thanks for calling this out. Looking through the selectors that depend on this response, I think we're good, but I'll double-check with fresh eyes in the morning before merging.

Found one, tested and fixed. Will merge on green. Thanks again, David!

Fixes #92065
Part of #89949

Our current UI architecture has many of the SQL Activity pages mapping
node or sql instance ids to regions at page load time, using the response
from this `/_status/nodes` endpoint to get the information it needs.

This approach is problematic for ephemeral serverless tenants but will
need a broader reimagining, probably ending up with us storing locality
information directly in many of the sqlstats tables, which will probably
require something like sending instance localities along in the distsql
tracing. An initial exploration in this direction happened in #92259.

In the meantime, as a stop-gap measure, we implement a reduced version
of this `/_status/nodes` endpoint for tenants, keeping the SQL Activity
pages somewhat working for multiregion tenants.

Release note: None
@matthewtodd
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 9, 2022

Build succeeded:

@craig craig bot merged commit 4a641ac into cockroachdb:master Dec 9, 2022
@matthewtodd matthewtodd deleted the tenant-nodes branch December 9, 2022 19:16
craig bot pushed a commit that referenced this pull request Dec 21, 2022
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson

These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them.

These gossip-clients keys make up two thirds or more of the gossip
traffic in various large clusters I've inspected, consuming almost an
entire CPU core in the worst case I've seen. They don't provide enough
value to justify that sort of ongoing cost, so this commit removes them
entirely as well as the periodic logging of the gossip network and the
crdb_internal.gossip_network table, both of which relied on them.

Release note (backward-incompatible change): We've stopped
supporting/populating the crdb_internal.gossip_network table. It was an
internal table with no API guarantees (so perhaps no meriting a release
note?).

Release note (performance improvement): Significantly reduced CPU usage
of the underlying gossip network in large clusters.

---

Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done)

This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change.

I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time.

`@irfansharif` `@kvoli` 

93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd

Part of #89949

Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region.

This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience.

This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly.

Release note: None

93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav

The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d).

A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots.

This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic.

Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests.

Closes #93849.

Release note: None.

Epic: CRDB-20293

Co-authored-by: Alex Robinson <arobinson@cloudflare.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
adityamaru pushed a commit to adityamaru/cockroach that referenced this pull request Dec 22, 2022
Part of cockroachdb#89949

Previously, when a tenant SQL instance had spun down (leaving us no way
to remember which region it had been in), the SQL Activity pages would
claim that statements and transactions had occurred in an "undefined"
region.

This change moves from saying "undefined" to saying nothing at all, a
slightly nicer user experience.

This broader problem of losing the region mapping has been described
in cockroachdb#93268; we'll begin addressing it shortly.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 6, 2023
95449: sqlstats: include regions in statement_statistics r=matthewtodd a=matthewtodd

Part of #89949.

This completes the work abandoned in #92259, properly storing the regions in which a statement was executed, rather than waiting until UI render time to try to map (dangerously ephemeral) node IDs to their respective regions.

With this work in place, we will be able to remove both the UI mapping code (usages of this [selector][]) and the stop-gap work of #93268.

[selector]: https://github.com/cockroachdb/cockroach/blob/0f6333cbd6c78264a59dc435324c0c33d75ea148/pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.selectors.ts#L52-L64

Release note (sql change): A regions field was added to the statistics column of crdb_internal.statement_statistics, reporting the regions of the nodes on which the statement was executed.

Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Mar 14, 2023
96991: roachtest: update mixed-version backup to use new framework r=srosenberg a=renatolabs


This updates the `backup/mixed-version` roachtest to use the recently
introduced mixed-version roachtest framework (`mixedversion` package).

The main behavior exercised remains the same: backups are taken in
mixed-binary state, and those backups are restored and verified at the
end of the test. However, this commit also improves the coverage of
mixed-version backup testing in a few ways:

* **Randomization**. By virtue of using the new framework, most runs
will be different from one another since the order of actions taken by
the test will be different. Previously, backups would always be taken
with 2 nodes in the old version and 2 nodes in the new version. Now,
backups can be taken when an arbitrary number of nodes is running the
new version. As a consequence, it's also possible that some executions
will attempt backups when all nodes are running a new binary version,
but the cluster version itself has not been updated. Other points of
new randomization include the choice of the node's external dir where
backups are stored, which node to connect to when running certain
statements, and how much to wait between backups.

* **Backup Options**. Backups will randomly be created with
`revision_history` enabled, or with an `encryption_passphrase`.

* **Downgrades**. The cluster is also downgraded in mixed-version
tests. No downgrades happened in that test before this commit.

* **Workload**. Instead of using fixed call to `generate_series` to
generate data between backups, the test now runs the `bank` workload
continuously during the test. A random wait between backups allows the
workload to make changes to the underlying table during the test and
for the backups to be taken while writes are taking place.

* **Finalization**: the test _may_ attempt to create a backup as the
upgrade is finalizing (i.e., migrations are running and cluster
version is advancing).

In addition, this test will also see improved coverage as we make more
improvements to test plans generated by the `mixedversion` package.
These changes will create more backup scenarios in the future without
requiring any code changes to this test.

This test has already helped us uncover one backup bug (#97953).

Epic: CRDB-19321

Release note: None

98398: statusccl: stop serving /_status/nodes to tenants r=matthewtodd a=matthewtodd

Fixes #98057.

This reverts the work of #93268, which is no longer necessary now that we are eagerly capturing region information at execution time in #95449.

Release note: None

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv,server: tenant-scope nodes endpoint
4 participants