-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.1: sql: add virtual tables to expose remote DistSQL flows #66332
release-21.1: sql: add virtual tables to expose remote DistSQL flows #66332
Conversation
#66333 needs to be backported here too. |
168afe3
to
ceb7561
Compare
This needs to be merged after #66331, and I want to bake this PR also for a month, targeting 21.1.4 release. |
ceb7561
to
eb06269
Compare
Andrew W pointed out that our newly proposed backport policy suggests not backporting new virtual tables, but if we do, they should be hidden behind a feature flag. Given that recommendation I'm a bit on the fence whether we want to proceed with this backport or not - I wanted to backport it to help ourselves (CRDB engineers) in the debugging of customer incidents (since we recently had several cases when there was a lot of leftover remote flows contributing to / causing the outage). I'm still leaning on the side of backporting this and will need to add a feature flag, thoughts? cc @cockroachdb/sql-queries |
I'd be in favor of backporting this behind a feature flag since it seems like it will be very useful for debugging issues. |
This commit renames some things related to `ListContentionEvents` RPC to something like `ListActivity*` in order to be reused by the follow-up commit that will introduce `ListDistSQLFlows` RPC. Also, a couple of redundant method implementations are removed. Release note: None
eb06269
to
6569425
Compare
I rebased on top of latest 21.1 without introduction the cluster setting for these new virtual tables. |
This commit introduces two virtual tables `crdb_internal.(node|cluster)_distsql_flows` that expose the information about the DistSQL flows that have gone through the flow scheduler. Notably, this information will not include any "local" flows (those running on the gateway node for the query) because they don't go through the flow scheduler. The goal of these virtual tables is to provide more observability into the state of the distributed query execution, especially when things are shutdown in an ungraceful manner. Previously, the only way to see the queued up or running flows would be via metrics. This information is exposed via another couple of methods that are added onto the status server, and it is required that the user is either a superuser or has VIEWACTIVITY permissions. Release note (sql change): a couple of virtual tables `crdb_internal.(node|cluster)_distsql_flows` are added that expose the information about the flows of the DistSQL execution scheduled on remote nodes. These tables don't include any information about the non-distributed queries nor about local flows (from the perspective of the gateway node of the query).
In 38dbeae we broke up the creation of the flow scheduler in order to give access to it to the status server. That required separating out initialization of the flow metrics into a separate step that is performed after the flow scheduler is created. However, as it turns out, it is possible that the DistSQL server receives some RPC calls before it is `Start`ed (the place where we put the metrics initialization). This can happen because the DistSQL server is registered as a gRPC service once it is created, and if an RPC comes in before the server is started, a NPE would currently occur. This commit fixes the issue by moving the initialization of the flow scheduler into the constructor of the DistSQL server (early enough to not race against incoming RPCs). Release note: None (no stable release with this bug)
6569425
to
9a56def
Compare
Fixed the linter, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 51 of 51 files at r6, 39 of 51 files at r7, 2 of 2 files at r9.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @rytaft, and @tbg)
TFTR! |
Backport 2/2 commits from #65727.
Backport 1/1 commits from #66333.
/cc @cockroachdb/release
server: minor cleanup around listing of contention events
This commit renames some things related to
ListContentionEvents
RPC tosomething like
ListActivity*
in order to be reused by the follow-upcommit that will introduce
ListDistSQLFlows
RPC. Also, a couple ofredundant method implementations are removed.
Release note: None
sql: add virtual tables to expose remote DistSQL flows
This commit introduces two virtual tables
crdb_internal.(node|cluster)_distsql_flows
that expose the informationabout the DistSQL flows that have gone through the flow scheduler.
Notably, this information will not include any "local" flows (those
running on the gateway node for the query) because they don't go through
the flow scheduler.
The goal of these virtual tables is to provide more observability into
the state of the distributed query execution, especially when things are
shutdown in an ungraceful manner. Previously, the only way to see the
queued up or running flows would be via metrics.
This information is exposed via another couple of methods that are added
onto the status server, and it is required that the user is either
a superuser or has VIEWACTIVITY permissions.
Release note (sql change): a couple of virtual tables
crdb_internal.(node|cluster)_distsql_flows
are added that expose theinformation about the flows of the DistSQL execution scheduled on remote
nodes. These tables don't include any information about the
non-distributed queries nor about local flows (from the perspective of
the gateway node of the query).
distsql: initialize the flow scheduler earlier
In 38dbeae we broke up the creation of
the flow scheduler in order to give access to it to the status server.
That required separating out initialization of the flow metrics into
a separate step that is performed after the flow scheduler is created.
However, as it turns out, it is possible that the DistSQL server
receives some RPC calls before it is
Start
ed (the place where we putthe metrics initialization). This can happen because the DistSQL server
is registered as a gRPC service once it is created, and if an RPC comes
in before the server is started, a NPE would currently occur.
This commit fixes the issue by moving the initialization of the flow
scheduler into the constructor of the DistSQL server (early enough to
not race against incoming RPCs).
Release note: None (no stable release with this bug)
Release justification: low-risk addition that can improve the debugging of
thorny issues with DistSQL flows not being properly canceled.