Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: sql: add virtual tables to expose remote DistSQL flows #66332

Merged
merged 3 commits into from
Sep 15, 2021

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jun 10, 2021

Backport 2/2 commits from #65727.
Backport 1/1 commits from #66333.

/cc @cockroachdb/release


server: minor cleanup around listing of contention events

This commit renames some things related to ListContentionEvents RPC to
something like ListActivity* in order to be reused by the follow-up
commit that will introduce ListDistSQLFlows RPC. Also, a couple of
redundant method implementations are removed.

Release note: None

sql: add virtual tables to expose remote DistSQL flows

This commit introduces two virtual tables
crdb_internal.(node|cluster)_distsql_flows that expose the information
about the DistSQL flows that have gone through the flow scheduler.
Notably, this information will not include any "local" flows (those
running on the gateway node for the query) because they don't go through
the flow scheduler.

The goal of these virtual tables is to provide more observability into
the state of the distributed query execution, especially when things are
shutdown in an ungraceful manner. Previously, the only way to see the
queued up or running flows would be via metrics.

This information is exposed via another couple of methods that are added
onto the status server, and it is required that the user is either
a superuser or has VIEWACTIVITY permissions.

Release note (sql change): a couple of virtual tables
crdb_internal.(node|cluster)_distsql_flows are added that expose the
information about the flows of the DistSQL execution scheduled on remote
nodes. These tables don't include any information about the
non-distributed queries nor about local flows (from the perspective of
the gateway node of the query).

distsql: initialize the flow scheduler earlier

In 38dbeae we broke up the creation of
the flow scheduler in order to give access to it to the status server.
That required separating out initialization of the flow metrics into
a separate step that is performed after the flow scheduler is created.

However, as it turns out, it is possible that the DistSQL server
receives some RPC calls before it is Started (the place where we put
the metrics initialization). This can happen because the DistSQL server
is registered as a gRPC service once it is created, and if an RPC comes
in before the server is started, a NPE would currently occur.

This commit fixes the issue by moving the initialization of the flow
scheduler into the constructor of the DistSQL server (early enough to
not race against incoming RPCs).

Release note: None (no stable release with this bug)

Release justification: low-risk addition that can improve the debugging of
thorny issues with DistSQL flows not being properly canceled.

@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Jun 10, 2021
@yuzefovich yuzefovich requested review from a team as code owners June 10, 2021 20:08
@yuzefovich yuzefovich removed request for a team June 10, 2021 20:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

#66333 needs to be backported here too.

@yuzefovich
Copy link
Member Author

This needs to be merged after #66331, and I want to bake this PR also for a month, targeting 21.1.4 release.

@yuzefovich
Copy link
Member Author

Andrew W pointed out that our newly proposed backport policy suggests not backporting new virtual tables, but if we do, they should be hidden behind a feature flag. Given that recommendation I'm a bit on the fence whether we want to proceed with this backport or not - I wanted to backport it to help ourselves (CRDB engineers) in the debugging of customer incidents (since we recently had several cases when there was a lot of leftover remote flows contributing to / causing the outage).

I'm still leaning on the side of backporting this and will need to add a feature flag, thoughts? cc @cockroachdb/sql-queries

@rytaft
Copy link
Collaborator

rytaft commented Jun 15, 2021

I'd be in favor of backporting this behind a feature flag since it seems like it will be very useful for debugging issues.

This commit renames some things related to `ListContentionEvents` RPC to
something like `ListActivity*` in order to be reused by the follow-up
commit that will introduce `ListDistSQLFlows` RPC. Also, a couple of
redundant method implementations are removed.

Release note: None
@yuzefovich yuzefovich requested a review from a team as a code owner September 14, 2021 19:19
@yuzefovich yuzefovich requested review from tbg and removed request for a team September 14, 2021 19:19
@yuzefovich yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Sep 14, 2021
@yuzefovich yuzefovich requested review from rytaft and a team September 14, 2021 19:20
@yuzefovich
Copy link
Member Author

I rebased on top of latest 21.1 without introduction the cluster setting for these new virtual tables.

This commit introduces two virtual tables
`crdb_internal.(node|cluster)_distsql_flows` that expose the information
about the DistSQL flows that have gone through the flow scheduler.
Notably, this information will not include any "local" flows (those
running on the gateway node for the query) because they don't go through
the flow scheduler.

The goal of these virtual tables is to provide more observability into
the state of the distributed query execution, especially when things are
shutdown in an ungraceful manner. Previously, the only way to see the
queued up or running flows would be via metrics.

This information is exposed via another couple of methods that are added
onto the status server, and it is required that the user is either
a superuser or has VIEWACTIVITY permissions.

Release note (sql change): a couple of virtual tables
`crdb_internal.(node|cluster)_distsql_flows` are added that expose the
information about the flows of the DistSQL execution scheduled on remote
nodes. These tables don't include any information about the
non-distributed queries nor about local flows (from the perspective of
the gateway node of the query).
In 38dbeae we broke up the creation of
the flow scheduler in order to give access to it to the status server.
That required separating out initialization of the flow metrics into
a separate step that is performed after the flow scheduler is created.

However, as it turns out, it is possible that the DistSQL server
receives some RPC calls before it is `Start`ed (the place where we put
the metrics initialization). This can happen because the DistSQL server
is registered as a gRPC service once it is created, and if an RPC comes
in before the server is started, a NPE would currently occur.

This commit fixes the issue by moving the initialization of the flow
scheduler into the constructor of the DistSQL server (early enough to
not race against incoming RPCs).

Release note: None (no stable release with this bug)
@yuzefovich
Copy link
Member Author

Fixed the linter, PTAL.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 51 of 51 files at r6, 39 of 51 files at r7, 2 of 2 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @rytaft, and @tbg)

@yuzefovich
Copy link
Member Author

TFTR!

@yuzefovich yuzefovich merged commit cbd3c9f into cockroachdb:release-21.1 Sep 15, 2021
@yuzefovich yuzefovich deleted the backport21.1-65727 branch September 15, 2021 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants