Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: administrative commands meant to work over unavailable clusters are broken #43974

Open
knz opened this issue Jan 14, 2020 · 9 comments
Open
Labels
A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects A-cli-admin CLI commands that pertain to controlling and configuring nodes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-server-and-security DB Server & Security

Comments

@knz
Copy link
Contributor

knz commented Jan 14, 2020

Found while working on #43893

  • cockroach node status is meant to work even if the node is unhealthy (eg. a system range is unavailable), as per a comment in roachtests and my own (knz') personal recollections of past discussions
  • however at the same time its is implemented using a SQL query
  • SQL is not meant to work when system.users is unavailable, because every connection should authenticate the user logging in

Therefore currently cockroach node status happens to work for the root user because of some bypass in the code when the username is "root" exactly. However, it is broken if used with any other client cert than root's.

There are two perspectives about this:

A. either we consider that SQL should always authenticate (which creates consistency and regularity), in which case cockroach node status should be made to use the status RPC entirely and bypass SQL.

For this to be viable we must start to commit to (and better understand) a separation between "control" (RPC) and "data" (SQL) planes in the APIs.

B. or, we should consider that the "main" administration API for crdb is SQL and that this admin API should always work even when system ranges are unavailable, i.e. we must implement and document a way for clients to establish SQL connections when the system ranges including system.users are unavailable.

For this to be viable we need a special "operator" account that's always special in this way, but without removing the ability for users to use a non-special account that's also admin in the UI (i.e. properly address #43847 / #43870)

(needless say my(knz) heart goes towards direction A)

Jira issue: CRDB-6332

@knz knz changed the title cli: cockroach node status is broken cli: administrative commands meant to work over unavailable clusters are broken Jan 14, 2020
@knz
Copy link
Contributor Author

knz commented Jan 14, 2020

@tbg can you check the above? I'd like to point out that node was using the RPC originally and that you specifically requested SQL in #20713.

Interestingly, #20713 was listing cockroach user, zone and node. Both user and zone have been removed since because they were redundant with SQL.

It seems to me that node fundamentally should not be done via SQL, and that essentially the thrust behind #20713 should have been narrowed to user and zone, and thus the current behavior reverted for node. Do you agree?

@knz knz added this to To do in DB Server & Security via automation Jan 14, 2020
@knz knz added A-cli C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jan 14, 2020
@knz knz moved this from To do to In progress in DB Server & Security Jan 14, 2020
@ajwerner
Copy link
Contributor

I can see some value in B. though I haven't decided if I'd go so far as to say I like it. In particular, shoving more administrative functionality underneath SQL gives us a smaller API surface w.r.t. API uniformity and concepts for users. I'm generally in favor of exposing non-performance critical functionality to users via a SQL (crdb_internal) interface rather than a new gRPC service. I anticipate that it can ease client adoption if that client has gone through the work of setting up a connection. I also suspect that there are use cases where it is convenient to be able to access data available from other APIs via SQL.

As noted, we'd need to tell a new story about authn/authz for such a sql user. Perhaps we want to provide a mechanism to create tokens with an expiration and well scoped access to "control" data. Furthermore, pushing such data underneath SQL will muddle the interface.

Perhaps my preference is to do both. Define a clear "control" portion which is implemented on top of SQL and can bypass authz explicitly. Then, with less urgency, provide a mechanism to authorize SQL connections to access the same data.

@knz knz moved this from In progress to 20.1 in DB Server & Security Jan 14, 2020
@knz
Copy link
Contributor Author

knz commented Jan 14, 2020

Perhaps we want to provide a mechanism to create tokens with an expiration and well scoped access to "control" data

Yeah and that info will be stored ... where? In a KV range that needs quorum to access? That sounds like a snake biting its tail.

I'm cooking a high-level description of the problem that underlines how we've been designing with conflicting goals. There will be fallout.

@ajwerner
Copy link
Contributor

Yeah and that info will be stored ... where? In a KV range that needs quorum to access? That sounds like a snake biting its tail.

In my head the proposal was that the token be irrevocable or best effort revocable. The idea was that it could be a signed token which stores its expiration so that verification can occur without access to the database.

@knz
Copy link
Contributor Author

knz commented Jan 14, 2020

idea was that it could be a signed token which stores its expiration so that verification can occur without access to the database.

We have that already with TLS client certs (signed + embedded expiration)

Except that it's not server-side revocable. And we can't control/revoke privileges.

@tbg
Copy link
Member

tbg commented Jan 15, 2020

None of the options discussed seems very appealing. Take a look at the SQL used for node status:

cockroach/pkg/cli/node.go

Lines 148 to 189 in 0917dcc

baseQuery := maybeAddActiveNodesFilter(
`SELECT node_id AS id,
address,
sql_address,
build_tag AS build,
started_at,
updated_at,
locality,
CASE WHEN split_part(expiration,',',1)::decimal > now()::decimal
THEN true
ELSE false
END AS is_available,
ifnull(is_live, false)
FROM crdb_internal.gossip_liveness LEFT JOIN crdb_internal.gossip_nodes USING (node_id)`,
)
const rangesQuery = `
SELECT node_id AS id,
sum((metrics->>'replicas.leaders')::DECIMAL)::INT AS replicas_leaders,
sum((metrics->>'replicas.leaseholders')::DECIMAL)::INT AS replicas_leaseholders,
sum((metrics->>'replicas')::DECIMAL)::INT AS ranges,
sum((metrics->>'ranges.unavailable')::DECIMAL)::INT AS ranges_unavailable,
sum((metrics->>'ranges.underreplicated')::DECIMAL)::INT AS ranges_underreplicated
FROM crdb_internal.kv_store_status
GROUP BY node_id`
const statsQuery = `
SELECT node_id AS id,
sum((metrics->>'livebytes')::DECIMAL)::INT AS live_bytes,
sum((metrics->>'keybytes')::DECIMAL)::INT AS key_bytes,
sum((metrics->>'valbytes')::DECIMAL)::INT AS value_bytes,
sum((metrics->>'intentbytes')::DECIMAL)::INT AS intent_bytes,
sum((metrics->>'sysbytes')::DECIMAL)::INT AS system_bytes
FROM crdb_internal.kv_store_status
GROUP BY node_id`
const decommissionQuery = `
SELECT node_id AS id,
ranges AS gossiped_replicas,
decommissioning AS is_decommissioning,
draining AS is_draining
FROM crdb_internal.gossip_liveness LEFT JOIN crdb_internal.gossip_nodes USING (node_id)`

Cobbling this together without SQL is unsavory. In fact, the previous code doing so was... bad, we can't revert to it.

The two options I find palatable are:

  1. don't do anything (which I suppose means that root can't have a password) - I bring this up because I wasn't able to find a clear reason why we wanted to make that change in the first place. We may thus have the option to fight this fight another day.
  2. push the code for the administrative commands out of the cli into the server. That is,
    func runStatusNodeInner(showDecommissioned bool, args []string) ([]string, [][]string, error) {
    would do nothing but open a gRPC connection and send an RPC containing the args, whose output is then rendered without any additional logic. The server side could create an "internal" authenticated session that can run SQL, something I believe we already do in some places.

@knz
Copy link
Contributor Author

knz commented Jan 15, 2020

I like 2 as it's well aligned with another item on the CLI roadmap. I'll mull it over.

@knz knz added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Jan 17, 2020
@knz knz moved this from 20.1 to To do in DB Server & Security Jan 17, 2020
craig bot pushed a commit that referenced this issue Jan 17, 2020
44022: server,sql,pgwire: let client conns timeout on unavailable clusters r=ajwerner a=knz

Fixes #44012.
Informs #43974 and #30887.

Release note (general change): CockroachDB will now report a timeout
error when a client attempts to connect via SQL or the web UI and some
system ranges are unavailable. The previous behavior was to wait
indefinitely. The timeout is configurable via the cluster setting
`server.user_login.timeout` and is set to 10 seconds by default.  The
value 0 means "indefinitely" can can be used to restore the pre-20.1
behavior.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@knz
Copy link
Contributor Author

knz commented Jan 20, 2020

@knz knz moved this from To do to In progress in DB Server & Security Jan 20, 2020
@knz knz moved this from In progress to To do in DB Server & Security Apr 17, 2020
@knz knz moved this from To do to Legacy debt in DB Server & Security Apr 17, 2020
@knz knz added A-cli-admin CLI commands that pertain to controlling and configuring nodes and removed A-cli labels Mar 20, 2021
@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@knz knz added A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects A-cli-admin CLI commands that pertain to controlling and configuring nodes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-server-and-security DB Server & Security
Projects
DB Server & Security
  
Tech debt & Grouping attempts
Development

No branches or pull requests

4 participants