cli: administrative commands meant to work over unavailable clusters are broken #43974

knz · 2020-01-14T18:34:06Z

Found while working on #43893

cockroach node status is meant to work even if the node is unhealthy (eg. a system range is unavailable), as per a comment in roachtests and my own (knz') personal recollections of past discussions
however at the same time its is implemented using a SQL query
SQL is not meant to work when system.users is unavailable, because every connection should authenticate the user logging in

Therefore currently cockroach node status happens to work for the root user because of some bypass in the code when the username is "root" exactly. However, it is broken if used with any other client cert than root's.

There are two perspectives about this:

A. either we consider that SQL should always authenticate (which creates consistency and regularity), in which case cockroach node status should be made to use the status RPC entirely and bypass SQL.

For this to be viable we must start to commit to (and better understand) a separation between "control" (RPC) and "data" (SQL) planes in the APIs.

B. or, we should consider that the "main" administration API for crdb is SQL and that this admin API should always work even when system ranges are unavailable, i.e. we must implement and document a way for clients to establish SQL connections when the system ranges including system.users are unavailable.

For this to be viable we need a special "operator" account that's always special in this way, but without removing the ability for users to use a non-special account that's also admin in the UI (i.e. properly address #43847 / #43870)

(needless say my(knz) heart goes towards direction A)

Jira issue: CRDB-6332

The text was updated successfully, but these errors were encountered:

knz · 2020-01-14T18:38:57Z

@tbg can you check the above? I'd like to point out that node was using the RPC originally and that you specifically requested SQL in #20713.

Interestingly, #20713 was listing cockroach user, zone and node. Both user and zone have been removed since because they were redundant with SQL.

It seems to me that node fundamentally should not be done via SQL, and that essentially the thrust behind #20713 should have been narrowed to user and zone, and thus the current behavior reverted for node. Do you agree?

ajwerner · 2020-01-14T19:03:58Z

I can see some value in B. though I haven't decided if I'd go so far as to say I like it. In particular, shoving more administrative functionality underneath SQL gives us a smaller API surface w.r.t. API uniformity and concepts for users. I'm generally in favor of exposing non-performance critical functionality to users via a SQL (crdb_internal) interface rather than a new gRPC service. I anticipate that it can ease client adoption if that client has gone through the work of setting up a connection. I also suspect that there are use cases where it is convenient to be able to access data available from other APIs via SQL.

As noted, we'd need to tell a new story about authn/authz for such a sql user. Perhaps we want to provide a mechanism to create tokens with an expiration and well scoped access to "control" data. Furthermore, pushing such data underneath SQL will muddle the interface.

Perhaps my preference is to do both. Define a clear "control" portion which is implemented on top of SQL and can bypass authz explicitly. Then, with less urgency, provide a mechanism to authorize SQL connections to access the same data.

knz · 2020-01-14T21:56:01Z

Perhaps we want to provide a mechanism to create tokens with an expiration and well scoped access to "control" data

Yeah and that info will be stored ... where? In a KV range that needs quorum to access? That sounds like a snake biting its tail.

I'm cooking a high-level description of the problem that underlines how we've been designing with conflicting goals. There will be fallout.

ajwerner · 2020-01-14T22:11:51Z

Yeah and that info will be stored ... where? In a KV range that needs quorum to access? That sounds like a snake biting its tail.

In my head the proposal was that the token be irrevocable or best effort revocable. The idea was that it could be a signed token which stores its expiration so that verification can occur without access to the database.

knz · 2020-01-14T22:37:07Z

idea was that it could be a signed token which stores its expiration so that verification can occur without access to the database.

We have that already with TLS client certs (signed + embedded expiration)

Except that it's not server-side revocable. And we can't control/revoke privileges.

tbg · 2020-01-15T13:28:13Z

None of the options discussed seems very appealing. Take a look at the SQL used for node status:

cockroach/pkg/cli/node.go

Lines 148 to 189 in 0917dcc

    
           	baseQuery := maybeAddActiveNodesFilter( 
        
           		`SELECT node_id AS id, 
        
                       address, 
        
                       sql_address, 
        
                       build_tag AS build, 
        
                       started_at, 
        
           			updated_at, 
        
           			locality, 
        
                       CASE WHEN split_part(expiration,',',1)::decimal > now()::decimal 
        
                            THEN true 
        
                            ELSE false 
        
                            END AS is_available, 
        
           			ifnull(is_live, false) 
        
                FROM crdb_internal.gossip_liveness LEFT JOIN crdb_internal.gossip_nodes USING (node_id)`, 
        
           	) 
        
           	const rangesQuery = ` 
        
           SELECT node_id AS id, 
        
                  sum((metrics->>'replicas.leaders')::DECIMAL)::INT AS replicas_leaders, 
        
                  sum((metrics->>'replicas.leaseholders')::DECIMAL)::INT AS replicas_leaseholders, 
        
                  sum((metrics->>'replicas')::DECIMAL)::INT AS ranges, 
        
                  sum((metrics->>'ranges.unavailable')::DECIMAL)::INT AS ranges_unavailable, 
        
                  sum((metrics->>'ranges.underreplicated')::DECIMAL)::INT AS ranges_underreplicated 
        
           FROM crdb_internal.kv_store_status 
        
           GROUP BY node_id` 
        
           	const statsQuery = ` 
        
           SELECT node_id AS id, 
        
                  sum((metrics->>'livebytes')::DECIMAL)::INT AS live_bytes, 
        
                  sum((metrics->>'keybytes')::DECIMAL)::INT AS key_bytes, 
        
                  sum((metrics->>'valbytes')::DECIMAL)::INT AS value_bytes, 
        
                  sum((metrics->>'intentbytes')::DECIMAL)::INT AS intent_bytes, 
        
                  sum((metrics->>'sysbytes')::DECIMAL)::INT AS system_bytes 
        
           FROM crdb_internal.kv_store_status 
        
           GROUP BY node_id` 
        
           	const decommissionQuery = ` 
        
           SELECT node_id AS id, 
        
                  ranges AS gossiped_replicas, 
        
                  decommissioning AS is_decommissioning, 
        
                  draining AS is_draining 
        
           FROM crdb_internal.gossip_liveness LEFT JOIN crdb_internal.gossip_nodes USING (node_id)`

Cobbling this together without SQL is unsavory. In fact, the previous code doing so was... bad, we can't revert to it.

The two options I find palatable are:

don't do anything (which I suppose means that root can't have a password) - I bring this up because I wasn't able to find a clear reason why we wanted to make that change in the first place. We may thus have the option to fight this fight another day.
push the code for the administrative commands out of the cli into the server. That is,

cockroach/pkg/cli/node.go

Line 129 in 0917dcc

func runStatusNodeInner(showDecommissioned bool, args []string) ([]string, [][]string, error) {

would do nothing but open a gRPC connection and send an RPC containing the args, whose output is then rendered without any additional logic. The server side could create an "internal" authenticated session that can run SQL, something I believe we already do in some places.

knz · 2020-01-15T14:15:07Z

I like 2 as it's well aligned with another item on the CLI roadmap. I'll mull it over.

44022: server,sql,pgwire: let client conns timeout on unavailable clusters r=ajwerner a=knz Fixes #44012. Informs #43974 and #30887. Release note (general change): CockroachDB will now report a timeout error when a client attempts to connect via SQL or the web UI and some system ranges are unavailable. The previous behavior was to wait indefinitely. The timeout is configurable via the cluster setting `server.user_login.timeout` and is set to 10 seconds by default. The value 0 means "indefinitely" can can be used to restore the pre-20.1 behavior. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

knz · 2020-01-20T10:38:45Z

RFC for a solution: https://github.com/knz/cockroach/blob/20200119-users-rfc/docs/RFCS/20200119_always_available_users.md

github-actions · 2023-09-19T11:09:01Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz changed the title ~~cli: cockroach node status is broken~~ cli: administrative commands meant to work over unavailable clusters are broken Jan 14, 2020

knz added this to To do in DB Server & Security via automation Jan 14, 2020

knz added A-cli C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jan 14, 2020

knz mentioned this issue Jan 14, 2020

server,cli,sql: enable setting a password for the root user #43893

Merged

knz moved this from To do to In progress in DB Server & Security Jan 14, 2020

knz moved this from In progress to 20.1 in DB Server & Security Jan 14, 2020

knz mentioned this issue Jan 15, 2020

server,sql,pgwire: let client conns timeout on unavailable clusters #44022

Merged

knz added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Jan 17, 2020

knz moved this from 20.1 to To do in DB Server & Security Jan 17, 2020

knz moved this from To do to In progress in DB Server & Security Jan 20, 2020

knz moved this from In progress to To do in DB Server & Security Apr 17, 2020

knz moved this from To do to Legacy debt in DB Server & Security Apr 17, 2020

knz added A-cli-admin CLI commands that pertain to controlling and configuring nodes and removed A-cli labels Mar 20, 2021

jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021

knz added A-authentication Pertains to authn subsystems A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021

github-actions bot added the no-issue-activity label Sep 19, 2023

knz removed the no-issue-activity label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: administrative commands meant to work over unavailable clusters are broken #43974

cli: administrative commands meant to work over unavailable clusters are broken #43974

knz commented Jan 14, 2020 •

edited by cockroach-jira-scripts

knz commented Jan 14, 2020 •

edited

ajwerner commented Jan 14, 2020

knz commented Jan 14, 2020

ajwerner commented Jan 14, 2020

knz commented Jan 14, 2020

tbg commented Jan 15, 2020

knz commented Jan 15, 2020

knz commented Jan 20, 2020

github-actions bot commented Sep 19, 2023

cli: administrative commands meant to work over unavailable clusters are broken #43974

cli: administrative commands meant to work over unavailable clusters are broken #43974

Comments

knz commented Jan 14, 2020 • edited by cockroach-jira-scripts

knz commented Jan 14, 2020 • edited

ajwerner commented Jan 14, 2020

knz commented Jan 14, 2020

ajwerner commented Jan 14, 2020

knz commented Jan 14, 2020

tbg commented Jan 15, 2020

knz commented Jan 15, 2020

knz commented Jan 20, 2020

github-actions bot commented Sep 19, 2023

knz commented Jan 14, 2020 •

edited by cockroach-jira-scripts

knz commented Jan 14, 2020 •

edited