server,cli/haproxy: use node liveness to filter the generated conf #43908

knz · 2020-01-13T14:13:38Z

Fixes #19863.

Prior to this patch, cockroach gen haproxy would include
decommissioned nodes in the generated configuration.
This patch removes them.

This is achieved by extending /_status/nodes to also
report the livesness information as currently known
via gossip.

Release note (cli change): cockroach gen haproxy now excludes
decommissioned nodes.

cockroach-teamcity · 2020-01-13T14:13:46Z

This change is

tbg · 2020-01-13T14:28:43Z

Isn't the correct thing to include all nodes that aren't explicitly decommissioned? If I generate a haproxy conf while some node is down for expected (short enough) maintenance (like a version upgrade), I want it in my haproxy conf, right?

Also, is there a rationale for making a new endpoint vs augmenting the data returned by the old endpoint?

knz · 2020-01-13T15:03:05Z

Isn't the correct thing to include all nodes that aren't explicitly decommissioned? If I generate a haproxy conf while some node is down for expected (short enough) maintenance (like a version upgrade), I want it in my haproxy conf, right?

I wasn't sure; but that works for me. Changed.

Also, is there a rationale for making a new endpoint vs augmenting the data returned by the old endpoint?

I wanted to share some code with the internal API. But it's not too important. Done.

tbg

mod my comments.

Also, can you update the cli help comment for this command to mention that decommission{ed,ing} nodes are skipped?

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cli/haproxy.go, line 99 at r1 (raw file):

		case storagepb.NodeLivenessStatus_DECOMMISSIONING,
			storagepb.NodeLivenessStatus_DECOMMISSIONED:
			fmt.Fprintf(stderr, "warning: node %d status is %s, excluding from haproxy configuration\n",

I think this warning should print only on DECOMMISSIONING. A DECOMMISSIONED node is understood to intentionally have been removed from the cluster a while ago, so nobody would expect it to show up in a haproxy config.

pkg/cli/haproxy.go, line 116 at r1 (raw file):

		// possible flags.
		//
		// TODO(knz): this logic is horrendously broken and

Wow, no kidding, this is wild.

pkg/cli/haproxy_test.go, line 114 at r1 (raw file):

			},
		},
		// Check that non-live nodes are not considered for generating the configuration.

s/non-live/decommission{ing,ed}/

pkg/server/serverpb/status.proto, line 115 at r1 (raw file):

message NodesResponse {
  repeated server.status.statuspb.NodeStatus nodes = 1 [ (gogoproto.nullable) = false ];
  map<int32, cockroach.storage.NodeLivenessStatus> liveness_by_node_id = 2 [

Do we use proto maps elsewhere? I'm asking because I faintly remember issues with them. Perhaps they weren't supported in gw-gateway, or the iteration order caused snags. If there's a precedent, I have no concerns but if this is the first map in these public protos consider using a repeated message instead.

tbg

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)

Prior to this patch, `cockroach gen haproxy` would include decommissioned nodes in the generated configuration. This patch removes them. This is achieved by extending `/_status/nodes` to also report the livesness information as currently known via gossip. Release note (cli change): `cockroach gen haproxy` now excludes decommissioned nodes.

knz

Also, can you update the cli help comment for this command to mention that decommission{ed,ing} nodes are skipped?

Done

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cli/haproxy.go, line 99 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think this warning should print only on DECOMMISSIONING. A DECOMMISSIONED node is understood to intentionally have been removed from the cluster a while ago, so nobody would expect it to show up in a haproxy config.

Done.

pkg/cli/haproxy_test.go, line 114 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/non-live/decommission{ing,ed}/

Done.

pkg/server/serverpb/status.proto, line 115 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Do we use proto maps elsewhere? I'm asking because I faintly remember issues with them. Perhaps they weren't supported in gw-gateway, or the iteration order caused snags. If there's a precedent, I have no concerns but if this is the first map in these public protos consider using a repeated message instead.

There's some precedent in RaftState , RaftDebugResponse, ProblemRangesResponse, HotRangesResponse, RangeResponse below in the same file.

knz · 2020-01-15T11:03:59Z

the iteration order caused snags

Yes that is a concern btw, that's why the haproxy code also sorts the node IDs.

knz · 2020-01-15T11:04:20Z

TFYR!

bors r=tbg

43908: server,cli/haproxy: use node liveness to filter the generated conf r=tbg a=knz Fixes #19863. Prior to this patch, `cockroach gen haproxy` would include decommissioned nodes in the generated configuration. This patch removes them. This is achieved by extending `/_status/nodes` to also report the livesness information as currently known via gossip. Release note (cli change): `cockroach gen haproxy` now excludes decommissioned nodes. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

craig · 2020-01-15T11:38:58Z

Build succeeded

GitHub CI (Cockroach)

43915: cli: warn if trying to [rd]ecommision when already [rd]ecommissioned r=tbg a=knz Fixes #36624. First commit from #43908 Release note (cli change): The CLI commands `cockroach node decommission` and `cockroach node recommission` now produce a warning on the standard error if one of the node(s) specified is already (d/r)ecommissioned. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

42969: storage: rationalize server-side refreshes and fix bugs r=andreimatei a=andreimatei Before this patch, we had several issues due to the server erroneously considering that it's OK to commit a transaction at a bumped timestamp. One of the issues was a lost update: a CPut could erroneously succeed even though there's been a more recent write. This was caused by faulty code in evaluateBatch() that was thinking that, just because an EndTxn claimed to have been able to commit a transaction, that means that any WriteTooOldError encountered previously by the batch was safe to discard. An EndTxn might consider that it can commit even if there had been previous write too old conditions if the NoRefreshSpans flag is set. The problems is that a CPut that had returned a WriteTooOldError also evaluated at the wrong read timestamp, and so its evaluation can't be relied on. Another issue is that, when the EndTxn code mentioned above considers that it's safe to commit at a bumped timestamp, it doesn't take into considerations that the EndTxn's batch might have performed reads (other than CPuts) that have been evaluated at a lower timestamp. This can happen, for example in the following scenario: - a txn sends a Put which gets bumped by the ts cache - the txn then sends a Scan + EndTxn. The scan gets evaluated at the original timestamp, but then we commit at a bumped one because the NoRefreshSpans flag is set. The patch fixes the bugs by reworking how evaluation takes advantage of the fact that some requests have flexible timestamps. EndTxn no longer is in the business of committing at bumped timestamps, and its code is thus simplified. Instead, the replica's "local retries" loop takes over. The replica already had code handling non-transactional batches that evaluated them repeatedly in case of WriteTooOldErrors. This patch rationalizes and expands this code to deal with transactional batches too, and with pushes besides WriteTooOldErrors. This reevaluation loop now handles the cases in which the EndTxn used to bump the commit timestamp. The patch also fixes a third bug: the logic evaluateBatch() for resetting the WriteTooOld state after a successful EndTransaction was ignoring the STAGING state, meaning that the server would return a WriteTooOldError even though the transaction was committed. I'm not sure if this had dramatic consequences or was benign... Fixes #42849 Release note (bug fix): A bug causing lost update transaction anomalies was fixed. 43915: cli: warn if trying to [rd]ecommision when already [rd]ecommissioned r=tbg a=knz Fixes #36624. First commit from #43908 Release note (cli change): The CLI commands `cockroach node decommission` and `cockroach node recommission` now produce a warning on the standard error if one of the node(s) specified is already (d/r)ecommissioned. 43989: colexec: fix an issue with builtin operators and a minor cleanup r=yuzefovich a=yuzefovich Flat bytes relies on coldata.Batch.SetLength call to maintain its invariant. We assume that it is always called before return the batch in which Bytes vector might have been modified. This was not the case for default builtin and substring operators, and the calls were added. Additionally, to be safe, similar calls have been in added in projection operators. In a few places where we were setting the length of an internal batch to 0 and then returning it, those were replaced with returning coldata.ZeroBatch. Fixes: #43656. Release note: None 44004: githooks: accept release note category 'security update' r=knz a=knz Forgot this in #43869 44008: re-enable: roachprod: Make multiple set [provider]-zones always geo-distribute nodes r=jlinder a=jlinder This re-enables commit d24e40e which was reverted in commit 63279f9. It was reverted due to roachtest automatically passing in a list of zones but only wanting the first zone to be used (#43898) which was fixed in f68c6d5 . Before: if multiple zones were set for a provider and --geo wasn't set, all hosts would be started in just one zone in one region. Why change? Because if multiple zones are set, the intention is that they be used. Now, --geo and --[provider]-zones work as follows for gcloud, aws and azure: 1. when geo and zones are not set, nodes are all placed in one of the default zones 2. when geo is set but zones aren't, nodes are spread evenly across the default zones 3. when zones are set, nodes are spread evenly across the specified zones Fixes #38542. Release note: None 44016: builtins: miscellaneous fixes for the to_hex builtin r=mjibson a=otan Resolves #41707. Release note (sql change, bug fix): * Added to_hex(string) -> string functionality. * Previously, `to_hex(-1)` would return `-1` instead of the negative hex representation (`FFFFFFFFFFFFFFFF`). This has been rectified in this PR. 44023: roachtest: bump minimum version of the sqlsmith roachtest r=yuzefovich a=rohany Fixes #43995. This PR bumps the minimum version of the SQLSmith roachtest to be v20.1.0. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: James H. Linder <jamesl@cockroachlabs.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>

knz requested a review from tbg January 13, 2020 14:13

knz requested a review from a team as a code owner January 13, 2020 14:13

knz added this to To do in DB Server & Security via automation Jan 13, 2020

knz moved this from To do to In progress in DB Server & Security Jan 13, 2020

knz force-pushed the 20200113-haproxy-only-include-live-nodes branch from 848994a to bf0a641 Compare January 13, 2020 15:02

knz mentioned this pull request Jan 13, 2020

cli: warn if trying to [rd]ecommision when already [rd]ecommissioned #43915

Merged

tbg approved these changes Jan 15, 2020

View reviewed changes

tbg self-requested a review January 15, 2020 10:27

tbg reviewed Jan 15, 2020

View reviewed changes

knz force-pushed the 20200113-haproxy-only-include-live-nodes branch from bf0a641 to 301196d Compare January 15, 2020 11:03

knz commented Jan 15, 2020

View reviewed changes

craig bot merged commit 301196d into cockroachdb:master Jan 15, 2020

DB Server & Security automation moved this from In progress to Done 20.1 Jan 15, 2020

knz deleted the 20200113-haproxy-only-include-live-nodes branch January 15, 2020 11:39

jseldess mentioned this pull request Feb 19, 2020

server,cli/haproxy: use node liveness to filter the generated conf cockroachdb/docs#6628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,cli/haproxy: use node liveness to filter the generated conf #43908

server,cli/haproxy: use node liveness to filter the generated conf #43908

knz commented Jan 13, 2020 •

edited

Loading

cockroach-teamcity commented Jan 13, 2020

tbg commented Jan 13, 2020

knz commented Jan 13, 2020

tbg left a comment

tbg left a comment

knz left a comment

knz commented Jan 15, 2020

knz commented Jan 15, 2020

craig bot commented Jan 15, 2020

server,cli/haproxy: use node liveness to filter the generated conf #43908

server,cli/haproxy: use node liveness to filter the generated conf #43908

Conversation

knz commented Jan 13, 2020 • edited Loading

cockroach-teamcity commented Jan 13, 2020

tbg commented Jan 13, 2020

knz commented Jan 13, 2020

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Jan 15, 2020

knz commented Jan 15, 2020

craig bot commented Jan 15, 2020

Build succeeded

knz commented Jan 13, 2020 •

edited

Loading