Etcd tnc #3079

enxebre · 2018-03-07T12:48:22Z

Bootstrap etcd nodes from TNC. Fix INST-965
Currently the TNC etcd ign config is missing:

etcd-memeber sevice (currently the ign config generated by the installer is different for each machine. We'll need to either unify them or have different TNC roles).
locksmith service

if TNC can serve the relevant services with the relevant domains for initial_advertise_peer_urls, advertise_client_urls, initial cluster, etc. via ign/v1/role/etcd/N (e.g ign/v1/role/etcd/1, ign/v1/role/etcd/2) that should work

coreosbot · 2018-03-07T12:48:24Z

Can one of the admins verify this patch?

enxebre · 2018-03-07T12:48:41Z

cc @thorfour @alexsomesan @squat

squat · 2018-03-08T09:24:42Z

installer/pkg/workflow/install.go

@@ -61,11 +61,13 @@ func installBootstrapStep(m *metadata) error {
 		return err
 	}

-	if err := waitForTNC(m); err != nil {
+	destroyCNAME(m.clusterDir)
+


Ignition does not continuously retry to download forever, so I think this introduces a race. What if one of the etcd machines gets provisioned and tries to load it’s ignition before the TNC is up?

I think this should go after the waitForTNC step.

yes there's a race which is currently only relying on retries. The problem here is that there's no obvious way for etcd to waitForTNC (when this is running as static pod) as the kubeclient won't be able to get answers until the server gets the state

I second @squat's concern about the race condition. If we know there could be one, we should try to find a solution.
I don't have enough insight into the matter right now to suggest one here, but let me know if you need my help looking into it.

@alexsomesan @squat this relies on ignition retries https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L200 which will keep retrying until it times out, and if I'm not wrong by the look of httpTotal here https://coreos.com/ignition/docs/latest/configuration-v2_1.html it will keep retrying for ever by default https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L69

@alexsomesan @squat it's verified that Ignition will never continue if it isn't completely successful.
waitForTNC(m) where it is now, will ensure that the step holds until etcd comes up (so the cluster get state, so the TNC daemonset gets deployed and so the api server actually gives a response, so waitForTNC can finish)
This is not strictly necessary as the every node will be able to get its config from the TNC pod, so the question is: do we still want to waitForTNC(m)? wdyt?

squat · 2018-03-08T09:48:24Z

Where are the retries? The retry logic would have to be *in* ignition for this to work IMO

…

On Thu, Mar 8, 2018 at 10:33 Alberto ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In installer/pkg/workflow/install.go <#3079 (comment)> : > @@ -61,11 +61,13 @@ func installBootstrapStep(m *metadata) error { return err } - if err := waitForTNC(m); err != nil { + destroyCNAME(m.clusterDir) + yes there's a race which is currently only relying on retries. The problem here is that there's no obvious way for etcd to waitForTNC (when this is running as static pod) as the kubeclient won't be able to get answers until the server gets the state — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#3079 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATiQP_XQsICJE1jt8_d8x5gkj5RicyU0ks5tcPr6gaJpZM4SgY3n> .

enxebre · 2018-03-08T10:18:13Z

yeah I was referring to the ignition retries https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L200
Ideally we'd need a waitForStaticPod mechanism though

enxebre · 2018-03-08T12:52:30Z

installer/pkg/workflow/install.go

 		return err
 	}

-	return destroyCNAME(m.clusterDir)
+	return waitForTNC(m)


By having it here we ensure that the step holds until etcd comes up (so the api server actually gives a response and waitForTNC finish when the TNC daemonset gets deployed)

trawler

lgtm

thorfour · 2018-03-10T00:00:07Z

@enxebre I was told to hold off on the etcd changes until the installer supports query strings. Is there a current issue that tracks that required work?

enxebre · 2018-03-12T08:41:23Z

hey @thorfour thanks for the update! I can't see how cnames vs any other approach can block us from getting the TNC to serve the expected services regardless how they are requested ( e.g etcd, https://github.com/coreos-inc/tectonic-operators/pull/286#issuecomment-370883510, #3090 (comment), is there a place where we track all these expected features for the tnc?). Also, although we need agree on an specific implementation (see INST-944) there's nothing stopping this PR from requesting via query strings so we can ensure etcd comes up correctly

thorfour · 2018-03-12T22:51:07Z

@enxebre Sounds good. I've implemented query strings for ignition using a separate endpoint to unblock us. To use query strings we'll use the path /ignition?role=etcd&id=N sound good? I'll push that PR tomorrow and then add the etcd templates to see if we can get this working.

enxebre · 2018-03-13T08:29:41Z

Cool thanks, sounds good. Just make sure you tag the docker image to a different id so we keep builds used by master hermetic

enxebre · 2018-03-14T13:39:53Z

@thorfour any update on this, is there an image serving etcd services I can point to?

thorfour · 2018-03-14T20:07:56Z

@enxebre PR with the etcd template is here

I've pushed an image with that template here: quay.io/coreos/tectonic-node-controller-dev:16fb00200ea86e6f31ce78e56225dec01f459921

yifan-gu · 2018-03-19T18:44:14Z

Any updates on this PR?

enxebre · 2018-03-19T18:56:23Z

@yifan-gu I dropped few comments here https://github.com/coreos-inc/tectonic-operators/pull/307, should get clarified and merging soon

enxebre · 2018-03-21T10:26:54Z

ok to test

yifan-gu · 2018-03-21T23:28:32Z

config.tf

@@ -71,7 +71,7 @@ variable "tectonic_container_images" {
    awscli                               = "quay.io/coreos/awscli:025a357f05242fdad6a81e8a6b520098aa65a600"
    gcloudsdk                            = "google/cloud-sdk:178.0.0-alpine"
    bootkube                             = "quay.io/coreos/bootkube:v0.10.0"
-    tnc_bootstrap                        = "quay.io/coreos/tectonic-node-controller-dev:76a584680b7f39aa7b3c40cd742c736b30b5a89a"
+    tnc_bootstrap                        = "quay.io/alberto_lamela/tectonic-node-controller-dev:7092e1772378470e14d00b11198767edf28f9698-dirty"


quay.io/coreos/tectonic-node-controller-bootstrap-dev:f6d5e710a97a8cd6f4cd2963f4426131f854a869

enxebre · 2018-03-22T15:28:46Z

retest this please

squat

Looks good overall. I have a few small notes I’d like your eyes on

squat · 2018-03-27T08:07:40Z

modules/aws/etcd/ignition.tf

  count = "${length(var.external_endpoints) == 0 ? var.instance_count : 0}"

-  replace {
+  append {
+    source = "${format("http://${var.cluster_name}-tnc.${var.base_domain}/ignition?role=etcd&etcd_index=%d", count.index)}"


:p it’s a little funny doing both interpolation and formatting but not a blocker

squat · 2018-03-27T08:12:40Z

steps/etcd/inputs.tf

+locals {
+  container_linux_version = "${data.terraform_remote_state.bootstrap.container_linux_version}"
+  instance_count          = "${data.terraform_remote_state.bootstrap.etcd_instance_count}"
+  ignition_etcd           = "${data.terraform_remote_state.assets.ignition_etcd}"


Just like instance_count, this could be just ignition since we are in the etcd step and know that everything is etcd-related

Also for sanity we should standardize the output names to either be prefix_value or value_suffix but not both. That can be a separate cleanup pr

enxebre · 2018-03-27T08:25:08Z

@squat thanks! I agree on your three points, I'll address that on a follow up as the append block is being moved into the cli any way after we merge this and #3120

506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162) converted tectonic_ignition_master to tectonic_ignition_masters (among many other things). But one of the purposes of the bootstrap node was to allow us to avoid special-casing individual master nodes. This commit moves us back towards identical masters. I'm not clear on the purpose of the etcd_index query, which dates back to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27, coreos/tectonic-installer#3079). We'll see how the CI tests do without it.

enxebre added the do-not-merge label Mar 7, 2018

enxebre force-pushed the etcd-tnc branch 2 times, most recently from c2461e6 to 45b06e1 Compare March 7, 2018 15:31

thorfour requested review from trawler, thorfour and squat March 7, 2018 17:49

enxebre mentioned this pull request Mar 8, 2018

run TNC as a static pod before cluster having state #3077

Merged

squat reviewed Mar 8, 2018

View reviewed changes

enxebre commented Mar 8, 2018

View reviewed changes

trawler previously approved these changes Mar 8, 2018

View reviewed changes

enxebre dismissed trawler’s stale review via 7d742bd March 13, 2018 14:30

enxebre force-pushed the etcd-tnc branch from 45b06e1 to 7d742bd Compare March 13, 2018 14:30

enxebre mentioned this pull request Mar 16, 2018

cli: add ignition support #3120

Merged

enxebre force-pushed the etcd-tnc branch from 7d742bd to b60ce29 Compare March 19, 2018 14:46

enxebre force-pushed the etcd-tnc branch from 4d4c266 to b5aa9fa Compare March 21, 2018 10:26

enxebre added run-smoke-tests platform/aws labels Mar 21, 2018

enxebre force-pushed the etcd-tnc branch 2 times, most recently from fb63266 to 3676c10 Compare March 21, 2018 13:57

yifan-gu reviewed Mar 21, 2018

View reviewed changes

enxebre removed the do-not-merge label Mar 22, 2018

enxebre force-pushed the etcd-tnc branch from ea86804 to 8d0fc53 Compare March 27, 2018 07:15

squat reviewed Mar 27, 2018

View reviewed changes

enxebre added 2 commits March 27, 2018 11:14

add tf support for running etcd from tnc

5ce1215

add cli support for running etcd from tnc

55d8479

enxebre force-pushed the etcd-tnc branch from 8d0fc53 to 55d8479 Compare March 27, 2018 09:14

squat approved these changes Mar 28, 2018

View reviewed changes

cpanato approved these changes Mar 28, 2018

View reviewed changes

squat merged commit 0a7cdb7 into coreos:master Mar 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd tnc #3079

Etcd tnc #3079

enxebre commented Mar 7, 2018 •

edited

Loading

coreosbot commented Mar 7, 2018

enxebre commented Mar 7, 2018

squat Mar 8, 2018

enxebre Mar 8, 2018

alexsomesan Mar 20, 2018

enxebre Mar 20, 2018 •

edited

Loading

enxebre Mar 20, 2018 •

edited

Loading

squat commented Mar 8, 2018 via email

enxebre commented Mar 8, 2018 •

edited

Loading

enxebre Mar 8, 2018

trawler left a comment

thorfour commented Mar 10, 2018

enxebre commented Mar 12, 2018 •

edited

Loading

thorfour commented Mar 12, 2018

enxebre commented Mar 13, 2018

enxebre commented Mar 14, 2018

thorfour commented Mar 14, 2018

yifan-gu commented Mar 19, 2018

enxebre commented Mar 19, 2018

enxebre commented Mar 21, 2018

yifan-gu Mar 21, 2018

enxebre commented Mar 22, 2018

squat left a comment

squat Mar 27, 2018

squat Mar 27, 2018

squat Mar 27, 2018

enxebre commented Mar 27, 2018

Etcd tnc #3079

Etcd tnc #3079

Conversation

enxebre commented Mar 7, 2018 • edited Loading

coreosbot commented Mar 7, 2018

enxebre commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

enxebre Mar 20, 2018 • edited Loading

Choose a reason for hiding this comment

squat commented Mar 8, 2018 via email

enxebre commented Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

trawler left a comment

Choose a reason for hiding this comment

thorfour commented Mar 10, 2018

enxebre commented Mar 12, 2018 • edited Loading

thorfour commented Mar 12, 2018

enxebre commented Mar 13, 2018

enxebre commented Mar 14, 2018

thorfour commented Mar 14, 2018

yifan-gu commented Mar 19, 2018

enxebre commented Mar 19, 2018

enxebre commented Mar 21, 2018

Choose a reason for hiding this comment

enxebre commented Mar 22, 2018

squat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Mar 27, 2018

enxebre commented Mar 7, 2018 •

edited

Loading

enxebre Mar 20, 2018 •

edited

Loading

enxebre Mar 20, 2018 •

edited

Loading

enxebre commented Mar 8, 2018 •

edited

Loading

enxebre commented Mar 12, 2018 •

edited

Loading