Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

Etcd tnc #3079

Merged
merged 2 commits into from
Mar 28, 2018
Merged

Etcd tnc #3079

merged 2 commits into from
Mar 28, 2018

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Mar 7, 2018

Bootstrap etcd nodes from TNC. Fix INST-965
Currently the TNC etcd ign config is missing:

  • etcd-memeber sevice (currently the ign config generated by the installer is different for each machine. We'll need to either unify them or have different TNC roles).
  • locksmith service

if TNC can serve the relevant services with the relevant domains for initial_advertise_peer_urls, advertise_client_urls, initial cluster, etc. via ign/v1/role/etcd/N (e.g ign/v1/role/etcd/1, ign/v1/role/etcd/2) that should work

@coreosbot
Copy link

Can one of the admins verify this patch?

@enxebre
Copy link
Member Author

enxebre commented Mar 7, 2018

cc @thorfour @alexsomesan @squat

@@ -61,11 +61,13 @@ func installBootstrapStep(m *metadata) error {
return err
}

if err := waitForTNC(m); err != nil {
destroyCNAME(m.clusterDir)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignition does not continuously retry to download forever, so I think this introduces a race. What if one of the etcd machines gets provisioned and tries to load it’s ignition before the TNC is up?

I think this should go after the waitForTNC step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there's a race which is currently only relying on retries. The problem here is that there's no obvious way for etcd to waitForTNC (when this is running as static pod) as the kubeclient won't be able to get answers until the server gets the state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second @squat's concern about the race condition. If we know there could be one, we should try to find a solution.
I don't have enough insight into the matter right now to suggest one here, but let me know if you need my help looking into it.

Copy link
Member Author

@enxebre enxebre Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsomesan @squat this relies on ignition retries https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L200 which will keep retrying until it times out, and if I'm not wrong by the look of httpTotal here https://coreos.com/ignition/docs/latest/configuration-v2_1.html it will keep retrying for ever by default https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L69

Copy link
Member Author

@enxebre enxebre Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsomesan @squat it's verified that Ignition will never continue if it isn't completely successful.
waitForTNC(m) where it is now, will ensure that the step holds until etcd comes up (so the cluster get state, so the TNC daemonset gets deployed and so the api server actually gives a response, so waitForTNC can finish)
This is not strictly necessary as the every node will be able to get its config from the TNC pod, so the question is: do we still want to waitForTNC(m)? wdyt?

@squat
Copy link
Contributor

squat commented Mar 8, 2018 via email

@enxebre
Copy link
Member Author

enxebre commented Mar 8, 2018

yeah I was referring to the ignition retries https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L200
Ideally we'd need a waitForStaticPod mechanism though

return err
}

return destroyCNAME(m.clusterDir)
return waitForTNC(m)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having it here we ensure that the step holds until etcd comes up (so the api server actually gives a response and waitForTNC finish when the TNC daemonset gets deployed)

trawler
trawler previously approved these changes Mar 8, 2018
Copy link
Contributor

@trawler trawler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@thorfour
Copy link
Contributor

@enxebre I was told to hold off on the etcd changes until the installer supports query strings. Is there a current issue that tracks that required work?

@enxebre
Copy link
Member Author

enxebre commented Mar 12, 2018

hey @thorfour thanks for the update! I can't see how cnames vs any other approach can block us from getting the TNC to serve the expected services regardless how they are requested ( e.g etcd, https://github.com/coreos-inc/tectonic-operators/pull/286#issuecomment-370883510, #3090 (comment), is there a place where we track all these expected features for the tnc?). Also, although we need agree on an specific implementation (see INST-944) there's nothing stopping this PR from requesting via query strings so we can ensure etcd comes up correctly

@thorfour
Copy link
Contributor

@enxebre Sounds good. I've implemented query strings for ignition using a separate endpoint to unblock us. To use query strings we'll use the path /ignition?role=etcd&id=N sound good? I'll push that PR tomorrow and then add the etcd templates to see if we can get this working.

@enxebre
Copy link
Member Author

enxebre commented Mar 13, 2018

Cool thanks, sounds good. Just make sure you tag the docker image to a different id so we keep builds used by master hermetic

@enxebre
Copy link
Member Author

enxebre commented Mar 14, 2018

@thorfour any update on this, is there an image serving etcd services I can point to?

@thorfour
Copy link
Contributor

@enxebre PR with the etcd template is here

I've pushed an image with that template here: quay.io/coreos/tectonic-node-controller-dev:16fb00200ea86e6f31ce78e56225dec01f459921

@yifan-gu
Copy link
Contributor

Any updates on this PR?

@enxebre
Copy link
Member Author

enxebre commented Mar 19, 2018

@yifan-gu I dropped few comments here https://github.com/coreos-inc/tectonic-operators/pull/307, should get clarified and merging soon

@enxebre
Copy link
Member Author

enxebre commented Mar 21, 2018

ok to test

@enxebre enxebre force-pushed the etcd-tnc branch 2 times, most recently from fb63266 to 3676c10 Compare March 21, 2018 13:57
config.tf Outdated
@@ -71,7 +71,7 @@ variable "tectonic_container_images" {
awscli = "quay.io/coreos/awscli:025a357f05242fdad6a81e8a6b520098aa65a600"
gcloudsdk = "google/cloud-sdk:178.0.0-alpine"
bootkube = "quay.io/coreos/bootkube:v0.10.0"
tnc_bootstrap = "quay.io/coreos/tectonic-node-controller-dev:76a584680b7f39aa7b3c40cd742c736b30b5a89a"
tnc_bootstrap = "quay.io/alberto_lamela/tectonic-node-controller-dev:7092e1772378470e14d00b11198767edf28f9698-dirty"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quay.io/coreos/tectonic-node-controller-bootstrap-dev:f6d5e710a97a8cd6f4cd2963f4426131f854a869

@enxebre
Copy link
Member Author

enxebre commented Mar 22, 2018

retest this please

Copy link
Contributor

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I have a few small notes I’d like your eyes on

count = "${length(var.external_endpoints) == 0 ? var.instance_count : 0}"

replace {
append {
source = "${format("http://${var.cluster_name}-tnc.${var.base_domain}/ignition?role=etcd&etcd_index=%d", count.index)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:p it’s a little funny doing both interpolation and formatting but not a blocker

locals {
container_linux_version = "${data.terraform_remote_state.bootstrap.container_linux_version}"
instance_count = "${data.terraform_remote_state.bootstrap.etcd_instance_count}"
ignition_etcd = "${data.terraform_remote_state.assets.ignition_etcd}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like instance_count, this could be just ignition since we are in the etcd step and know that everything is etcd-related

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for sanity we should standardize the output names to either be prefix_value or value_suffix but not both. That can be a separate cleanup pr

@enxebre
Copy link
Member Author

enxebre commented Mar 27, 2018

@squat thanks! I agree on your three points, I'll address that on a follow up as the append block is being moved into the cli any way after we merge this and #3120

@squat squat merged commit 0a7cdb7 into coreos:master Mar 28, 2018
wking added a commit to wking/openshift-installer that referenced this pull request Aug 27, 2018
506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162)
converted tectonic_ignition_master to tectonic_ignition_masters (among
many other things).  But one of the purposes of the bootstrap node was
to allow us to avoid special-casing individual master nodes.  This
commit moves us back towards identical masters.

I'm not clear on the purpose of the etcd_index query, which dates back
to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27,
coreos/tectonic-installer#3079).  We'll see how the CI tests do
without it.
wking added a commit to wking/openshift-installer that referenced this pull request Aug 27, 2018
506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162)
converted tectonic_ignition_master to tectonic_ignition_masters (among
many other things).  But one of the purposes of the bootstrap node was
to allow us to avoid special-casing individual master nodes.  This
commit moves us back towards identical masters.

I'm not clear on the purpose of the etcd_index query, which dates back
to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27,
coreos/tectonic-installer#3079).  We'll see how the CI tests do
without it.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants