-
Notifications
You must be signed in to change notification settings - Fork 267
Conversation
Can one of the admins verify this patch? |
c2461e6
to
45b06e1
Compare
@@ -61,11 +61,13 @@ func installBootstrapStep(m *metadata) error { | |||
return err | |||
} | |||
|
|||
if err := waitForTNC(m); err != nil { | |||
destroyCNAME(m.clusterDir) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignition does not continuously retry to download forever, so I think this introduces a race. What if one of the etcd machines gets provisioned and tries to load it’s ignition before the TNC is up?
I think this should go after the waitForTNC step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes there's a race which is currently only relying on retries. The problem here is that there's no obvious way for etcd to waitForTNC (when this is running as static pod) as the kubeclient won't be able to get answers until the server gets the state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second @squat's concern about the race condition. If we know there could be one, we should try to find a solution.
I don't have enough insight into the matter right now to suggest one here, but let me know if you need my help looking into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexsomesan @squat this relies on ignition retries https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L200 which will keep retrying until it times out, and if I'm not wrong by the look of httpTotal
here https://coreos.com/ignition/docs/latest/configuration-v2_1.html it will keep retrying for ever by default https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexsomesan @squat it's verified that Ignition will never continue if it isn't completely successful.
waitForTNC(m)
where it is now, will ensure that the step holds until etcd comes up (so the cluster get state, so the TNC daemonset gets deployed and so the api server actually gives a response, so waitForTNC can finish)
This is not strictly necessary as the every node will be able to get its config from the TNC pod, so the question is: do we still want to waitForTNC(m)
? wdyt?
Where are the retries? The retry logic would have to be *in* ignition for
this to work IMO
…On Thu, Mar 8, 2018 at 10:33 Alberto ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In installer/pkg/workflow/install.go
<#3079 (comment)>
:
> @@ -61,11 +61,13 @@ func installBootstrapStep(m *metadata) error {
return err
}
- if err := waitForTNC(m); err != nil {
+ destroyCNAME(m.clusterDir)
+
yes there's a race which is currently only relying on retries. The problem
here is that there's no obvious way for etcd to waitForTNC (when this is
running as static pod) as the kubeclient won't be able to get answers until
the server gets the state
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#3079 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ATiQP_XQsICJE1jt8_d8x5gkj5RicyU0ks5tcPr6gaJpZM4SgY3n>
.
|
yeah I was referring to the ignition retries https://github.com/coreos/ignition/blob/master/internal/resource/http.go#L200 |
return err | ||
} | ||
|
||
return destroyCNAME(m.clusterDir) | ||
return waitForTNC(m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By having it here we ensure that the step holds until etcd comes up (so the api server actually gives a response and waitForTNC finish when the TNC daemonset gets deployed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@enxebre I was told to hold off on the etcd changes until the installer supports query strings. Is there a current issue that tracks that required work? |
hey @thorfour thanks for the update! I can't see how cnames vs any other approach can block us from getting the TNC to serve the expected services regardless how they are requested ( e.g etcd, https://github.com/coreos-inc/tectonic-operators/pull/286#issuecomment-370883510, #3090 (comment), is there a place where we track all these expected features for the tnc?). Also, although we need agree on an specific implementation (see INST-944) there's nothing stopping this PR from requesting via query strings so we can ensure etcd comes up correctly |
@enxebre Sounds good. I've implemented query strings for ignition using a separate endpoint to unblock us. To use query strings we'll use the path |
Cool thanks, sounds good. Just make sure you tag the docker image to a different id so we keep builds used by master hermetic |
@thorfour any update on this, is there an image serving etcd services I can point to? |
Any updates on this PR? |
@yifan-gu I dropped few comments here https://github.com/coreos-inc/tectonic-operators/pull/307, should get clarified and merging soon |
ok to test |
fb63266
to
3676c10
Compare
config.tf
Outdated
@@ -71,7 +71,7 @@ variable "tectonic_container_images" { | |||
awscli = "quay.io/coreos/awscli:025a357f05242fdad6a81e8a6b520098aa65a600" | |||
gcloudsdk = "google/cloud-sdk:178.0.0-alpine" | |||
bootkube = "quay.io/coreos/bootkube:v0.10.0" | |||
tnc_bootstrap = "quay.io/coreos/tectonic-node-controller-dev:76a584680b7f39aa7b3c40cd742c736b30b5a89a" | |||
tnc_bootstrap = "quay.io/alberto_lamela/tectonic-node-controller-dev:7092e1772378470e14d00b11198767edf28f9698-dirty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quay.io/coreos/tectonic-node-controller-bootstrap-dev:f6d5e710a97a8cd6f4cd2963f4426131f854a869
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I have a few small notes I’d like your eyes on
count = "${length(var.external_endpoints) == 0 ? var.instance_count : 0}" | ||
|
||
replace { | ||
append { | ||
source = "${format("http://${var.cluster_name}-tnc.${var.base_domain}/ignition?role=etcd&etcd_index=%d", count.index)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:p it’s a little funny doing both interpolation and formatting but not a blocker
locals { | ||
container_linux_version = "${data.terraform_remote_state.bootstrap.container_linux_version}" | ||
instance_count = "${data.terraform_remote_state.bootstrap.etcd_instance_count}" | ||
ignition_etcd = "${data.terraform_remote_state.assets.ignition_etcd}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like instance_count, this could be just ignition
since we are in the etcd step and know that everything is etcd-related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for sanity we should standardize the output names to either be prefix_value
or value_suffix
but not both. That can be a separate cleanup pr
506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162) converted tectonic_ignition_master to tectonic_ignition_masters (among many other things). But one of the purposes of the bootstrap node was to allow us to avoid special-casing individual master nodes. This commit moves us back towards identical masters. I'm not clear on the purpose of the etcd_index query, which dates back to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27, coreos/tectonic-installer#3079). We'll see how the CI tests do without it.
506ca08 (*: move etcd members to the master nodes, 2018-08-16, openshift#162) converted tectonic_ignition_master to tectonic_ignition_masters (among many other things). But one of the purposes of the bootstrap node was to allow us to avoid special-casing individual master nodes. This commit moves us back towards identical masters. I'm not clear on the purpose of the etcd_index query, which dates back to 5ce1215 (add tf support for running etcd from tnc, 2018-03-27, coreos/tectonic-installer#3079). We'll see how the CI tests do without it.
Bootstrap etcd nodes from TNC. Fix INST-965
Currently the TNC etcd ign config is missing:
if TNC can serve the relevant services with the relevant domains for
initial_advertise_peer_urls
,advertise_client_urls
,initial cluster
, etc. viaign/v1/role/etcd/N
(e.gign/v1/role/etcd/1
,ign/v1/role/etcd/2
) that should work