New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: cluster file overrides environment settings #4015

Closed
MartinPodval opened this Issue Dec 17, 2015 · 6 comments

Comments

3 participants
@MartinPodval

MartinPodval commented Dec 17, 2015

We use kubernetes cluster installed on CoreOS machines. We utilize cloud config as much as possible. As it's a testing cluster, we have one static master. Our network admin have changed domain name this week so we were pushed to update etcd configuration placed in cloud config. Namely the master ip/host name in fact.

I'm surprised that the settings placed in files in not overridden by environment settings.

In more details:

Current cloud config:

coreos:
  etcd2:
    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
    initial-cluster: master=http://mpavmcore01:2380
    proxy: on 

Note that machine mpavmcore01 is the new master.

When I use this cloud config with the new setting, etcd fails, here is a log:

Dec 17 16:04:15 mpavmcore03 systemd[1]: Starting etcd2...
Dec 17 16:04:19 mpavmcore03 etcd2[769]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
Dec 17 16:04:19 mpavmcore03 etcd2[769]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Dec 17 16:04:19 mpavmcore03 etcd2[769]: recognized and used environment variable ETCD_INITIAL_CLUSTER=master=http://mpavmcore01:2380
Dec 17 16:04:19 mpavmcore03 etcd2[769]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
Dec 17 16:04:19 mpavmcore03 etcd2[769]: recognized and used environment variable ETCD_NAME=5ee478f3e9fa4cd8b616cce9ef0479ba
Dec 17 16:04:19 mpavmcore03 etcd2[769]: recognized and used environment variable ETCD_PROXY=on
Dec 17 16:04:19 mpavmcore03 etcd2[769]: etcd Version: 2.2.2
Dec 17 16:04:19 mpavmcore03 etcd2[769]: Git SHA: b4bddf6
Dec 17 16:04:19 mpavmcore03 etcd2[769]: Go Version: go1.4.3
Dec 17 16:04:19 mpavmcore03 etcd2[769]: Go OS/Arch: linux/amd64
Dec 17 16:04:19 mpavmcore03 etcd2[769]: setting maximum number of CPUs to 1, total number of available CPUs is 4
Dec 17 16:04:19 mpavmcore03 etcd2[769]: the server is already initialized as proxy before, starting as etcd proxy...
Dec 17 16:04:19 mpavmcore03 etcd2[769]: proxy: using peer urls [http://16.55.176.195:2380] from cluster file "/var/lib/etcd2/proxy/cluster"
Dec 17 16:04:20 mpavmcore03 etcd2[769]: could not get cluster response from http://16.55.176.195:2380: Get http://16.55.176.195:2380/members: dial tcp 16.55.176.195:2380: i/o timeout

Note that 16.55.176.195 is old master. When I looked at /var/lib/etcd2/proxy/cluster there is already a record for peer urls which points to the old master.

sudo cat /var/lib/etcd2/proxy/cluster
{"PeerURLs":["http://16.55.176.195:2380"]}

I read a lot of documentation regarding runtime cluster configuration and there is a recommendation to remove old proxy files.

What is than point of one-file-cloud-config if you need to change internal config files when there is a need to change master node? I'd like to place and use new cloud config on all cluster machines, reboot them one by one and have health cluster.

I've also found an old documentation where there is an important statement:

Individual node configuration options can be set in three places:

Command line flags
Environment variables
Configuration file

However, the new version for etcd2 does not contain the last line.

@jonboulle jonboulle changed the title from File setting overrides environment properties to proxy: cluster file overrides environment settings Dec 19, 2015

@jonboulle

This comment has been minimized.

Contributor

jonboulle commented Dec 19, 2015

I read a lot of documentation regarding runtime cluster configuration and there is a recommendation to remove old proxy files.

Hm, to which documentation are you referring? The only mention I see in ours is when talking about promoting a proxy to a member which I don't think is what you're after.

The proxy cluster file is not really a configuration file but rather an internal implementation detail of etcd; it needs to periodically re-sync and cache the member set so that it can successfully reconnect to the cluster after the proxy process restarts. This is desired behaviour in the majority of use cases (with multi-member clusters) - I'll talk more about your particular one below.

However, the new version for etcd2 does not contain the last line.

Right, this is because currently there is no configuration file per se for etcd2. (As I mentioned the proxy cluster file doesn't really fit this definition as it's never intended to be manipulated by users). Side note, we are considering changing this.

But getting back to solving your actual use case:

What is than point of one-file-cloud-config if you need to change internal config files when there is a need to change master node? I'd like to place and use new cloud config on all cluster machines, reboot them one by one and have health cluster.

The problem is kind of the result of coreos-cloudinit's strange nature (halfway between initial bootstrapping tool and configuration management); it's not intended for dynamic configuration changes, but really just to do the bare minimum to get everything up and running. If your cluster membership changes, we don't expect you to have to re-apply a configuration file: etcd itself tracks the cluster reconfigurations. Your case is special because it's only ever a single-member cluster, but etcd can't really know this.

However: if you can always guarantee that the cloud-config that's retrieved at boot time always has up-to-date configuration, then for your purposes it should be safe to add something to your cloud config to remove the proxy file before etcd2 starts.

The alternative -- to make etcd itself more aware of this kind of use case, e.g. with a --proxy-don't-cache-cluster-members or --proxy-ignore-cluster-members-cache kind of flag -- is nice because it hides the file completely from the user, but troubling because it removes the safety guarantee that the current behaviour provides (i.e. then if you have a live configuration change which the proxy CAN track -- e.g. adding new members, removing existing ones, while the cluster retains quorum -- and the proxy process restarts, you would HAVE to re-apply your cloud-config to have it successfully find the cluster again). Maybe a compromise would be to expose that some other way through tooling -- e.g. etcdctl admin flush-proxy-cache or similar -- but I'm not sure how often we'd expect this to be useful, /cc @xiang90.

Does that make sense?

@MartinPodval

This comment has been minimized.

MartinPodval commented Dec 29, 2015

Thank you for the response.

I've thought about that and you are right that we have special (un-realistic) setup with one static master. In the case that the etcdctl command line tool is intended to be used to manage the ectd cluster, the cloud config purpose is really first-time-only configuration only.

So I'm going to put a new service to the cloud config to clear mentioned directory and there is probably a time for us to start digging to the process how to corectly restart the whole cluster in production environment.

@xiang90

This comment has been minimized.

Contributor

xiang90 commented Jan 14, 2016

@MartinPodval Have you got the issue resolved? The proxy of etcd2 now preserves the cluster configuration in a best-effort manner. You need to clean the data dir to remove the previous cluster configuration. I feel it is easy enough to do a rm -rf. Adding a etcdctl command seems to be overkilled, since it will just be a thin wrap around rm -rf and still require the data dir path.

@MartinPodval

This comment has been minimized.

MartinPodval commented Jan 14, 2016

@xiang90 I've finally ended up with a service like this:

    - name: remove-etcd-configuration.service
      command: start
      content: |
        [Unit]
        Description=Remove existing etcd2 configuration
        Documentation=https://github.com/coreos/etcd/issues/4015
        Before=etcd2.service

        [Service]
        ExecStartPre=/usr/bin/rm -rf /var/lib/etcd2
        ExecStart=/usr/bin/mkdir -p /var/lib/etcd2
        ExecStartPost=/usr/bin/chown etcd /var/lib/etcd2/
        RemainAfterExit=no
        Type=oneshot

So, I was little bit surprised by the purpose of cloud config as I originally think that it was intended to be used in different way but I understand that :-)

@xiang90

This comment has been minimized.

Contributor

xiang90 commented Jan 14, 2016

@MartinPodval I think that service should work. Thanks! I am closing this issue now. Let us know if you have any other questions!

@xiang90 xiang90 closed this Jan 14, 2016

@MartinPodval

This comment has been minimized.

MartinPodval commented Jan 14, 2016

@xiang90 Yeah, it works perfectly :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment