-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure upgrades are sequential #48
Conversation
Testing right now, noticed an issue. With Edit: Solved by forcing facts gathering before running |
I've tested running the playbook on an existing cluster without upgrading, and it looks like it does what it's supposed to. One task on one hosts at a time. I'll probably test this again when Would be nice to get the |
@anton-johansson I'm not sure if this is the right approach. To run all tasks in serial. The serial property is not supported in Ansible on a per-task basis, which means that all tasks under a play would run in serial. In large clusters that would increase the execution time significantly. I don't believe that it's necessary to install masters with a rolling strategy. Maybe that even applies to the kubelet and kube-proxy as well, and only containerd should be restarted one host at a time. It all depends on the type of downtime. Is it workload or APIs? What is your experience of upgrading a production cluster with KTRW? |
Hmm, okay. I did lose connectivity to my services during a period when I upgraded from 1.13 to 1.14. I'm not sure what caused it, but my guess was I understand that masters don't need to be in serial, but I would still feel more safe if I could reach my API server during the upgrade, which I can't if I restart them all at the same time.
This is pretty much what I wanted to accomplish. :) Do you know if this setting is controlled with a variable somehow? I guess it would be nice if it would default to 100%, but you could set it using some parameter. I'd set it to Or maybe a constant |
@anton-johansson Yes - name: Start and enable kubelet
systemd:
name: kubelet
state: restarted
enabled: True
with_items: "{{ groups['nodes'] }}"
run_once: True
delegate_to: "{{ item }}" Another solution would be to restart everything at the very end, putting a section in - hosts: localhost
connection: local
roles:
- certificates
- hosts: etcd
roles:
- { role: etcd, vars: { restart: false }}
- hosts: masters
roles:
- { role: kube-apiserver, vars: { restart: false }}
- { role: kube-controller-manager, vars: { restart: false }}
- { role: kube-scheduler, vars: { restart: false }}
- hosts: nodes
roles:
- cni
- { role: containerd, vars: { restart: false }}
- runc
- { role: kube-proxy, vars: { restart: false }}
- { role: kubelet, vars: { restart: false }}
# Restart components in serial
- hosts: nodes
serial: 1
roles:
- { role: containerd, vars: { restart: true }}
- { role: kube-proxy, vars: { restart: true }}
- { role: kubelet, vars: { restart: true }}
- hosts: etcd
serial: 1
roles:
- { role: etcd, vars: { restart: true }}
- hosts: masters
serial: 1
roles:
- { role: containerd, vars: { restart: true }}
- { role: kube-proxy, vars: { restart: true }}
- { role: kubelet, vars: { restart: true }} |
I like the last suggestion, where we only restart the services in serial. But wouldn't we need two variables? One that indicates that we should prepare everything (download binaries, update configuration files, etc) and one that indicates that the service should be restarted. Both could default to EDIT: Good to know that stopping |
Yeah you are right, we need two variables. |
I can see if I can get something like that up and running. |
@anton-johansson So i've been doing some experimenting with my initial suggestion where I proposed putting all restarts in the end of the playbook. I have come to the conclusion that this is an anti-pattern in terms of Ansible best practices. This is due to how everything evolves around Plays and not Playbooks. The "best" possible implementation (Ansible-wise) is to use Handlers. But handlers alone doesn't solve the sequential restart problem. However by combining handlers with Rolling Updates (serial) we can achieve what we wan't in a good way. So I propose that we make sure that users may set the I've opened a PR that adds restart-handlers to all appropriate roles. Have a look and let me know what you think |
Yeha, that sounds good. What would be a good default for this value though? I mentioned earlier in this PR that if we use Or would you want it to be 100% by default, indicating pure parallellism? |
We can move the discussion over to #52. Closing this! |
My mistake, this PR is not really related to #52. Re-opening and see if I can make the appropriate changes. |
75785de
to
7d90381
Compare
I've made some changes @amimof, but I have yet to test this thoroughly. I'll remove |
It appears that host variables cannot be used in the scope of the playbook, only tasks within the playbook. How about using For example:
|
b554eca
to
953e0c6
Compare
I've made some changes to utilize the command line argument I think this needs to be done for Also, this means that we'll run the facts gathering even if we don't need to (i.e. if we don't use when: hostvars[item]['ansible_hostname'] is not defined |
Nice job! I also encountered that kubernetes-the-right-way/roles/kube-apiserver/templates/kube-apiserver.service.j2 Line 21 in 1499f86
Also I don't mind gathering facts even if it isn't needed. It should be possible with a simple |
Fixed Also squashed together some commits to avoid redundant messages. Kept some commit separate because they're actually informational. |
This is useful for high availability cluster upgrades or modifications.
If we use `serial_all` or `serial_etcd`, not all hosts will have their facts gathered and therefore the templates won't be generated properly.
…server` These facts usually already exists because you run the `etcd` play before `kube-apiserver`. However, to make the plays independant of eachother, we must make sure that the `kube-apiserver` play also gathers facts about `etcd` hosts, since it needs them in their systemd unit file.
7f7b480
to
6502bf5
Compare
@anton-johansson It might be a good idea to also make use of kubernetes-the-right-way/.travis.yml Line 22 in 1499f86
Other than that LGTM and should be good to merge |
This way, we'll catch errors caused by missing facts due to limits from serial executions
Good idea! Done, lets see what Travis says. :) |
This way, we won't stop for example the API server or
kube-proxy
on allmasters or nodes at the same time, causing downtime.
To do:
containerd
to start (see Metrics for containerd #47)1.14.3
or1.15.0-rc.1
)Discussions: