New dev #1193

itswl · 2022-11-16T12:53:00Z

同步修改 etcd restore，使用 {{ ETCD_DATA_DIR }} 为恢复文件路径，不创建 /etcd_backup 路径

itswl · 2022-11-16T13:01:36Z

当前多次备份恢复，未发现问题
节点信息
ansible : 10.0.0.14

etcd : 10.0.0.41, 10.0.0.56, 10.0.0.219

详细信息

# dk ezctl  backup k8s-test
ansible-playbook -i clusters/k8s-test/hosts -e @clusters/k8s-test/config.yml playbooks/94.backup.yml
2022-11-16 12:47:52 INFO cluster:k8s-test backup begins in 5s, press any key to abort:


PLAY [localhost] ******************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [set NODE_IPS of the etcd cluster] *******************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [get etcd cluster status] ****************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [debug] **********************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
    "ETCD_CLUSTER_STATUS": {
        "changed": true,
        "cmd": "for ip in 10.0.0.56 10.0.0.41 10.0.0.219 ;do ETCDCTL_API=3 /etc/kubeasz/bin/etcdctl --endpoints=https://\"$ip\":2379 --cacert=/etc/kubeasz/clusters/k8s-test/ssl/ca.pem --cert=/etc/kubeasz/clusters/k8s-test/ssl/etcd.pem --key=/etc/kubeasz/clusters/k8s-test/ssl/etcd-key.pem endpoint health; done",
        "delta": "0:00:00.120765",
        "end": "2022-11-16 12:48:01.223679",
        "failed": false,
        "msg": "",
        "rc": 0,
        "start": "2022-11-16 12:48:01.102914",
        "stderr": "",
        "stderr_lines": [],
        "stdout": "https://10.0.0.56:2379 is healthy: successfully committed proposal: took = 19.226564ms\nhttps://10.0.0.41:2379 is healthy: successfully committed proposal: took = 17.817878ms\nhttps://10.0.0.219:2379 is healthy: successfully committed proposal: took = 17.321196ms",
        "stdout_lines": [
            "https://10.0.0.56:2379 is healthy: successfully committed proposal: took = 19.226564ms",
            "https://10.0.0.41:2379 is healthy: successfully committed proposal: took = 17.817878ms",
            "https://10.0.0.219:2379 is healthy: successfully committed proposal: took = 17.321196ms"
        ]
    }
}

TASK [get a running ectd node] ****************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [debug] **********************************************************************************************************************************************************************************************************************************************************************************
ok: [localhost] => {
    "RUNNING_NODE.stdout": "10.0.0.56"
}

TASK [get current time] ***********************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [make a backup on the etcd node] *********************************************************************************************************************************************************************************************************************************************************
changed: [localhost -> 10.0.0.56]

TASK [fetch the backup data] ******************************************************************************************************************************************************************************************************************************************************************
changed: [localhost -> 10.0.0.56]

TASK [update the latest backup] ***************************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

PLAY RECAP ************************************************************************************************************************************************************************************************************************************************************************************
localhost                  : ok=10   changed=6    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

# dk ezctl restore k8s-test   # 节选
ansible-playbook -i clusters/k8s-test/hosts -e @clusters/k8s-test/config.yml playbooks/95.restore.yml
2022-11-16 12:48:11 INFO cluster:k8s-test restore begins in 5s, press any key to abort:


PLAY [etcd] ***********************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.56]
ok: [10.0.0.219]
ok: [10.0.0.41]

TASK [cluster-restore : 停止ectd 服务] ************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.41]
changed: [10.0.0.219]

TASK [cluster-restore : 清除etcd 数据目录] **********************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 准备指定的备份etcd 数据] *******************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.41]
changed: [10.0.0.56]
changed: [10.0.0.219]

TASK [cluster-restore : 清理上次备份恢复数据] ***********************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.41]
changed: [10.0.0.219]

TASK [cluster-restore : etcd 数据恢复] ************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 恢复数据至etcd 数据目录] *******************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.41]
changed: [10.0.0.219]

TASK [cluster-restore : 重启etcd 服务] ************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 以轮询的方式等待服务同步完成] *******************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.219]
changed: [10.0.0.41]

PLAY RECAP ************************************************************************************************************************************************************************************************************************************************************************************
10.0.0.219                 : ok=9    changed=8    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
10.0.0.41                  : ok=9    changed=8    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
10.0.0.56                  : ok=9    changed=8    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

多次操作后, 任一 etcd 节点

# tree
.
|-- etcd-10.0.0.56.etcd
|   `-- member
|       |-- snap
|       |   |-- 0000000000000001-0000000000000003.snap
|       |   `-- db
|       `-- wal
|           `-- 0000000000000000-0000000000000000.wal
|-- member
|   |-- snap
|   |   |-- 0000000000000001-0000000000000003.snap
|   |   `-- db
|   `-- wal
|       |-- 0000000000000000-0000000000000000.wal
|       `-- 0.tmp
`-- snapshot.db

7 directories, 8 files

itswl · 2022-11-16T17:38:54Z

重新修改逻辑，在 ansible 主控节点生成恢复文件，然后下发到各个 etcd 节点。不在 etcd 节点额外生产目录和文件

测试没有问题

# dk ezctl restore k8s-test
ansible-playbook -i clusters/k8s-test/hosts -e @clusters/k8s-test/config.yml playbooks/95.restore.yml
2022-11-16 17:27:18 INFO cluster:k8s-test restore begins in 5s, press any key to abort:


PLAY [etcd] **************************************************************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.56]
ok: [10.0.0.219]
ok: [10.0.0.41]

TASK [cluster-restore : 停止ectd 服务] ***************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.219]
ok: [10.0.0.41]
ok: [10.0.0.56]

TASK [cluster-restore : 清除etcd 数据目录] *************************************************************************************************************************************************************************************************************************************************************************
ok: [10.0.0.41]
ok: [10.0.0.56]
ok: [10.0.0.219]

TASK [cluster-restore : 清除 etcd 备份目录] ************************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56 -> 127.0.0.1]

TASK [cluster-restore : etcd 数据恢复] ***************************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56 -> 127.0.0.1]

TASK [cluster-restore : 分发备份文件到 etcd 各个节点] *******************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.56]
changed: [10.0.0.41]

TASK [cluster-restore : 重启etcd 服务] ***************************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.219]
changed: [10.0.0.41]
changed: [10.0.0.56]

TASK [cluster-restore : 以轮询的方式等待服务同步完成] **********************************************************************************************************************************************************************************************************************************************************************
changed: [10.0.0.56]
changed: [10.0.0.219]
changed: [10.0.0.41]

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************************************************************************
10.0.0.219                 : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
10.0.0.41                  : ok=6    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
10.0.0.56                  : ok=8    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

优化etcd 恢复逻辑

gjmzj · 2023-04-16T00:18:56Z

恢复脚本有问题，使用这个恢复3节点etcd集群，会变成3个leader

for ip in ${NODE_IPS}; do   ETCDCTL_API=3 etcdctl   --endpoints=https://${ip}:2379    --cacert=/etc/kubernetes/ssl/ca.pem   --cert=/etc/kubernetes/ssl/etcd.pem   --key=/etc/kubernetes/ssl/etcd-key.pem   --write-out=table endpoint status; done
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.96:2379 | 8e9e05c52164694d |   3.5.6 |  3.6 MB |      true |      false |         2 |       5261 |               5261 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.97:2379 | 8e9e05c52164694d |   3.5.6 |  3.6 MB |      true |      false |         2 |       5323 |               5323 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.98:2379 | 8e9e05c52164694d |   3.5.6 |  3.6 MB |      true |      false |         2 |       5270 |               5270 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

需要回退成原先脚本，就能正常

itswl · 2023-04-16T02:34:36Z

不好意思，看来不能那么操作

- name: 停止ectd 服务
  service: name=etcd state=stopped

- name: 清除etcd 数据目录
  file: name={{ ETCD_DATA_DIR }}/member state=absent

- name: 清除etcd 备份文件
  file: name={{ ETCD_DATA_DIR }}/snapshot.db state=absent
  
- name: 清除历史恢复文件
  file: name={{ ETCD_DATA_DIR }}/etcd-{{ inventory_hostname }}.etcd state=absent 

- name: 拷贝备份文件到各节点
  copy:
    src: "{{ cluster_dir }}/backup/snapshot.db"
    dest: "{{ ETCD_DATA_DIR }}/snapshot.db"

- name: etcd 数据恢复
  shell: "cd {{ ETCD_DATA_DIR }} && \
	ETCDCTL_API=3 {{ bin_dir }}/etcdctl snapshot restore snapshot.db \
	--name etcd-{{ inventory_hostname }} \
	--initial-cluster {{ ETCD_NODES }} \
	--initial-cluster-token etcd-cluster-0 \
	--initial-advertise-peer-urls https://{{ inventory_hostname }}:2380"

- name: 恢复数据至etcd 数据目录
  shell: "cp -rf {{ ETCD_DATA_DIR }}/etcd-{{ inventory_hostname }}.etcd/member {{ ETCD_DATA_DIR }}/"
  
- name: 重启etcd 服务
  service: name=etcd state=restarted

- name: 以轮询的方式等待服务同步完成
  shell: "systemctl is-active etcd.service"
  register: etcd_status
  until: '"active" in etcd_status.stdout'
  retries: 8
  delay: 8

这样改回来了，修改了一下目录

ETCDCTL_API=3 etcdctl   -w table  --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/kubernetes/ssl/etcd.pem --key=/etc/kubernetes/ssl/etcd-key.pem --endpoints=https://172.20.19.17:2379,https://172.20.19.14:2379,https://172.20.19.9:2379 endpoint status
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.20.19.17:2379 | 3d0481e94aabd34d |   3.5.5 |   56 MB |     false |      false |         2 |       6369 |               6369 |        |
| https://172.20.19.14:2379 | e7b3523af07db303 |   3.5.5 |   56 MB |     false |      false |         2 |       6369 |               6369 |        |
|  https://172.20.19.9:2379 | 232989330c375192 |   3.5.5 |   56 MB |      true |      false |         2 |       6369 |               6369 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

itswl added 3 commits November 15, 2022 14:16

修改ectd 备份命令和备份路径均在 ansible 节点

192645d

修改 /etcd_backup 为{{ ETCD_DATA_DIR }}

2b69f00

Merge branch 'master' into new_dev

298a074

修改etcd 恢复逻辑，ansible 主控节点生成恢复文件然后分发到各个节点

56b2576

gjmzj merged commit 13439cb into easzlab:master Nov 24, 2022

kubeasz pushed a commit that referenced this pull request Jan 7, 2023

New dev (#1193)

807d712

优化etcd 恢复逻辑

kubeasz pushed a commit that referenced this pull request Apr 16, 2023

fix: etcd集群恢复选主问题(#1193 引入)

6693340

kubeasz pushed a commit that referenced this pull request Apr 16, 2023

fix: etcd集群恢复选主问题(#1193 引入)

13b0cc0

kubeasz pushed a commit that referenced this pull request Apr 16, 2023

fix: etcd集群恢复选主问题(#1193 引入)

aa50039

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New dev #1193

New dev #1193

itswl commented Nov 16, 2022

itswl commented Nov 16, 2022

itswl commented Nov 16, 2022

gjmzj commented Apr 16, 2023

itswl commented Apr 16, 2023

New dev #1193

New dev #1193

Conversation

itswl commented Nov 16, 2022

itswl commented Nov 16, 2022

itswl commented Nov 16, 2022

gjmzj commented Apr 16, 2023

itswl commented Apr 16, 2023