CKS: retry if unable to drain node or unable to upgrade k8s node #8402

weizhouapache · 2023-12-22T13:15:45Z

Description

This PR tries to fix the upgrade of HA cluster, by retrying in case of

fail to drain kubernetes node
fail to upgrade kubernetes node

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded. 2 of 16 upgrades failed due to ``` error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command... ``` 3 of 16 upgrades failed due to ``` Error from server: error when retrieving current configuration of: Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role" Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard" from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed ```

…d by the upgrade test

it will take 15m * 20 = 5 hours

weizhouapache · 2023-12-22T13:16:05Z

@blueorangutan package

blueorangutan · 2023-12-22T13:24:03Z

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov · 2023-12-22T14:19:16Z

Codecov Report

Attention: 28 lines in your changes are missing coverage. Please review.

Comparison is base (08749d8) 13.12% compared to head (1d20dfe) 13.16%.
Report is 3 commits behind head on 4.18.

Files	Patch %	Lines
.../actionworkers/KubernetesClusterUpgradeWorker.java	0.00%	28 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.18    #8402      +/-   ##
============================================
+ Coverage     13.12%   13.16%   +0.03%     
- Complexity     9141     9203      +62     
============================================
  Files          2720     2724       +4     
  Lines        257726   258091     +365     
  Branches      40177    40229      +52     
============================================
+ Hits          33838    33988     +150     
- Misses       219598   219796     +198     
- Partials       4290     4307      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

blueorangutan · 2023-12-22T14:32:13Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8132

shwstppr · 2023-12-22T14:39:01Z

@weizhouapache very few new tests 😜
Looks good. Need to wait for test results

weizhouapache · 2023-12-22T15:35:27Z

@blueorangutan test matrix

blueorangutan · 2023-12-22T15:36:03Z

@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2023-12-23T08:10:11Z

[SF] Trillian test result (tid-8667)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 58210 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8667-xenserver-71.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_deploy_vm_on_specific_host	`Error`	1.32	test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster	`Error`	2.36	test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod	`Error`	2.43	test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster	`Error`	1.34	test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod	`Error`	2.33	test_vm_deployment_planner.py
test_02_cancel_host_maintenace_with_migration_jobs	`Error`	1786.37	test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs	`Error`	1786.41	test_host_maintenance.py

weizhouapache · 2023-12-23T08:38:38Z

[SF] Trillian test result (tid-8667) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 58210 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8667-xenserver-71.zip Smoke tests completed. 107 look OK, 2 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 1.32 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 2.36 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 2.43 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 1.34 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.33 test_vm_deployment_planner.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1786.37 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1786.41 test_host_maintenance.py

@shwstppr
looks good 😺

Test to check for failure while tying to upgrade a Kubernetes cluster to a lower version ... === TestName: test_01_invalid_upgrade_kubernetes_cluster | Status : SUCCESS ===
ok
Test to deploy a new Kubernetes cluster and upgrade it to newer version ... === TestName: test_02_upgrade_kubernetes_cluster | Status : SUCCESS ===
ok
Test to deploy a new Kubernetes cluster and check for failure while tying to scale it ... === TestName: test_03_deploy_and_scale_kubernetes_cluster | Status : SUCCESS ===
ok
Test to enable autoscaling a Kubernetes cluster ... === TestName: test_04_autoscale_kubernetes_cluster | Status : SUCCESS ===
ok
Test to deploy a new Kubernetes cluster ... === TestName: test_05_basic_lifecycle_kubernetes_cluster | Status : SUCCESS ===
ok
Test to delete an existing Kubernetes cluster ... === TestName: test_06_delete_kubernetes_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to deploy a Kubernetes cluster on VPC ... === TestName: test_10_vpc_tier_kubernetes_cluster | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 26 tests in 19677.224s

OK

blueorangutan · 2023-12-23T11:47:21Z

[SF] Trillian test result (tid-8669)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 71121 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8669-kvm-centos7.zip
Smoke tests completed. 106 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_08_9_upgrade_kubernetes_ha_cluster	`Failure`	881.05	test_kubernetes_clusters.py
test_08_migrate_vm	`Error`	0.05	test_vm_life_cycle.py
test_hostha_kvm_host_degraded	`Error`	699.05	test_hostha_kvm.py
test_hostha_kvm_host_fencing	`Error`	683.27	test_hostha_kvm.py

weizhouapache · 2023-12-23T12:48:48Z

@blueorangutan test matrix

blueorangutan · 2023-12-23T12:50:04Z

@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2023-12-24T05:11:56Z

[SF] Trillian test result (tid-8671)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 57476 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8671-xenserver-71.zip
Smoke tests completed. 108 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_deploy_vm_on_specific_host	`Error`	2.26	test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster	`Error`	2.24	test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod	`Error`	1.25	test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster	`Error`	2.30	test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod	`Error`	2.26	test_vm_deployment_planner.py

shwstppr · 2023-12-24T05:23:03Z

@weizhouapache do we need to move some of the test cases to a component test? Taking more than 5.5h now

shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr7345-t8662-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 2738 seconds
test_kubernetes_supported_versions.py: 65 seconds
shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr8402-t8671-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 20429 seconds
test_kubernetes_supported_versions.py: 65 seconds

blueorangutan · 2023-12-24T07:20:44Z

[SF] Trillian test result (tid-8673)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 65144 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8673-kvm-centos7.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_08_migrate_vm	`Error`	0.06	test_vm_life_cycle.py
test_hostha_kvm_host_degraded	`Error`	702.24	test_hostha_kvm.py
test_hostha_kvm_host_fencing	`Error`	684.15	test_hostha_kvm.py

weizhouapache · 2023-12-24T08:07:01Z

@weizhouapache do we need to move some of the test cases to a component test? Taking more than 5.5h now

shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr7345-t8662-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 2738 seconds
test_kubernetes_supported_versions.py: 65 seconds
shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr8402-t8671-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 20429 seconds
test_kubernetes_supported_versions.py: 65 seconds

@shwstppr
The last commit is only used for testing..
I will revert it af93915

This reverts commit af93915.

weizhouapache · 2023-12-24T08:55:18Z

@blueorangutan test matrix

blueorangutan · 2023-12-24T08:56:03Z

@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2024-01-25T08:36:03Z

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

DaanHoogland

clgtm

weizhouapache · 2024-01-25T11:40:16Z

@blueorangutan test rocky8 kvm-rocky8

blueorangutan · 2024-01-25T11:52:03Z

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

blueorangutan · 2024-01-26T08:27:46Z

[SF] Trillian test result (tid-8943)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 72166 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8943-kvm-rocky8.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_migrate_VM_and_root_volume	`Error`	90.53	test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks	`Error`	55.88	test_vm_life_cycle.py
test_08_migrate_vm	`Error`	50.25	test_vm_life_cycle.py

weizhouapache · 2024-01-26T12:28:31Z

@blueorangutan package

blueorangutan · 2024-01-26T12:30:03Z

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-01-26T14:24:19Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8446

weizhouapache · 2024-01-26T14:32:27Z

@blueorangutan test rocky8 kvm-rocky8

blueorangutan · 2024-01-26T14:34:03Z

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

blueorangutan · 2024-01-27T14:09:03Z

[SF] Trillian test result (tid-8952)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 82863 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8952-kvm-rocky8.zip
Smoke tests completed. 110 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File

blueorangutan · 2024-01-28T10:23:48Z

[SF] Trillian test result (tid-8958)
Environment: kvm-rocky8 (x3), Advanced Networking with Mgmt server r8
Total time taken: 62988 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8958-kvm-rocky8.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_migrate_VM_and_root_volume	`Error`	83.32	test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks	`Error`	52.78	test_vm_life_cycle.py
test_01_secure_vm_migration	`Error`	243.32	test_vm_life_cycle.py
test_02_unsecure_vm_migration	`Error`	243.25	test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration	`Error`	243.26	test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration	`Error`	243.29	test_vm_life_cycle.py
test_08_migrate_vm	`Error`	47.04	test_vm_life_cycle.py

weizhouapache · 2024-01-29T08:58:24Z

run 2 trillian test in the weekend

20/20 are good on pr8402-t8958-kvm-rocky8

01:14:20 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

38/40 are good on pr8402-t8952-kvm-rocky8

00:58:04 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : FAILED ===
00:58:04 
00:58:04 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : FAILED ===
00:58:04 
00:58:04 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

06:26:03 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

Not perfect but looks better than before

rohityadavcloud · 2024-02-05T08:39:01Z

@blueorangutan package

blueorangutan · 2024-02-05T08:40:03Z

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-02-05T10:36:35Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8520

weizhouapache · 2024-02-05T10:46:22Z

@blueorangutan test rocky8 kvm-rocky8

blueorangutan · 2024-02-05T10:48:04Z

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

blueorangutan · 2024-02-06T05:16:37Z

[SF] Trillian test result (tid-9078)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 64477 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t9078-kvm-rocky8.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_01_template_usage	`Error`	1.18	test_usage.py
test_01_volume_usage	`Error`	129.89	test_usage.py

weizhouapache · 2024-02-06T10:12:01Z

Merging based on 2 approvals and perfect smoke tests result.

19:54:06 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

This reverts commit 34e227a.

…che#8402) * CKS: retry if unable to drain node or unable to upgrade k8s node I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded. 2 of 16 upgrades failed due to ``` error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command... ``` 3 of 16 upgrades failed due to ``` Error from server: error when retrieving current configuration of: Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role" Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard" from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed ``` * CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test * Update PR 8402 as suggested * test: remove CKS cluster if fail to create or verify

weizhouapache added 3 commits December 22, 2023 14:13

CKS: remove tests of creating/deleting HA clusters as they are covere…

62d6733

…d by the upgrade test

CKS: run HA cluster upgrade test 20 times ...will revert later

af93915

it will take 15m * 20 = 5 hours

boring-cyborg bot added component:integration-test component:kubernetes Python Warning... Python code Ahead! labels Dec 22, 2023

weizhouapache added this to the 4.18.2.0 milestone Dec 22, 2023

weizhouapache marked this pull request as ready for review December 22, 2023 15:35

apache locked and limited conversation to collaborators Dec 22, 2023

apache unlocked this conversation Dec 22, 2023

weizhouapache marked this pull request as draft December 23, 2023 08:34

Revert "CKS: run HA cluster upgrade test 20 times ...will revert later"

cc3e8c0

This reverts commit af93915.

DaanHoogland approved these changes Jan 25, 2024

View reviewed changes

test: remove CKS cluster if fail to create or verify

1d20dfe

rohityadavcloud requested a review from JoaoJandre February 5, 2024 08:38

Revert "CKS: run HA cluster upgrade test 20 times ...will revert later"

232d52c

This reverts commit 34e227a.

weizhouapache merged commit 69e8ebc into apache:4.18 Feb 6, 2024
25 checks passed

DaanHoogland deleted the 4.18-cks-ha-cluster-upgrade-retry branch February 6, 2024 10:16

This was referenced Feb 8, 2024

Fix smoke tests for kubernetes cluster #7604

Closed

intermittent smoke test failure for kubernetes clusters #7778

Closed

CKS: retry if unable to drain node or unable to upgrade k8s node #8402

CKS: retry if unable to drain node or unable to upgrade k8s node #8402

Conversation

weizhouapache commented Dec 22, 2023

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

weizhouapache commented Dec 22, 2023

blueorangutan commented Dec 22, 2023

codecov bot commented Dec 22, 2023 • edited

Codecov Report

blueorangutan commented Dec 22, 2023

shwstppr commented Dec 22, 2023

weizhouapache commented Dec 22, 2023

blueorangutan commented Dec 22, 2023

blueorangutan commented Dec 23, 2023

weizhouapache commented Dec 23, 2023

blueorangutan commented Dec 23, 2023

weizhouapache commented Dec 23, 2023

blueorangutan commented Dec 23, 2023

blueorangutan commented Dec 24, 2023

shwstppr commented Dec 24, 2023

blueorangutan commented Dec 24, 2023

weizhouapache commented Dec 24, 2023

weizhouapache commented Dec 24, 2023

blueorangutan commented Dec 24, 2023

blueorangutan commented Jan 25, 2024

DaanHoogland left a comment

Choose a reason for hiding this comment

weizhouapache commented Jan 25, 2024

blueorangutan commented Jan 25, 2024

blueorangutan commented Jan 26, 2024

weizhouapache commented Jan 26, 2024

blueorangutan commented Jan 26, 2024

blueorangutan commented Jan 26, 2024

weizhouapache commented Jan 26, 2024

blueorangutan commented Jan 26, 2024

blueorangutan commented Jan 27, 2024

blueorangutan commented Jan 28, 2024

weizhouapache commented Jan 29, 2024

rohityadavcloud commented Feb 5, 2024

blueorangutan commented Feb 5, 2024

blueorangutan commented Feb 5, 2024

weizhouapache commented Feb 5, 2024

blueorangutan commented Feb 5, 2024

blueorangutan commented Feb 6, 2024

weizhouapache commented Feb 6, 2024

codecov bot commented Dec 22, 2023 •

edited