Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKS: retry if unable to drain node or unable to upgrade k8s node #8402

Merged

Conversation

weizhouapache
Copy link
Member

Description

This PR tries to fix the upgrade of HA cluster, by retrying in case of

  • fail to drain kubernetes node
  • fail to upgrade kubernetes node

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded.

2 of 16 upgrades failed due to
```
error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command...
```

3 of 16 upgrades failed due to
```
Error from server: error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role"
Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard"
from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed
```
@weizhouapache weizhouapache added this to the 4.18.2.0 milestone Dec 22, 2023
@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Dec 22, 2023

Codecov Report

Attention: 28 lines in your changes are missing coverage. Please review.

Comparison is base (08749d8) 13.12% compared to head (1d20dfe) 13.16%.
Report is 3 commits behind head on 4.18.

Files Patch % Lines
.../actionworkers/KubernetesClusterUpgradeWorker.java 0.00% 28 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.18    #8402      +/-   ##
============================================
+ Coverage     13.12%   13.16%   +0.03%     
- Complexity     9141     9203      +62     
============================================
  Files          2720     2724       +4     
  Lines        257726   258091     +365     
  Branches      40177    40229      +52     
============================================
+ Hits          33838    33988     +150     
- Misses       219598   219796     +198     
- Partials       4290     4307      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8132

@shwstppr
Copy link
Contributor

@weizhouapache very few new tests 😜
Looks good. Need to wait for test results

@weizhouapache weizhouapache marked this pull request as ready for review December 22, 2023 15:35
@weizhouapache
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@apache apache locked and limited conversation to collaborators Dec 22, 2023
@apache apache unlocked this conversation Dec 22, 2023
@blueorangutan
Copy link

[SF] Trillian test result (tid-8667)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 58210 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8667-xenserver-71.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 1.32 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 2.36 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 2.43 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 1.34 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.33 test_vm_deployment_planner.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1786.37 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1786.41 test_host_maintenance.py

@weizhouapache weizhouapache marked this pull request as draft December 23, 2023 08:34
@weizhouapache
Copy link
Member Author

[SF] Trillian test result (tid-8667) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 58210 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8667-xenserver-71.zip Smoke tests completed. 107 look OK, 2 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 1.32 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 2.36 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 2.43 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 1.34 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.33 test_vm_deployment_planner.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1786.37 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 1786.41 test_host_maintenance.py

@shwstppr
looks good 😺

Test to check for failure while tying to upgrade a Kubernetes cluster to a lower version ... === TestName: test_01_invalid_upgrade_kubernetes_cluster | Status : SUCCESS ===
ok
Test to deploy a new Kubernetes cluster and upgrade it to newer version ... === TestName: test_02_upgrade_kubernetes_cluster | Status : SUCCESS ===
ok
Test to deploy a new Kubernetes cluster and check for failure while tying to scale it ... === TestName: test_03_deploy_and_scale_kubernetes_cluster | Status : SUCCESS ===
ok
Test to enable autoscaling a Kubernetes cluster ... === TestName: test_04_autoscale_kubernetes_cluster | Status : SUCCESS ===
ok
Test to deploy a new Kubernetes cluster ... === TestName: test_05_basic_lifecycle_kubernetes_cluster | Status : SUCCESS ===
ok
Test to delete an existing Kubernetes cluster ... === TestName: test_06_delete_kubernetes_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to upgrade a HA Kubernetes cluster to newer version ... === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
ok
Test to deploy a Kubernetes cluster on VPC ... === TestName: test_10_vpc_tier_kubernetes_cluster | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 26 tests in 19677.224s

OK

@blueorangutan
Copy link

[SF] Trillian test result (tid-8669)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 71121 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8669-kvm-centos7.zip
Smoke tests completed. 106 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_9_upgrade_kubernetes_ha_cluster Failure 881.05 test_kubernetes_clusters.py
test_08_migrate_vm Error 0.05 test_vm_life_cycle.py
test_hostha_kvm_host_degraded Error 699.05 test_hostha_kvm.py
test_hostha_kvm_host_fencing Error 683.27 test_hostha_kvm.py

@weizhouapache
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8671)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 57476 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8671-xenserver-71.zip
Smoke tests completed. 108 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deploy_vm_on_specific_host Error 2.26 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 2.24 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 1.25 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 2.30 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 2.26 test_vm_deployment_planner.py

@shwstppr
Copy link
Contributor

@weizhouapache do we need to move some of the test cases to a component test? Taking more than 5.5h now

shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr7345-t8662-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 2738 seconds
test_kubernetes_supported_versions.py: 65 seconds
shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr8402-t8671-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 20429 seconds
test_kubernetes_supported_versions.py: 65 seconds

@blueorangutan
Copy link

[SF] Trillian test result (tid-8673)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 65144 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8673-kvm-centos7.zip
Smoke tests completed. 107 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_migrate_vm Error 0.06 test_vm_life_cycle.py
test_hostha_kvm_host_degraded Error 702.24 test_hostha_kvm.py
test_hostha_kvm_host_fencing Error 684.15 test_hostha_kvm.py

@weizhouapache
Copy link
Member Author

@weizhouapache do we need to move some of the test cases to a component test? Taking more than 5.5h now

shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr7345-t8662-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 2738 seconds
test_kubernetes_supported_versions.py: 65 seconds
shwstppr@shwstppr-zbook:~/Downloads|⇒  grep "kubernetes" pr8402-t8671-xenserver-71/MarvinLogs/tests-time.txt 
test_kubernetes_clusters.py: 20429 seconds
test_kubernetes_supported_versions.py: 65 seconds

@shwstppr
The last commit is only used for testing..
I will revert it af93915

@weizhouapache
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@weizhouapache
Copy link
Member Author

@blueorangutan test rocky8 kvm-rocky8

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8943)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 72166 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8943-kvm-rocky8.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 90.53 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 55.88 test_vm_life_cycle.py
test_08_migrate_vm Error 50.25 test_vm_life_cycle.py

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8446

@weizhouapache
Copy link
Member Author

@blueorangutan test rocky8 kvm-rocky8

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-8952)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 82863 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8952-kvm-rocky8.zip
Smoke tests completed. 110 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-8958)
Environment: kvm-rocky8 (x3), Advanced Networking with Mgmt server r8
Total time taken: 62988 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t8958-kvm-rocky8.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 83.32 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 52.78 test_vm_life_cycle.py
test_01_secure_vm_migration Error 243.32 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 243.25 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 243.26 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 243.29 test_vm_life_cycle.py
test_08_migrate_vm Error 47.04 test_vm_life_cycle.py

@weizhouapache
Copy link
Member Author

run 2 trillian test in the weekend

  • 20/20 are good on pr8402-t8958-kvm-rocky8
01:14:20 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
01:14:20 
01:14:20 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
  • 38/40 are good on pr8402-t8952-kvm-rocky8
00:58:04 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : FAILED ===
00:58:04 
00:58:04 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
00:58:04 
00:58:04 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : FAILED ===
00:58:04 
00:58:04 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

06:26:03 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
06:26:03 
06:26:03 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

Not perfect but looks better than before

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8520

@weizhouapache
Copy link
Member Author

@blueorangutan test rocky8 kvm-rocky8

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9078)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 64477 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8402-t9078-kvm-rocky8.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_template_usage Error 1.18 test_usage.py
test_01_volume_usage Error 129.89 test_usage.py

@weizhouapache
Copy link
Member Author

Merging based on 2 approvals and perfect smoke tests result.

19:54:06 === TestName: test_08_10_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_11_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_12_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_13_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_14_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_15_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_16_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_17_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_18_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_19_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_2_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_3_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_4_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_5_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_6_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_7_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_8_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_9_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===
19:54:06 
19:54:06 === TestName: test_08_upgrade_kubernetes_ha_cluster | Status : SUCCESS ===

@weizhouapache weizhouapache merged commit 69e8ebc into apache:4.18 Feb 6, 2024
25 checks passed
@DaanHoogland DaanHoogland deleted the 4.18-cks-ha-cluster-upgrade-retry branch February 6, 2024 10:16
sureshanaparti pushed a commit to shapeblue/cloudstack that referenced this pull request Feb 8, 2024
…che#8402)

* CKS: retry if unable to drain node or unable to upgrade k8s node

I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded.

2 of 16 upgrades failed due to
```
error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command...
```

3 of 16 upgrades failed due to
```
Error from server: error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role"
Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard"
from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed
```

* CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test

* Update PR 8402 as suggested

* test: remove CKS cluster if fail to create or verify
sureshanaparti pushed a commit to shapeblue/cloudstack that referenced this pull request Feb 8, 2024
…che#8402)

* CKS: retry if unable to drain node or unable to upgrade k8s node

I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded.

2 of 16 upgrades failed due to
```
error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command...
```

3 of 16 upgrades failed due to
```
Error from server: error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role"
Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard"
from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed
```

* CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test

* Update PR 8402 as suggested

* test: remove CKS cluster if fail to create or verify
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Feb 20, 2024
…che#8402)

* CKS: retry if unable to drain node or unable to upgrade k8s node

I tried CKS upgrade 16 times, 11 of 16 upgrades succeeded.

2 of 16 upgrades failed due to
```
error: unable to drain node "testcluster-of7974-node-18c8c33c2c3" due to error:[error when evicting pods/"cloud-controller-manager-5b8fc87665-5nwlh" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/cloud-controller-manager-5b8fc87665-5nwlh/eviction": unexpected EOF, error when evicting pods/"coredns-5d78c9869d-h5nkz" -n "kube-system": Post "https://10.0.66.18:6443/api/v1/namespaces/kube-system/pods/coredns-5d78c9869d-h5nkz/eviction": unexpected EOF], continuing command...
```

3 of 16 upgrades failed due to
```
Error from server: error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=roles", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=Role"
Name: "kubernetes-dashboard", Namespace: "kubernetes-dashboard"
from server for: "/mnt/k8sdisk//dashboard.yaml": etcdserver: leader changed
```

* CKS: remove tests of creating/deleting HA clusters as they are covered by the upgrade test

* Update PR 8402 as suggested

* test: remove CKS cluster if fail to create or verify
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants