Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-10006: Internal DRS-like load balancing implementation for Vmware #2189

Closed
wants to merge 1 commit into from

Conversation

nvazquez
Copy link
Contributor

JIRA TICKET: https://issues.apache.org/jira/browse/CLOUDSTACK-10006

Introduction

One of the most useful features provided by Vmware is DRS (Distributed Resources Scheduler), whose main job is to load balance workload within clusters when needed (given a migration threshold). However, this feature is only available for Enterprise Plus licenses. We would like a way to provide a similar feature internal to CloudStack for those Vmware licenses which don't have DRS feature available.

Usage

This feature is disabled by default, it can be activated by switching global setting: vmware.drs.internal.enabled to true (management server restart is required).
When feature is active, it would set a a thread per cluster which will execute every vmware.drs.interval seconds:

  • If cluster setting: vmware.drs.internal.enabled = false -> no action performed on cluster
  • If cluster setting: vmware.drs.internal.enabled = true:
    • Get cluster hosts statistics and calculate cluster imbalance
    • While cluster imbalance > vmware.drs.threshold:
      • If no good migration is found -> stop
      • If good migration is found -> it is added to the list of recommendations

Good migration search works like this:

  • For each VM v:
    • For each host h that is not Source Host:
      • Simulate move v to h
      • Measure new cluster-wide load imbalance metric as g
  • Return move v that gives least cluster-wide imbalance g.

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @nvazquez, thanks for this great feature, have you planned on adding marvin tests?

@borisstoyanov
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-808

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1212)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44601 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2189-t1212-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_affinity_groups_projects.py
Intermitten failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermitten failure detected: /marvin/tests/smoke/test_network.py
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Test completed. 50 look ok, 2 have error(s)

Test Result Time (s) Test File
test_04_rvpc_privategw_static_routes Failure 424.82 test_privategw_acl.py
test_01_vpc_site2site_vpn Success 174.17 test_vpc_vpn.py
test_01_vpc_remote_access_vpn Success 80.86 test_vpc_vpn.py
test_01_redundant_vpc_site2site_vpn Success 249.49 test_vpc_vpn.py
test_02_VPC_default_routes Success 261.64 test_vpc_router_nics.py
test_01_VPC_nics_after_destroy Success 515.23 test_vpc_router_nics.py
test_05_rvpc_multi_tiers Success 508.11 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Success 1391.06 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Success 545.37 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Success 751.12 test_vpc_redundant.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Success 1271.09 test_vpc_redundant.py
test_09_delete_detached_volume Success 156.13 test_volumes.py
test_08_resize_volume Success 156.08 test_volumes.py
test_07_resize_fail Success 156.16 test_volumes.py
test_06_download_detached_volume Success 156.01 test_volumes.py
test_05_detach_volume Success 145.58 test_volumes.py
test_04_delete_attached_volume Success 145.96 test_volumes.py
test_03_download_attached_volume Success 156.05 test_volumes.py
test_02_attach_volume Success 84.10 test_volumes.py
test_01_create_volume Success 616.28 test_volumes.py
test_03_delete_vm_snapshots Success 275.18 test_vm_snapshots.py
test_02_revert_vm_snapshots Success 100.79 test_vm_snapshots.py
test_01_create_vm_snapshots Success 133.74 test_vm_snapshots.py
test_deploy_vm_multiple Success 287.28 test_vm_life_cycle.py
test_deploy_vm Success 0.02 test_vm_life_cycle.py
test_advZoneVirtualRouter Success 0.02 test_vm_life_cycle.py
test_10_attachAndDetach_iso Success 26.55 test_vm_life_cycle.py
test_09_expunge_vm Success 125.21 test_vm_life_cycle.py
test_08_migrate_vm Success 60.86 test_vm_life_cycle.py
test_07_restore_vm Success 0.09 test_vm_life_cycle.py
test_06_destroy_vm Success 125.67 test_vm_life_cycle.py
test_03_reboot_vm Success 125.66 test_vm_life_cycle.py
test_02_start_vm Success 10.12 test_vm_life_cycle.py
test_01_stop_vm_forced Success 5.10 test_vm_life_cycle.py
test_01_stop_vm Success 40.26 test_vm_life_cycle.py
test_CreateTemplateWithDuplicateName Success 130.81 test_templates.py
test_08_list_system_templates Success 0.03 test_templates.py
test_07_list_public_templates Success 0.03 test_templates.py
test_05_template_permissions Success 0.05 test_templates.py
test_04_extract_template Success 5.13 test_templates.py
test_03_delete_template Success 5.08 test_templates.py
test_02_edit_template Success 90.11 test_templates.py
test_01_create_template Success 70.51 test_templates.py
test_10_destroy_cpvm Success 161.21 test_ssvm.py
test_09_destroy_ssvm Success 163.08 test_ssvm.py
test_08_reboot_cpvm Success 101.22 test_ssvm.py
test_07_reboot_ssvm Success 133.88 test_ssvm.py
test_06_stop_cpvm Success 101.36 test_ssvm.py
test_05_stop_ssvm Success 163.67 test_ssvm.py
test_04_cpvm_internals Success 0.96 test_ssvm.py
test_03_ssvm_internals Success 4.40 test_ssvm.py
test_02_list_cpvm_vm Success 0.09 test_ssvm.py
test_01_list_sec_storage_vm Success 0.09 test_ssvm.py
test_02_list_snapshots_with_removed_data_store Success 87.61 test_snapshots.py
test_01_snapshot_root_disk Success 11.02 test_snapshots.py
test_04_change_offering_small Success 242.59 test_service_offerings.py
test_03_delete_service_offering Success 0.03 test_service_offerings.py
test_02_edit_service_offering Success 0.06 test_service_offerings.py
test_01_create_service_offering Success 0.08 test_service_offerings.py
test_02_sys_template_ready Success 0.09 test_secondary_storage.py
test_01_sys_vm_start Success 0.12 test_secondary_storage.py
test_09_reboot_router Success 35.24 test_routers.py
test_08_start_router Success 25.19 test_routers.py
test_07_stop_router Success 10.12 test_routers.py
test_06_router_advanced Success 0.04 test_routers.py
test_05_router_basic Success 0.03 test_routers.py
test_04_restart_network_wo_cleanup Success 5.51 test_routers.py
test_03_restart_network_cleanup Success 55.39 test_routers.py
test_02_router_internal_adv Success 1.11 test_routers.py
test_01_router_internal_basic Success 0.54 test_routers.py
test_router_dns_guestipquery Success 73.88 test_router_dns.py
test_router_dns_externalipquery Success 0.07 test_router_dns.py
test_router_dhcphosts Success 236.05 test_router_dhcphosts.py
test_router_dhcp_opts Success 21.63 test_router_dhcphosts.py
test_01_updatevolumedetail Success 0.08 test_resource_detail.py
test_01_reset_vm_on_reboot Success 130.72 test_reset_vm_on_reboot.py
test_createRegion Success 0.03 test_regions.py
test_create_pvlan_network Success 5.15 test_pvlan.py
test_dedicatePublicIpRange Success 0.31 test_public_ip_range.py
test_03_vpc_privategw_restart_vpc_cleanup Success 568.11 test_privategw_acl.py
test_02_vpc_privategw_static_routes Success 449.16 test_privategw_acl.py
test_01_vpc_privategw_acl Success 111.67 test_privategw_acl.py
test_03_migration_options_storage_tags Success 161.39 test_primary_storage.py
test_02_edit_primary_storage_tags Success 0.07 test_primary_storage.py
test_01_primary_storage_nfs Success 35.80 test_primary_storage.py
test_01_deploy_vms_storage_tags Success 30.78 test_primary_storage.py
test_createPortablePublicIPRange Success 15.16 test_portable_publicip.py
test_createPortablePublicIPAcquire Success 15.32 test_portable_publicip.py
test_isolate_network_password_server Success 56.85 test_password_server.py
test_UpdateStorageOverProvisioningFactor Success 0.10 test_over_provisioning.py
test_oobm_zchange_password Success 25.48 test_outofbandmanagement.py
test_oobm_multiple_mgmt_server_ownership Success 16.28 test_outofbandmanagement.py
test_oobm_issue_power_status Success 10.18 test_outofbandmanagement.py
test_oobm_issue_power_soft Success 15.26 test_outofbandmanagement.py
test_oobm_issue_power_reset Success 15.26 test_outofbandmanagement.py
test_oobm_issue_power_on Success 15.27 test_outofbandmanagement.py
test_oobm_issue_power_off Success 15.26 test_outofbandmanagement.py
test_oobm_issue_power_cycle Success 15.26 test_outofbandmanagement.py
test_oobm_enabledisable_across_clusterzones Success 72.19 test_outofbandmanagement.py
test_oobm_enable_feature_valid Success 5.13 test_outofbandmanagement.py
test_oobm_enable_feature_invalid Success 0.08 test_outofbandmanagement.py
test_oobm_disable_feature_valid Success 5.14 test_outofbandmanagement.py
test_oobm_disable_feature_invalid Success 0.07 test_outofbandmanagement.py
test_oobm_configure_invalid_driver Success 0.06 test_outofbandmanagement.py
test_oobm_configure_default_driver Success 0.06 test_outofbandmanagement.py
test_oobm_background_powerstate_sync Success 18.31 test_outofbandmanagement.py
test_extendPhysicalNetworkVlan Success 15.24 test_non_contigiousvlan.py
test_01_nic Success 488.46 test_nic.py
test_releaseIP Success 237.08 test_network.py
test_reboot_router Success 428.02 test_network.py
test_public_ip_user_account Success 10.19 test_network.py
test_public_ip_admin_account Success 40.21 test_network.py
test_network_rules_acquired_public_ip_3_Load_Balancer_Rule Success 66.80 test_network.py
test_network_rules_acquired_public_ip_2_nat_rule Success 61.77 test_network.py
test_network_rules_acquired_public_ip_1_static_nat_rule Success 124.10 test_network.py
test_delete_account Success 322.34 test_network.py
test_02_port_fwd_on_non_src_nat Success 55.51 test_network.py
test_01_port_fwd_on_src_nat Success 108.77 test_network.py
test_nic_secondaryip_add_remove Success 196.93 test_multipleips_per_nic.py
test_list_zones_metrics Success 0.17 test_metrics_api.py
test_list_volumes_metrics Success 5.33 test_metrics_api.py
test_list_vms_metrics Success 176.23 test_metrics_api.py
test_list_pstorage_metrics Success 0.23 test_metrics_api.py
test_list_infrastructure_metrics Success 0.27 test_metrics_api.py
test_list_hosts_metrics Success 0.24 test_metrics_api.py
test_list_clusters_metrics Success 0.23 test_metrics_api.py
login_test_saml_user Success 17.80 test_login.py
test_assign_and_removal_lb Success 133.98 test_loadbalance.py
test_02_create_lb_rule_non_nat Success 187.63 test_loadbalance.py
test_01_create_lb_rule_src_nat Success 188.04 test_loadbalance.py
test_03_list_snapshots Success 0.06 test_list_ids_parameter.py
test_02_list_templates Success 0.03 test_list_ids_parameter.py
test_01_list_volumes Success 0.03 test_list_ids_parameter.py
test_07_list_default_iso Success 0.04 test_iso.py
test_05_iso_permissions Success 0.04 test_iso.py
test_04_extract_Iso Success 5.26 test_iso.py
test_03_delete_iso Success 95.17 test_iso.py
test_02_edit_iso Success 0.06 test_iso.py
test_01_create_iso Success 20.69 test_iso.py
test_04_rvpc_internallb_haproxy_stats_on_all_interfaces Success 177.12 test_internal_lb.py
test_03_vpc_internallb_haproxy_stats_on_all_interfaces Success 131.86 test_internal_lb.py
test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80 Success 533.39 test_internal_lb.py
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 Success 539.99 test_internal_lb.py
test_dedicateGuestVlanRange Success 10.21 test_guest_vlan_range.py
test_UpdateConfigParamWithScope Success 0.12 test_global_settings.py
test_rolepermission_lifecycle_update Success 5.79 test_dynamicroles.py
test_rolepermission_lifecycle_list Success 5.70 test_dynamicroles.py
test_rolepermission_lifecycle_delete Success 5.59 test_dynamicroles.py
test_rolepermission_lifecycle_create Success 5.60 test_dynamicroles.py
test_rolepermission_lifecycle_concurrent_updates Success 5.67 test_dynamicroles.py
test_role_lifecycle_update_role_inuse Success 5.60 test_dynamicroles.py
test_role_lifecycle_update Success 10.67 test_dynamicroles.py
test_role_lifecycle_list Success 5.61 test_dynamicroles.py
test_role_lifecycle_delete Success 10.64 test_dynamicroles.py
test_role_lifecycle_create Success 5.62 test_dynamicroles.py
test_role_inuse_deletion Success 5.58 test_dynamicroles.py
test_role_account_acls_multiple_mgmt_servers Success 7.12 test_dynamicroles.py
test_role_account_acls Success 7.12 test_dynamicroles.py
test_default_role_deletion Success 5.73 test_dynamicroles.py
test_04_create_fat_type_disk_offering Success 0.07 test_disk_offerings.py
test_03_delete_disk_offering Success 0.03 test_disk_offerings.py
test_02_edit_disk_offering Success 0.06 test_disk_offerings.py
test_02_create_sparse_type_disk_offering Success 0.07 test_disk_offerings.py
test_01_create_disk_offering Success 0.09 test_disk_offerings.py
test_deployvm_userdispersing Success 30.52 test_deploy_vms_with_varied_deploymentplanners.py
test_deployvm_userconcentrated Success 25.46 test_deploy_vms_with_varied_deploymentplanners.py
test_deployvm_firstfit Success 130.85 test_deploy_vms_with_varied_deploymentplanners.py
test_deployvm_userdata_post Success 10.29 test_deploy_vm_with_userdata.py
test_deployvm_userdata Success 45.51 test_deploy_vm_with_userdata.py
test_02_deploy_vm_root_resize Success 0.04 test_deploy_vm_root_resize.py
test_01_deploy_vm_root_resize Success 0.05 test_deploy_vm_root_resize.py
test_00_deploy_vm_root_resize Success 276.58 test_deploy_vm_root_resize.py
test_deploy_vm_from_iso Success 231.87 test_deploy_vm_iso.py
test_06_verify_guest_lspci_again Success 7.37 test_deploy_virtio_scsi_vm.py
test_05_change_vm_ostype_restart Success 15.73 test_deploy_virtio_scsi_vm.py
test_04_verify_guest_lspci Success 40.47 test_deploy_virtio_scsi_vm.py
test_03_verify_libvirt_attach_disk Success 10.58 test_deploy_virtio_scsi_vm.py
test_02_verify_libvirt_after_restart Success 131.13 test_deploy_virtio_scsi_vm.py
test_01_verify_libvirt Success 0.37 test_deploy_virtio_scsi_vm.py
test_DeployVmAntiAffinityGroup Success 55.63 test_affinity_groups.py
test_change_service_offering_for_vm_with_snapshots Skipped 0.00 test_vm_snapshots.py
test_09_copy_delete_template Skipped 0.01 test_templates.py
test_06_copy_template Skipped 0.00 test_templates.py
test_static_role_account_acls Skipped 0.01 test_staticroles.py
test_11_ss_nfs_version_on_ssvm Skipped 0.02 test_ssvm.py
test_01_scale_vm Skipped 0.00 test_scale_vm.py
test_01_primary_storage_iscsi Skipped 0.03 test_primary_storage.py
test_nested_virtualization_vmware Skipped 0.00 test_nested_virtualization.py
test_06_copy_iso Skipped 0.00 test_iso.py
test_deploy_vgpu_enabled_vm Skipped 0.02 test_deploy_vgpu_enabled_vm.py
test_3d_gpu_support Skipped 0.03 test_deploy_vgpu_enabled_vm.py

@nvazquez
Copy link
Contributor Author

Thanks @borisstoyanov, sorry for the delay in response. I'll work on a way to test this feature

@cloudmonger
Copy link

ACS CI BVT Run

Sumarry:
Build Number 1026
Hypervisor xenserver
NetworkType Advanced
Passed=99
Failed=16
Skipped=12

Link to logs Folder (search by build_no): https://www.dropbox.com/sh/r2si930m8xxzavs/AAAzNrnoF1fC3auFrvsKo_8-a?dl=0

Failed tests:

  • test_routers.py

  • test_03_restart_network_cleanup Failing since 6 runs

  • test_04_restart_network_wo_cleanup Failing since 6 runs

  • test_05_router_basic Failing since 6 runs

  • test_06_router_advanced Failing since 6 runs

  • test_08_start_router Failing since 6 runs

  • test_09_reboot_router Failing since 6 runs

  • test_network.py

  • test_reboot_router Failing since 6 runs

  • test_deploy_vm_iso.py

  • test_deploy_vm_from_iso Failing since 81 runs

  • test_volumes.py

  • test_06_download_detached_volume Failing since 3 runs

  • test_nic.py

  • test_01_nic Failing since 6 runs

  • test_vm_life_cycle.py

  • test_10_attachAndDetach_iso Failing since 81 runs

  • test_routers_network_ops.py

  • test_01_isolate_network_FW_PF_default_routes_egress_true Failing since 114 runs

  • test_02_isolate_network_FW_PF_default_routes_egress_false Failing since 114 runs

  • test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failing since 112 runs

  • test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failing since 112 runs

  • ContextSuite context=TestRedundantIsolateNetworks>:teardown Failing since 12 runs

Skipped tests:
test_vm_nic_adapter_vmxnet3
test_01_verify_libvirt
test_02_verify_libvirt_after_restart
test_03_verify_libvirt_attach_disk
test_04_verify_guest_lspci
test_05_change_vm_ostype_restart
test_06_verify_guest_lspci_again
test_static_role_account_acls
test_11_ss_nfs_version_on_ssvm
test_nested_virtualization_vmware
test_3d_gpu_support
test_deploy_vgpu_enabled_vm

Passed test suits:
test_deploy_vm_with_userdata.py
test_affinity_groups_projects.py
test_portable_publicip.py
test_vm_snapshots.py
test_over_provisioning.py
test_global_settings.py
test_router_dnsservice.py
test_scale_vm.py
test_service_offerings.py
test_routers_iptables_default_policy.py
test_loadbalance.py
test_reset_vm_on_reboot.py
test_deploy_vms_with_varied_deploymentplanners.py
test_router_dns.py
test_non_contigiousvlan.py
test_login.py
test_list_ids_parameter.py
test_public_ip_range.py
test_multipleips_per_nic.py
test_metrics_api.py
test_regions.py
test_affinity_groups.py
test_network_acl.py
test_pvlan.py
test_deploy_vm_root_resize.py
test_resource_detail.py
test_secondary_storage.py
test_disk_offerings.py

@rafaelweingartner
Copy link
Member

rafaelweingartner commented Oct 10, 2017

@nvazquez some pieces of this code are a little familiar to me :)

It is great your initiative, and from what I looked into the code of this PR, the “management model” used here is similar to the one we have in our beta version of Autonomiccs [1], which was developed as a plugin for ACS.

As I told you when you tested that solution for Autodesk, it does not scale well… So, we have a dataset [2] that we use to compare our management models, and we have compared this simple solution with a more comprehensive model that we developed a while ago. It turns out that this simple management approach can bring quite some problems in dynamic production environments. In some condition it will be worse than if you were not using anything at all. The dataset and the simulation tool to check the results are public [2], so anyone can check the results ;)

The blue bar is one of our management models that was presented at IEEE SERVICES conference in the beginning of this year; the red line is the simple management model we made available at [1], which is similar to the one being introduced here; the yellow line is the unbalance of the cloud environment if one does not use anything at all (relying only on the allocation algorithm, the first fit was used), This figure presents the unbalance of RAM through out time in the dataset.
untitled2

This figure presents the unbalance of CPU through out time in the dataset
untitled

The blue bar, which is one of our management models perform way better because it is a multi-dimensional management model.

The message I want to send here is the following: use this solution on your cloud production environment with caution.

[1] https://github.com/Autonomiccs/autonomiccs-platform
[2] https://github.com/Autonomiccs/cloud-traces

@nvazquez
Copy link
Contributor Author

@rafaelweingartner somehow I missed out your last comment, sorry for a bit late response :). I'll close this PR temporally

@nvazquez nvazquez closed this Dec 27, 2018
@nvazquez nvazquez deleted the drs-implementation branch April 6, 2020 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants