Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVM: DPDK live migrations #3365

Merged
merged 3 commits into from Jun 25, 2019

Conversation

@nvazquez
Copy link
Contributor

commented May 30, 2019

Description

This feature allows administrators to perform live migrations on DPDK enabled VMs between DPDK enabled hosts in the same cluster.

A DPDK enabled host must be configured to enable the CloudStack DPDK support. Properties on agent.properties must be populated as:

  • openvswitch.dpdk.enabled=true
  • openvswitch.dpdk.ovs.path=<OVS_PATH>
  • network.bridge.type=openvswitch
  • libvirt.vif.driver=com.cloud.hypervisor.kvm.resource.OvsVifDriver

Each time a CloudStack agent on a KVM host starts a connection to its management server, it sends a comma-separated list of capabilities available on the host, such as: 'hvm', 'snapshot'. The management server persists the information received from the agent on database ('capabilities' column on 'hosts' table)

  • When the property 'openvswitch.dpdk.enabled' = true, the 'dpdk' capability is included in the list of capabilities sent to the management server.

  • When the property 'openvswitch.dpdk.enabled' = false, the list of capabilities sent to the management server does not inclide the 'dpdk' capability.

Please note that agent restart is required after any property from agent.properties file is changed. Also having the dpdk capability, the hosts will be able and will host VMs without the dpdk configuration. In this case, openvswitch ports without DPDK support are created, just like for system VMs.

The management server is enhanced to determinate if a VM is a DPDK enabled VM when:

  • The VM type is 'User'
  • The VM details or service offering details contain additional configurations with the name:
    • 'dpdk-hugepages'
    • 'dpdk-numa'
  • The KVM host on which the VM is running contains the 'dpdk' capability.

API changes

The 'findSuitableHostsForMigration' API is enhanced to support DPDK enabled VMs.

  • When the VM is a DPDK enabled VM then only DPDK enabled hosts must be suitable for migration. Suitable hosts must be only the ones with the 'dpdk' capability as part of its capabilities.

The 'migrateVirtualMachine' API method is extended to support DPDK enabled VMs live migrations:

  • When the VM to live migrate is not a DPDK enabled VM, the logic remains unchanged.
  • When the VM to live migrate is a DPDK enabled VM, it can be live migrated between two hosts when:
    • The KVM source host is a DPDK enabled host (the 'dpdk' capability is part of its capabilities)
    • The KVM destination host is a DPDK enabled host on the same cluster as the source host
    • If the destination host is not a DPDK enabled host, then the migration must fail with a descriptive message.
    • Before attempting the live migration, the destination host must be prepared for the migration:
      • A new DPDK port must be created for each interface of the VM
    • Before attempting the live migration, the source host must:
      • Obtain the VM XML domain as a copy of the current XML domain
      • Replace the XML parts, referencing the new DPDK ports on the destination host for each interface
    • After the migration is successful, the VM interface DPDK ports on the source host must be removed.
    • In case of migration failure, the DPDK ports created on the destination hosts are cleaned up.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@blueorangutan package

@blueorangutan

This comment has been minimized.

Copy link

commented May 30, 2019

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Copy link

commented May 30, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2816

@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@blueorangutan package

@blueorangutan

This comment has been minimized.

Copy link

commented May 30, 2019

@borisstoyanov a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Copy link

commented May 30, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2817

@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented May 30, 2019

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

This comment has been minimized.

Copy link

commented May 30, 2019

Trillian test result (tid-3613)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32361 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3365-t3613-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 70 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

@apache apache deleted a comment from blueorangutan May 31, 2019

@apache apache deleted a comment from blueorangutan May 31, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented May 31, 2019

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@borisstoyanov
Copy link
Contributor

left a comment

LGTM, based on marvin tests, code review and manual verification on DPDK enabled hosts, both covering normal and DPDK VMs. It passed tests for migration(positive and negative) on enabled and non VMs. Also cleanup is properly done as well as host capability recognition.
test results: test-results.xlsx

@blueorangutan

This comment has been minimized.

Copy link

commented May 31, 2019

Trillian test result (tid-3624)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33753 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3365-t3624-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 69 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_hostha_enable_ha_when_host_in_maintenance Error 301.71 test_hostha_kvm.py
@nvazquez nvazquez referenced this pull request Jun 3, 2019

@nvazquez nvazquez changed the title DPDK live migrations KVM: DPDK live migrations Jun 3, 2019

@nvazquez nvazquez closed this Jun 3, 2019

@nvazquez nvazquez reopened this Jun 3, 2019

@borisstoyanov

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

@nvazquez can you address the merge conflicts pls.

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 10, 2019

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 10, 2019

Trillian test result (tid-3679)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 25576 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3365-t3679-kvm-centos7.zip
Smoke tests completed. 71 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
@DaanHoogland
Copy link
Contributor

left a comment

looks good overall, to large to judge in detail. the prove of the code is in the testing.

api/src/main/java/com/cloud/agent/api/to/DPDKTO.java Outdated Show resolved Hide resolved
@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2019

Thanks @DaanHoogland, renamed classes after your comment
@blueorangutan package

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 17, 2019

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 17, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2893

@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 17, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 17, 2019

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 17, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2894

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 17, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2895

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 17, 2019

Packaging result: ✔centos6 ✔centos7 ✖debian. JID-2896

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 18, 2019

Trillian test result (tid-3696)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 36164 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3365-t3696-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 71 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 20, 2019

@blueorangutan package

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 20, 2019

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 20, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2912

@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 20, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 20, 2019

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 20, 2019

Trillian test result (tid-3712)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 35287 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3365-t3712-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 69 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_05_rvpc_multi_tiers Failure 404.25 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 430.53 test_vpc_redundant.py
test_01_redundant_vpc_site2site_vpn Failure 268.41 test_vpc_vpn.py
@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 25, 2019

@blueorangutan package

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 25, 2019

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 25, 2019

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-11

@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 25, 2019

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 25, 2019

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@rhtyd
rhtyd approved these changes Jun 25, 2019
@rhtyd

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

@anuragaw @shwstppr @DaanHoogland please review, thanks.

@blueorangutan

This comment has been minimized.

Copy link

commented Jun 25, 2019

Trillian test result (tid-9)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 35967 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3365-t9-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 71 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
@nvazquez

This comment has been minimized.

Copy link
Contributor Author

commented Jun 25, 2019

Merging based on approvals and test results

@nvazquez nvazquez merged commit a75444a into apache:master Jun 25, 2019

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
Jenkins This pull request looks good
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.