Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE-3823: kvm agent hooks #3839

Merged
merged 4 commits into from Mar 14, 2020
Merged

FEATURE-3823: kvm agent hooks #3839

merged 4 commits into from Mar 14, 2020

Conversation

bwsw
Copy link
Contributor

@bwsw bwsw commented Jan 27, 2020

Description

The PR introduces the new KVM agent extension interface in the form of hooks. Every hook is implemented in Groovy like

package groovy

class BaseTransform {
	String transform(Object logger, String xml) {
		File file = new File("/tmp/hooks.txt")
		file.append("------------------------------------------------------------------------\n")
		file.append(xml)
		return xml
	}
}

new BaseTransform()

There are 3 hooks are implemented now:

  1. XML transformer which is called right before VM is launched and it allows modifying XML somehow (example above)

  2. onStart and onStop hooks which are called right after VM state changes to started/stopped.

package groovy

class StateHandler {
	Object onStart(Object logger, String vmName) {
		File file = new File("/tmp/hooks-start.txt")
		file.append(vmName)
		return null
	}

        Object onStop(Object logger, String vmName) {
		File file = new File("/tmp/hooks-stop.txt")
		file.append(vmName)
                return null
        }
}

new StateHandler()

Every hook is run inside try {} catch (Exception e) {}, so it can not cause agent misbehavior. If hooks are not defined, they are skipped.

Also, the PR adds initial support for GitLab CICD for those, who does WIPs with GitLab, not in GitHub.

Initial RFC/Proposal: #3823

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Tested in unit tests and manually for master.

@svenvogel
Copy link
Contributor

@bwsw can you add a documentation too, how to use?

@bwsw
Copy link
Contributor Author

bwsw commented Jan 28, 2020

@bwsw can you add a documentation too, how to use?

@svenvogel I'll do for sure, just would like to do it after there are no more design concerns about PR and it is approved for merge.

As far as I know, there is no specific place for documentation like that. Where we would like to have it?

@DaanHoogland
Copy link
Contributor

@nvazquez this resembles something you did. Can you have look?

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-674

@poussa
Copy link

poussa commented Feb 6, 2020

I tried this feature and it provided a solution my use case (map SRIOV pci device to KVM host).

Is this going to be merged to the master anytime soon?

@bwsw
Copy link
Contributor Author

bwsw commented Feb 7, 2020

@poussa Wow! Great you have found this useful. We hoped that some other interested users exist. Hope it will be merged as well.

@nvazquez
Copy link
Contributor

Hi @bwsw, I find the XML use case pretty similar to what's been introduced by this PR: #3510. Start and stop VM hooks looks good. In PR #3610 I'm working on adding hooks scripts for rolling maintenance and similarly to this PR, an admin can define its own scripts and has to specify a directory where to find these scripts on the agent.properties file, perhaps this property can be shared between both PRs

@bwsw
Copy link
Contributor Author

bwsw commented Feb 13, 2020

@DaanHoogland @rhtyd
@nvazquez Hi, the XML functionality is similar but differs very much.

Read that discussion:
#3823
It may look similar in the way it acts, but completely different from the perspective of the ISP operation, especially for public cloud. This PR allows extending CS at a scale beyond the CS functionality implemented in the system. Your PR allows users adding the options to XML.

E.g.

Case1. In your PR there is no way to select any of the available NVME drives installed in the platform during VM deployment and attach it to the VM using some advanced logic.

Case2. when VM is deployed, for every CS account, I create VXLAN unmanaged by CS. I create a bridge and would like to add a NIC into configuration dynamically. It's impossible to implement with your PR.

Case 3. I would like to find any of available Quadro RTX 4000 GPU and pass it into VM.

Case 4. Read Sakari Poussa's request from mailing list:

Thanks for the information. It was useful.

Let me elaborate by use cases a bit more. I have PCI host devices with
sriov capabilities. I may have 48 virtual functions (VF) on a host. I want
to assign the VFs to some VM but not all. So I need some control which VMs
gets the VF and others don't. Also, I need to keep track which VFs are
already assigned and which are free. Lastly, I want to expose the VFs to
containers running on VMs created by the upcoming Cloudstack Kubernetes
Service (AKS, pull #3680).

Looking at the first feature you mentioned, I don't think I can use that.
It has no control on which VMs to add the extraconfig. It is all or nothing.

The second feature, which you started to work on seems to have more
potential. Do you see it can support my use case?

Thanks, Sakari

None of those cases are possible with PR you mention. To sum up:

  • My PR is for public cloud operators, who percept CS as a constructor;
  • Your PR is for private cloud operators who percept CS as a final system.

The PR is ready to be merged from my point of view. I suppose It's better to merge it to avoid diff accumulation.

@DaanHoogland
Copy link
Contributor

@bwsw I'm happy to put some effort into getting this merged. Yur code looks neat, even though I don't undertand it fully yet. A few things putting me off in your latest reply. Please have patience with me?
Firstly:

My PR is for public cloud operators, who percept CS as a constructor;

I imagine you mean this PR, right?

Your PR is for private cloud operators who percept CS as a final system.

What PR is that? Not "who's!", can you please refer to bits of code and not people? I can see some frustration in this comment and I have been involved in this conflict between private clouds an public clouds a lot. My question is genuine. On a side note, It's all apache's PRs ;)

Case1. In your PR there is no way to select any of the available NVME drives installed in the platform during VM deployment and attach it to the VM using some advanced logic.

(what PR?) can you add/explain how you are adding that info? I would expect some kind of dialog going back to the MS to provide the user with a list of available resources but can not see that in the code.

Case2. when VM is deployed, for every CS account, I create VXLAN unmanaged by CS. I create a bridge and would like to add a NIC into configuration dynamically.

So how do you guarantee the same vxlan's aren't also used by ACS? (And also, parallel to case 1 above, how is the vxlan id fed back into ACS)

Case 3. I would like to find any of available Quadro RTX 4000 GPU and pass it into VM.

In this case you do not want the user to add it, right? but have it automagically added?

Case 4. Read Sakari Poussa's request from mailing list:

Can you add a reference to the mail thread in lists.apache.org please? The snippet contains references to context.

@svenvogel is asking for documentation above. Are you already using this in production? Can you add a PR to https://github.com/apache/cloudstack-documentation, please?

@weizhouapache , @GabrielBrascher , @kiwiflyer thoughts/reviews?

@apache apache deleted a comment from blueorangutan Feb 19, 2020
@apache apache deleted a comment from blueorangutan Feb 19, 2020
@andrijapanicsb
Copy link
Contributor

@bwsw can I kindly join the effort on asking for some documentation - at the bare minimum I would like to see an example of how to use this feature (some specific, very-simple scenario) here on the PR description - so we can test it - i.e. I can't test it by going through the code as I'm not a developer myself, but would like to help testing it. i.e. expand on "How Has This Been Tested" section please.

thx

@bwsw
Copy link
Contributor Author

bwsw commented Feb 20, 2020

@andrijapanicsb
@svenvogel

// these hooks allow adding a temporary high-speed device for local VM cache device.
// simple error-prone implementation.

package groovy

import com.sun.deploy.xml.XMLNode
import groovy.util.XmlParser
import groovy.util.Node
import groovy.xml.XmlUtil

class CacheDeviceAdder {
    def highSpeedDir = "/mnt/nvme0/"
    final def ramGBDivider = 1024

    Node diskXmlSpec(String swapFile) {
        return new XmlParser().parseText(
                "    <disk type='file' device='disk'>\n" +
                "      <driver name='qemu' type='qcow2' cache='none'/>\n" +
                "      <source file='$swapFile'/>\n" +
                "      <target dev='vdb' bus='ide'/>\n" +
                "      <alias name='ide0-0-1'/>\n" +
                "      <address type='drive' controller='0' bus='0' target='0' unit='1'/>\n" +
                "    </disk>")
    }

    String transform(Object logger, String xml) {
        def vmDef = new XmlParser().parseText(xml)
        // get VM ram amount
        def memory = Integer.parseInt(vmDef.memory.text()) / ramGBDivider
        // get VM ram name
        def name = vmDef.name.text()
        def swapFile = highSpeedDir + name + ".qcow2"
        // remove swap device if exists
        "rm -f ${swapFile}".execute()
        def volCmd = "qemu-img create -f qcow2 ${swapFile} ${memory}G"
        // create new swap device
        volCmd.execute()
        def diskSpec = diskXmlSpec(swapFile)
        // update XML definition
        vmDef.devices[0].append(diskSpec)
        // return new XML definition
        return groovy.xml.XmlUtil.serialize(vmDef)
    }

    Object stop(Object logger, String vmName) {
        def swapFile = highSpeedDir + vmName.toString() + ".qcow2"
        // remove unused swap device
        "rm -f ${swapFile}".execute()
        return null
    }

    Object start(Object logger, String vmName) {
        return null
    }
}

new CacheDeviceAdder()

This is a very simple hook which utilizes stop and transform cases:

  • When transform is used, it adds a highspeed volume which size is eq to VM RAM size, which can be used as a swap.
  • When stop, it removes an unused swap device from FS.
  • Start is not used, no idea how to use it in the current case.

It's very primitive and avoids many checks, but allows getting a general idea.

Again, I'll add the documentation, when somebody wants to approve the idea... Because, right now there is very simple documentation:

  • Configure hook paths and methods in the agent.properties
  • Implement hooks in the form:
package groovy

class AnyNameNoMatter {
       String method(Object logger, String xml) {
            // your code
            return null // for onStart, onStop
            return xml // for xml transformation
      }
}

new AnyNameNoMatter()

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@bwsw
Copy link
Contributor Author

bwsw commented Feb 25, 2020

@DaanHoogland

So if I understand correctly, they are not activated in the agent.properties by cloudstack, nor deactivated?

Sure, they must be enabled/disabled by the person who manages the agent after it was connected to the CloudStack, the same way, say, you change RNG and activate watchdog. ACS doesn't do that, you do it after the agent is connected and initial configuration is produced by the agent deployment/integration script.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, some style remarks, no blockers.

@DaanHoogland
Copy link
Contributor

@weizhouapache did you test and can you approve? If so i think we can merge...

@bwsw
Copy link
Contributor Author

bwsw commented Mar 9, 2020

@DaanHoogland @weizhouapache please, give me one-two days to complete the refactoring as Daan proposed.

@DaanHoogland
Copy link
Contributor

@bwsw freeze is planned on friday end of day.

.gitlab-ci.yml Outdated Show resolved Hide resolved
Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested. start/stop vm works fine.
Did not test the agent hook scripts.

…ature/3823-kvm-agent-hooks

� Conflicts:
�	plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper/LibvirtStartCommandWrapper.java
@DaanHoogland DaanHoogland self-requested a review March 13, 2020 10:12
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1048

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@andrijapanicsb
Copy link
Contributor

Waiting for tests results, and if all good, will merge regardless of the freeze date (as approvals and everything else is good atm)

@andrijapanicsb
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@andrijapanicsb a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1051

@andrijapanicsb
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1242)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44040 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3839-t1242-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland merged commit 750abf3 into apache:master Mar 14, 2020
@blueorangutan
Copy link

Trillian test result (tid-1244)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33321 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3839-t1244-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet