Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-8956: NSX/Nicira Plugin does not support NSX v4.2.1 #935

Merged
merged 4 commits into from
Nov 18, 2015

Conversation

nvazquez
Copy link
Contributor

JIRA Ticket: https://issues.apache.org/jira/browse/CLOUDSTACK-8956

Description of the problem:

Prior to version 4.2. Nicira/VmWare NSX used a variation of Open vSwitch as means of integrating SDN into hypervisor layer. Cloudstack NiciraNVP plugin was written to support OVS as a bridge to NSX.
In version 4.2 VMware introduced NSX vSwitch as a replacement for OVS in ESX hypervisors. It is a fork of distributed vSwitch leveraging one of the recent features of ESX called opaque networks. Because of that change the current version of NiciraNVP plugin doesn’t support versions of NSX-MH above 4.2 specifically in Vsphere environment. Proposed fix will analyze a version of NVP/NSX API and use proper support for ESX hypervisors.

vSphere hypervisor mode operations when NV is deployed onto NSX managed network changes:

  • Current mode. A portgroup = UUID of CS VM NIC is created on a local standard switch of the Hypervisor where VM is starting. VM nic is attached to that port group.
  • New mode. No additional port group is created on a HW. No port group cleanup is needed after VM/NIC is destroyed. VM is attached to 1st port group having the following attributes:
    ** opaqueNetworkId string "br-int”
    ** opaqueNetworkType string "nsx.network"

If portgroup with such attributes is not found a deployment should fail with exception.

VMware vSphere API version from 5.1 to 5.5:

Since vSphere API version 5.5, OpaqueNetworks are introduced.
Its description says:

This interface defines an opaque network, in the sense that the detail and configuration of the network is unknown to vShpere and is managed by a management plane outside of vSphere. However, the identifier and name of these networks is made available to vSphere so that host and virtual machine virtual ethernet device can connect to them.

In order to connect a vm's virtual ethernet device to the proper opaque network when deploying a vm into a NSX managed network, we first need to look for a particular opaque network on hosts. This opaque network's id has to be "br-int" and its type "nsx.network".

Since vSphere API version 5.5 HostNetworkInfo introduces a list of available opaque networks for each host.
If NSX API version >= 4.2 we look for a OpaqueNetworkInfo which satisfies:

  • opaqueNetworkId = "br-int"
  • opaqueNetworkType = "nsx.netork"

If that opaque network is found, then we need to attach vm's NIC to a virtual ethernet device which support this, so we use VirtualEthernetCardOpaqueNetworkBackingInfo setting:

  • opaqueNetworkId = "br-int"
  • opaqueNetworkType = "nsx.netork"

@@ -180,6 +182,17 @@ public Type getType() {
public PingCommand getCurrentStatus(final long id) {
try {
ControlClusterStatus ccs = niciraNvpApi.getControlClusterStatus();
//------------------------------------------------GET API_PROVIDER MAJORITY VERSION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this block is wrapped by a comment, wouldn't it be better to refactor it into a new method? (giving the method a meaningful name that would render the comment redundant, say getApiProviderMajorityVersion(...))

@miguelaferreira
Copy link
Contributor

Hi @nvazquez

Thanks for your contribution. I'm not quite sure I understand the problem you are trying to solve, but that likely my own lack of NXS knowledge.
The last time I changed this plugin (NiciraNvp), I've added an integration tests for checking is redirects within a cluster were being followed. You change probably doesn't impact that, but it would be great to make sure. Can you please run the test (test/integration/smoke/test_nicira_controller.py, let me know if you need help in getting the test running) in your setup and report on the results?
In addition, it would be great of you document the problem you are trying to fix by adding a new integration tests (that fails before your change is applied and succeeds afterwards).

A final remark about the code and the commits, please see my comments on certain lines, and consider rearranging your commits in such a way that what you did and/or the intention behind each commit become clear.

Thanks again, and do let me know if I can hep you further.

@remibergsma
Copy link
Contributor

Hi @nvazquez Please reorganise your commits, as we need them to be atomic. It cannot go in like this.

I'm also with @miguelaferreira, please let us know what problem you fix here and what testing you have done (also to make sure it still works with current setups).

@nvazquez
Copy link
Contributor Author

Hi @miguelaferreira @remibergsma

Thanks a lot for your help. I will try to make a better description of the problem and work on commits and code as you suggested. Sorry for unorganized commits, this is my first contribution to CS, I will try to make it better.

@miguelaferreira I don't have that test file (test/integration/smoke/test_nicira_controller.py) on my branch. I've created this branch from tag 4.5.1, maybe it wasn't available yet. How can I do for testing it?

Thanks,
Nicolas

@nvazquez nvazquez force-pushed the from4.5.1 branch 2 times, most recently from 48659bf to 894a04d Compare October 16, 2015 20:26
@remibergsma
Copy link
Contributor

@nvazquez Thanks for your reply. If you made this from 4.5.1, be aware that you submitted it against current master so this will for sure cause problems. That may be why Jenkins failed and you have all these conflicts.

Please make your patch against current master, as it is the only way to get it in. Otherwise, it would be in 4.5.x and not in any newer versions causing regression when upgrading from 4.5 to 4.6 etc.

Master is currently frozen and almost stable as we're heading to a 4.6 release so developing against it works great.

You will also find the test in the master branch.

@miguelaferreira
Copy link
Contributor

@nvazquez Thanks for your work on the commits, they are much better now. I would say it can easily be improved even further, but given the current practice in the project, it is already good!

As @remibergsma pointed out, it would be best if you apply your changes to master. However if you want to use the functionality you added in a new minor version of an older release, then make a PR for the branch of that release, and we will help you port your changes all the way up to master. We have worked on a workflow that helps with that (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Release+principles+for+Apache+CloudStack+4.6+and+up#ReleaseprinciplesforApacheCloudStack4.6andup-Whymergingforwardovercherry-pickingfrommaster?)

EDIT: and btw I think you missed one commented out line in #935 (diff)

@nvazquez nvazquez force-pushed the from4.5.1 branch 7 times, most recently from afe75f8 to 35e3508 Compare October 20, 2015 13:58
@nvazquez
Copy link
Contributor Author

@miguelaferreira @remibergsma
Thanks again for your help! As you suggested I rebased master branch. I also added a more detailed description of the patch on this PR description and JIRA ticket.

Thanks,
Nicolas

@nvazquez
Copy link
Contributor Author

VMware vSphere API version from 5.1 to 5.5:

Since vSphere API version 5.5, OpaqueNetworks are introduced.
Its description says:

This interface defines an opaque network, in the sense that the detail and configuration of the network is unknown to vShpere and is managed by a management plane outside of vSphere. However, the identifier and name of these networks is made available to vSphere so that host and virtual machine virtual ethernet device can connect to them.

In order to connect a vm's virtual ethernet device to the proper opaque network when deploying a vm into a NSX managed network, we first need to look for a particular opaque network on hosts. This opaque network's id has to be "br-int" and its type "nsx.network".

Since vSphere API version 5.5 HostNetworkInfo introduces a list of available opaque networks for each host.
If NSX API version >= 4.2 we look for a OpaqueNetworkInfo which satisfies:

  • opaqueNetworkId = "br-int"
  • opaqueNetworkType = "nsx.netork"

If that opaque network is found, then we need to attach vm's NIC to a virtual ethernet device which support this, so we use VirtualEthernetCardOpaqueNetworkBackingInfo setting:

  • opaqueNetworkId = "br-int"
  • opaqueNetworkType = "nsx.netork"

@nvazquez nvazquez changed the title From4.5.1: NSX/Nicira Plugin does not support NSX v4.2.1 CLOUDSTACK-8956: NSX/Nicira Plugin does not support NSX v4.2.1 Oct 21, 2015
@miguelaferreira
Copy link
Contributor

@nvazquez PR looks awesome now. Great work!

I will pull in your PR and run the marvin test we already have against it to see if that is still good. I will post the results in here.
However, I do not have an environment with vSphere to test one. Could you please run the tests we have already on you environment and post the results here?

In addition, to make this PR super awesome, we would also need at least one new marvin test that exercises the new logic you have introduced. That is, a test where the Nicira plugin is configured for vSphere, and we can at least create te a VM on a Nicira backed network.

break;

default:
assert (false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this "assert (false);" here? Asserting as false might cause problems.

Each assertion contains a boolean expression that you believe will be true when the assertion executes. If it is not true, the system will throw an error.

Source: http://docs.oracle.com/javase/7/docs/technotes/guides/language/assert.html

The user os assertion in this way is irrelevant and error-prone. Could you please remove it and push a new commit?

Cheers,
Wilder

@nvazquez
Copy link
Contributor Author

@miguelaferreira thanks a lot for your help and advices! As suggested I post test_nicira_controller.py results:

$ cat /tmp/MarvinLogs/test_nicira_controller_77FOV9/results.txt
test_01_nicira_controller (integration.smoke.test_nicira_controller.TestNiciraContoller) ... === TestName: test_01_nicira_controller | Status : SUCCESS ===
ok
Nicira clusters will redirect clients (in this case ACS) to the master node. ... === TestName: test_02_nicira_controller_redirect | Status : SUCCESS ===
ok


Ran 2 tests in 398.132s

OK

@wilderrodrigues
Copy link
Contributor

Awesome to see the changes, @nvazquez ! Thanks for being openminded and also for your contributions!

Cheers,
Wilder

@miguelaferreira
Copy link
Contributor

btw, I'm testing your PR in our setup

@miguelaferreira
Copy link
Contributor

@nvazquez I havent' forgotten your PR. I'm having some trouble with the automation to re-create a NSX cluster with a KVM host in our test environment. As soon as it is fixed, I will test.
In any case, this PR can only be merged after 4.6 because of feature freeze.

I'll keep you posted.

@miguelaferreira
Copy link
Contributor

@nvazquez I've built and tested your PR in our environment and the functionality we have tests for remains stable. 👍

What I did:

  1. build and unit-test with developer and systemvm profiles
  2. package centos7 KVM agent RPMs
  3. deploy infra with
    • 1 mangement server (war installed from step 1)
    • 2 NSX controllers (cluster)
    • 1 NSX manger node
    • 1 NSX service node
    • 1 KVM host (agent installed from step 2)
  4. deploy data center using this config
  5. run marvin tests from test/integration/smoke/test_nicira_controller.py with flags tags=advanced,required_hardware=true

Result

[root@cs1 cloudstack]# nosetests --with-marvin --marvin-config=/data/shared/marvin/mct-zone1-kvm1-NVP.cfg -s -a tags=advanced,required_hardware=true test/integration/smoke/test_nicira_controller.py

==== Marvin Init Started ====

=== Marvin Parse Config Successful ===

=== Marvin Setting TestData Successful===

==== Log Folder Path: /tmp//MarvinLogs//Nov_10_2015_10_27_55_46U497. All logs will be available here ====

=== Marvin Init Logging Successful===

==== Marvin Init Successful ====
/usr/lib/python2.7/site-packages/requests/packages/urllib3/util/ssl_.py:100: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
(previous warning repeated a couple times)
=== TestName: test_01_nicira_controller | Status : SUCCESS ===

=== TestName: test_02_nicira_controller_redirect | Status : SUCCESS ===

===final results are now copied to: /tmp//MarvinLogs/test_nicira_controller_RV74VA===

LGTM

@nvazquez
Copy link
Contributor Author

@wilderrodrigues Thanks a lot Wilder!

@miguelaferreira Thanks for following this PR and testing it!

@miguelaferreira
Copy link
Contributor

@nvazquez last Friday 4.6.0 was released so current master is now 4.7.0. Could you please rebase your branch on latest master?

Then @wilderrodrigues will do one final review of the code, and if all is well (as it looks now) then he will give another LGTM and this feature can be merged.

@nvazquez
Copy link
Contributor Author

@miguelaferreira Great, I rebased my branch on master and pushed again

@remibergsma
Copy link
Contributor

Would be great to include this in 4.7, please ping me when you're ready!

@@ -2076,6 +2087,9 @@ private static void postNvpConfigBeforeStart(VirtualMachineMO vmMo, VirtualMachi
} else if (backing instanceof VirtualEthernetCardNetworkBackingInfo) {
// This NIC is connected to a Virtual Switch
// Nothing to do
} else if (backing instanceof VirtualEthernetCardOpaqueNetworkBackingInfo) {
//if NSX API VERSION >= 4.2, connect to br-int (nsx.network), do not create portgroup else previous behaviour
//OK, connected to OpaqueNetwork
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those 2 else/if blocks should contain some logging in order to inform sys admins what is now comments in the code.

You can keep it for now, because it doesn't affect the feature that @miguelaferreira has already tested. Once this PR is merged, we can fix it.

@wilderrodrigues
Copy link
Contributor

Couple of minor things that can go through now and be fixed after this is merged. @miguelaferreira has already tested the PR and showed that it works just fine. So, it has my LGTM as well 👍

Thanks a lot for your contribution, @nvazquez !

Ping @remibergsma you can proceed with the merge.

Cheers,
Wilder

@asfgit asfgit merged commit b20c5be into apache:master Nov 18, 2015
asfgit pushed a commit that referenced this pull request Nov 18, 2015
CLOUDSTACK-8956: NSX/Nicira Plugin does not support NSX v4.2.1JIRA Ticket: https://issues.apache.org/jira/browse/CLOUDSTACK-8956

### Description of the problem:
Prior to version 4.2. Nicira/VmWare NSX used a variation of Open vSwitch as means of integrating SDN into hypervisor layer. Cloudstack NiciraNVP plugin was written to support OVS as a bridge to NSX.
In version 4.2 VMware introduced NSX vSwitch as a replacement for OVS in ESX hypervisors. It is a fork of distributed vSwitch leveraging one of the recent features of ESX called opaque networks. Because of that change the current version of NiciraNVP plugin doesnt support versions of NSX-MH above 4.2 specifically in Vsphere environment. Proposed fix will analyze a version of NVP/NSX API and use proper support for ESX hypervisors.

vSphere hypervisor mode operations when NV is deployed onto NSX managed network changes:
* Current mode. A portgroup = UUID of CS VM NIC is created on a local standard switch of the Hypervisor where VM is starting. VM nic is attached to that port group.
* New mode. No additional port group is created on a HW. No port group cleanup is needed after VM/NIC is destroyed. VM is attached to 1st port group having the following attributes:
** opaqueNetworkId string "br-int
** opaqueNetworkType string "nsx.network"

If portgroup with such attributes is not found a deployment should fail with exception.

### VMware vSphere API version from 5.1 to 5.5:
Since vSphere API version 5.5, [OpaqueNetworks](https://www.vmware.com/support/developer/converter-sdk/conv55_apireference/vim.OpaqueNetwork.html) are introduced.
Its description says:
> This interface defines an opaque network, in the sense that the detail and configuration of the network is unknown to vShpere and is managed by a management plane outside of vSphere. However, the identifier and name of these networks is made available to vSphere so that host and virtual machine virtual ethernet device can connect to them.

In order to connect a vm's virtual ethernet device to the proper opaque network when deploying a vm into a NSX managed network, we first need to look for a particular opaque network on hosts. This opaque network's id has to be **"br-int"** and its type **"nsx.network"**.

Since vSphere API version 5.5 [HostNetworkInfo](https://www.vmware.com/support/developer/converter-sdk/conv55_apireference/vim.host.NetworkInfo.html#opaqueNetwork) introduces a list of available opaque networks for each host.
If NSX API version >= 4.2 we look for a [OpaqueNetworkInfo](https://www.vmware.com/support/developer/converter-sdk/conv55_apireference/vim.host.OpaqueNetworkInfo.html) which satisfies:
* opaqueNetworkId = "br-int"
* opaqueNetworkType = "nsx.netork"

If that opaque network is found, then we need to attach vm's NIC to a virtual ethernet device which support this, so we use [VirtualEthernetCardOpaqueNetworkBackingInfo](https://www.vmware.com/support/developer/converter-sdk/conv55_apireference/vim.vm.device.VirtualEthernetCard.OpaqueNetworkBackingInfo.html) setting:
* opaqueNetworkId = "br-int"
* opaqueNetworkType = "nsx.netork"

* pr/935:
  CLOUDSTACK-8956: Remove assert(false) on opaque network and ping method on NiciraNvpApiVersion
  CLOUDSTACK-8956: Deploy VM on NSX managed network changes if NSX Api Version >= 4.2: has to connect to "br-int" of "nsx.network" type
  CLOUDSTACK-8956: Log NSX Api Version
  CLOUDSTACK-8956: Add VMware Api v5.5 and change pom.xml to use VMware Api v5.5

Signed-off-by: Remi Bergsma <github@remi.nl>
@remibergsma
Copy link
Contributor

@nvazquez Can you look at this build failure: http://jenkins.buildacloud.org/job/build-master-noredist/4746/console

Does the slave need the 5.5 jar or is something else wrong?

@remibergsma
Copy link
Contributor

According to Jenkins the build was fine:

Maven:  -B -f /home/jenkins/jenkins-slave/workspace/cloudstack-pull-analysis/pom.xml -Dmaven.repo.local=/home/jenkins/jenkins-slave/maven-repositories/0 -Penablefindbugs clean install cobertura:cobertura

although it doesn't do the -Pnoredist profile.

@nvazquez
Copy link
Contributor Author

@remibergsma It looks like it is not finding 5.5 jar. These are the steps I followed:

@DaanHoogland
Copy link
Contributor

@nvazquez Do you mean to say that we can just rename the old jar to contain the new version in the name and the sun shines again?

@nvazquez
Copy link
Contributor Author

@DaanHoogland Hi! No, just renaming the old jar (version 5.1) will fail because it doesn't contain opaque network classes. Those ones are introduced in the 5.5 version jar.
I added a line in deps/install-non-oss.sh which includes 5.5 jar to maven repo and also replaced cs.vmware.api.version property in pom.xml file from 5.1 to 5.5

@DaanHoogland
Copy link
Contributor

@nvazquez I tried to mv and copy the jar in the job but it doesn't work. now trying to see if your link yields a newer version then our job has

@DaanHoogland
Copy link
Contributor

@nvazquez is the older one still needed?

@nvazquez
Copy link
Contributor Author

@DaanHoogland No, it's not.

@DaanHoogland
Copy link
Contributor

Ok, shouldn't it be removed from the install script and then?

@nvazquez
Copy link
Contributor Author

@DaanHoogland Ok, sure

@DaanHoogland
Copy link
Contributor

@nvazquez can you please look at http://jenkins.buildacloud.org/job/build-master-slowbuild/2660/findbugsResult/new/
I think these are trivial, you just have to decide on which encoding to use and call the right util method (getPrefferedCharset() or getDefaultCharset()) from StringUtils.

@nvazquez
Copy link
Contributor Author

@DaanHoogland those warnings are in files that don't belong to my pull request.
I could see in http://jenkins.buildacloud.org/job/build-master-slowbuild/2660/changes#detail0 that it compiled 5 commits, but the first one is commited by other contributor.

@DaanHoogland
Copy link
Contributor

sorry, @nlivens can you have a look?

@nvazquez nvazquez deleted the from4.5.1 branch November 4, 2016 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants