CLOUDSTACK-9362: Skip VXLANs when rewriting the bridge name for migrations (4.8-2) #1513

Merged
merged 1 commit into from May 18, 2016

Projects

None yet

6 participants

@insom
Contributor
insom commented Apr 22, 2016

From the JIRA issue:

bb8f7c6

The above commit introduces rewriting of bridge device names when migrating a virtual machine from one host to another. However, it also matches bridges called "brvx-1234" and rewrites them to (in my case) "brem1-1234" - this doesn't match the bridge name on the destination and causes the migration to fail with the error:

error : virNetDevGetMTU:397 : Cannot get interface MTU on 'brem1-1234': No such device

I have flagged this as major because it's not possible to migrate VMs using VXLANs for maintenance, which seems important (it's certainly important to me!).

This is a version of #1508 based against 4.8 (sorry!)

@asfbot
asfbot commented Apr 22, 2016

"Richard Klein (RSI)" on dev@cloudstack.apache.org replies:
First time poster to the development mailing list so greetings everyone. W=
e ran across this issue in our environment as well (CentOS 7 with KVM) with=
CS 4.7.0. The problem appeared to use as a behavior of libvirt because we=
could reproduce the error independent of CS by manually creating a VM and =
migrating it outside of CS but on the CS host. This led us to believe it w=
as a libvirt issue. However, based on the code change in the Git repositor=
y I was unaware there were hooks into libvirt.

We worked around the issue by just changing the name of the bridge from brv=
x-1234 to vxbr-1234 and that resolved the issue for us. Not that this is a=
better solution but just wondering if we are talking about the same issue.

Richard Klein=A0 rklein@rsitex.com=20
RSI=20
5426 Guadalupe, Suite 100=20
Austin TX 78751=20
K-
m1-
the
Ms using
o
h (at
ur reply
led
ct
FRA.

@kiwiflyer
Contributor

We'll pull this in for testing.

@kiwiflyer
Contributor

Tested this in a hardware 4.8 lab: I was able to migrate VXLAN enabled VMs between hosts as expected.

LGTM

@dmabry
Contributor
dmabry commented May 12, 2016

Also, tested in a 4.8 lab and I was able to migrate VMs between hosts when using VXLAN Guest Isolated network.

LGTM

@swill
Contributor
swill commented May 13, 2016

CI RESULTS

Tests Run: 85
  Skipped: 0
   Failed: 1
   Errors: 0
 Duration: 9h 01m 07s

Summary of the problem(s):

FAIL: Test Site 2 Site VPN Across redundant VPCs
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_vpc_vpn.py", line 984, in test_01_redundant_vpc_site2site_vpn
    self.assert_(vpc2 is not None, "VPC2 creation failed")
AssertionError: VPC2 creation failed
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_network_B8TEJ7/results.txt

Associated Uploads

/tmp/MarvinLogs/DeployDataCenter__May_12_2016_22_26_38_6CZAVF:

/tmp/MarvinLogs/test_network_B8TEJ7:

/tmp/MarvinLogs/test_vpc_routers_XOZ934:

Uploads will be available until 2016-07-13 02:00:00 +0200 CEST

Comment created by upr comment.

@swill
Contributor
swill commented May 13, 2016

@insom, @dmabry and @kiwiflyer, the CI failure seems like it could be related to this change, but I can't be sure. Can you guys provide some guidance since all of you have tested this. Also, @insom can you do a force push or close and re-open the PR so we can get this green to make it mergeable once we agree on status. Thanks...

@insom insom closed this May 13, 2016
@insom insom reopened this May 13, 2016
@insom
Contributor
insom commented May 13, 2016 edited

@swill I've closed and re-opened. This change is to a shell script only and the failures are within Marvin, so I'm pretty confident that they are unrelated to my proposed change.

edit: As in, they are errors in the simulated DC, which I don't believe touches these scripts, not Marvin itself.

@swill
Contributor
swill commented May 13, 2016

@insom yes, I tend to agree with you. the main reason I highlighted it was because the error was network related and the code changed details related to an interface, so I figured I would check. Once everything comes back green, I will merge this. Thank you for the work and the quick reply. ๐Ÿ‘

@insom
Contributor
insom commented May 13, 2016

@swill Hmm. Now it's a different Cobertura error

@swill
Contributor
swill commented May 13, 2016

Holy errors batman!!!

It looks like it started like this:

[INFO] --- cobertura-maven-plugin:2.6:instrument (default-cli) @ cloud-server ---
[INFO] Cobertura 2.0.3 - GNU GPL License (NO WARRANTY) - See COPYRIGHT file
[cobertura] WARN  [main] net.sourceforge.cobertura.instrument.CoberturaInstrumenter - Unable to instrument file /home/jenkins/jenkins-slave/workspace/cloudstack-pr-analysis/server/target/generated-classes/cobertura/com/cloud/ha/AbstractInvestigatorImpl.class
java.lang.NullPointerException
    at net.sourceforge.cobertura.instrument.tp.ClassMap.applyOnProjectData(ClassMap.java:364)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:187)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:121)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.addInstrumentationToSingleClass(CoberturaInstrumenter.java:234)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentationToSingleClass(Main.java:298)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentation(Main.java:307)
    at net.sourceforge.cobertura.instrument.Main.parseArguments(Main.java:399)
    at net.sourceforge.cobertura.instrument.Main.main(Main.java:421)
[cobertura] WARN  [main] net.sourceforge.cobertura.instrument.CoberturaInstrumenter - Unable to instrument file /home/jenkins/jenkins-slave/workspace/cloudstack-pr-analysis/server/target/generated-classes/cobertura/com/cloud/ha/HighAvailabilityManagerImpl$WorkerThread.class
java.lang.NullPointerException
    at net.sourceforge.cobertura.instrument.tp.ClassMap.applyOnProjectData(ClassMap.java:364)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:187)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:121)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.addInstrumentationToSingleClass(CoberturaInstrumenter.java:234)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentationToSingleClass(Main.java:298)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentation(Main.java:307)
    at net.sourceforge.cobertura.instrument.Main.parseArguments(Main.java:399)
    at net.sourceforge.cobertura.instrument.Main.main(Main.java:421)
[cobertura] WARN  [main] net.sourceforge.cobertura.instrument.CoberturaInstrumenter - Unable to instrument file /home/jenkins/jenkins-slave/workspace/cloudstack-pr-analysis/server/target/generated-classes/cobertura/com/cloud/ha/RecreatableFencer.class
java.lang.NullPointerException
    at net.sourceforge.cobertura.instrument.tp.ClassMap.applyOnProjectData(ClassMap.java:364)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:187)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:121)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.addInstrumentationToSingleClass(CoberturaInstrumenter.java:234)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentationToSingleClass(Main.java:298)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentation(Main.java:307)
    at net.sourceforge.cobertura.instrument.Main.parseArguments(Main.java:399)
    at net.sourceforge.cobertura.instrument.Main.main(Main.java:421)
[cobertura] WARN  [main] net.sourceforge.cobertura.instrument.CoberturaInstrumenter - Unable to instrument file /home/jenkins/jenkins-slave/workspace/cloudstack-pr-analysis/server/target/generated-classes/cobertura/com/cloud/ha/dao/HighAvailabilityDao.class
java.lang.NullPointerException
    at net.sourceforge.cobertura.instrument.tp.ClassMap.applyOnProjectData(ClassMap.java:364)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:187)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:121)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.addInstrumentationToSingleClass(CoberturaInstrumenter.java:234)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentationToSingleClass(Main.java:298)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentation(Main.java:307)
    at net.sourceforge.cobertura.instrument.Main.parseArguments(Main.java:399)
    at net.sourceforge.cobertura.instrument.Main.main(Main.java:421)
[cobertura] WARN  [main] net.sourceforge.cobertura.instrument.CoberturaInstrumenter - Unable to instrument file /home/jenkins/jenkins-slave/workspace/cloudstack-pr-analysis/server/target/generated-classes/cobertura/com/cloud/ha/dao/HighAvailabilityDaoImpl.class
java.lang.NullPointerException
    at net.sourceforge.cobertura.instrument.tp.ClassMap.applyOnProjectData(ClassMap.java:364)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:187)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.instrumentClass(CoberturaInstrumenter.java:121)
    at net.sourceforge.cobertura.instrument.CoberturaInstrumenter.addInstrumentationToSingleClass(CoberturaInstrumenter.java:234)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentationToSingleClass(Main.java:298)
    at net.sourceforge.cobertura.instrument.Main.addInstrumentation(Main.java:307)
    at net.sourceforge.cobertura.instrument.Main.parseArguments(Main.java:399)
    at net.sourceforge.cobertura.instrument.Main.main(Main.java:421)

I am pretty sure we are having clean problems on Jenkins, so I wonder if it is related to something being left over from the build of a different branch. Hmmm, I need to get better with Jenkins. Does anyone have an expertise here that could potentially help us understand what would be happening here?

@swill
Contributor
swill commented May 16, 2016

@insom we are down to the wire here. I would like to see if we can get this green and merge it before the freeze. Can you rebase and push again to kick the automation. I will likely make judgement calls and merge PRs that have a passing travis even if jenkins is failing. We will see. I would rather not be in that situation...

Aaron Brady Skip VXLANs when rewriting the bridge name for migrations
e9bf751
@insom
Contributor
insom commented May 17, 2016

@swill rebased and pushed. It does have two LGTMs and tests by real people (plus it's in production at here at iWeb) if that makes it a little easier to overrule Jenkins.

@swill
Contributor
swill commented May 17, 2016

We are green now. ๐Ÿ‘ I am going to rerun CI to make sure the Failure we got does not pop up again. Otherwise we are pretty good. Thanks...

@swill
Contributor
swill commented May 17, 2016

CI RESULTS

Tests Run: 44
  Skipped: 0
   Failed: 3
   Errors: 0
 Duration: 3h 12m 14s

Summary of the problem(s):

FAIL: test_02_vpc_privategw_static_routes (integration.smoke.test_privategw_acl.TestPrivateGwACL)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 262, in test_02_vpc_privategw_static_routes
    self.performVPCTests(vpc_off)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 325, in performVPCTests
    privateGw_1 = self.createPvtGw(vpc_1, "10.0.3.100", "10.0.3.101", acl1.id, vlan_1)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 595, in createPvtGw
    self.fail("Failed to create Private Gateway ==> %s" % e)
AssertionError: Failed to create Private Gateway ==> Execute cmd: createprivategateway failed, due to: errorCode: 431, errorText:Network with vlan vlan://100 already exists in zone 1
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_network_Y9TPF9/results.txt
FAIL: test_03_vpc_privategw_restart_vpc_cleanup (integration.smoke.test_privategw_acl.TestPrivateGwACL)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 274, in test_03_vpc_privategw_restart_vpc_cleanup
    self.performVPCTests(vpc_off, restart_with_cleanup = True)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 325, in performVPCTests
    privateGw_1 = self.createPvtGw(vpc_1, "10.0.3.100", "10.0.3.101", acl1.id, vlan_1)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 595, in createPvtGw
    self.fail("Failed to create Private Gateway ==> %s" % e)
AssertionError: Failed to create Private Gateway ==> Execute cmd: createprivategateway failed, due to: errorCode: 431, errorText:Network with vlan vlan://100 already exists in zone 1
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_network_Y9TPF9/results.txt
FAIL: test_04_rvpc_privategw_static_routes (integration.smoke.test_privategw_acl.TestPrivateGwACL)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 286, in test_04_rvpc_privategw_static_routes
    self.performVPCTests(vpc_off)
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 345, in performVPCTests
    self.check_pvt_gw_connectivity(vm1, public_ip_1, [vm2.nic[0].ipaddress, vm1.nic[0].ipaddress])
  File "/data/git/cs1/cloudstack/test/integration/smoke/test_privategw_acl.py", line 704, in check_pvt_gw_connectivity
    "Ping to VM on Network Tier N from VM in Network Tier A should be successful at least for 2 out of 3 VMs"
AssertionError: Ping to VM on Network Tier N from VM in Network Tier A should be successful at least for 2 out of 3 VMs
----------------------------------------------------------------------
Additional details in: /tmp/MarvinLogs/test_network_Y9TPF9/results.txt

Associated Uploads

/tmp/MarvinLogs/DeployDataCenter__May_17_2016_17_19_03_TAJ94R:

/tmp/MarvinLogs/test_network_Y9TPF9:

Uploads will be available until 2016-07-17 02:00:00 +0200 CEST

Comment created by upr comment.

@swill
Contributor
swill commented May 17, 2016

Not sure what to think of that. I have never seen those errors before in my environment...

@kiwiflyer
Contributor

@swill Sure you haven't got some artifacts left behind from some other testing? This is only a libvirt hook, so I can't imagine how the tests failing above could be related to this PR.

@swill
Contributor
swill commented May 17, 2016 edited

I will rerun tests now...
Edit: Well, I will run it later tonight when I have a CI env open again.

@swill
Contributor
swill commented May 18, 2016

CI RESULTS

Tests Run: 44
  Skipped: 0
   Failed: 0
   Errors: 0
 Duration: 3h 20m 40s

Associated Uploads

/tmp/MarvinLogs/DeployDataCenter__May_18_2016_08_05_00_O6IVBL:

/tmp/MarvinLogs/test_network_CK2TZV:

Uploads will be available until 2016-07-18 02:00:00 +0200 CEST

Comment created by upr comment.

@swill
Contributor
swill commented May 18, 2016

The tests that were failing are not failing anymore, I think this is ready...

@asfgit asfgit merged commit e9bf751 into apache:4.8 May 18, 2016

2 checks passed

Jenkins This pull request looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment