Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have persistent DHCP leases file on VRs and cleanup /etc/hosts on VM deletion #3351

Merged
merged 10 commits into from Jun 3, 2019

Conversation

nvazquez
Copy link
Contributor

@nvazquez nvazquez commented May 24, 2019

Description

Since the CloudStack virtual router was redesigned on version 4.6 it has been observed that the DHCP leases file is not persistent across network operations. This causes conflicts on guest VMs static IPs, causing these static IPs to not be renewed by the DHCP server running on isolated and VPC networks' virtual routers (dnsmasq). On stopping or destroying a VM, its dhcp/dns records are not removed from the virtual router causing ghost effects.

Fixes #3272
Fixes #3354

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Tested on local environment, 1x KVM host, Isolated network

  • Create VMs on isolated network
  • Restart network

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-2782

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@nvazquez nvazquez changed the title Fix not persistent DHCP leases file on VRs [WIP: DO NOT MERGE] Fix not persistent DHCP leases file on VRs May 24, 2019
@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2787

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3584)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30405 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3351-t3584-kvm-centos7.zip
Smoke tests completed. 68 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@rohityadavcloud
Copy link
Member

@blueorangutan test matrix

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@rohityadavcloud rohityadavcloud added this to the 4.11.3.0 milestone May 27, 2019
@blueorangutan
Copy link

Trillian test result (tid-3587)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34342 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3351-t3587-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_router_nics.py
Smoke tests completed. 66 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_scale_vm Error 17.43 test_scale_vm.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 562.10 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 528.05 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 567.24 test_vpc_redundant.py

@blueorangutan
Copy link

Trillian test result (tid-3589)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40124 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3351-t3589-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Smoke tests completed. 65 look OK, 3 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_extract_template Failure 136.54 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_06_download_detached_volume Failure 167.45 test_volumes.py

@onitake
Copy link
Contributor

onitake commented May 28, 2019

Is the fix implemented by this PR ready to be tested?
The WIP tag is slightly confusing...
I'd like to test it on one of our affected networks, if possible.

@nvazquez
Copy link
Contributor Author

Hi @onitake the WIP tag stands for 'Work in progress' which means this is still not ready. Having said that, you can test it now but you may expect new changes

mac = lease[1]
ip = lease[2]
if mac not in macs_dhcphosts:
cmd = "dhcp_release %s %s" % (ip, mac)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvazquez the dhcp_release <interface> parameter is missing here

See https://manpages.debian.org/stretch/dnsmasq-utils/dhcp_release.1.en.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I'll fix this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohityadavcloud
Copy link
Member

@nvazquez can you address review comments and also advise if this PR is ready for review and testing?

@rohityadavcloud
Copy link
Member

@nvazquez while you're on it, can you also look at another similar issue related to VRs - #3273?

@rohityadavcloud
Copy link
Member

@nvazquez can you also fix pylint error:

************* Module cleanuphosts
E: 75,78: Instance of 'Exception' has no 'strerror' member (no-member)
-----------------------------------
Your code has been rated at 9.99/10
+'[' 2 -gt 0 ']'
+echo 'Pylint failed, please check your code'
Pylint failed, please check your code

@rohityadavcloud rohityadavcloud changed the title [WIP: DO NOT MERGE] Fix not persistent DHCP leases file on VRs [WIP: DO NOT MERGE] Have persistent DHCP leases file on VRs and cleanup /etc/hosts on VM deletion May 30, 2019
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rohityadavcloud
Copy link
Member

rohityadavcloud commented May 31, 2019

@nvazquez fixes for #3272 and #3354 LGTM, but I found a weird issue:

  • Deploy two small VMs
  • Destroy one of the small VMs, verify that their entries are not seen in /etc/hosts and /etc/dhcphosts.txt
  • Deploy another new small VM
  • Verify that the new small VM's entry exists, but not the delete VM's entry suddenly show up as well (possibly something to do with CsDhcp.py and databags, we probably need to either fix in /etc/cloudstack/dhcpentry.json in the VR via the cleanup script, or see DhcpEntryCommand, command setup helper:: createDhcpEntryCommand)

Please continue with the investigation and fix for the above edge-case and also fix for #3273

@nvazquez
Copy link
Contributor Author

nvazquez commented May 31, 2019

@rhtyd changed the implementation, no need for a new script. Marking entries as 'remove=True' on dhcpentry.json solves the issue. This solves #3272 and #3354.
Fix for #3273 is the only remaining

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2826

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3631)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32583 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3351-t3631-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_05_rvpc_multi_tiers Failure 407.69 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 433.89 test_vpc_redundant.py

Copy link
Member

@rohityadavcloud rohityadavcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested Nicolas's latest changes, LGTM for both the bugs. The third one is not started yet.

@rohityadavcloud rohityadavcloud changed the title [WIP: DO NOT MERGE] Have persistent DHCP leases file on VRs and cleanup /etc/hosts on VM deletion Have persistent DHCP leases file on VRs and cleanup /etc/hosts on VM deletion Jun 3, 2019
@rohityadavcloud
Copy link
Member

Let's do #3273 in a separate PR.

Copy link
Contributor

@anuragaw anuragaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Based on code read and test results

@server24
Copy link

@nvazquez @rhtyid this breaks Basic Networking (#3497) whenever a VM is stopped or expunged.

@rohityadavcloud
Copy link
Member

Can you check and possibly send a PR @nvazquez ? Thanks.

@nvazquez
Copy link
Contributor Author

Sure I'll take a look tonight

@GabrielBrascher
Copy link
Member

@nvazquez has this PR been tested on a shared guest network within a Basic Network Zone? I am facing a null pointer when trying to remove the DHCP entry rule from a VM that is being stopped.

ACS version: 4.11.3

java.lang.NullPointerException
        at org.apache.cloudstack.network.topology.BasicNetworkVisitor.visit(BasicNetworkVisitor.java:201)
        at com.cloud.network.rules.DhcpEntryRules.accept(DhcpEntryRules.java:64)
        at org.apache.cloudstack.network.topology.BasicNetworkTopology.applyRules(BasicNetworkTopology.java:390)
        at org.apache.cloudstack.network.topology.BasicNetworkTopology.removeDhcpEntry(BasicNetworkTopology.java:464)
        at com.cloud.network.element.VirtualRouterElement.removeDhcpEntry(VirtualRouterElement.java:972)
        at org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.cleanupNicDhcpDnsEntry(NetworkOrchestrator.java:2933)

@GabrielBrascher
Copy link
Member

Please ignore the message above, just saw the PR #3501 fixing it.

Thanks for the PR @nvazquez.

@nvazquez nvazquez deleted the dhcp411 branch April 6, 2020 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet