Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix enable/disable static nat if userdata is not supported #5839

Merged

Conversation

weizhouapache
Copy link
Member

Description

This PR fixes #5824

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@weizhouapache weizhouapache added this to the 4.16.1.0 milestone Jan 7, 2022
Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, do we have a reason for this regression?

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM

@weizhouapache
Copy link
Member Author

clgtm, do we have a reason for this regression?

@DaanHoogland

the exception is thrown at
https://github.com/apache/cloudstack/blob/main/engine/schema/src/main/java/com/cloud/network/dao/NetworkServiceMapDaoImpl.java#L125-L127

        if (ntwkSvc == null) {
            throw new UnsupportedServiceException("Service " + service.getName() + " is not supported in the network id=" + networkId);
        }

removing the lines might also fix the issue. but it has large impact.
this PR has much smaller impact.

@sureshanaparti sureshanaparti linked an issue Jan 7, 2022 that may be closed by this pull request
@DaanHoogland
Copy link
Contributor

clgtm, do we have a reason for this regression?

@DaanHoogland

the exception is thrown at https://github.com/apache/cloudstack/blob/main/engine/schema/src/main/java/com/cloud/network/dao/NetworkServiceMapDaoImpl.java#L125-L127

        if (ntwkSvc == null) {
            throw new UnsupportedServiceException("Service " + service.getName() + " is not supported in the network id=" + networkId);
        }

removing the lines might also fix the issue. but it has large impact. this PR has much smaller impact.

I would agree @weizhouapache, I approve of this change and intuitively it is the best solution but still have some questions;

  • why is applyUserData called at all if it is not supported? and,
  • is this a regression that might have other unforeseen consequences?

It looks like it was introduced by 9c6b02f.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2128

@weizhouapache
Copy link
Member Author

clgtm, do we have a reason for this regression?

@DaanHoogland
the exception is thrown at https://github.com/apache/cloudstack/blob/main/engine/schema/src/main/java/com/cloud/network/dao/NetworkServiceMapDaoImpl.java#L125-L127

        if (ntwkSvc == null) {
            throw new UnsupportedServiceException("Service " + service.getName() + " is not supported in the network id=" + networkId);
        }

removing the lines might also fix the issue. but it has large impact. this PR has much smaller impact.

I would agree @weizhouapache, I approve of this change and intuitively it is the best solution but still have some questions;

  • why is applyUserData called at all if it is not supported? and,
  • is this a regression that might have other unforeseen consequences?

It looks like it was introduced by 9c6b02f.

@DaanHoogland
when enable/disable static nat, the public IP of the vm needs to also updated in userdata server (=VR).
by default, VM public IP is source nat IP. if static nat is enabled, VM public IP is the static nat IP.
that's why we apply userdata of the vm/nic.

you might have also noticed that there are already check below in code

        if (element == null) {
            s_logger.error("Can't find network element for " + Service.UserData.getName() + " provider needed for UserData update");

if element cannot be found, then do not apply userdata. but the exception is thrown in Dao (I think it is not a good idea)

I have no idea if there are other issues caused by the code in Dao I pasted above.
for enable/disable static nat, both should work no matter userdata is supported or not.

@DaanHoogland
Copy link
Contributor

clgtm, do we have a reason for this regression?

@DaanHoogland
the exception is thrown at https://github.com/apache/cloudstack/blob/main/engine/schema/src/main/java/com/cloud/network/dao/NetworkServiceMapDaoImpl.java#L125-L127

        if (ntwkSvc == null) {
            throw new UnsupportedServiceException("Service " + service.getName() + " is not supported in the network id=" + networkId);
        }

removing the lines might also fix the issue. but it has large impact. this PR has much smaller impact.

I would agree @weizhouapache, I approve of this change and intuitively it is the best solution but still have some questions;

  • why is applyUserData called at all if it is not supported? and,
  • is this a regression that might have other unforeseen consequences?

It looks like it was introduced by 9c6b02f.

@DaanHoogland when enable/disable static nat, the public IP of the vm needs to also updated in userdata server (=VR). by default, VM public IP is source nat IP. if static nat is enabled, VM public IP is the static nat IP. that's why we apply userdata of the vm/nic.

you might have also noticed that there are already check below in code

        if (element == null) {
            s_logger.error("Can't find network element for " + Service.UserData.getName() + " provider needed for UserData update");

if element cannot be found, then do not apply userdata. but the exception is thrown in Dao (I think it is not a good idea)

I have no idea if there are other issues caused by the code in Dao I pasted above. for enable/disable static nat, both should work no matter userdata is supported or not.

I get it @weizhouapache , as discussed of line applyUserData should be renamed to applyUserDataIfNeeded. I thought it should not have been invoked in the first place, if it is harmfull. Your solution is good, thanks.

@weizhouapache
Copy link
Member Author

I get it @weizhouapache , as discussed of line applyUserData should be renamed to applyUserDataIfNeeded. I thought it should not have been invoked in the first place, if it is harmfull. Your solution is good, thanks.

@DaanHoogland thanks.

Renamed the method as per your comment.

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2130

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@GutoVeronezi GutoVeronezi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM

…java

Co-authored-by: Daniel Augusto Veronezi Salvador <38945620+GutoVeronezi@users.noreply.github.com>
@blueorangutan
Copy link

Trillian test result (tid-2823)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46814 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5839-t2823-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_loadbalance.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Intermittent failure detected: /marvin/tests/smoke/test_deploy_virtio_scsi_vm.py
Intermittent failure detected: /marvin/tests/smoke/test_password_server.py
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_iso.py
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_with_userdata.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vms_with_varied_deploymentplanners.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_diagnostics.py
Intermittent failure detected: /marvin/tests/smoke/test_reset_vm_on_reboot.py
Intermittent failure detected: /marvin/tests/smoke/test_resource_accounting.py

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖️ el7 ✔️ el8 ✖️ debian ✖️ suse15. SL-JID 2137

@sureshanaparti
Copy link
Contributor

@blueorangutan package

1 similar comment
@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2159

@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2844)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30953 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5839-t2844-kvm-centos7.zip
Smoke tests completed. 92 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@sureshanaparti sureshanaparti merged commit 9293f5b into apache:4.16 Jan 11, 2022
@weizhouapache weizhouapache deleted the 4.16-fix-static-nat-no-userdata branch December 9, 2022 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

"Disable Static NAT" in VPC results in stuck IP Address
7 participants