Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for race when automatically assigning IP to Vms #9240

Merged
merged 2 commits into from
Jun 28, 2024

Conversation

abh1sar
Copy link
Collaborator

@abh1sar abh1sar commented Jun 13, 2024

Description

Fixes: #7907

This PR fixes the issue where two VMs can be assigned the same IP if they are created at the same time.
NetworkOrchestrator.allocateNic() calls guru.alocate() which returns a free IP address.
NetworkOrchestrator.allocateNic() then calls _nicDao.persists()

But guru.allocate() can return same IPs to two VMs if the first VM hasn't persisted the nic to the DB yet.
Doing the whole thing in a transaction might be costly.

So the fix is to check if the IP returned by guru.allocate is already assigned just before persisting the NicVO in a transaction.
Check will be done only for cases where Ipv4 allocation might race.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Wasn't possible to reproduce the actual race, but I tested by manually setting values with debugger and verifying that the code does what it is supposed to.

How did you try to break this feature and the system with this change?

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 0% with 48 lines in your changes missing coverage. Please review.

Project coverage is 14.98%. Comparing base (b2ef53b) to head (1fb61ae).
Report is 70 commits behind head on 4.19.

Files Patch % Lines
...tack/engine/orchestration/NetworkOrchestrator.java 0.00% 41 Missing ⚠️
api/src/main/java/com/cloud/vm/NicProfile.java 0.00% 6 Missing ⚠️
.../java/com/cloud/network/guru/GuestNetworkGuru.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19    #9240      +/-   ##
============================================
+ Coverage     14.96%   14.98%   +0.01%     
- Complexity    11013    11048      +35     
============================================
  Files          5377     5389      +12     
  Lines        469567   470615    +1048     
  Branches      60162    57503    -2659     
============================================
+ Hits          70285    70517     +232     
- Misses       391498   392263     +765     
- Partials       7784     7835      +51     
Flag Coverage Δ
uitests 4.28% <ø> (-0.02%) ⬇️
unittests 15.69% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DaanHoogland
Copy link
Contributor

Check will be done only for cases where Ipv4 allocation might race.

right now, this means when allocating for the GuestNetworkGuru. Can you look at @hsato03 's implementation (together with him) to see if it can be unified?

@rohityadavcloud
Copy link
Member

@abh1sar can you review and address outstanding comments? And, can you care to run packaging and smoketests for your own PRs that are ready for review.
@blueorangutan package

@blueorangutan
Copy link

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el7 ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 10097

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10101

@abh1sar
Copy link
Collaborator Author

abh1sar commented Jun 26, 2024

@blueorangutan package

@blueorangutan
Copy link

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10135

@abh1sar
Copy link
Collaborator Author

abh1sar commented Jun 26, 2024

@blueorangutan test

@blueorangutan
Copy link

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-10631)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41624 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9240-t10631-kvm-centos7.zip
Smoke tests completed. 131 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache
Copy link
Member

I tried to reproduce the issue on a env without this PR

  • create a shared network with 5 IPs (1 for VR, 3 for test vms). deployed 6 vms in parallel, 3 works and 3 failed
    image

  • create isolated network with netmask 255.255.255.248 (6 IPs in total, 1 for VR, 2 for existing VM, 2 for test vms). deploy 6 vms in parallel

image

2 vms failed, 4 vms succeeded (3 vms have the same IP)

With this PR (on another env)

  • create isolated network with netmask 255.255.255.248 (6 IPs in total, 1 for VR, 3 for existing VM, 2 for test vms). deploy 6 vms in parallel, 2 succeeded, 4 failed. looks nice
    image

Copy link
Member

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

cc @sureshanaparti

Copy link
Member

@vishesh92 vishesh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@sureshanaparti sureshanaparti merged commit 646c894 into apache:4.19 Jun 28, 2024
24 of 26 checks passed
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jul 2, 2024
* Fix for race when automatically assigning IP to Vms

* code refactor
weizhouapache added a commit to weizhouapache/cloudstack that referenced this pull request Jul 23, 2024
This code snippet has been removed in the merge forward of PR apache#9240
in commit 90fe1d5
nvazquez pushed a commit that referenced this pull request Aug 14, 2024
This code snippet has been removed in the merge forward of PR #9240
in commit 90fe1d5
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 22, 2024
This code snippet has been removed in the merge forward of PR apache#9240
in commit 90fe1d5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Race condition when automatically assigning IPs to VMs
7 participants