Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix azure ipam flake caused by instance resync race condition. #31580

Merged
merged 2 commits into from
Apr 9, 2024

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Mar 25, 2024

TestIpamManyNodes fails occasionally with unexpected available IP counts.
This appears to be an underlying bug in the azure/ipam implementation. This results from two racing resyncInstance calls racing to update the cached instances list.

This moves the azure API implementation lock to encompass the entire operation Resync operation - such that getting instances/subnets assigning them is atomic.

This shouldn't have any risk of deadlocks as azure api doesn't have any references to node or node manager.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 25, 2024
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-azure-ipam-flake-v2 branch 2 times, most recently from 4008aa2 to 3523bcd Compare March 25, 2024 05:08
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor Author

Looking for a review for @cilium/azure. @hemanthmalla or @doniacld could either of you take a look 🙏

@tommyp1ckles tommyp1ckles added release-note/bug This PR fixes an issue in a previous release of Cilium. release-note/ci This PR makes changes to the CI. labels Mar 26, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 26, 2024
Copy link
Contributor

@doniacld doniacld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

Copy link
Member

@hemanthmalla hemanthmalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologize for the delay. Thanks.

TestIpamManyNodes fails occasionally with unexpected available IP counts.
This appears to be an underlying bug in the azure/ipam implementation.
This results from two racing resyncInstance calls racing to update the
cached instances list.

The chain of triggers for Upsert is as follows:

*note*: Node has a reference to nodemanager, and nodemanager maintains
	references to all nodes, thus each node has a circular ref w/
	the node manager.  This complicates locking.

nodemanager.Upsert ->
	1. (NodeManager) Sets up node triggers.
	2. (Node) Does node IP IPAM accounting.
	3. (Node) Triggers poolMaintenance.
poolMaintenance ->
	1. (Node) Attach any necessary private IPs to instance based on current
		accounting (i.e. node stats).
	2. (Node) Performs node device/address accounting and triggers instance
		sync (i.e. in this case: syncing Cloud API instance/device/addr with
		local cache).
instanceSync ->
	1. (NodeManager/InstanceManager) API is called to get
		subnets and instances.
	2. (NodeManager/InstanceManager) Instances are updated in
	   	critical section.

Finally, based on the outcome of the instances sync, following addr
accounting, another maintainIPPool may be triggered causing a series of
confusing allocation/accounting errors.

In this last part, because getting instance data from the API is not
locked, and because the instanceSync operation is an operation on the
NodeManager. Two such syncs can occur at the same time (i.e. from
different vms), in such a case, two instanceSync runs may occur out of
order and the most recent result will be overwritten.

This moves the azure API implementation lock to encompass the entire
operation Resync operation - such that getting instances/subnets and
assigning them is atomic.

This shouldn't have any risk of deadlocks as azure api doesn't have any
references to node or node manager.

Fixes: #31466

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
The code being test has been subject to several bugs. These tend
to be more likely to reproduce when running on a smaller test case
(presumably due to a lower chance of another trigger correcting
previous errors).

This adds another test case to TestIpamManyNodes that should make
surfacing of regressions in the azure ipam implementation more likely.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/fix-azure-ipam-flake-v2 branch from 3523bcd to c2b9fa5 Compare April 9, 2024 04:12
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles added area/azure Impacts Azure based IPAM. area/ipam Impacts IP address management functionality. labels Apr 9, 2024
@tommyp1ckles tommyp1ckles added this pull request to the merge queue Apr 9, 2024
Merged via the queue into main with commit d9924db Apr 9, 2024
231 of 232 checks passed
@tommyp1ckles tommyp1ckles deleted the pr/tp/fix-azure-ipam-flake-v2 branch April 9, 2024 17:41
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/azure Impacts Azure based IPAM. area/ipam Impacts IP address management functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. release-note/ci This PR makes changes to the CI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants