New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam: Fix race in NodeManager.Resync #26963
ipam: Fix race in NodeManager.Resync #26963
Conversation
@tommyp1ckles @christarazi Please take a look. This flake clearly happens more often lately, but I'm not quite sure why recent changes would introduce this. |
/test |
@jaffcheng Interesting, I've been looking into this as well, I'm pretty sure its related to racey code related to appending to the AzureInterface Addresses array in the mock API implementation. I have a local reproduction that can cause the failure to occur fairly quickly, let me validate if this fixes it. |
Even with these changes, I stil run into the failure. |
Thanks for trying, could you share your reproduction method and test output? My local test is running pkg/ipam unit tests a lot of times with/without -race and lockdebug. But haven’t tried pkg/azure yet. |
@jaffcheng Sure, my branch is quite messy, ill push that up later today. The main part is here in the Upsert function in node_manager.go. Basically just removing the exponential backoff between triggers and running the poolMaintainer trigger in a loop causes this to occur usually in a couple seconds. I'm running the test with
poolMaintainer, err := trigger.NewTrigger(trigger.Parameters{
Name: fmt.Sprintf("ipam-pool-maintainer-%s", resource.Name),
MinInterval: 10 * time.Millisecond,
MetricsObserver: n.metricsAPI.PoolMaintainerTrigger(),
TriggerFunc: func(reasons []string) {
if err := node.MaintainIPPool(ctx); err != nil {
node.logger().WithError(err).Warning("Unable to maintain ip pool of node")
//backoff.Wait(ctx) Remove exponential backoff wait to increase chance of race occuring.
}
},
ShutdownFunc: cancel,
})
// Trigger poolMaintainer continuously!
go func() {
for {
<-time.After(time.Millisecond)
poolMaintainer.Trigger()
}
}() |
Race reveals a couple data races in this, I was hoping fixing those would solve the problem but yet it still persists 😕 |
Although I think this patch should significantly lower the possibility of flakes, there are still some races found in /pkg/ipam with -race on. Need more digging... |
there appears to be many races there 😄 I have a few race fixes as well I'll put PRs for. Considering how hard im finding it to get to a single cause, its possible the issue is caused by multiple races. We can start fixing these and see if the flakes improve. |
24c0779
to
d0fc72b
Compare
Change |
/test |
When I was looking at this, another place I thought it might be racing was here: https://github.com/cilium/cilium/blob/main/pkg/ipam/node.go#L430 i.e. the interfaces are checked in parallel and then on the next line they will race to get the Node lock, so in theory a stale set of values could overwrite the latest ones. But... fixing this didn't fix the flake in practice. |
I think these changes make sense, lock changes in this code scare me a bit, I'll do another review to double check that we aren't creating a deadlock somewhere in shortly. |
d0fc72b
to
dfbdc33
Compare
Changed to |
/test |
@jaffcheng The integration tests are failing I believe to CI using a newer version of Go, can you rebase your branch - that should fix it. |
TestNodeManagerManyNodes had been flaky before ported from aws/eni, and was eventually disabled in aws tests: cilium#11560 One of the sources of this flake is the races in the values of metricsapi, e.g. `metricsapi.Nodes("total")` and `metricsapi.AllocatedIPs("available")`, which is not protected from concurrent writes in NodeManager.Resync. Allowing multiple goroutines to execute Resync simultaneously doesn't really make sense, since `Node.resyncNode` is already executed in parallel and controlled by semaphore. this patch serializes NodeManager.Resync to avoid data races on metricsAPI. Some excerpts of failed tests: --- FAIL: Test (17.29s) --- FAIL: Test/IPAMSuite (17.29s) --- FAIL: Test/IPAMSuite/TestIPAMMetadata (0.01s) testing.go:1446: race detected during execution of test --- FAIL: Test/IPAMSuite/TestNodeManagerManyNodes (3.88s) node_manager_test.go:610: ... obtained int = 850 ... expected int = 1000 --- FAIL: Test (17.74s) --- FAIL: Test/IPAMSuite (17.74s) --- FAIL: Test/IPAMSuite/TestNodeManagerManyNodes (4.36s) node_manager_test.go:606: ... obtained int = 87 ... expected int = 100 Related: cilium#26617 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Create a separate metricsapi for every test case to prevent interference between each other. Related: cilium#26617 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
dfbdc33
to
e469fa3
Compare
/test Edit: ConformanceGatewayAPI hit #27647 error log
|
What's the consequence of this bugfix? Should we backport it to v1.14 per the backport criteria? Does the race condition affect other branches too? |
Reviews are in, CI is passing. LGTM to merge. |
@joestringer The changes that causes this don't appear to be in the v1.14 branch |
please see commit message
Related: #26617