ipam: Fix race in NodeManager.Resync #26963

jaffcheng · 2023-07-20T15:37:46Z

please see commit message

Related: #26617

jaffcheng · 2023-07-20T15:44:25Z

@tommyp1ckles @christarazi Please take a look. This flake clearly happens more often lately, but I'm not quite sure why recent changes would introduce this.

jaffcheng · 2023-07-20T15:52:43Z

/test

tommyp1ckles · 2023-07-20T17:03:21Z

@jaffcheng Interesting, I've been looking into this as well, I'm pretty sure its related to racey code related to appending to the AzureInterface Addresses array in the mock API implementation.

I have a local reproduction that can cause the failure to occur fairly quickly, let me validate if this fixes it.

tommyp1ckles · 2023-07-20T17:30:08Z

Even with these changes, I stil run into the failure.

jaffcheng · 2023-07-20T17:59:22Z

Even with these changes, I stil run into the failure.

Thanks for trying, could you share your reproduction method and test output? My local test is running pkg/ipam unit tests a lot of times with/without -race and lockdebug. But haven’t tried pkg/azure yet.

tommyp1ckles · 2023-07-20T18:29:05Z

@jaffcheng Sure, my branch is quite messy, ill push that up later today. The main part is here in the Upsert function in node_manager.go. Basically just removing the exponential backoff between triggers and running the poolMaintainer trigger in a loop causes this to occur usually in a couple seconds. I'm running the test with

go test -count=100 -v -failfast -race

		poolMaintainer, err := trigger.NewTrigger(trigger.Parameters{
			Name:            fmt.Sprintf("ipam-pool-maintainer-%s", resource.Name),
			MinInterval:     10 * time.Millisecond,
			MetricsObserver: n.metricsAPI.PoolMaintainerTrigger(),
			TriggerFunc: func(reasons []string) {
				if err := node.MaintainIPPool(ctx); err != nil {
					node.logger().WithError(err).Warning("Unable to maintain ip pool of node")
					//backoff.Wait(ctx) Remove exponential backoff wait to increase chance of race occuring.
				}
			},
			ShutdownFunc: cancel,
		})
                // Trigger poolMaintainer continuously!
		go func() {
			for {
				<-time.After(time.Millisecond)
				poolMaintainer.Trigger()
			}
		}()

tommyp1ckles · 2023-07-20T18:29:59Z

Race reveals a couple data races in this, I was hoping fixing those would solve the problem but yet it still persists 😕

jaffcheng · 2023-07-22T11:29:47Z

Although I think this patch should significantly lower the possibility of flakes, there are still some races found in /pkg/ipam with -race on. Need more digging...

tommyp1ckles · 2023-07-26T15:40:07Z

Although I think this patch should significantly lower the possibility of flakes, there are still some races found in /pkg/ipam with -race on. Need more digging...

there appears to be many races there 😄 I have a few race fixes as well I'll put PRs for. Considering how hard im finding it to get to a single cause, its possible the issue is caused by multiple races. We can start fixing these and see if the flakes improve.

jaffcheng · 2023-08-08T04:37:49Z

Change Fixes: to Related: since this does not fix all races.

jaffcheng · 2023-08-08T06:13:40Z

/test

pkg/ipam/node_manager.go

tommyp1ckles · 2023-08-16T04:58:36Z

When I was looking at this, another place I thought it might be racing was here: https://github.com/cilium/cilium/blob/main/pkg/ipam/node.go#L430

i.e. the interfaces are checked in parallel and then on the next line they will race to get the Node lock, so in theory a stale set of values could overwrite the latest ones.

But... fixing this didn't fix the flake in practice.

tommyp1ckles · 2023-08-16T04:58:48Z

I think these changes make sense, lock changes in this code scare me a bit, I'll do another review to double check that we aren't creating a deadlock somewhere in shortly.

pkg/ipam/node_manager.go

Resolved

jaffcheng · 2023-08-17T04:32:50Z

Changed to GetNodesByIPWatermarkLocked and slightly changed the commit message given that this only fixes one source of the flake.

ti-mo · 2023-08-17T07:55:27Z

/test

tommyp1ckles · 2023-08-18T18:44:27Z

@jaffcheng The integration tests are failing I believe to CI using a newer version of Go, can you rebase your branch - that should fix it.

TestNodeManagerManyNodes had been flaky before ported from aws/eni, and was eventually disabled in aws tests: cilium#11560 One of the sources of this flake is the races in the values of metricsapi, e.g. `metricsapi.Nodes("total")` and `metricsapi.AllocatedIPs("available")`, which is not protected from concurrent writes in NodeManager.Resync. Allowing multiple goroutines to execute Resync simultaneously doesn't really make sense, since `Node.resyncNode` is already executed in parallel and controlled by semaphore. this patch serializes NodeManager.Resync to avoid data races on metricsAPI. Some excerpts of failed tests: --- FAIL: Test (17.29s) --- FAIL: Test/IPAMSuite (17.29s) --- FAIL: Test/IPAMSuite/TestIPAMMetadata (0.01s) testing.go:1446: race detected during execution of test --- FAIL: Test/IPAMSuite/TestNodeManagerManyNodes (3.88s) node_manager_test.go:610: ... obtained int = 850 ... expected int = 1000 --- FAIL: Test (17.74s) --- FAIL: Test/IPAMSuite (17.74s) --- FAIL: Test/IPAMSuite/TestNodeManagerManyNodes (4.36s) node_manager_test.go:606: ... obtained int = 87 ... expected int = 100 Related: cilium#26617 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

Create a separate metricsapi for every test case to prevent interference between each other. Related: cilium#26617 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

jaffcheng · 2023-08-19T04:59:10Z

/test

Edit: ConformanceGatewayAPI hit #27647

error log

=== CONT  TestConformance/HTTPRouteRequestMirror/0_request_to_'/mirror'_should_go_to_infra-backend-v1
  httproute-request-mirror.go:69: Making GET request to http://172.18.255.200/mirror
  http.go:222: Response expectation failed for request: {URL: {Scheme:http Opaque: User: Host:172.18.255.200 Path:/mirror RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}, Host: , Protocol: HTTP, Method: GET, Headers: map[X-Echo-Set-Header:[]], UnfollowRedirect: false, Server: , CertPem: <truncated>, KeyPem: <truncated>}  not ready yet: expected status code to be 200, got 404 (after 2.1µs)
  http.go:228: Request passed
  mirror.go:41: Searching for the mirrored request log
  mirror.go:42: Reading "gateway-conformance-infra/infra-backend-v2" logs
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc0 pc=0x2088cd9]

goroutine 1493 [running]:
k8s.io/client-go/kubernetes.(*Clientset).CoreV1(...)
  /home/runner/work/cilium/cilium/vendor/k8s.io/client-go/kubernetes/clientset.go:309
sigs.k8s.io/gateway-api/conformance/utils/kubernetes.DumpEchoLogs({0x287200f, 0x19}, {0x285c32e, 0x10}, {0x2c54560, 0xc0006f45a0}, 0x0)
  /home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/kubernetes/logs.go:50 +0x399
sigs.k8s.io/gateway-api/conformance/utils/http.ExpectMirroredRequest.func1()
  /home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/http/mirror.go:43 +0x1df
github.com/stretchr/testify/assert.Eventually.func1()
  /home/runner/work/cilium/cilium/vendor/github.com/stretchr/testify/assert/assertions.go:1852 +0x22
created by github.com/stretchr/testify/assert.Eventually in goroutine 1483
  /home/runner/work/cilium/cilium/vendor/github.com/stretchr/testify/assert/assertions.go:1852 +0x21e
FAIL	github.com/cilium/cilium/operator/pkg/gateway-api	53.044s
FAIL
Error: Process completed with exit code 1.

joestringer · 2023-08-24T04:40:54Z

What's the consequence of this bugfix? Should we backport it to v1.14 per the backport criteria? Does the race condition affect other branches too?

joestringer · 2023-08-24T04:41:26Z

Reviews are in, CI is passing. LGTM to merge.

tommyp1ckles · 2023-08-24T05:10:40Z

@joestringer The changes that causes this don't appear to be in the v1.14 branch

jaffcheng requested a review from a team as a code owner July 20, 2023 15:37

jaffcheng requested a review from tommyp1ckles July 20, 2023 15:37

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 20, 2023

jaffcheng force-pushed the fix-ipam-stats-sporadic-race branch 2 times, most recently from 24c0779 to d0fc72b Compare August 8, 2023 04:34

tommyp1ckles reviewed Aug 16, 2023

View reviewed changes

pkg/ipam/node_manager.go Outdated Show resolved Hide resolved

tommyp1ckles mentioned this pull request Aug 16, 2023

CI: Flake: TestIpamManyNodes panics occasionally #27523

Closed

tommyp1ckles approved these changes Aug 16, 2023

View reviewed changes

tommyp1ckles added the release-note/ci This PR makes changes to the CI. label Aug 16, 2023

maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 16, 2023

christarazi previously requested changes Aug 16, 2023

View reviewed changes

pkg/ipam/node_manager.go Show resolved Hide resolved

maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 16, 2023

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 16, 2023

jaffcheng force-pushed the fix-ipam-stats-sporadic-race branch from d0fc72b to dfbdc33 Compare August 17, 2023 04:29

jaffcheng changed the title ~~ipam: Fix flaky test TestNodeManagerManyNodes~~ ipam: Fix race in NodeManager.Resync Aug 17, 2023

ti-mo removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 17, 2023

christarazi added sig/ipam IP address management, including cloud IPAM area/ipam Impacts IP address management functionality. kind/bug This is a bug in the Cilium logic. labels Aug 17, 2023

jaffcheng added 2 commits August 19, 2023 12:55

ipam/test: Avoid using a shared metricsapi in unit tests

e469fa3

Create a separate metricsapi for every test case to prevent interference between each other. Related: cilium#26617 Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>

jaffcheng force-pushed the fix-ipam-stats-sporadic-race branch from dfbdc33 to e469fa3 Compare August 19, 2023 04:55

joestringer merged commit 69c3a5d into cilium:main Aug 24, 2023
59 of 60 checks passed

jaffcheng deleted the fix-ipam-stats-sporadic-race branch August 24, 2023 08:45

aanm mentioned this pull request Aug 24, 2023

CI: azure/ipam: TestIpamManyNodes fails #26617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipam: Fix race in NodeManager.Resync #26963

ipam: Fix race in NodeManager.Resync #26963

jaffcheng commented Jul 20, 2023 •

edited

jaffcheng commented Jul 20, 2023

jaffcheng commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

jaffcheng commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

jaffcheng commented Jul 22, 2023

tommyp1ckles commented Jul 26, 2023

jaffcheng commented Aug 8, 2023

jaffcheng commented Aug 8, 2023

tommyp1ckles commented Aug 16, 2023 •

edited

tommyp1ckles commented Aug 16, 2023

jaffcheng commented Aug 17, 2023

ti-mo commented Aug 17, 2023

tommyp1ckles commented Aug 18, 2023

jaffcheng commented Aug 19, 2023 •

edited

joestringer commented Aug 24, 2023

joestringer commented Aug 24, 2023

tommyp1ckles commented Aug 24, 2023

ipam: Fix race in NodeManager.Resync #26963

ipam: Fix race in NodeManager.Resync #26963

Conversation

jaffcheng commented Jul 20, 2023 • edited

jaffcheng commented Jul 20, 2023

jaffcheng commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

jaffcheng commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

tommyp1ckles commented Jul 20, 2023

jaffcheng commented Jul 22, 2023

tommyp1ckles commented Jul 26, 2023

jaffcheng commented Aug 8, 2023

jaffcheng commented Aug 8, 2023

tommyp1ckles commented Aug 16, 2023 • edited

tommyp1ckles commented Aug 16, 2023

jaffcheng commented Aug 17, 2023

ti-mo commented Aug 17, 2023

tommyp1ckles commented Aug 18, 2023

jaffcheng commented Aug 19, 2023 • edited

joestringer commented Aug 24, 2023

joestringer commented Aug 24, 2023

tommyp1ckles commented Aug 24, 2023

jaffcheng commented Jul 20, 2023 •

edited

tommyp1ckles commented Aug 16, 2023 •

edited

jaffcheng commented Aug 19, 2023 •

edited