Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam: Fix race in NodeManager.Resync #26963

Merged
merged 2 commits into from Aug 24, 2023

Conversation

jaffcheng
Copy link
Contributor

@jaffcheng jaffcheng commented Jul 20, 2023

please see commit message

Related: #26617

@jaffcheng jaffcheng requested a review from a team as a code owner July 20, 2023 15:37
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 20, 2023
@jaffcheng
Copy link
Contributor Author

@tommyp1ckles @christarazi Please take a look. This flake clearly happens more often lately, but I'm not quite sure why recent changes would introduce this.

@jaffcheng
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor

@jaffcheng Interesting, I've been looking into this as well, I'm pretty sure its related to racey code related to appending to the AzureInterface Addresses array in the mock API implementation.

I have a local reproduction that can cause the failure to occur fairly quickly, let me validate if this fixes it.

@tommyp1ckles
Copy link
Contributor

Even with these changes, I stil run into the failure.

@jaffcheng
Copy link
Contributor Author

Even with these changes, I stil run into the failure.

Thanks for trying, could you share your reproduction method and test output? My local test is running pkg/ipam unit tests a lot of times with/without -race and lockdebug. But haven’t tried pkg/azure yet.

@tommyp1ckles
Copy link
Contributor

@jaffcheng Sure, my branch is quite messy, ill push that up later today. The main part is here in the Upsert function in node_manager.go. Basically just removing the exponential backoff between triggers and running the poolMaintainer trigger in a loop causes this to occur usually in a couple seconds. I'm running the test with

go test -count=100 -v -failfast -race
		poolMaintainer, err := trigger.NewTrigger(trigger.Parameters{
			Name:            fmt.Sprintf("ipam-pool-maintainer-%s", resource.Name),
			MinInterval:     10 * time.Millisecond,
			MetricsObserver: n.metricsAPI.PoolMaintainerTrigger(),
			TriggerFunc: func(reasons []string) {
				if err := node.MaintainIPPool(ctx); err != nil {
					node.logger().WithError(err).Warning("Unable to maintain ip pool of node")
					//backoff.Wait(ctx) Remove exponential backoff wait to increase chance of race occuring.
				}
			},
			ShutdownFunc: cancel,
		})
                // Trigger poolMaintainer continuously!
		go func() {
			for {
				<-time.After(time.Millisecond)
				poolMaintainer.Trigger()
			}
		}()

@tommyp1ckles
Copy link
Contributor

Race reveals a couple data races in this, I was hoping fixing those would solve the problem but yet it still persists 😕

@jaffcheng
Copy link
Contributor Author

Although I think this patch should significantly lower the possibility of flakes, there are still some races found in /pkg/ipam with -race on. Need more digging...

@tommyp1ckles
Copy link
Contributor

Although I think this patch should significantly lower the possibility of flakes, there are still some races found in /pkg/ipam with -race on. Need more digging...

there appears to be many races there 😄 I have a few race fixes as well I'll put PRs for. Considering how hard im finding it to get to a single cause, its possible the issue is caused by multiple races. We can start fixing these and see if the flakes improve.

@jaffcheng jaffcheng force-pushed the fix-ipam-stats-sporadic-race branch 2 times, most recently from 24c0779 to d0fc72b Compare August 8, 2023 04:34
@jaffcheng
Copy link
Contributor Author

Change Fixes: to Related: since this does not fix all races.

@jaffcheng
Copy link
Contributor Author

/test

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Aug 16, 2023

When I was looking at this, another place I thought it might be racing was here: https://github.com/cilium/cilium/blob/main/pkg/ipam/node.go#L430

i.e. the interfaces are checked in parallel and then on the next line they will race to get the Node lock, so in theory a stale set of values could overwrite the latest ones.

But... fixing this didn't fix the flake in practice.

@tommyp1ckles
Copy link
Contributor

I think these changes make sense, lock changes in this code scare me a bit, I'll do another review to double check that we aren't creating a deadlock somewhere in shortly.

@tommyp1ckles tommyp1ckles added the release-note/ci This PR makes changes to the CI. label Aug 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 16, 2023
pkg/ipam/node_manager.go Show resolved Hide resolved
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 16, 2023
@jaffcheng
Copy link
Contributor Author

Changed to GetNodesByIPWatermarkLocked and slightly changed the commit message given that this only fixes one source of the flake.

@jaffcheng jaffcheng changed the title ipam: Fix flaky test TestNodeManagerManyNodes ipam: Fix race in NodeManager.Resync Aug 17, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Aug 17, 2023

/test

@ti-mo ti-mo removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 17, 2023
@christarazi christarazi added sig/ipam IP address management, including cloud IPAM area/ipam Impacts IP address management functionality. kind/bug This is a bug in the Cilium logic. labels Aug 17, 2023
@tommyp1ckles
Copy link
Contributor

@jaffcheng The integration tests are failing I believe to CI using a newer version of Go, can you rebase your branch - that should fix it.

TestNodeManagerManyNodes had been flaky before ported from aws/eni,
and was eventually disabled in aws tests:
cilium#11560

One of the sources of this flake is the races in the values of metricsapi,
e.g. `metricsapi.Nodes("total")` and `metricsapi.AllocatedIPs("available")`,
which is not protected from concurrent writes in NodeManager.Resync.

Allowing multiple goroutines to execute Resync simultaneously doesn't
really make sense, since `Node.resyncNode` is already executed in parallel
and controlled by semaphore.
this patch serializes NodeManager.Resync to avoid data races on metricsAPI.

Some excerpts of failed tests:
    --- FAIL: Test (17.29s)
        --- FAIL: Test/IPAMSuite (17.29s)
            --- FAIL: Test/IPAMSuite/TestIPAMMetadata (0.01s)
                testing.go:1446: race detected during execution of test
            --- FAIL: Test/IPAMSuite/TestNodeManagerManyNodes (3.88s)
                node_manager_test.go:610:
                    ... obtained int = 850
                    ... expected int = 1000

    --- FAIL: Test (17.74s)
        --- FAIL: Test/IPAMSuite (17.74s)
            --- FAIL: Test/IPAMSuite/TestNodeManagerManyNodes (4.36s)
                node_manager_test.go:606:
                    ... obtained int = 87
                    ... expected int = 100

Related: cilium#26617
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
Create a separate metricsapi for every test case to prevent
interference between each other.

Related: cilium#26617
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng
Copy link
Contributor Author

jaffcheng commented Aug 19, 2023

/test

Edit: ConformanceGatewayAPI hit #27647

error log
=== CONT  TestConformance/HTTPRouteRequestMirror/0_request_to_'/mirror'_should_go_to_infra-backend-v1
  httproute-request-mirror.go:69: Making GET request to http://172.18.255.200/mirror
  http.go:222: Response expectation failed for request: {URL: {Scheme:http Opaque: User: Host:172.18.255.200 Path:/mirror RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}, Host: , Protocol: HTTP, Method: GET, Headers: map[X-Echo-Set-Header:[]], UnfollowRedirect: false, Server: , CertPem: <truncated>, KeyPem: <truncated>}  not ready yet: expected status code to be 200, got 404 (after 2.1µs)
  http.go:228: Request passed
  mirror.go:41: Searching for the mirrored request log
  mirror.go:42: Reading "gateway-conformance-infra/infra-backend-v2" logs
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc0 pc=0x2088cd9]

goroutine 1493 [running]:
k8s.io/client-go/kubernetes.(*Clientset).CoreV1(...)
  /home/runner/work/cilium/cilium/vendor/k8s.io/client-go/kubernetes/clientset.go:309
sigs.k8s.io/gateway-api/conformance/utils/kubernetes.DumpEchoLogs({0x287200f, 0x19}, {0x285c32e, 0x10}, {0x2c54560, 0xc0006f45a0}, 0x0)
  /home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/kubernetes/logs.go:50 +0x399
sigs.k8s.io/gateway-api/conformance/utils/http.ExpectMirroredRequest.func1()
  /home/runner/work/cilium/cilium/vendor/sigs.k8s.io/gateway-api/conformance/utils/http/mirror.go:43 +0x1df
github.com/stretchr/testify/assert.Eventually.func1()
  /home/runner/work/cilium/cilium/vendor/github.com/stretchr/testify/assert/assertions.go:1852 +0x22
created by github.com/stretchr/testify/assert.Eventually in goroutine 1483
  /home/runner/work/cilium/cilium/vendor/github.com/stretchr/testify/assert/assertions.go:1852 +0x21e
FAIL	github.com/cilium/cilium/operator/pkg/gateway-api	53.044s
FAIL
Error: Process completed with exit code 1.

@joestringer
Copy link
Member

What's the consequence of this bugfix? Should we backport it to v1.14 per the backport criteria? Does the race condition affect other branches too?

@joestringer
Copy link
Member

Reviews are in, CI is passing. LGTM to merge.

@joestringer joestringer merged commit 69c3a5d into cilium:main Aug 24, 2023
59 of 60 checks passed
@tommyp1ckles
Copy link
Contributor

@joestringer The changes that causes this don't appear to be in the v1.14 branch

@jaffcheng jaffcheng deleted the fix-ipam-stats-sporadic-race branch August 24, 2023 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipam Impacts IP address management functionality. kind/bug This is a bug in the Cilium logic. release-note/ci This PR makes changes to the CI. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants