Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.7 Backport ENI data race fixes #11766

Merged
merged 5 commits into from Jun 5, 2020

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented May 29, 2020

The fixes in this PR do not attempt to fix all data races within the code, as
many of them are not "crucial". These data races come from test code. For
example, the test code may access an internal field within the Node struct,
but an equivalent access from within the implementation has a mutex protecting
it.

The fixes landing in this PR are around "crucial" data structures like the
Node.enis, a map containing all references to ENIs in a node. These fixes
attempt to mitigate any potential upgrade regression from older versions. Note,
the 1.6 tree contains may of the same data races as the code difference between
this tree (1.7) and 1.6 is not significant. The code difference likely did not
include new data races. This means that these data races have been around
since 1.6, at least.

$ git diff origin/v1.6..origin/v1.7 -- $(find pkg/aws/eni -type f -not -path '*/mock/*' -not -path '*test.go')

This PR is a partial backport of the two PRs below:

#11685
#10587

Commits from #11685:

  • 699a8a5 | Backported; this fixes issues around the main ENI map
  • ef77a98 | Dropped as it doesn't change anything relevant in v1.7 tree
  • 20055e4 | Dropped as it doesn't change anything relevant in v1.7 tree
  • ab1b7e2 | Dropped as it doesn't change anything relevant in v1.7 tree
  • 7427e22 | Dropped as it doesn't change anything relevant in v1.7 tree
  • 9ded76e | Dropped as it doesn't change anything relevant in v1.7 tree
  • 9168268 | Backported; included because it could fix flaky v1.7 Travis; amended to remove all other references of metricsapi

Commits from #10587:

  • cac8d0d | Backported; included to provide protection around ENI limits map

This PR also contains two new commits which are fixes for (potential) data races that only
exists in this 1.7 tree:

$ for pr in 11685 10587; do contrib/backporting/set-labels.py $pr done 1.7; done

[ upstream commit 699a8a5 ]

This fixes data races found when running the unit-tests with `-race`.
The mutex is to protect access to `n.enis`.

```
WARNING: DATA RACE
Write at 0x00c0007b40e0 by goroutine 135:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).ResyncInterfacesAndIPs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:430 +0xfa
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:352 +0xfb
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:342 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:389 +0x88

Previous read at 0x00c0007b40e0 by goroutine 46:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).findNextIndex()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:292 +0x86
  github.com/cilium/cilium/pkg/aws/eni.(*Node).CreateInterface()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:339 +0x584
  github.com/cilium/cilium/pkg/ipam.(*Node).createInterface()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:435 +0x290
  github.com/cilium/cilium/pkg/ipam.(*Node).maintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:628 +0x85d
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:663 +0x82
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 135 (running) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:388 +0x2c5
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 46 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:593 +0x80a
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

```
WARNING: DATA RACE
Write at 0x00c000110060 by goroutine 94:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).ResyncInterfacesAndIPs()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:439 +0xfa
  github.com/cilium/cilium/pkg/ipam.(*Node).recalculate()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:352 +0xfb
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:342 +0x7a
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:389 +0x88

Previous read at 0x00c000110060 by goroutine 92:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).PrepareIPAllocation()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node.go:211 +0xefc
  github.com/cilium/cilium/pkg/ipam.(*Node).determineMaintenanceAction()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:542 +0x184
  github.com/cilium/cilium/pkg/ipam.(*Node).maintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:578 +0x53
  github.com/cilium/cilium/pkg/ipam.(*Node).MaintainIPPool()
      /home/chris/code/cilium/cilium/pkg/ipam/node.go:663 +0x82
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:241 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 94 (running) created at:
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:388 +0x2c5
  github.com/cilium/cilium/pkg/ipam.NewNodeManager.func1()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:168 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:206 +0x4b9

Goroutine 92 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium/pkg/trigger/trigger.go:129 +0x23d
  github.com/cilium/cilium/pkg/ipam.(*NodeManager).Update()
      /home/chris/code/cilium/cilium/pkg/ipam/node_manager.go:236 +0x523
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium/pkg/aws/eni/node_manager_test.go:593 +0x80a
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
[ upstream commit 9168268 ]

In this test suite, the `metricsapi` is global variable which is shared
among all the `Test*` functions. It is possible that it becomes polluted
over time during test execution.

This is an attempt to resolve the following:

```
FAIL: node_manager_test.go:563: ENISuite.TestNodeManagerManyNodes
node_manager_test.go:602:

    c.Errorf("Node %s allocation mismatch. expected: %d allocated: %d", s.name, minAllocate, node.Stats().AvailableIPs)

... Error: Node node53 allocation mismatch. expected: 10 allocated: 18

node_manager_test.go:602:

    c.Errorf("Node %s allocation mismatch. expected: %d allocated: %d", s.name, minAllocate, node.Stats().AvailableIPs)

... Error: Node node59 allocation mismatch. expected: 10 allocated: 18

node_manager_test.go:617:
    c.Assert(metricsapi.AllocatedIPs("available"), check.Equals, numNodes*minAllocate)
... obtained int = 1016
... expected int = 1000

...
OOPS: 17 passed, 1 FAILED

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.7 kind/backports This PR provides functionality previously merged into master. labels May 29, 2020
@christarazi christarazi changed the title This PR is intended to fix data races in the ENI code. This PR is inspired from https://github.com/cilium/cilium/pull/11685. [v1.7] Backport ENI data race fixes May 29, 2020
@christarazi christarazi changed the title [v1.7] Backport ENI data race fixes v1.7 Backport ENI data race fixes May 29, 2020
@christarazi
Copy link
Member Author

never-tell-me-the-odds

@christarazi christarazi marked this pull request as ready for review May 29, 2020 00:51
@christarazi christarazi requested a review from a team as a code owner May 29, 2020 00:51
@christarazi
Copy link
Member Author

never-tell-me-the-odds

@christarazi christarazi force-pushed the pr/christarazi/1.7-backport-eni-data-races-fixes branch from 4d6c50b to a7ceb7e Compare May 29, 2020 06:03
@christarazi
Copy link
Member Author

never-tell-me-the-odds

christarazi and others added 3 commits May 28, 2020 23:46
Fixes:

```
WARNING: DATA RACE
Write at 0x00c0003e4450 by goroutine 81:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).recalculateLocked()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node.go:235 +0x75
  github.com/cilium/cilium/pkg/aws/eni.(*NodeManager).resyncNode()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager.go:261 +0xeb
  github.com/cilium/cilium/pkg/aws/eni.(*NodeManager).Resync.func1()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager.go:309 +0x88

Previous read at 0x00c0003e4450 by goroutine 123:
  github.com/cilium/cilium/pkg/aws/eni.(*Node).maintainIpPool()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node.go:736 +0xa74
  github.com/cilium/cilium/pkg/aws/eni.(*Node).MaintainIpPool()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node.go:776 +0x90
  github.com/cilium/cilium/pkg/aws/eni.(*NodeManager).Update.func1()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager.go:154 +0x8b
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium-backports/pkg/trigger/trigger.go:210 +0x4b9

Goroutine 81 (running) created at:
  github.com/cilium/cilium/pkg/aws/eni.(*NodeManager).Resync()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager.go:308 +0x27c
  github.com/cilium/cilium/pkg/aws/eni.NewNodeManager.func1()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager.go:111 +0x101
  github.com/cilium/cilium/pkg/trigger.(*Trigger).waiter()
      /home/chris/code/cilium/cilium-backports/pkg/trigger/trigger.go:210 +0x4b9

Goroutine 123 (running) created at:
  github.com/cilium/cilium/pkg/trigger.NewTrigger()
      /home/chris/code/cilium/cilium-backports/pkg/trigger/trigger.go:133 +0x23d
  github.com/cilium/cilium/pkg/aws/eni.(*NodeManager).Update()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager.go:149 +0x482
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestNodeManagerManyNodes()
      /home/chris/code/cilium/cilium-backports/pkg/aws/eni/node_manager_test.go:579 +0x9c8
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/go/pkg/mod/gopkg.in/check.v1@v1.0.0-20180628173108-788fd7840127/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/go/pkg/mod/gopkg.in/check.v1@v1.0.0-20180628173108-788fd7840127/check.go:675 +0xd9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
[ upstream commit cac8d0d ]

Fixes:
```
WARNING: DATA RACE
Write at 0x00c0005b0750 by goroutine 308:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/map_faststr.go:202 +0x0
  github.com/cilium/cilium/pkg/aws/eni.UpdateLimitsFromUserDefinedMappings()
      /home/vagrant/go/src/github.com/cilium/cilium/pkg/aws/eni/limits.go:269 +0xdf
  github.com/cilium/cilium/pkg/aws/eni.(*ENISuite).TestUpdateLimitsFromUserDefinedMappings()
      /home/vagrant/go/src/github.com/cilium/cilium/pkg/aws/eni/limits_test.go:47 +0x11d
  runtime.call32()
      /usr/local/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/vagrant/go/src/github.com/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

Signed-off-by: Thomas Graf <thomas@cilium.io>
Signed-off-by: Chris Tarazi <chris@isovalent.com>
This fixes a potential data race as `n.resource` is a live pointer to an
object.

Fixes: 06bce43 ("aws/eni: Fix race condition leading to overaggressive ENI allocation")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/1.7-backport-eni-data-races-fixes branch from a7ceb7e to 43253d0 Compare May 29, 2020 06:48
@christarazi
Copy link
Member Author

never-tell-me-the-odds

@christarazi
Copy link
Member Author

christarazi commented May 29, 2020

test-missed-k8s

Edit: hit known flake #10442

@christarazi
Copy link
Member Author

christarazi commented May 29, 2020

restart-ginkgo

@christarazi
Copy link
Member Author

restart-ginkgo

@christarazi
Copy link
Member Author

christarazi commented May 29, 2020

test-missed-k8s

Edit: hit known flake #10442

@christarazi
Copy link
Member Author

test-missed-k8s

@christarazi
Copy link
Member Author

test-focus K8sPolicyTest Basic Test Redirects traffic to proxy when no policy is applied with proxy-visibility annotation Tests DNS proxy visibility without policy

@christarazi
Copy link
Member Author

The focused test passed, which was the same test that failed in Cilium-Ginkgo-Test-k8s

@joestringer
Copy link
Member

Changes look good. Given these changes are in the ENI area, CI (other than unit testing) is not likely to provide us a lot of signal. For reference, did you perform any manual validation on an EKS cluster?

I'm thinking this is probably good to merge assuming it passes basic validation in a real environment.

@christarazi
Copy link
Member Author

christarazi commented Jun 4, 2020

Just deployed a quick sanity check cluster and it seems to be fine. Deployed the connectivity check and it was all good. Weirdly enough, pod-to-b-intra-node-nodeport and pod-to-b-multi-node-nodeport won't come up. I then deployed with the latest released version of 1.7 and that has the same behavior. Not sure if there's been a recent regression that's been under our noses.

As far as this PR is considered, I think it's good to go.

@joestringer joestringer merged commit 36a0729 into v1.7 Jun 5, 2020
@joestringer joestringer deleted the pr/christarazi/1.7-backport-eni-data-races-fixes branch June 5, 2020 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants