Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various issues (data races, flakes) related to DaemonSuite and Endpoint related test code #12195

Merged

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Jun 19, 2020

See commit msgs.

Fixes: #11955 (Specifically commit 355f037 )

@christarazi christarazi added area/CI Continuous Integration testing issue or flake area/daemon Impacts operation of the Cilium daemon. kind/cleanup This includes no functional changes. ci/flake This is a known failure that occurs in the tree. Please investigate me! release-note/ci This PR makes changes to the CI. labels Jun 19, 2020
@maintainer-s-little-helper
Copy link

Please set the appropriate release note label.

@christarazi
Copy link
Member Author

test-me-please

@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.10 Jun 19, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.0 Jun 19, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.6 Jun 19, 2020
@christarazi christarazi added the release-priority/best-effort The project for target version is not a hard requirement. label Jun 19, 2020
@coveralls
Copy link

coveralls commented Jun 19, 2020

Coverage Status

Coverage decreased (-0.008%) to 37.178% when pulling a8191a5 on christarazi:pr/christarazi/fix-metrics-assertion-flake into 21b86d5 on cilium:master.

@christarazi
Copy link
Member Author

All green CI on first try 🎉

@christarazi christarazi force-pushed the pr/christarazi/fix-metrics-assertion-flake branch from 01f059a to f646e4b Compare June 21, 2020 03:49
@christarazi
Copy link
Member Author

test-me-please

@christarazi christarazi force-pushed the pr/christarazi/fix-metrics-assertion-flake branch from f646e4b to 2896094 Compare June 21, 2020 04:00
@christarazi
Copy link
Member Author

christarazi commented Jun 21, 2020

test-me-please

Edit: net-next provisioning failure

@christarazi
Copy link
Member Author

retest-net-next

@christarazi christarazi marked this pull request as ready for review June 21, 2020 05:53
@christarazi christarazi requested a review from a team June 21, 2020 05:53
@christarazi christarazi requested review from a team as code owners June 21, 2020 05:53
@christarazi christarazi requested a review from a team June 21, 2020 05:53
@christarazi christarazi changed the title Fix various data races related to DaemonSuite Fix various issues (data races, flakes) related to DaemonSuite and Endpoint related test code Jun 21, 2020
@christarazi christarazi requested review from nebril and aanm June 23, 2020 17:05
Comment on lines 30 to 31
mutex = lock.RWMutex{} // Protects access to `pool` below.
pool = idpool.NewIDPool(minID, maxID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will slow down the access of the pool which already has a mutex builtin. I know the mutex is protecting the variable pool not the contents of pool. This entire mutex is only being added because of ReallocatePool which is only used in unit tests. Since this is the case we should avoid using a mutex entirely and re-write the ReallocatePool function to do something like:

// ReallocatePool starts over with a new pool.
// This function is only used for tests and its implementation is not
// optimized for production.
func ReallocatePool() {
	for i := minID; i < maxID; i++ {
		pool.Release(i)
	}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for extra insight. I agree that the mutex would slow things down. Good catch on noticing that ReallocatePool isn't used in implementation code.

Will add this and remote the mutex.

@@ -32,7 +32,7 @@ import (
"github.com/cilium/cilium/pkg/metrics"
"github.com/cilium/cilium/pkg/option"
"github.com/cilium/cilium/pkg/policy/api"
"github.com/cilium/proxy/go/cilium/api"
cilium "github.com/cilium/proxy/go/cilium/api"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will remove

@@ -282,7 +282,7 @@ func (m *endpointCreationManager) DebugStatus() (output string) {

// createEndpoint attempts to create the endpoint corresponding to the change
// request that was specified.
func (d *Daemon) createEndpoint(ctx context.Context, epTemplate *models.EndpointChangeRequest) (*endpoint.Endpoint, int, error) {
func (d *Daemon) createEndpoint(ctx context.Context, owner regeneration.Owner, epTemplate *models.EndpointChangeRequest) (*endpoint.Endpoint, int, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change, why can't we keep the old way and still pass d to the functions on line 316, 412 and 497?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows us to pass DaemonSuite as the owner, because it implements the regeneration.Owner interface. This is useful because we can mock out certain calls that happen during the creation of the endpoint. Specifically, to fix the metrics assertion flake (as mentioned in the commit msg and in the PR body), we want to mock out the call to (*Daemon).QueueEndpointBuild because that causes the endpoint to transition states extremely quickly. By mocking this out, we allow the test to become more deterministic, thus less flaky. See this comment for more info on this.

Does that make sense? Happy to clarify further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I miss that we were passing ds in the tests and not ds.d

@aanm
Copy link
Member

aanm commented Jun 24, 2020

Also I've removed all needs-backport labels for older branches to reduce the risk of introducing regressions

@aanm aanm removed area/CI Continuous Integration testing issue or flake needs-backport/1.6 labels Jun 24, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.6.10 Jun 24, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Needs backport from master in 1.7.6 Jun 24, 2020
For testing purposes, it's useful to pass the owner of the endpoint to
`(*Daemon).createEndpoint`. Since an owner must implement the
`regeneration.Owner` interface, we can utilize this fact to mock out the
enqueuing of endpoint regeneration. In other words, we avoid the state
transitions caused by endpoint regeneration because the enqueuing has
been "blocked" by mocking it out. This allows us to assert on specific
state transitions without worrying about endpoint regeneration kicking
in and transitioning endpoint state, unintentionally.

We also are utilizing the fact that this test suite (DaemonSuite)
already implemented the `regeneration.Owner` interface for mocking
purposes, but wasn't used in such a fashion.

This commit also fixes two incorrect assertions in the test:

1) Remove premature assertion on metrics because the
   `TestEndpointAddNoLabels` test is about creating an endpoint without
   any initial labels set. The endpoint's initial label is assigned
   (`reserved:init`) within the call of `(*Daemon).createEndpoint`, so
   it is very unlikely that we catch the endpoint in any other state
   besides "ready". Therefore, we remove this premature assertion as it
   doesn't test anything useful.
2) The final state transition in `TestEndpointAddNoLabels` should be
   "regenerating" -> "ready" because now
   `(*Daemon).QueueEndpointBuild()` is using the implementation (not
   mocked out, like in the rest of the suite).

Fixes: 4a4a59b ("daemon: Add assertions for endpoint state metrics")
Fixes: 19030a7 ("daemon: Fix waiting on metrics in endpoint_test.go")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
The data race stemmed from the `ReallocatePool()` which is a function
that's only used in testing code, and not implementation code.
Therefore, we don't need a mutex to protect access to the variable
itself. The data race is fixed by eliminating the overwrite of `pool`,
and releasing the IDs instead, thus reusing the same instance of `pool`.
Operations by `pool` are already protected by a mutex.

Along with fixing the following data race, this commit also changes the
usage of idpool to be a value rather than a pointer. idpool doesn't need
to be a pointer as the only consumer of it is the endpoint manager,
which is essentially used as a singleton during the lifetime of Cilium.
Using values rather than pointers helps reduce the load on the Golang
garbage collector. In this case, the impact is small, but useful
nonetheless.

```
WARNING: DATA RACE
Write at 0x0000050336b0 by goroutine 243:
  github.com/cilium/cilium/pkg/endpointmanager/idallocator.ReallocatePool()
      /home/chris/code/cilium/cilium/pkg/endpointmanager/idallocator/allocator.go:34 +0x14e
  github.com/cilium/cilium/pkg/endpointmanager.(*EndpointManager).RemoveAll()
      /home/chris/code/cilium/cilium/pkg/endpointmanager/manager.go:301 +0x80
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).TearDownTest()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:167 +0x8f
  github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).TearDownTest()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:209 +0x44
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:730 +0x1c3
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Previous read at 0x0000050336b0 by goroutine 196:
  github.com/cilium/cilium/pkg/endpointmanager/idallocator.Release()
      /home/chris/code/cilium/cilium/pkg/endpointmanager/idallocator/allocator.go:72 +0x3e
  github.com/cilium/cilium/pkg/endpointmanager.(*EndpointManager).ReleaseID()
      /home/chris/code/cilium/cilium/pkg/endpointmanager/manager.go:287 +0x50
  github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Unexpose.func1()
      /home/chris/code/cilium/cilium/pkg/endpoint/manager.go:143 +0xc4

Goroutine 243 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).runFunc()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:682 +0x83
  gopkg.in/check%2ev1.(*suiteRunner).runFixture()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:726 +0x3d
  gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:744 +0x82
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:782 +0xa0b
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 196 (finished) created at:
  github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Unexpose()
      /home/chris/code/cilium/cilium/pkg/endpoint/manager.go:132 +0xd1
  github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Delete()
      /home/chris/code/cilium/cilium/pkg/endpoint/endpoint.go:2193 +0x174
  github.com/cilium/cilium/daemon/cmd.(*Daemon).deleteEndpointQuiet()
      /home/chris/code/cilium/cilium/daemon/cmd/endpoint.go:633 +0xe6
  github.com/cilium/cilium/daemon/cmd.(*Daemon).errorDuringCreation()
      /home/chris/code/cilium/cilium/daemon/cmd/endpoint.go:188 +0x4d
  github.com/cilium/cilium/daemon/cmd.(*Daemon).createEndpoint()
      /home/chris/code/cilium/cilium/daemon/cmd/endpoint.go:443 +0x177f
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).TestUpdateLabelsFailed()
      /home/chris/code/cilium/cilium/daemon/cmd/endpoint_test.go:149 +0x124
  github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).TestUpdateLabelsFailed()
      <autogenerated>:1 +0x4d
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit adds optimizations that cut the execution time of
IDPoolTestSuite by roughly half, from ~1.6 seconds to ~0.8 seconds. This
optimization involves eliminating superfluous memory allocations.

(While shaving off 0.8 seconds [from a test suite that takes on the
order of minutes] does not sound like much, this commit can serve as a
template for future optimizations around tests which require a large
number of memory allocations.)

These allocations come from the `Assert` function in the Golang checker
package which we use for assertions, etcetera. This function is costly
when many Goroutines are running in parallel performing assertions. Each
assertion involves a rather large memory allocation when processing the
arguments to the function, before actually getting to the assertion
itself. This is quite wasteful in this specific test suite where we're
simply comparing integer and boolean types.

This commit implements the optimization by replacing the call to
`Assert` with a simple equality check.

See the cpu and mem profiles below.

Before optimization; cpu profile:
```
$ go test -c ./pkg/idpool && go test ./pkg/idpool -cpuprofile cpu.out -count 1
ok      github.com/cilium/cilium/pkg/idpool     1.612s

$ go tool pprof ./idpool.test cpu.out
File: idpool.test
Type: cpu
Time: Jun 20, 2020 at 4:39pm (PDT)
Duration: 1.60s, Total samples = 1.39s (86.77%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 770ms, 55.40% of 1390ms total
Showing top 10 nodes out of 129
      flat  flat%   sum%        cum   cum%
     320ms 23.02% 23.02%      320ms 23.02%  runtime.procyield
      70ms  5.04% 28.06%      200ms 14.39%  runtime.mallocgc    <--- Nearly 15% of CPU time is spent in memory allocations
      70ms  5.04% 33.09%       70ms  5.04%  runtime.unlock
      60ms  4.32% 37.41%      310ms 22.30%  runtime.findrunnable
      50ms  3.60% 41.01%      310ms 22.30%  runtime.lock
      50ms  3.60% 44.60%       50ms  3.60%  runtime.runqget
      40ms  2.88% 47.48%      270ms 19.42%  gopkg.in/check%2ev1.(*C).internalCheck
      40ms  2.88% 50.36%       60ms  4.32%  runtime.heapBitsSetType
      40ms  2.88% 53.24%      150ms 10.79%  sync.(*Mutex).Unlock
      30ms  2.16% 55.40%       30ms  2.16%  runtime.casgstatus
```

Before optimization; memory profle:
```
$ go test -c ./pkg/idpool && go test ./pkg/idpool -memprofile mem.out -count 1
ok      github.com/cilium/cilium/pkg/idpool     1.303s

$ go tool pprof ./idpool.test mem.out
File: idpool.test
Type: alloc_space
Time: Jun 20, 2020 at 4:41pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 341.51MB, 100% of 341.51MB total
      flat  flat%   sum%        cum   cum%
  225.01MB 65.89% 65.89%   315.51MB 92.39%  gopkg.in/check%2ev1.(*C).internalCheck
   90.50MB 26.50% 92.39%    90.50MB 26.50%  gopkg.in/check%2ev1.(*notChecker).Info
      18MB  5.27% 97.66%       18MB  5.27%  gopkg.in/check%2ev1.Not (inline)
       8MB  2.34%   100%   225.01MB 65.89%  github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID.func1
         0     0%   100%   116.50MB 34.11%  github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID
         0     0%   100%   315.51MB 92.39%  gopkg.in/check%2ev1.(*C).Assert
         0     0%   100%   116.50MB 34.11%  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1
         0     0%   100%   116.50MB 34.11%  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1
         0     0%   100%   116.50MB 34.11%  reflect.Value.Call
         0     0%   100%   116.50MB 34.11%  reflect.Value.call
(pprof) list internalCheck
Total: 341.51MB
ROUTINE ======================== gopkg.in/check%2ev1.(*C).internalCheck in /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/helpers.go
  225.01MB   315.51MB (flat, cum) 92.39% of Total
         .          .    192:                   comment = c
         .          .    193:                   args = args[:len(args)-1]
         .          .    194:           }
         .          .    195:   }
         .          .    196:
     135MB      135MB    197:   params := append([]interface{}{obtained}, args...)
         .    90.50MB    198:   info := checker.Info()
         .          .    199:
         .          .    200:   if len(params) != len(info.Params) {
         .          .    201:           names := append([]string{info.Params[0], info.Name}, info.Params[1:]...)
         .          .    202:           c.logCaller(2)
         .          .    203:           c.logString(fmt.Sprintf("%s(%s):", funcName, strings.Join(names, ", ")))
         .          .    204:           c.logString(fmt.Sprintf("Wrong number of parameters for %s: want %d, got %d", info.Name, len(names), len(params)+1))
         .          .    205:           c.logNewLine()
         .          .    206:           c.Fail()
         .          .    207:           return false
         .          .    208:   }
         .          .    209:
         .          .    210:   // Copy since it may be mutated by Check.
      90MB       90MB    211:   names := append([]string{}, info.Params...)
         .          .    212:
         .          .    213:   // Do the actual check.
         .          .    214:   result, error := checker.Check(params, names)
         .          .    215:   if !result || error != "" {
         .          .    216:           c.logCaller(2)
(pprof) list TestAllocateID
Total: 341.51MB
ROUTINE ======================== github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID in /home/chris/code/cilium/cilium/pkg/idpool/idpool_test.go
         0   116.50MB (flat, cum) 34.11% of Total
         .          .    342:           allocators.Wait()
         .          .    343:           close(allocated)
         .          .    344:   }()
         .          .    345:
         .          .    346:   for id := range allocated {
         .   116.50MB    347:           c.Assert(p.Insert(id), Equals, true)
         .          .    348:   }
         .          .    349:}
ROUTINE ======================== github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID.func1 in /home/chris/code/cilium/cilium/pkg/idpool/idpool_test.go
       8MB   225.01MB (flat, cum) 65.89% of Total
         .          .    329:   for i := 0; i < 256; i++ {
         .          .    330:           allocators.Add(1)
         .          .    331:           go func() {
         .          .    332:                   for i := 1; i <= maxID; i++ {
         .          .    333:                           id := p.AllocateID()
       8MB   225.01MB    334:                           c.Assert(id, Not(Equals), NoID)
         .          .    335:                           allocated <- id
         .          .    336:                   }
         .          .    337:                   allocators.Done()
         .          .    338:           }()
         .          .    339:   }
```

After optimization; cpu profle:
```
$ go test -c ./pkg/idpool && go test ./pkg/idpool -cpuprofile cpu.out -count 1
ok      github.com/cilium/cilium/pkg/idpool     0.807s   <--- Notice the time compared to the previous test run

$ go tool pprof ./idpool.test cpu.out
File: idpool.test
Type: cpu
Time: Jun 20, 2020 at 4:53pm (PDT)
Duration: 800.83ms, Total samples = 660ms (82.41%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 430ms, 65.15% of 660ms total
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
     160ms 24.24% 24.24%      160ms 24.24%  runtime.procyield
      50ms  7.58% 31.82%      160ms 24.24%  sync.(*Mutex).lockSlow
      40ms  6.06% 37.88%       40ms  6.06%  runtime.casgstatus
      40ms  6.06% 43.94%       40ms  6.06%  runtime.unlock
      40ms  6.06% 50.00%      150ms 22.73%  sync.(*Mutex).Unlock
      20ms  3.03% 53.03%      300ms 45.45%  github.com/cilium/cilium/pkg/idpool.(*IDPool).AllocateID
      20ms  3.03% 56.06%      130ms 19.70%  github.com/cilium/cilium/pkg/idpool.(*IDPool).Insert
      20ms  3.03% 59.09%       20ms  3.03%  runtime.cansemacquire
      20ms  3.03% 62.12%       40ms  6.06%  runtime.findrunnable
      20ms  3.03% 65.15%       20ms  3.03%  runtime.futex
```

By looking at the test run times, you can see it was nearly cut in half.
We also no longer see any measurable time spent in `runtime.mallocgc`.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This allows us to fix the following data race. The data race occurred
because the creation of EndpointManager was racing with the controllers
running in the background when the `Daemon` object is created.

```
$ go test ./daemon/cmd -race -v -count 1 -check.f DaemonEtcdSuite.TestEndpointAddNoLabels

=== RUN   Test
==================
WARNING: DATA RACE
Write at 0x00c0007612f8 by goroutine 103:
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).SetUpTest()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:145 +0x724
  github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).SetUpTest()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:195 +0x44
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:730 +0x1c3
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Previous read at 0x00c0007612f8 by goroutine 122:
  github.com/cilium/cilium/daemon/cmd.(*Daemon).bootstrapFQDN.func1()
      /home/chris/code/cilium/cilium/daemon/cmd/fqdn.go:191 +0x3ea
  github.com/cilium/cilium/pkg/controller.(*Controller).runController()
      /home/chris/code/cilium/cilium/pkg/controller/controller.go:205 +0xc6f

Goroutine 103 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).runFunc()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:682 +0x83
  gopkg.in/check%2ev1.(*suiteRunner).runFixture()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:726 +0x3d
  gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:744 +0x82
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:769 +0x28c
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Goroutine 122 (running) created at:
  github.com/cilium/cilium/pkg/controller.(*Manager).updateController()
      /home/chris/code/cilium/cilium/pkg/controller/manager.go:120 +0xfb5
  github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController()
      /home/chris/code/cilium/cilium/pkg/controller/manager.go:59 +0xc7d
  github.com/cilium/cilium/daemon/cmd.(*Daemon).bootstrapFQDN()
      /home/chris/code/cilium/cilium/daemon/cmd/fqdn.go:172 +0xa42
  github.com/cilium/cilium/daemon/cmd.NewDaemon()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon.go:556 +0x33b6
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).SetUpTest()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:134 +0x2de
  github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).SetUpTest()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:195 +0x44
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:730 +0x1c3
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9
```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-metrics-assertion-flake branch 2 times, most recently from fbea8ec to 453ed1e Compare June 25, 2020 18:12
@@ -150,7 +150,7 @@ type Allocator struct {
initialListDone waitChan

// idPool maintains a pool of available ids for allocation.
idPool *idpool.IDPool
idPool idpool.IDPool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this changes anything WRT go's GC because the variable is only initialized once when the agent starts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; in this specific case, the impact in minimal. The general rule I try to follow is to only make a variable a pointer if it must be one. I noticed while looking at this code, it didn't need to be a pointer, and changing it so was quite trivial. I think moving forward we should also consider reducing our usage of pointers. Over time, we may find that we improve the performance of Cilium because it'll reduce the load on the GC.

This fixes a data race with `(*Daemon).policy`; the variable is overwritten in
each test run. Each policy repository has event queues which are running in the
background which accesses the repository. When the variable is overwritten with
a new instance of a policy repository, the event queues accesses race with the
write to `(*Daemon).policy`.

The fix is to first shutdown and drain the event queues before proceeding onto
the next test run.

```
WARNING: DATA RACE
Write at 0x00c0011cef48 by goroutine 329:
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).Test_addCiliumNetworkPolicyV2()
      /home/chris/code/cilium/cilium/daemon/cmd/policy_test.go:1236 +0x5e5
  github.com/cilium/cilium/daemon/cmd.(*DaemonConsulSuite).Test_addCiliumNetworkPolicyV2()
      <autogenerated>:1 +0x4d
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

Previous read at 0x00c0011cef48 by goroutine 210:
  github.com/cilium/cilium/daemon/cmd.(*Daemon).policyAdd()
      /home/chris/code/cilium/cilium/daemon/cmd/policy.go:430 +0x1330
  github.com/cilium/cilium/daemon/cmd.(*PolicyAddEvent).Handle()
      /home/chris/code/cilium/cilium/daemon/cmd/policy.go:217 +0xc9
  github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run.func1()
      /home/chris/code/cilium/cilium/pkg/eventqueue/eventqueue.go:260 +0x25d
  sync.(*Once).doSlow()
      /usr/lib/go/src/sync/once.go:66 +0x103
  sync.(*Once).Do()
      /usr/lib/go/src/sync/once.go:57 +0x68

Goroutine 329 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d
  gopkg.in/check%2ev1.(*suiteRunner).forkTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:763 +0x1b9
  gopkg.in/check%2ev1.(*suiteRunner).runTest()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:818 +0x228
  gopkg.in/check%2ev1.(*suiteRunner).run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x1c7
  gopkg.in/check%2ev1.Run()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x5a
  gopkg.in/check%2ev1.RunAll()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x127
  gopkg.in/check%2ev1.TestingT()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x5f6
  github.com/cilium/cilium/daemon/cmd.Test()
      /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:50 +0x38
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:991 +0x1eb

Goroutine 210 (running) created at:
  github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run()
      /home/chris/code/cilium/cilium/pkg/eventqueue/eventqueue.go:248 +0xe6
  github.com/cilium/cilium/pkg/policy.NewPolicyRepository()
      /home/chris/code/cilium/cilium/pkg/policy/repository.go:144 +0x11d
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).Test_addCiliumNetworkPolicyV2.func2()
      /home/chris/code/cilium/cilium/daemon/cmd/policy_test.go:1006 +0xd4
  github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).Test_addCiliumNetworkPolicyV2()
      /home/chris/code/cilium/cilium/daemon/cmd/policy_test.go:1234 +0x574
  github.com/cilium/cilium/daemon/cmd.(*DaemonConsulSuite).Test_addCiliumNetworkPolicyV2()
      <autogenerated>:1 +0x4d
  runtime.call32()
      /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a
  reflect.Value.Call()
      /usr/lib/go/src/reflect/value.go:321 +0xd3
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9

```

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-metrics-assertion-flake branch from 453ed1e to a8191a5 Compare June 25, 2020 19:09
@christarazi
Copy link
Member Author

test-me-please

@joestringer joestringer removed this from In progress in 1.8.0 Jun 26, 2020
@joestringer joestringer removed this from Needs backport from master in 1.8.0 Jun 26, 2020
@aanm aanm merged commit 978248b into cilium:master Jun 26, 2020
@christarazi christarazi deleted the pr/christarazi/fix-metrics-assertion-flake branch June 26, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. ci/flake This is a known failure that occurs in the tree. Please investigate me! kind/cleanup This includes no functional changes. release-note/ci This PR makes changes to the CI. release-priority/best-effort The project for target version is not a hard requirement.
Projects
No open projects
1.8.1
Awaiting triage
Development

Successfully merging this pull request may close these issues.

CI: TestEndpointAddNoLabels: metrics assertion failed
6 participants