v1.8 backports 2020-06-29 #12332

christarazi · 2020-06-30T04:04:31Z

hubble: Bump images to v0.6.1 #12290 -- hubble: Bump images to v0.6.1 (@gandro)
test: split log gathering goroutine #12282 -- test: split log gathering goroutine (@nebril)
Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail #12292 -- Fix bug where etcd session renew would block indefinitely, causing endpoint provision to fail (@joestringer)
docs: fix enableIdentityMark helm chart option #12299 -- docs: fix enableIdentityMark helm chart option (@aanm)
operator: Make CRD availability timeout configurable #12177 -- operator: Make CRD availability timeout configurable (@mrostecki)
docs: Fix bpfMasquerade option #12302 -- docs: Fix bpfMasquerade option (@sayboras)
Fix various issues (data races, flakes) related to DaemonSuite and Endpoint related test code #12195 -- Fix various issues (data races, flakes) related to DaemonSuite and Endpoint related test code (@christarazi)
pkg/endpoint: wait for security identity on restore #12307 -- pkg/endpoint: wait for security identity on restore (@aanm)
policy/api: Add reserved:health entity #12199 -- policy/api: Add reserved:health entity (@pchaigno)
fqdn/dnsproxy/proxy_test: increase timeout for DNS TCP client exchanges #12305 -- fqdn/dnsproxy/proxy_test: increase timeout for DNS TCP client exchanges (@qmonnet)
bpf: Share __NR_CPUS__ value with Golang side #12310 -- bpf: Share NR_CPUS value with Golang side (@pchaigno)
test/helpers: introduce CmdStreamBuffer to manipulate a command's output #12252 -- test/helpers: introduce CmdStreamBuffer to manipulate a command's output (@rolinh)
make: fix LOCKDEBUG env variable reference for docker-plugin-image #12318 -- make: fix LOCKDEBUG env variable reference for docker-plugin-image (@rolinh)
iptables: Remove '--nowildcard' from socket match #12248 -- iptables: Remove '--nowildcard' from socket match (@jrajahalme)
daemon: Skip devices without hardware address during device detection #12321 -- daemon: Skip devices without hardware address during device detection (@pchaigno)
bpf: Handle ICMPv6 NS/NA in host firewall #12049 -- bpf: Handle ICMPv6 NS/NA in host firewall (@pchaigno)
docs: Switch hostfw tech-preview -> beta #12325 -- docs: Switch hostfw tech-preview -> beta (@joestringer)

Once this PR is merged, you can update the PR labels via:

$ for pr in 12290 12282 12292 12299 12177 12302 12195 12307 12199 12305 12310 12252 12318 12248 12321 12049 12325; do contrib/backporting/set-labels.py $pr done 1.8; done

[ upstream commit ac6a11e ] Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit d075b7a ] Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 802fa78 ] If Cilium pods are in crash loop backoff, `kub.DumpCiliumCommandOutput` blocks for whole context duration, which causes subsequent `kub.GatherLogs` call to fail immediately. This change splits these two calls to separate goroutines. Signed-off-by: Maciej Kwiek <maciej@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit b99c7b8 ] When renewSession() attempts to re-establish a session with an etcd peer, it may block for some period of time, currently 10s. At the moment, this timeout is added to the LockPath() timeout because the LockPath() context timeout is created *after* acquiring the lock. Move the context creation before the lock acquisition to allow these timers to overlap to provide more accurate bounding on the timeouts here. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 8524fca ] A user reported an issue where Cilium would become unable to provision new endpoints, with the following messages recurring in the agent logs: [T=0s] ... level=info msg="New endpoint" [T=0s] ... level=info msg="Resolving identity labels (blocking) [T=43s]... level=error msg="Forcefully unlocking local kvstore lock" [T=89s]... level=info msg="Removed endpoint" Upon further investigation of "gops stack" output from a sysdump in the environment, the following stack trace was observed on Cilium 1.6.8: goroutine 589 [select, 14 minutes]: github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).waitOnHeader(0xc00214d500, 0x8, 0xc000740d10) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:245 +0xcc github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport.(*Stream).RecvCompress(...) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/internal/transport/transport.go:256 github.com/cilium/cilium/vendor/google.golang.org/grpc.(*csAttempt).recvMsg(0xc000740d10, 0x32ec340, 0xc001caf890, 0x0, 0xc001e52650, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:850 +0x70a github.com/cilium/cilium/vendor/google.golang.org/grpc.(*clientStream).RecvMsg.func1(0xc000740d10, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/google.golang.org/grpc/stream.go:715 +0x46 ... github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseClient).LeaseGrant(0xc00042a238, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0xc001fbaea0, 0x4 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3792 +0xd3 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*retryLeaseClient).LeaseGrant(0xc000ec0510, 0x38e6ae0, 0xc00005cb40, 0xc00150ad20, 0x53bcba0, 0x3, 0x3, 0x1319 /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/retry.go:144 +0xeb github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3.(*lessor).Grant(0xc0006a8640, 0x38e6ae0, 0xc00005cb40, 0x19, 0x0, 0xc000abd680, 0xc000abd708) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/lease.go:216 +0x98 github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency.NewSession(0xc000c15860, 0xc001e52f50, 0x1, 0x1, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/vendor/go.etcd.io/etcd/clientv3/concurrency/session.go:46 +0x308 github.com/cilium/cilium/pkg/kvstore.(*etcdClient).renewLockSession(0xc000b00790, 0xc0227a56a6, 0x14dffe3633def) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:467 +0xde github.com/cilium/cilium/pkg/kvstore.connectEtcdClient.func4(0x38e6ae0, 0xc00005d640, 0x53bd3e0, 0x2) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:583 +0x2a github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc000c06ff0) /go/src/github.com/cilium/cilium/pkg/controller/controller.go:203 +0x9e1 created by github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController /go/src/github.com/cilium/cilium/pkg/controller/manager.go:112 +0xae0 In particular, it's useful to note: * The goroutine has been blocking for 14 minutes attempting to make progress on re-establishing the etcd session. * This path is hit by a controller responsible for handling etcd session renewal in the event of etcd connectivity issues. It would not be triggered by initial etcd connectivity failures, only when etcd connectivity was successfully established then subsequently lost. * NewSession() attempts to pick one of the configured etcd peers to connect to, and will block until connectivity to that peer is restored. It will not by itself back off and re-attempt connectivity to another node in the etcd cluster. * This path is in the critical section for writelock of `etcdClient.RWMutex` so will block out all etcd client reads. The above is consistent with the agent logs: Effectively, a new endpoint is scheduled and Cilium attempts to allocate an identity for it. This process requires the kvstore and will block on the lock. Given that it makes no progress, first we see the lock liveness check kick to forcefully unlock the path lock (`Forcefully unlocking local kvstore lock`), then later we see cilium-cni / kubelet give up on creating the endpoint and remove it again. This patch fixes the issue by introducing a timeout when attempting to renew the session. Each time `NewSession()` is called, the etcd client library will pick an etcd peer to connect to. The intent behind the timeout is to provide a way for Cilium to detect when it is attempting to connect to an unhealthy peer, and try to reconnect to one of the other peers in the etcd cluster in the hopes that that other peer will be healthy. In the event that there is an etcd connectivity outage where one of three etcd peers is unhealthy, we expect that the remaining two can retain chorum and continue to operate despite the outage of the third peer. In this case if Cilium was attempting to connect to the third (unhealthy) peer, Cilium would previously block indefinitely. With this patch Cilium will time out after statusCheckTimeout (10s) and re-establish a session to the remaining etcd peers, thereby unblocking subsequent endpoint provisioning. Signed-off-by: Joe Stringer <joe@cilium.io> Co-authored-by: André Martins <andre@cilium.io> Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit f62bd84 ] This option exist in the 'config' chart so we should set it with `config.enableIdentityMark` and not with `global.enableIdentityMark' Fixes: 1cc79c1 ("cilium: chaining mode skb->mark can be mangled by iptables allow opt-out") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 76df134 ] Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 629b989 ] Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 5dbe413 ] This change allows to override the default 5m timeout with the --crd-wait-timeout option. Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 0c856f7 ] The value of bpfMasquerade should be coming from config chart only. However, few places are still referring to global. Relates 8ec15e2 Signed-off-by: Tam Mach <sayboras@yahoo.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit dc62fbd ] For testing purposes, it's useful to pass the owner of the endpoint to `(*Daemon).createEndpoint`. Since an owner must implement the `regeneration.Owner` interface, we can utilize this fact to mock out the enqueuing of endpoint regeneration. In other words, we avoid the state transitions caused by endpoint regeneration because the enqueuing has been "blocked" by mocking it out. This allows us to assert on specific state transitions without worrying about endpoint regeneration kicking in and transitioning endpoint state, unintentionally. We also are utilizing the fact that this test suite (DaemonSuite) already implemented the `regeneration.Owner` interface for mocking purposes, but wasn't used in such a fashion. This commit also fixes two incorrect assertions in the test: 1) Remove premature assertion on metrics because the `TestEndpointAddNoLabels` test is about creating an endpoint without any initial labels set. The endpoint's initial label is assigned (`reserved:init`) within the call of `(*Daemon).createEndpoint`, so it is very unlikely that we catch the endpoint in any other state besides "ready". Therefore, we remove this premature assertion as it doesn't test anything useful. 2) The final state transition in `TestEndpointAddNoLabels` should be "regenerating" -> "ready" because now `(*Daemon).QueueEndpointBuild()` is using the implementation (not mocked out, like in the rest of the suite). Fixes: 4a4a59b ("daemon: Add assertions for endpoint state metrics") Fixes: 19030a7 ("daemon: Fix waiting on metrics in endpoint_test.go") Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 26c01a8 ] The data race stemmed from the `ReallocatePool()` which is a function that's only used in testing code, and not implementation code. Therefore, we don't need a mutex to protect access to the variable itself. The data race is fixed by eliminating the overwrite of `pool`, and releasing the IDs instead, thus reusing the same instance of `pool`. Operations by `pool` are already protected by a mutex. Along with fixing the following data race, this commit also changes the usage of idpool to be a value rather than a pointer. idpool doesn't need to be a pointer as the only consumer of it is the endpoint manager, which is essentially used as a singleton during the lifetime of Cilium. Using values rather than pointers helps reduce the load on the Golang garbage collector. In this case, the impact is small, but useful nonetheless. ``` WARNING: DATA RACE Write at 0x0000050336b0 by goroutine 243: github.com/cilium/cilium/pkg/endpointmanager/idallocator.ReallocatePool() /home/chris/code/cilium/cilium/pkg/endpointmanager/idallocator/allocator.go:34 +0x14e github.com/cilium/cilium/pkg/endpointmanager.(*EndpointManager).RemoveAll() /home/chris/code/cilium/cilium/pkg/endpointmanager/manager.go:301 +0x80 github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).TearDownTest() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:167 +0x8f github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).TearDownTest() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:209 +0x44 runtime.call32() /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a reflect.Value.Call() /usr/lib/go/src/reflect/value.go:321 +0xd3 gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:730 +0x1c3 gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 Previous read at 0x0000050336b0 by goroutine 196: github.com/cilium/cilium/pkg/endpointmanager/idallocator.Release() /home/chris/code/cilium/cilium/pkg/endpointmanager/idallocator/allocator.go:72 +0x3e github.com/cilium/cilium/pkg/endpointmanager.(*EndpointManager).ReleaseID() /home/chris/code/cilium/cilium/pkg/endpointmanager/manager.go:287 +0x50 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Unexpose.func1() /home/chris/code/cilium/cilium/pkg/endpoint/manager.go:143 +0xc4 Goroutine 243 (running) created at: gopkg.in/check%2ev1.(*suiteRunner).forkCall() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d gopkg.in/check%2ev1.(*suiteRunner).runFunc() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:682 +0x83 gopkg.in/check%2ev1.(*suiteRunner).runFixture() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:726 +0x3d gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:744 +0x82 gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:782 +0xa0b gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 Goroutine 196 (finished) created at: github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Unexpose() /home/chris/code/cilium/cilium/pkg/endpoint/manager.go:132 +0xd1 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).Delete() /home/chris/code/cilium/cilium/pkg/endpoint/endpoint.go:2193 +0x174 github.com/cilium/cilium/daemon/cmd.(*Daemon).deleteEndpointQuiet() /home/chris/code/cilium/cilium/daemon/cmd/endpoint.go:633 +0xe6 github.com/cilium/cilium/daemon/cmd.(*Daemon).errorDuringCreation() /home/chris/code/cilium/cilium/daemon/cmd/endpoint.go:188 +0x4d github.com/cilium/cilium/daemon/cmd.(*Daemon).createEndpoint() /home/chris/code/cilium/cilium/daemon/cmd/endpoint.go:443 +0x177f github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).TestUpdateLabelsFailed() /home/chris/code/cilium/cilium/daemon/cmd/endpoint_test.go:149 +0x124 github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).TestUpdateLabelsFailed() <autogenerated>:1 +0x4d runtime.call32() /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a reflect.Value.Call() /usr/lib/go/src/reflect/value.go:321 +0xd3 gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 ``` Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 7be4ec2 ] This commit adds optimizations that cut the execution time of IDPoolTestSuite by roughly half, from ~1.6 seconds to ~0.8 seconds. This optimization involves eliminating superfluous memory allocations. (While shaving off 0.8 seconds [from a test suite that takes on the order of minutes] does not sound like much, this commit can serve as a template for future optimizations around tests which require a large number of memory allocations.) These allocations come from the `Assert` function in the Golang checker package which we use for assertions, etcetera. This function is costly when many Goroutines are running in parallel performing assertions. Each assertion involves a rather large memory allocation when processing the arguments to the function, before actually getting to the assertion itself. This is quite wasteful in this specific test suite where we're simply comparing integer and boolean types. This commit implements the optimization by replacing the call to `Assert` with a simple equality check. See the cpu and mem profiles below. Before optimization; cpu profile: ``` $ go test -c ./pkg/idpool && go test ./pkg/idpool -cpuprofile cpu.out -count 1 ok github.com/cilium/cilium/pkg/idpool 1.612s $ go tool pprof ./idpool.test cpu.out File: idpool.test Type: cpu Time: Jun 20, 2020 at 4:39pm (PDT) Duration: 1.60s, Total samples = 1.39s (86.77%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top10 Showing nodes accounting for 770ms, 55.40% of 1390ms total Showing top 10 nodes out of 129 flat flat% sum% cum cum% 320ms 23.02% 23.02% 320ms 23.02% runtime.procyield 70ms 5.04% 28.06% 200ms 14.39% runtime.mallocgc <--- Nearly 15% of CPU time is spent in memory allocations 70ms 5.04% 33.09% 70ms 5.04% runtime.unlock 60ms 4.32% 37.41% 310ms 22.30% runtime.findrunnable 50ms 3.60% 41.01% 310ms 22.30% runtime.lock 50ms 3.60% 44.60% 50ms 3.60% runtime.runqget 40ms 2.88% 47.48% 270ms 19.42% gopkg.in/check%2ev1.(*C).internalCheck 40ms 2.88% 50.36% 60ms 4.32% runtime.heapBitsSetType 40ms 2.88% 53.24% 150ms 10.79% sync.(*Mutex).Unlock 30ms 2.16% 55.40% 30ms 2.16% runtime.casgstatus ``` Before optimization; memory profle: ``` $ go test -c ./pkg/idpool && go test ./pkg/idpool -memprofile mem.out -count 1 ok github.com/cilium/cilium/pkg/idpool 1.303s $ go tool pprof ./idpool.test mem.out File: idpool.test Type: alloc_space Time: Jun 20, 2020 at 4:41pm (PDT) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top10 Showing nodes accounting for 341.51MB, 100% of 341.51MB total flat flat% sum% cum cum% 225.01MB 65.89% 65.89% 315.51MB 92.39% gopkg.in/check%2ev1.(*C).internalCheck 90.50MB 26.50% 92.39% 90.50MB 26.50% gopkg.in/check%2ev1.(*notChecker).Info 18MB 5.27% 97.66% 18MB 5.27% gopkg.in/check%2ev1.Not (inline) 8MB 2.34% 100% 225.01MB 65.89% github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID.func1 0 0% 100% 116.50MB 34.11% github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID 0 0% 100% 315.51MB 92.39% gopkg.in/check%2ev1.(*C).Assert 0 0% 100% 116.50MB 34.11% gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1 0 0% 100% 116.50MB 34.11% gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1 0 0% 100% 116.50MB 34.11% reflect.Value.Call 0 0% 100% 116.50MB 34.11% reflect.Value.call (pprof) list internalCheck Total: 341.51MB ROUTINE ======================== gopkg.in/check%2ev1.(*C).internalCheck in /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/helpers.go 225.01MB 315.51MB (flat, cum) 92.39% of Total . . 192: comment = c . . 193: args = args[:len(args)-1] . . 194: } . . 195: } . . 196: 135MB 135MB 197: params := append([]interface{}{obtained}, args...) . 90.50MB 198: info := checker.Info() . . 199: . . 200: if len(params) != len(info.Params) { . . 201: names := append([]string{info.Params[0], info.Name}, info.Params[1:]...) . . 202: c.logCaller(2) . . 203: c.logString(fmt.Sprintf("%s(%s):", funcName, strings.Join(names, ", "))) . . 204: c.logString(fmt.Sprintf("Wrong number of parameters for %s: want %d, got %d", info.Name, len(names), len(params)+1)) . . 205: c.logNewLine() . . 206: c.Fail() . . 207: return false . . 208: } . . 209: . . 210: // Copy since it may be mutated by Check. 90MB 90MB 211: names := append([]string{}, info.Params...) . . 212: . . 213: // Do the actual check. . . 214: result, error := checker.Check(params, names) . . 215: if !result || error != "" { . . 216: c.logCaller(2) (pprof) list TestAllocateID Total: 341.51MB ROUTINE ======================== github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID in /home/chris/code/cilium/cilium/pkg/idpool/idpool_test.go 0 116.50MB (flat, cum) 34.11% of Total . . 342: allocators.Wait() . . 343: close(allocated) . . 344: }() . . 345: . . 346: for id := range allocated { . 116.50MB 347: c.Assert(p.Insert(id), Equals, true) . . 348: } . . 349:} ROUTINE ======================== github.com/cilium/cilium/pkg/idpool.(*IDPoolTestSuite).TestAllocateID.func1 in /home/chris/code/cilium/cilium/pkg/idpool/idpool_test.go 8MB 225.01MB (flat, cum) 65.89% of Total . . 329: for i := 0; i < 256; i++ { . . 330: allocators.Add(1) . . 331: go func() { . . 332: for i := 1; i <= maxID; i++ { . . 333: id := p.AllocateID() 8MB 225.01MB 334: c.Assert(id, Not(Equals), NoID) . . 335: allocated <- id . . 336: } . . 337: allocators.Done() . . 338: }() . . 339: } ``` After optimization; cpu profle: ``` $ go test -c ./pkg/idpool && go test ./pkg/idpool -cpuprofile cpu.out -count 1 ok github.com/cilium/cilium/pkg/idpool 0.807s <--- Notice the time compared to the previous test run $ go tool pprof ./idpool.test cpu.out File: idpool.test Type: cpu Time: Jun 20, 2020 at 4:53pm (PDT) Duration: 800.83ms, Total samples = 660ms (82.41%) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top10 Showing nodes accounting for 430ms, 65.15% of 660ms total Showing top 10 nodes out of 63 flat flat% sum% cum cum% 160ms 24.24% 24.24% 160ms 24.24% runtime.procyield 50ms 7.58% 31.82% 160ms 24.24% sync.(*Mutex).lockSlow 40ms 6.06% 37.88% 40ms 6.06% runtime.casgstatus 40ms 6.06% 43.94% 40ms 6.06% runtime.unlock 40ms 6.06% 50.00% 150ms 22.73% sync.(*Mutex).Unlock 20ms 3.03% 53.03% 300ms 45.45% github.com/cilium/cilium/pkg/idpool.(*IDPool).AllocateID 20ms 3.03% 56.06% 130ms 19.70% github.com/cilium/cilium/pkg/idpool.(*IDPool).Insert 20ms 3.03% 59.09% 20ms 3.03% runtime.cansemacquire 20ms 3.03% 62.12% 40ms 6.06% runtime.findrunnable 20ms 3.03% 65.15% 20ms 3.03% runtime.futex ``` By looking at the test run times, you can see it was nearly cut in half. We also no longer see any measurable time spent in `runtime.mallocgc`. Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 18e5c60 ] This allows us to fix the following data race. The data race occurred because the creation of EndpointManager was racing with the controllers running in the background when the `Daemon` object is created. ``` $ go test ./daemon/cmd -race -v -count 1 -check.f DaemonEtcdSuite.TestEndpointAddNoLabels === RUN Test ================== WARNING: DATA RACE Write at 0x00c0007612f8 by goroutine 103: github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).SetUpTest() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:145 +0x724 github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).SetUpTest() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:195 +0x44 runtime.call32() /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a reflect.Value.Call() /usr/lib/go/src/reflect/value.go:321 +0xd3 gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:730 +0x1c3 gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 Previous read at 0x00c0007612f8 by goroutine 122: github.com/cilium/cilium/daemon/cmd.(*Daemon).bootstrapFQDN.func1() /home/chris/code/cilium/cilium/daemon/cmd/fqdn.go:191 +0x3ea github.com/cilium/cilium/pkg/controller.(*Controller).runController() /home/chris/code/cilium/cilium/pkg/controller/controller.go:205 +0xc6f Goroutine 103 (running) created at: gopkg.in/check%2ev1.(*suiteRunner).forkCall() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d gopkg.in/check%2ev1.(*suiteRunner).runFunc() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:682 +0x83 gopkg.in/check%2ev1.(*suiteRunner).runFixture() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:726 +0x3d gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:744 +0x82 gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:769 +0x28c gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 Goroutine 122 (running) created at: github.com/cilium/cilium/pkg/controller.(*Manager).updateController() /home/chris/code/cilium/cilium/pkg/controller/manager.go:120 +0xfb5 github.com/cilium/cilium/pkg/controller.(*Manager).UpdateController() /home/chris/code/cilium/cilium/pkg/controller/manager.go:59 +0xc7d github.com/cilium/cilium/daemon/cmd.(*Daemon).bootstrapFQDN() /home/chris/code/cilium/cilium/daemon/cmd/fqdn.go:172 +0xa42 github.com/cilium/cilium/daemon/cmd.NewDaemon() /home/chris/code/cilium/cilium/daemon/cmd/daemon.go:556 +0x33b6 github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).SetUpTest() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:134 +0x2de github.com/cilium/cilium/daemon/cmd.(*DaemonEtcdSuite).SetUpTest() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:195 +0x44 runtime.call32() /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a reflect.Value.Call() /usr/lib/go/src/reflect/value.go:321 +0xd3 gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:730 +0x1c3 gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 ``` Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 978248b ] This fixes a data race with `(*Daemon).policy`; the variable is overwritten in each test run. Each policy repository has event queues which are running in the background which accesses the repository. When the variable is overwritten with a new instance of a policy repository, the event queues accesses race with the write to `(*Daemon).policy`. The fix is to first shutdown and drain the event queues before proceeding onto the next test run. ``` WARNING: DATA RACE Write at 0x00c0011cef48 by goroutine 329: github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).Test_addCiliumNetworkPolicyV2() /home/chris/code/cilium/cilium/daemon/cmd/policy_test.go:1236 +0x5e5 github.com/cilium/cilium/daemon/cmd.(*DaemonConsulSuite).Test_addCiliumNetworkPolicyV2() <autogenerated>:1 +0x4d runtime.call32() /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a reflect.Value.Call() /usr/lib/go/src/reflect/value.go:321 +0xd3 gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 Previous read at 0x00c0011cef48 by goroutine 210: github.com/cilium/cilium/daemon/cmd.(*Daemon).policyAdd() /home/chris/code/cilium/cilium/daemon/cmd/policy.go:430 +0x1330 github.com/cilium/cilium/daemon/cmd.(*PolicyAddEvent).Handle() /home/chris/code/cilium/cilium/daemon/cmd/policy.go:217 +0xc9 github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run.func1() /home/chris/code/cilium/cilium/pkg/eventqueue/eventqueue.go:260 +0x25d sync.(*Once).doSlow() /usr/lib/go/src/sync/once.go:66 +0x103 sync.(*Once).Do() /usr/lib/go/src/sync/once.go:57 +0x68 Goroutine 329 (running) created at: gopkg.in/check%2ev1.(*suiteRunner).forkCall() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:672 +0x44d gopkg.in/check%2ev1.(*suiteRunner).forkTest() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:763 +0x1b9 gopkg.in/check%2ev1.(*suiteRunner).runTest() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:818 +0x228 gopkg.in/check%2ev1.(*suiteRunner).run() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:624 +0x1c7 gopkg.in/check%2ev1.Run() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:92 +0x5a gopkg.in/check%2ev1.RunAll() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:84 +0x127 gopkg.in/check%2ev1.TestingT() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/run.go:72 +0x5f6 github.com/cilium/cilium/daemon/cmd.Test() /home/chris/code/cilium/cilium/daemon/cmd/daemon_test.go:50 +0x38 testing.tRunner() /usr/lib/go/src/testing/testing.go:991 +0x1eb Goroutine 210 (running) created at: github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run() /home/chris/code/cilium/cilium/pkg/eventqueue/eventqueue.go:248 +0xe6 github.com/cilium/cilium/pkg/policy.NewPolicyRepository() /home/chris/code/cilium/cilium/pkg/policy/repository.go:144 +0x11d github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).Test_addCiliumNetworkPolicyV2.func2() /home/chris/code/cilium/cilium/daemon/cmd/policy_test.go:1006 +0xd4 github.com/cilium/cilium/daemon/cmd.(*DaemonSuite).Test_addCiliumNetworkPolicyV2() /home/chris/code/cilium/cilium/daemon/cmd/policy_test.go:1234 +0x574 github.com/cilium/cilium/daemon/cmd.(*DaemonConsulSuite).Test_addCiliumNetworkPolicyV2() <autogenerated>:1 +0x4d runtime.call32() /usr/lib/go/src/runtime/asm_amd64.s:539 +0x3a reflect.Value.Call() /usr/lib/go/src/reflect/value.go:321 +0xd3 gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:781 +0xa0a gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1() /home/chris/code/cilium/cilium/vendor/gopkg.in/check.v1/check.go:675 +0xd9 ``` Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 8e1a7fa ] If the KVStore connectivity is not reliable during the endpoint restore process Cilium can end up with an endpoint in a 'restoring' state in case the ep's security identity resolution fails. Adding a controller will make sure Cilium will retry to get an identity for that endpoint until the endpoint is removed or the connectivity with the allocator is successful. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 54d2175 ] If the KVStore connectivity is not reliable during the endpoint restore process Cilium can end up with an endpoint in a 'restoring' state in case the global security identities sync would fail or time out. Adding a controller will make sure Cilium will wait until the global security identities are synced or until the endpoint is removed before restoring the endpoint. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 8390305 ] With the introduction of CiliumClusterwideNetworkPolicies a policy has the ability to select the 'reserved:health' endpoints. If a user applies a policy such as [0], it will block all connectivity for the health endpoints. As this does not seem expected, since Cilium will report that the connectivity across Cilium pods is not working, the user will need to deploy a CCNP that allows connectivity between health endpoints. This can be done with [1]. This commit introduces 'reserved:health' as an entity that can be selected by CNP and / or CCNP. [0] ``` Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit c93459e ] Under heavy load, the round-trip-time (RTT) for DNS requests between a TCP client and a DNS proxy may exceed the 100ms timeout specified when creating the client in the dnsproxy tests. This was observed on the test-PR #12298, with a RTT value going up to 296ms (under exceptional memory strain). This might be the cause for the rare flakes reported in #12042. Let's increase this timeout. The timeout is only used a couple of times in the tests, so increasing it by a few hundred milliseconds would have no visible impact. And because we expect all requests from the TCP client to succeed on the L4 anyway (i.e. it should never time out in our tests), this should not prolong at all the execution of tests in the normal case. Let's also retrieve and print the RTT value for that request in case of error, to get more info if this change were not enough to fix the flake. Hopefully fixes: #12042 Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 1bd4624 ] On Golang side we get the number of CPUs from /sys/devices/system/cpu/possible. On BPF side, we use $(nproc -all). nproc calls num_processors() from the gnulib. That function, however, may not always return the value from the /sys file above. Instead, we should use the exact same source as Golang side to ensure both sides have the same value and avoid issues later on. See #12070 for details. The __NR_CPUS__ values in test/bpf/Makefile and bpf/Makefile.bpf do not need to be in sync. with Golang values because these files are only used for unit tests, sparse, and compile testing. Fixes: 8191b16 ("bpf: Use `nproc --all` for __NR_CPUS__") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 8cf1e20 ] This commit fixes a typo that prevents LOCKDEBUG from working for the `docker-plugin-image` make target. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit e427a09 ] The current implementation of `CmdRes` allows to perform a certain number of operations on the standard output stream of a command. However, these operations are not available for the standard error stream of a command. This commit adds a new type, `CmdStreamBuffer` which is a superset of `Buffer`. It has basic operations to process the buffered stream namely `ByLines()`, `KVOutput()`, `Filter()`, `FilterLineJSONPath()` and `FilterLines()`. These methods, which were part of `CmdRes`, are still available to `CmdRes` but they have been updated to call the respective methods from the standard output stream's corresponding methods. Two new methods have been added to `CmdRes`: `GetStdOut()` which returns a `CmdStreamBuffer` for the standard output stream and `GetStdErr()` which returns a `CmdStreamBuffer` for the standard error stream. The new methods `Stdout()` and `Stderr()` return the respective buffers content as string. The method `Output()`, which return the `CmdRes` stdout has been removed to avoid any confusion. All Go code in tests has been adapted to these changes. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 4e280dd ] Handling both `\n` and `\r\n` was part of `CmdRes` already but this treatment has never been propagated to `FilterBuffer`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 30f4e75 ] There is no need to skip the L4 services test if it runs without kube-proxy (i.e., with NodePort BPF), as this test does not deploy any L7 redirects. Signed-off-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 069ff48 ] Add a test case where a TFTP client in k8s1 uses the DNS proxy port of k8s2 as it's ephemeral local (source) port number. This exposes a problem with the iptables rules used in proxy redirection in k8s2, as the response TFTP packets get intercepted by a socket match rule. As of now, this test case fails, but the fix in a following commit fixes the underlying problem. Related: #12241 Signed-off-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit ca767ee ] '--no-wildcard' allows the socket match to find zero-bound (listening) sockets, which we do not want, as this may intercept (reply) traffic intended for other nodes when an ephemeral source port number allocated in one node happens to be the same as the allocated proxy port number in 'this' node (the node doing the iptables socket match changed here). Fixes: #12241 Related: #8864 Signed-off-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 089060b ] We need NodePort and direct routing devices to have a MAC address. If they don't, init.sh fails with the following error: level=warning msg="+ for NATIVE_DEV in ${NATIVE_DEVS//;/ }" subsys=datapath-loader level=warning msg="++ cat /sys/class/net/lo/ifindex" subsys=datapath-loader level=warning msg="+ IDX=1" subsys=datapath-loader level=warning msg="++ ip link show lo" subsys=datapath-loader level=warning msg="++ grep ether" subsys=datapath-loader level=warning msg="++ awk '{print $2}'" subsys=datapath-loader level=warning msg="+ MAC=" subsys=datapath-loader level=error msg="Error while initializing daemon" error="exit status 1" subsys=daemon level=fatal msg="Error while creating daemon" error="exit status 1" subsys=daemon Thus, we need to skip auto-detected devices that don't have a MAC address. This commit implements that and was tested by injecting a loopback interface with an IP address in the code, in the dev. VM: loAddr, err := netlink.ParseAddr("192.168.33.11/32") if err == nil { loAddr.LinkIndex = 1 addrs = append(addrs, *loAddr) } Fixes: #12228 Fixes: #12304 Fixes: 6730d0f ("daemon: Extend BPF NodePort device auto-detection") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 5244b68 ] In IPv6 mode, when the host firewall is enabled and rules are enforced, we start dropping ICMPv6 packets that are required to route packets. In particular, we can notice the following drops in cilium monitor: xx drop (Policy denied) flow 0x1fc0ef0 to endpoint 0, identity 1->0: fd00::c -> f00d::a0f:0:0:dfa1 DestinationUnreachable(NoRouteToDst) xx drop (Policy denied) flow 0x0 to endpoint 0, identity 0->0: fd01::c -> fd01::b NeighborAdvertisement The nodes need to be able to exchange ICMPv6 NS and NA messages to establish routes. We already handle the response to NS messages on ingress, but when the egress policies are enforced, we start dropping outgoing NS and NA messages. This commit fixes that by allowing and rejecting ICMPv6 messages according to RFC4890 Section 4.4. No other verifications than the types' are performed on the messages' correctness or their source IP addresses. Such messages from the pods are already handled on their egress, so we're not at risk of spoofing from pods here. Handling of echo request and reply messages does not conform to RFC4890 as they can be filtered by the host firewall. That is to be consistent with our handling of ICMPv4 messages. With this commit, we also stop answering to NS and echo request messages from the BPF program on ingress to the host. This behavior had been broken by a695f53 ("Endpoint for host"), but we will now explicitly stop replying to those messages and pass them up the stack instead. Fixes: a695f53 ("Endpoint for host") Fixes: 489dbef ("bpf: Enforce host policies for IPv6") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

[ upstream commit 655c636 ] After discussion from the Cilium community meeting, we determined that the beta designation is more appropriate for the state of the host firewall feature. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi · 2020-06-30T07:37:45Z

test-backport-1.8

qmonnet

#12305 -- fqdn/dnsproxy/proxy_test: increase timeout for DNS TCP client exchanges (@qmonnet)

Looks good, thanks!

aanm · 2020-06-30T12:47:22Z

test-missed-k8s

aanm · 2020-06-30T13:04:36Z

previously hit #12336

pchaigno

Bit late to the party, but my changes look good! Thanks!

gandro and others added 21 commits June 29, 2020 20:50

Dockerfile: Bump hubble to v0.6.1

ccf0a59

[ upstream commit ac6a11e ] Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

helm: Bump hubble-ui to v0.6.1

195e96e

[ upstream commit d075b7a ] Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com>

helm: Expose the crd-wait-timeout option

ab0fc4c

[ upstream commit 76df134 ] Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Chris Tarazi <chris@isovalent.com>

docs: Update cmdref

f88e3c3

[ upstream commit 629b989 ] Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Chris Tarazi <chris@isovalent.com>

operator: Make CRD availability timeout configurable

83e8b37

[ upstream commit 5dbe413 ] This change allows to override the default 5m timeout with the --crd-wait-timeout option. Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Chris Tarazi <chris@isovalent.com>

christarazi requested review from a team as code owners June 30, 2020 04:04

christarazi added backport/1.8 kind/backports This PR provides functionality previously merged into master. labels Jun 30, 2020

christarazi force-pushed the pr/v1.8-backport-2020-06-29 branch from 8d0ca36 to b13c056 Compare June 30, 2020 07:19

rolinh and others added 4 commits June 30, 2020 00:22

jrajahalme and others added 4 commits June 30, 2020 00:22

christarazi force-pushed the pr/v1.8-backport-2020-06-29 branch 2 times, most recently from 439e98c to 1b7c80e Compare June 30, 2020 07:24

qmonnet approved these changes Jun 30, 2020

View reviewed changes

aanm mentioned this pull request Jun 30, 2020

CI: K8sServicesTest Checks service across nodes TFTP with DNS Proxy port collision Tests TFTP from DNS Proxy Port #12336

Closed

joestringer merged commit 63113e8 into v1.8 Jun 30, 2020

joestringer deleted the pr/v1.8-backport-2020-06-29 branch June 30, 2020 18:24

pchaigno reviewed Jul 1, 2020

View reviewed changes

joestringer mentioned this pull request Jul 9, 2020

v1.8 backports 2020-07-07 #12442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.8 backports 2020-06-29 #12332

v1.8 backports 2020-06-29 #12332

christarazi commented Jun 30, 2020 •

edited

christarazi commented Jun 30, 2020

qmonnet left a comment

aanm commented Jun 30, 2020

aanm commented Jun 30, 2020 •

edited

pchaigno left a comment

v1.8 backports 2020-06-29 #12332

v1.8 backports 2020-06-29 #12332

Conversation

christarazi commented Jun 30, 2020 • edited

christarazi commented Jun 30, 2020

qmonnet left a comment

Choose a reason for hiding this comment

aanm commented Jun 30, 2020

aanm commented Jun 30, 2020 • edited

pchaigno left a comment

Choose a reason for hiding this comment

christarazi commented Jun 30, 2020 •

edited

aanm commented Jun 30, 2020 •

edited