New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.7 backports 2020-06-08 #11971
v1.7 backports 2020-06-08 #11971
Conversation
[ upstream commit 90217c1 ] In case a toServices selected a service that contained more than 1 endpoint, the generated rules could never be deleted. This can easily be reproducible by adding one more endpoint to the unit test of the deleteToCidrFromEndpoint function. Fixes: bae09f7 ("Move ToCIDR gen logic to k8s package") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit c54a76f ] [ Backporter's notes: Minor conflict in imports ] The rule_translate_test.go was checking the output of an unsorted slice, which could occasionally fail with: FAIL: rule_translate_test.go:197: K8sSuite.TestGenerateToCIDRFromEndpoint rule_translate_test.go:228: c.Assert(string(rule.ToCIDRSet[0].Cidr), Equals, epIP1+"/32") ... obtained string = "10.1.1.2/32" ... expected string = "10.1.1.1/32" Fix it by implementing some basic sorting functions in the policy api package and using these in the offending tests. Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 02b43fc ] [ Backporter's notes: Added extra "! -s $internal_ip" match to resolve conflict on older v1.7 branch. ] Proxy return traffic accessed via a k8s NodePort will not be routed back via Cilium bpf datapath, so such traffic needs to have possible reverse NAT applied. Setting NOTRACK prevented this. Fix this by setting NOTRACK only on packets heading back to the Cilium datapath (-o lxc+ and -o cilium_host). Signed-off-by: Jarno Rajahalme <jarno@covalent.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 31e2e11 ] Add an info log message for all create, patch and dlete request. The volume should be low enough. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 27b0a88 ] So far, the endpoint create request has relied on timing out eventually and checking liveness of the endpoint in various spots to then abort the creation. This has the disadvantage that if a blocking operation such as etcd interactions are delayed for a long time, kubelet may schedule a new pod creation attempt while the old endpoint is still being created, leading to parallel endpoint create events for the same pod. Establish a new map which keeps track of all endpoint create requests and: * Cancel any endpoint create request if an endpoint delete request for the same endpoint is being received. * Cancel any endpoint create request if a new endpoint create request for the same pod is being received. The new request will continue. In order to assist in troubleshooting of create endpoint related issues, the list of ongoing endpoint creations is printed in the debuginfo output. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 53a80b0 ] It has been observed that 3 minutes is not sufficient. Endpoint creations are currently dependant on kubelet cancelling the request or scheduling a new one. The IPAM expiration may never ooccur before such an event happens as otherwise the IP is returned prematurely. Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
f2ab555
to
f8ccebf
Compare
test-backport-1.7 |
f8ccebf
to
0c4909a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflict resolution for NOTRACK looks good.
[ upstream commit 10f4e2c ] [ Backporter's notes: Minor conflicts scattered across five files. ] Calls to GetIdentity() have been assuming both that the endpoint is locked or not locked. * daemon/cmd/fqdn.go / notifyOnDNSMsg(): LookupEndpointIDByIP(). No locking. --> Vulnerable * pkg/datapath/linux/config/config.go / writeStaticData() e.owner.Datapath().WriteEndpointConfig -> WriteEndpointConfig() -> writeStaticData() The endpoint is locked. Not using epCacheInfo. Must use a non-locking variation or a deadlock may occur. --> Not vulnerable * pkg/datapath/loader/template.go e.realizeBPFState() -> -> CompileOrLoad -> ELFSubstitutions() -> elfVariableSubstitutions() -> CompileAndLoad() -> compileAndLoad() -> realizeBPFState() -> ReloadDatapath() -> -> ReloadDatapath() -> reloadHostDatapath() -> patchHostNetdevDatapath() -> ELFSubstitutions() -> elfVariableSubstitutions() Uses epInfoCache --> Not vulnerable * pkg/envoy/server.go e.updateNetworkPolicy() -> UpdateNetworkPolicy() The endpoint is locked. Must use a non-locking variation. --> Not vulnerable * pkg/hubble/parser/threefour/parser.go Decode() -> resolveEndpoint() --> Vulnerable Fixes: #11932 Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 3285cbc ] The following deadlock can occur when the ipcache GC relies on map renames and datapath reloads to delete entries in combination with endpoint regenrations triggered by the FQDN proxy which perform ipcache upserts as part of regenerations: Routine 1: The following go routine holds the ipcache mutex while garbage collecting: ``` goroutine 779 [semacquire, 48 minutes]: sync.runtime_SemacquireMutex(0xc0008f4f68, 0xc007663600, 0x0) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*RWMutex).Lock(0xc0008f4f60) /usr/local/go/src/sync/rwmutex.go:103 +0x88 github.com/cilium/cilium/pkg/datapath/loader.(*Loader).Reinitialize(0xc0003d70a0, 0x2759d20, 0xc000468bc0, 0x2771e60, 0xc000515680, 0x2329, 0x7f87768eac80, 0xc00022ff10, 0x26faf60, 0xc0004a18b0, ...) /go/src/github.com/cilium/cilium/pkg/datapath/loader/base.go:124 +0xcd main.(*Daemon).TriggerReloadWithoutCompile(0xc000515680, 0x2332316, 0x10, 0x12, 0xc008f75060, 0x16) /go/src/github.com/cilium/cilium/daemon/daemon.go:588 +0x202 github.com/cilium/cilium/pkg/datapath/ipcache.(*BPFListener).garbageCollect(0xc0006f8960, 0x2759d20, 0xc003748000, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/datapath/ipcache/listener.go:306 +0xadc github.com/cilium/cilium/pkg/datapath/ipcache.(*BPFListener).OnIPIdentityCacheGC.func1(0x2759d20, 0xc003748000, 0x3a5bf00, 0x2) /go/src/github.com/cilium/cilium/pkg/datapath/ipcache/listener.go:340 +0x3e github.com/cilium/cilium/pkg/controller.(*Controller).runController(0xc001e1e000) /go/src/github.com/cilium/cilium/pkg/controller/controller.go:205 +0xa2a created by github.com/cilium/cilium/pkg/controller.(*Manager).updateController /go/src/github.com/cilium/cilium/pkg/controller/manager.go:120 +0xb09 ``` As part of this, Reinitialize() is called which will require the compilation mutex to be acquired. Routine 2 The following ongoing endpoint regeneration is holding the compilation lock and thus blocks Routine 1 from completing. It is itself blocked on the FQDN NameManager mutex. ``` goroutine 7227352 [semacquire, 48 minutes]: sync.runtime_SemacquireMutex(0xc0005c3744, 0xc009bf4c00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc0005c3740) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cilium/cilium/pkg/fqdn.(*NameManager).Lock(0xc0005c3740) /go/src/github.com/cilium/cilium/pkg/fqdn/name_manager.go:93 +0x47 github.com/cilium/cilium/pkg/policy.(*SelectorCache).AddFQDNSelector(0xc0004691c0, 0x26faf00, 0xc00acd6e70, 0x0, 0x0, 0xc004106590, 0xb, 0x0, 0x0, 0x40ca00) /go/src/github.com/cilium/cilium/pkg/policy/selectorcache.go:580 +0x13f github.com/cilium/cilium/pkg/policy.(*L4Filter).cacheFQDNSelector(...) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:392 github.com/cilium/cilium/pkg/policy.(*L4Filter).cacheFQDNSelectors(0xc00acd6e70, 0xc008c7d100, 0x38, 0x38, 0xc0004691c0) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:387 +0xad github.com/cilium/cilium/pkg/policy.createL4Filter(0x2739660, 0xc001c49680, 0xc006beab60, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:489 +0x7ce github.com/cilium/cilium/pkg/policy.createL4EgressFilter(...) /go/src/github.com/cilium/cilium/pkg/policy/l4.go:599 github.com/cilium/cilium/pkg/policy.mergeEgressPortProto(0x2739660, 0xc001c49680, 0xc0052087d0, 0xc006beab60, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/cilium/cilium/pkg/policy/rule.go:637 +0x187 github.com/cilium/cilium/pkg/policy.mergeEgress(0x2739660, 0xc001c49680, 0xc0052087d0, 0xc006beab60, 0x1, 0x1, 0x0, 0x0, 0x0, 0xc00b91f2c0, ...) /go/src/github.com/cilium/cilium/pkg/policy/rule.go:572 +0xdbd github.com/cilium/cilium/pkg/policy.(*rule).resolveEgressPolicy(0xc0079a0400, 0x2739660, 0xc001c49680, 0xc0052087d0, 0xc005208418, 0xc0090738f0, 0x0, 0x0, 0x0, 0xc0090738f0, ...) /go/src/github.com/cilium/cilium/pkg/policy/rule.go:675 +0x265 github.com/cilium/cilium/pkg/policy.ruleSlice.resolveL4EgressPolicy(0xc008d0a9c0, 0x5, 0x8, 0x2739660, 0xc001c49680, 0xc0052087d0, 0xc008d6c300, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/policy/rules.go:106 +0x70b github.com/cilium/cilium/pkg/policy.(*Repository).resolvePolicyLocked(0xc000246460, 0xc0059fe820, 0xc0000066c3, 0xc0030b0960, 0xc006176201) /go/src/github.com/cilium/cilium/pkg/policy/repository.go:665 +0x3f9 github.com/cilium/cilium/pkg/policy.(*PolicyCache).updateSelectorPolicy(0xc000342b20, 0xc0059fe820, 0x3a5bf00, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:135 +0x148 github.com/cilium/cilium/pkg/policy.(*PolicyCache).UpdatePolicy(...) /go/src/github.com/cilium/cilium/pkg/policy/distillery.go:170 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regeneratePolicy(0xc001427080, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:175 +0x255 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0xc001427080, 0xc005d3a500, 0xc008d0a800, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:736 +0x96a github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc001427080, 0xc005d3a500, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:508 +0x1e0 github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc001427080, 0xc005d3a500, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:349 +0x841 github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc003530210, 0xc0068115c0) /go/src/github.com/cilium/cilium/pkg/endpoint/events.go:58 +0x459 github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run.func1() /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:260 +0x187 sync.(*Once).doSlow(0xc005f60e58, 0xc009bbb330) /usr/local/go/src/sync/once.go:66 +0xe3 sync.(*Once).Do(0xc005f60e58, 0xc009bbb330) /usr/local/go/src/sync/once.go:57 +0x45 created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run /go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:248 +0xa9 ``` Routine 3 This third routine is holding the FQDN NameManager mutex while itself waiting on the ipcache mutex which is held by routine 1. ``` goroutine 7229843 [select, 48 minutes]: golang.org/x/sync/semaphore.(*Weighted).Acquire(0xc000464230, 0x2759d60, 0xc000058018, 0x40000000, 0xc0008f9740, 0x0) /go/src/github.com/cilium/cilium/vendor/golang.org/x/sync/semaphore/semaphore.go:60 +0x28e github.com/cilium/cilium/pkg/lock.(*SemaphoredMutex).Lock(...) /go/src/github.com/cilium/cilium/pkg/lock/semaphored_mutex.go:41 github.com/cilium/cilium/pkg/ipcache.(*IPCache).Upsert(0xc000464280, 0xc008858dd0, 0xd, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1000596, 0x231acf1, ...) /go/src/github.com/cilium/cilium/pkg/ipcache/ipcache.go:184 +0x93 github.com/cilium/cilium/pkg/ipcache.allocateCIDRs(0xc004bb2a00, 0x98, 0x98, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:97 +0x7aa github.com/cilium/cilium/pkg/ipcache.AllocateCIDRsForIPs(0xc0046a2000, 0x98, 0x200, 0x1, 0x1, 0x0, 0x1f4a440, 0xc00a768900) /go/src/github.com/cilium/cilium/pkg/ipcache/cidr.go:52 +0x65 main.identitiesForFQDNSelectorIPs(0xc00a7687e0, 0xc0008f9740, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/daemon/fqdn.go:87 +0x38e main.(*Daemon).updateSelectors(0xc000515680, 0x2759da0, 0xc000e09a40, 0xc00a7687e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/daemon/fqdn.go:320 +0x40 github.com/cilium/cilium/pkg/fqdn.(*NameManager).UpdateGenerateDNS(0xc0005c3740, 0x2759da0, 0xc000e09a40, 0xbfae6eb0118dc6cd, 0x131b26c90d12, 0x3a5bf00, 0xc004cf9910, 0x0, 0x0, 0x0) /go/src/github.com/cilium/cilium/pkg/fqdn/name_manager.go:225 +0x48f main.(*Daemon).notifyOnDNSMsg(0xc000515680, 0xbfae6eb0118dc6cd, 0x131b26c90d12, 0x3a5bf00, 0xc001427080, 0xc00a2bdec0, 0x11, 0xc00ae52340, 0x10, 0xc003564f30, ...) /go/src/github.com/cilium/cilium/daemon/fqdn.go:584 +0x12cd github.com/cilium/cilium/pkg/fqdn/dnsproxy.(*DNSProxy).ServeDNS(0xc0008c8e80, 0x27818e0, 0xc00302e1e0, 0xc0072a4f30) /go/src/github.com/cilium/cilium/pkg/fqdn/dnsproxy/proxy.go:475 +0x238c github.com/miekg/dns.(*Server).serveDNS(0xc000563a40, 0xc001f8be00, 0x6b, 0x200, 0xc00302e1e0) /go/src/github.com/cilium/cilium/vendor/github.com/miekg/dns/server.go:597 +0x20b github.com/miekg/dns.(*Server).serveUDPPacket(0xc000563a40, 0xc00047d060, 0xc001f8be00, 0x6b, 0x200, 0xc000df0380, 0xc0092cf2b0) /go/src/github.com/cilium/cilium/vendor/github.com/miekg/dns/server.go:552 +0xb2 github.com/miekg/dns.(*Server).serveUDP.func2(0xc000563a40, 0xc00047d060, 0xc001f8be00, 0x6b, 0x200, 0xc000df0380, 0xc0092cf2b0) /go/src/github.com/cilium/cilium/vendor/github.com/miekg/dns/server.go:478 +0x67 created by github.com/miekg/dns.(*Server).serveUDP /go/src/github.com/cilium/cilium/vendor/github.com/miekg/dns/server.go:477 +0x272 ``` Fixes: #11946 Signed-off-by: Thomas Graf <thomas@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 19030a7 ] This commit fixes a flake where the metrics do not update as quickly as we expect. It is fixed by waiting for up to 10 seconds to retrieve the metric value we expect. Fixes: 4a4a59b ("daemon: Add assertions for endpoint state metrics") Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
0c4909a
to
8c8ea25
Compare
test-backport-1.7 |
Hit #11213 and timeouts hitting external service 1.1.1.1, suspected flakes. Will re-kick main ginkgo run. https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Validated/19731/ All of the other CI targets were green. |
retest-runtime EDIT: ....that's not the way to kick "Cilium-Ginkgo-Tests". Ignore the failure there, that target is not supported on this branch. |
test-me-please EDIT: Welp, that's not it either, all of those will fail too. |
test-backport-1.7 |
@errordeveloper thanks for merging :) In future, please remember to run the command in the PR description to update the labels for the corresponding backported PRs. I'll handle it this time :-) |
Backported here:
I folded into this commit.
GetIdentity()
with
getIdentity()
andGetIdentityLocked()
.NOT backported:
Once this PR is merged, you can update the PR labels via: