-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.11 backports 2022-05-09 #19752
v1.11 backports 2022-05-09 #19752
Conversation
@YutaroHayakawa BPF check failure is due to |
a8c5231
to
9fdc4e2
Compare
@kaworu (cc: @qmonnet) Hi, thanks for the heads up! I think just replacing |
@YutaroHayakawa replacing by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My commits look good, thanks!
my change looks good, but I got the below error while trying to verify it again locally
|
TL;DR: Yes, go with The supposition above is correct. So
In the case of Yutaro's PR, the context of the new traces will be either |
[ upstream commit 90c0d62 ] ... while clustermesh-apiserver is starting and new CEW with identity 0 is existing in etcd In original implementation clustermesh-apiserver and etcd are running in separate containers within one pod. So a new empty etcd is created while clustermesh-apiserver is starting. In our usecase we use an existing etcd which is shared with cilium-agent. An error below has been occurring when a new CEW resource was already applied into kubernetes and cilium-agent was already started on related external workload machine while clustermesh-apiserver was not deployed yet. $ ./clustermesh-apiserver --cluster-id=12 --cluster-name=uacl-test --k8s-kubeconfig-path ../../uacl/uacl-test.kubeconfig --kvstore-opt etcd.config=../../uacl/etcd.config level=info msg="Started gops server" address="127.0.0.1:9892" subsys=clustermesh-apiserver level=info msg="Starting clustermesh-apiserver..." cluster-id=12 cluster-name=uacl-test subsys=clustermesh-apiserver level=info msg="Establishing connection to apiserver" host="https://uacl-test-api.test:31243" subsys=k8s level=info msg="Connected to apiserver" subsys=k8s level=info msg="Waiting until all Cilium CRDs are available" subsys=k8s level=info msg="All Cilium CRDs have been found and are available" subsys=k8s level=info msg="Initializing identity allocator" subsys=identity-cache level=info msg="Creating etcd client" ConfigPath=../../uacl/etcd.config KeepAliveHeartbeat=15s KeepAliveTimeout=25s RateLimit=20 subsys=kvstore level=info msg="Started health API" subsys=clustermesh-apiserver level=info msg="Connecting to etcd server..." config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" subsys=kvstore level=info msg="Got lease ID 320f7d7b1f23bc32" subsys=kvstore level=info msg="Got lock lease ID 320f7d7b1f23bc34" subsys=kvstore level=info msg="Initial etcd session established" config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" subsys=kvstore level=info msg="Successfully verified version of etcd endpoint" config=../../uacl/etcd.config endpoints="[https://uacl-test-api.test:30108]" etcdEndpoint="https://uacl-test-api.test:30108" subsys=kvstore version=3.4.16 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1919aa3] goroutine 213 [running]: github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).keyPath(0x0, {0x7f349c122c38, 0xc000a7e240}) /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:276 +0x43 github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).syncLocalKey(0x0, {0x2204f70, 0xc000050138}, {0x22051d8, 0xc000a7e240}) /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:288 +0x87 github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).UpdateKeySync(...) /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:375 main.(*VMManager).OnUpdate(0xc0005f6080, {0x21e9e10, 0xc000a7e0c0}) /home/abocim/go/src/github.com/cilium/cilium/clustermesh-apiserver/vmmanager.go:183 +0x3f4 github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).onUpdate(...) /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:233 github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).updateKey(0xc000128780, {0xc00025e85d, 0x15}, {0xc0000f6a00, 0xf3, 0x100}) /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:414 +0x102 github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).watcher(0xc000128780, 0xc000270a20) /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:482 +0x73c created by github.com/cilium/cilium/pkg/kvstore/store.(*SharedStore).listAndStartWatcher /home/abocim/go/src/github.com/cilium/cilium/pkg/kvstore/store/store.go:447 +0x89 Signed-off-by: Adam Bocim <adam.bocim@seznam.cz> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit bb43425 ] Since robots.txt file is now used in the 'html_extra_path' we will move the sitemap-index.xml to the static directory. That way we can robots can access the sitemap-index.xml file directly. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit 21e6e6a ] Currently Hubble-Relay builds its client list by querying the Peer Service over the local Hubble Unix domain socket. This goes against best security practices (sharing files across pods) and is not allowed on platforms that strictly enforce SELinux policies (e.g. OpenShift). This PR enables, by default, the creation of a Kubernetes Service that proxies the Hubble Peer Service so that Hubble-Relay can use it to build its client list, eliminating the need for a shared Unix domain socket completely. Helm values and configurations have been added to enable the service in a cilium deployment. Signed-off-by: Nate Sweet <nathanjsweet@pm.me> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit 88d31cd ] The upgrade tests are using the official helm charts with unreleased Cilium images. This can might cause the upgrade tests to fail in case the changes done in the unreleased Cilium versions require a new helm chart release. To fix this problem the upgrade tests will now use the unreleased helm charts as well. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit 67f74ff ] Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit e80f60d ] This commit is to perform lift and shift current initialization for k8s node watcher to another function with single scope of work, so that it can be re-used later. There is no change in logic. Signed-off-by: Tam Mach <tam.mach@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit 64b37e1 ] This is to make sure that k8s node watcher is only setup at max once. Also, synced channel is added, so that the consumer of this store knows if syncing process is done. Signed-off-by: Tam Mach <tam.mach@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit edc1a0a ] In the normal scenario, CiliumNode is created by agent with owner references attached all time in below PR[0]. However, there could be the case that CiliumNode is created by IPAM module[1], which didn't have any ownerReferences at all. For this case, if the corresponding node got terminated and never came back with same name, the CiliumNode resource is still dangling, and needs to be garbage collected. This commit is to add garbage collector for CiliumNode, with below logic: - Gargage collector will run every predefined interval (e.g. specify by flag --nodes-gc-interval) - Each run will check if CiliumNode is having a counterpart k8s node resource. Also, remove this node from GC candidate if required. - If yes, CiliumNode is considered as valid, happy day. - If no, check if ownerReferences are set. - If yes, let k8s perform garbage collection. - If no, mark the node as GC candidate. If in the next run, this node is still in GC candidate, remove it. References: [0]: cilium#17329 [1]: https://github.com/cilium/cilium/blob/master/pkg/ipam/allocator/podcidr/podcidr.go#L258 Signed-off-by: Tam Mach <tam.mach@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit 167fa2b ] This commit is to map operator CLI node GC interval flag in helm value and config map. Signed-off-by: Tam Mach <tam.mach@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
[ upstream commit d378a91 ] src_id variable in to_netdev in bpf_host.c is always zero, so safe to delete. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
9fdc4e2
to
8139b39
Compare
Thanks, I still have the below error :(
|
8139b39
to
c1b00cd
Compare
[ upstream commit 322510d ] In to-netdev in bpf_host.c, the packet tracing is missing when the handle_nat_fwd uses tailcall. This PR adds new packet tracing inside the handle_nat_fwd, but there were two complicated issue in here. 1. handle_nat_fwd is called from to-netdev as well as to-overlay There are two possible caller for this function, so we needed to use IS_BPF_OVERLAY macro to make sure where it is called. 2. handle_nat_fwd does not always tailcalls Internally, it uses conditional tailcall. When it returns without tailcall, we will hit the packet tracing hook at the bottom of the to_netdev and have duplicated packet trace. We need to make sure whether we already trace the packet or not and skip it. Below is an example of simple ICMP Echo to outside of Pod network Before the fix (with tailcall) ``` <- endpoint ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> stack ... 10.0.1.185 -> 1.1.1.1 EchoRequest ["-> network" is missing here] <- network ... 1.1.1.1 -> 172.18.0.3 EchoReply <- host ... 1.1.1.1 -> 10.0.1.185 EchoReply -> endpoint ... 1.1.1.1 -> 10.0.1.185 EchoReply ``` Before the fix (without tailcall) ``` <- endpoint ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> stack ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> network ... 172.18.0.3 -> 1.1.1.1 EchoRequest <- network ... 1.1.1.1 -> 172.18.0.3 EchoReply <- host ... 1.1.1.1 -> 10.0.1.185 EchoReply -> endpoint ... 1.1.1.1 -> 10.0.1.185 EchoReply [no problem] ``` After the fix (with tailcall) ``` <- endpoint ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> stack ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> network ... 172.18.0.3 -> 1.1.1.1 EchoRequest <- network ... 1.1.1.1 -> 172.18.0.3 EchoReply <- host ... 1.1.1.1 -> 10.0.1.185 EchoReply -> endpoint ... 1.1.1.1 -> 10.0.1.185 EchoReply [no problem] ``` After the fix (without tailcall) ``` <- endpoint ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> stack ... 10.0.1.185 -> 1.1.1.1 EchoRequest -> network ... 172.18.0.3 -> 1.1.1.1 EchoRequest <- network ... 1.1.1.1 -> 172.18.0.3 EchoReply <- host ... 1.1.1.1 -> 10.0.1.185 EchoReply -> endpoint ... 1.1.1.1 -> 10.0.1.185 EchoReply [no duplicated trace, no problem] ``` Fixes: cilium#18909 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Alexandre Perrin <alex@kaworu.ch>
@YutaroHayakawa I've replaced usage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good for my commit :). Verified locally to make sure GC was done correctly.
$ ksyslo deployment/cilium-operator --timestamps | grep "subsys=watchers"
2022-05-09T13:52:42.607874474Z level=info msg="Starting to garbage collect stale CiliumNode custom resources" subsys=watchers
...
2022-05-09T13:57:42.608310224Z level=info msg="Add CiliumNode to garbage collector candidates" nodeName=dummy-one subsys=watchers
2022-05-09T13:57:42.608347044Z level=debug msg="CiliumNode is valid, no gargage collection required" nodeName=minikube subsys=watchers
...
2022-05-09T14:02:42.609286306Z level=debug msg="CiliumNode is valid, no gargage collection required" nodeName=minikube subsys=watchers
2022-05-09T14:02:42.609323065Z level=info msg="Perform GC for invalid CiliumNode" nodeName=dummy-one subsys=watchers
2022-05-09T14:02:42.615194899Z level=info msg="CiliumNode is garbage collected successfully" nodeName=dummy-one subsys=watchers
To not break CI, we need to kind of backport the following PRs: |
/test-backport-1.11 Job 'Cilium-PR-K8s-1.23-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test-1.23-4.9 |
Once this PR is merged, you can update the PR labels via:
or with