-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cilium upstream master branch breaks nodeport connection from external client #17192
Comments
#17106 (comment) , this might break the https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1289/ test |
I did some debugging and the problem is the pod is not synced to cilium_ipcache map to another node, no encapulation happened and FIB lookup failure, this issue does not happen to v1.10.3 release the cilium datapath log:
|
trying to learn how the ipcache map sync works and here is what I find so far from cilium agent debug log.
I do not see above log with the master branch image build |
git bisect found this
|
CC @Weil0ng any thoughts on why that commit might have broken ipcache synchronization? Maybe an unintended change in the CEP watcher? |
Hmm, the commit itself should not affect how ipcache is synced, but it DOES affect how CEPs should be synced with apiserver, I wonder if this is similar to #16984 (comment) |
@Weil0ng did you have a chance to look (or test) further into it? Every node should have ipcache entries of all the endpoints in the cluster given we need to know whether to reach remote endpoints via tunnel or not. (If there is no entry, connectivity would break since routing won't happen via tunnel (this info as @vincentmli pointed out is stored in the ipcache meta data).) |
@borkmann on a side note, just curious, for nodeport service in tunnel mode, is it possible to use tunnel map for lookup for encapsulation since the tunnel map should always include the node pod cidr, no need for a specific endpoint lookup in ipcache map in case it is missing for some reason. |
From Cilium meeting, maybe one next step is to look into how the tunnel IP is associated with the CEP resources? Seemed like there was a question whether it could be included in the "subresources" field in which case maybe the update doesn't propagate. |
@vincentmli if it's possible, can you please provide sysdump? Running |
@Weil0ng I have done git bisect reset but i may still has the image in docker hub, let me see if I can re-create the issue. |
@Weil0ng here is the sysdump, the cilium image I ran is from master branch build |
Thank you! However I don't see |
|
oops, my bad, was looking at a wrong sysdump path... |
A few observations:
Then a moment later when it did get the identity, it went through with the endpointsynchronization without any error:
So at this point, it should've created a CEP in the apiserver with status populated...
|
to add more confusion :) out of curiosity, I manually collect the operator log
and I found log entry
note the timestamp seems incorrect, but the endpoint identity 2912 matches the nginx endpoint identity, I also noticed above log output does not have the timestamp for each line, unlike the operator log collected by sysdump, and the operator log collected by sysdump does not have above log entry |
Hmm...that could the manifestation of not having a CEP reference the identity (not sure...). Is it possible to get apiserver audit log from your env? I wonder if the code ever hits https://github.com/cilium/cilium/blob/master/pkg/k8s/watchers/endpointsynchronizer.go#L168, which it should. So questions is:
|
how do I enable apiserver audit log, after i enable apiserver audit log, re-create the nginx pod? |
Hmm, I'm actually not sure how to enable it, it depends on your environment, is that a self-manged cluster? If so, there should be a flag to apiserver. Yes, after you enable the log, recreate the nginx pod to retrigger this code path. |
@Weil0ng |
@Weil0ng Here is another sysdump and apiserver audit log the nginx pod is
|
Thanks @vincentmli a lot for going through with the effort! I do see the following in the audit log which confirms the code path that is actually running:
So above corresponds to the Then I can see right after, apiserver sees the following create event:
which is when we create the CEP with https://github.com/cilium/cilium/blob/master/pkg/k8s/watchers/endpointsynchronizer.go#L168 So far so good. But then we got the following
I think next step would be set audit log level to |
Below is the audit policy setting I used for apiserver, it has
|
Yea, you are right, it's already at "Request", hmm...then I wonder if the object being created is without status fields... Can you help take a look at #17192 (comment) @aanm ? Specifically where does that |
friendly ping :) @aanm Can you help take a look at #17192 (comment)? Specifically where does that PATCH come from? Is that expected? |
@vincentmli If you don't mind giving this another go, can you please edit the last part in your audit policy to record non-core requests at "request" level too? I suspect the reason we did not see the request body is that In short, can you change the last item in your audit policy to the following and reproduce the issue?
|
cilium-sysdump-20210907-202207.zip @Weil0ng here you go
|
Thank you! Now we can clearly see that the
There's also a so in summary, so far we've confirmed:
Question still need answer:
|
Let me know if that helps. |
Yes this is the old model with
|
Coming back to this, I wonder @vincentmli if you still experience this issue? I cannot repro this on GKE at least (using master cilium image). If you still can repro this, can you please provide output from |
yes, the problem still here kubectl get crd ciliumendpoints.cilium.io -o yaml
|
Thank you! Ah oki, I see it now, this CRD still has status as a
I checked the schema resource version at the commit in your initial msg, it should be
But somehow your cluster has cilium/pkg/k8s/apis/cilium.io/v2/types.go Line 41 in 8d8a7f8
I wonder if you only upgraded your agent from an old ver but not operator? Seems like the CRD was not updated correctly and that's why we are losing the |
yes, I always run |
I think so, the CRDs are registered by the operator, if it does not restart, I suspect the CRDs were never updated. So eventually you have an outdated CRD (which has |
As a mitigation, can you please try restarting/upgrading the operator and see if it fixes the problem? |
@Weil0ng yes, |
Cilium is expected to be deployed with one consistent version across all components, so that makes sense.Thanks for digging to the bottom of this! Closing out as it seems like the real solution is to upgrade the operator to match the Cilium version. |
Bug report
Cilium upstream master branch breaks nodeport connection from external client,
General Information
Cilium version (run
cilium version
)Client: 1.10.90 c44ff1b 2021-08-03T00:35:29+05:30 go version go1.16.5 linux/amd64
Daemon: 1.10.90 c44ff1b 2021-08-03T00:35:29+05:30 go version go1.16.5 linux/amd64
Kernel version (run
uname -a
)5.8.1-050801-generic #202008111432 SMP Tue Aug 11 14:34:42 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Orchestration system version in use (e.g.
kubectl version
, ...)Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8", GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}
can upload if needed
How to reproduce the issue
build docker image from upstream master branch
deploy cilium with attached cilium yaml file
deploy above nodeport service
5, access the nodeport service from external client
cilium-upstream.yaml.txt
The text was updated successfully, but these errors were encountered: