-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable CiliumEndpointSlice feature #17658
Conversation
d3bf21f
to
ae7e40b
Compare
test-me-please |
ae7e40b
to
750e536
Compare
test-me-please |
ConformanceEKS (ci-eks) Failure looks flaky. looked at the cilium-config from sysdump, CES feature not enabled at all. https://github.com/cilium/cilium/actions/runs/1366449190 gke-stable (test-gke) Failures
Pods aren't ready. |
750e536
to
43aa2e6
Compare
k8s-1.20-kernel-4.19 (test-1.20-4.19) Failure: The root cause of Issue, One of the deleted POD's IPV6 address is reallocated for cilium-health, before that POD IPCache entry deleted in Ipcache.
The sequence of Events,
|
k8s-1.16-kernel-netnext Failures https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/1733/ Four tests failed with net-next kernel Test Result (4 failures / +4) |
test-1.20-4.19 |
Found out reason for netnext failures (#17658 (comment)):
The egress policy updater relies on We discussed this issue with @MasterZ40 offline, the fix is to use This failure is tracked in #17669. @aanm Can we treat this as a known issue and unblock merging, all other tests are expected to pass. |
test-1.19-5.4 |
test-1.21-4.9 |
test-gke |
@Weil0ng as long it's documented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit I have is that there are still lots of files named as ciliumendpointbatch and not ciliumendpointslice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit I have is that there are still lots of files named as ciliumendpointbatch and not ciliumendpointslice.
All other e2e tests are green: gke-stable: https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/6714/ |
03736b8
to
5ad2554
Compare
This should be marked ready for review? Can you provide context for reviewers in the PR description/commit msg? |
5ad2554
to
315700b
Compare
We're aware of an issue affecting the Jenkins-based infrastructure, so no action is necessary from you on that side. We can look out for the results of the GHA-based infrastructure ( |
Provisioning issues have been fixed: |
|
/test-runtime |
test-1.16-netnext run looks like it needs some additional attention. Unless there is a very recent regression on master and this PR has been rebased to include it, it seems likely that the failures are somehow related to this PR as one of the recent PR runs for this job has succeeded about 5 hours ago (note, this is the PR listing so can include failures related to other PRs). |
This is very odd...we know that the egress gateway tests WILL fail w/ CES (see #17669), but per my understanding, the CI here does not enable CES... Edit: actually on a closer look, these are failing for a different reason...the known issue is that the traffic won't be SNATed to egress IP correctly, but this is packet loss... |
Sometimes, failure to NAT or reverse-NAT correctly can exhibit as packet loss because the packets end up at the wrong destination or replies arrive back with the wrong addresses, hence the stack doesn't hand the response back to the application socket. |
Makse sense, but this is pinging from a pod to a node within the cluster...I don't see how this PR would affect this path, plus, CES is not enabled at all...maybe the test is flaky? |
Given that all 4/4 egress gateway tests failed with a consistent error and a lack of similar past failures, the most likely explanation is that something in this PR is triggering the failure. |
I did quick comparison between |
Just for record, currently we are seeing only failure net-next based tests, failing tests are related to Validated these tests on dev machine all are passed. |
Enable CiliumEndpointSlice feature 1) CiliumEndpointSlice object packs group of slim version of CiliumEndpoints, and broadcast these objects to all cilium-agents running in on the cluster. 2) If CiliumEndpointSlice feature is enabled, Cilium-agent no longer watch for CiliumEndpoint updates instead they watch for CiliumEndpointSlice. CES watcher function calls endpointUpdated/ endpointDeleted functions for every CEP present in CES. 3) Only cilium-Operator watches for CEPs, it Creates/Updates/Deletes CiliumEndpointSlice objects based on CiliumEndpoint updates. 4) By default, CiliumEndpoints are grouped based on Security Identity ID. If the pods have same Security Identity ID they are put together in single CiliumEndpointSlice. 5) By default, maximum of 100 CiliumEndpoints can be grouped in single CiliumEndpointSlice. This entire feature is split across multiple PRs, each PR is reviewed separately and merged in cep-scalability branch. Signed-off-by: Gobinath Krishnamoorthy <gobinathk@google.com>
86cf35a
to
a8efa34
Compare
/test-1.16-netnext |
Thank you @joestringer for re-run of netnext CI test, i see it's passed now. |
Everything should be green in CI now, all other tests were already green and the net-next run was also green this time. I'll run once more just to check that there weren't any other consistent failures and then I think this should be good to merge. |
/test |
ci-aks job failed during Cilium install due to warnings , seems like the cilium-agent backends couldn't be reached to fetch agent status. 🤔 |
/ci-aks |
gke-stable (test-gke) failure is related to cluster access issue, all of sudden we lost connection to cluster. [2021-11-10T19:47:23.923Z] error when deleting "cilium-16b645389d12cc54.yaml": Delete "https://34.127.123.192/apis/apps/v1/namespaces/kube-system/deployments/cilium-operator": dial tcp 34.127.123.192:443: connect: connection refused |
test-gke Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
Again
@joestringer @Weil0ng any thoughts here ?
|
gke failure looks like an infra instability to me...1/2 test pods is ready, the other one fails health check... |
created #17857, retrigger here |
test-gke |
Just to cross check CES feature is validated on K8S-1.22 based CI test,
|
similarly CES feature is validated on K8S-1.21 based CI test,
|
Enable CiliumEndpointSlice feature, see design in CFP.
Security Identity ID
they are put together in single CiliumEndpointSlice.This entire feature is split across multiple PRs, each PR is reviewed separately and merged in cep-scalability branch.
List of PR's reviewed and merged in cep-scalability branch.
List of pending work in CiliuemEndpointSlice feature
Few of them are tracked here
Signed-off-by: Gobinath Krishnamoorthy gobinathk@google.com