Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: fix incorrect EndpointSlice API version #27277

Merged
merged 1 commit into from Aug 9, 2023

Conversation

ysksuzuki
Copy link
Member

@ysksuzuki ysksuzuki commented Aug 4, 2023

This PR fixes a bug that k8s watcher incorrectly chooses the discovery/v1beta1 EndpointSlice in environments where discovery/v1 EndpointSlice is available. The agent doesn't wait for the discovery/v1 EndpointSlice to be received on its restart and ends up removing live services because of this bug.

The issue happens in the following scenario:

  1. The agent initializes and runs the informers for Service and EndpointSlice.
  2. The reflectors list the resources from kube-apiserver and replace DeltaFIFO.
  3. The informers pop items from the DeltaFIFO and call the onAdd handler one by one.
  4. If the Service handler is called first, ServiceCache.UpdateService doesn't emit the UpdateService event, because the service is not ready. (No backend)
  5. The EndpointSlice handler is called next, ServiceCache.updateEndpoints emits UpdateService event.
  6. The agent doesn't wait until the event triggered at step 5 is processed, and calls SyncWithK8sFinished.
  7. The agent removes live services because it hasn't heard about them. (It is supposed to hear about those services in the event process triggered at step 5)

This issue doesn't persist in v1.14 and later because K8sAPIGroupEndpointSliceOrEndpoint is used in those versions. The affected versions are v1.13 and v1.12.

Fixes: #27215

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels Aug 4, 2023
This commit fixes a bug that k8s watcher incorrectly chooses the
discovery/v1beta1 EndpointSlice in environments where
discovery/v1 EndpointSlice is available.
The agent doesn't wait for the discovery/v1 EndpointSlice to be received
and ends up removing live services because of this bug.

The issue happens in the following scenario:

1. The agent initializes and runs the informers for Service and EndpointSlice.
2. The reflectors list the resources from kube-apiserver and replace DeltaFIFO.
3. The informers pop items from the DeltaFIFO and call the onAdd handler one by one.
4. If the Service handler is called first, ServiceCache.UpdateService doesn't
emit the UpdateService event, because the service is not ready. (No backend)
5. The EndpointSlice handler is called next, ServiceCache.updateEndpoints emits
UpdateService event.
6. The agent doesn't wait until the event triggered at step 5 is processed, and
calls SyncWithK8sFinished.
7. The agent removes live services because it hasn't heard about it.
(It is supposed to hear about those services in the event process triggered at step 5)

fixes: cilium#27215

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
@ysksuzuki ysksuzuki marked this pull request as ready for review August 4, 2023 13:50
@ysksuzuki ysksuzuki requested a review from a team as a code owner August 4, 2023 13:50
@christarazi
Copy link
Member

christarazi commented Aug 4, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Tests NodePort inside cluster (kube-proxy) 

Failure Output

FAIL: Request from testclient-rlbxb pod to service tftp://[fd04::11]:31994/hello failed

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/818/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@christarazi christarazi added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label Aug 4, 2023
@christarazi
Copy link
Member

Good catch!

@ysksuzuki
Copy link
Member Author

/test-1.25-4.19

@nebril
Copy link
Member

nebril commented Aug 9, 2023

I think we can merge this one, but we also need this change in 1.12, right?

@nebril nebril added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 9, 2023
@nebril nebril merged commit 4f28282 into cilium:v1.13 Aug 9, 2023
60 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. release-blocker/1.13 This issue will prevent the release of the next version of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent restart causes temporary connection loss by incorrectly removing still alive services
3 participants