Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with toServices policy where service backend churn left stale CIDR identities #25687

Merged
merged 4 commits into from Jun 12, 2023

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented May 25, 2023

  • k8s: Plumb old Endpoints object through handlers
  • k8s: Fix toServices rule translation cleanup
  • k8s: Add test for RuleTranslator.Translate()
  • k8s: Remove unnecessary AllocatePrefixes from RuleTranslator

Fixes: #20477


Main commit "k8s: Fix toServices rule translation cleanup" will be pasted below for ease of review:

Previously, toServices-based rules did not properly cleanup CIDR
identities. When service backends were removed or changed, the deletion
logic acted on the new object rather than on the old object, thus
the entries that were supposed to be deleted were simply added back in
generateToCidrFromEndpoint().

Fix this by passing the deletion logic the old endpoint object state
and performing a diff between the old and new states.

@christarazi christarazi added kind/bug This is a bug in the Cilium logic. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels May 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels May 25, 2023
@christarazi christarazi changed the title pr/christarazi/to services fixups k8s: Fix toServices rule translation cleanup May 25, 2023
@christarazi christarazi changed the title k8s: Fix toServices rule translation cleanup Fix bug with toServices policy where backend churn left stale CIDR identities in the ipcache May 25, 2023
@christarazi christarazi changed the title Fix bug with toServices policy where backend churn left stale CIDR identities in the ipcache Fix bug with toServices policy where service backend churn left stale CIDR identities in the ipcache May 25, 2023
@christarazi christarazi force-pushed the pr/christarazi/to-services-fixups branch 2 times, most recently from 6da6251 to f9adfdf Compare May 25, 2023 18:11
@christarazi christarazi changed the title Fix bug with toServices policy where service backend churn left stale CIDR identities in the ipcache Fix bug with toServices policy where service backend churn left stale CIDR identities May 25, 2023
@christarazi christarazi force-pushed the pr/christarazi/to-services-fixups branch from f9adfdf to f301050 Compare May 25, 2023 18:18
@christarazi
Copy link
Member Author

/test

@christarazi christarazi marked this pull request as ready for review May 25, 2023 20:46
@christarazi christarazi requested review from a team as code owners May 25, 2023 20:46
@christarazi
Copy link
Member Author

cc @squeed @joestringer for opening #20477

@jrajahalme
Copy link
Member

#25559 has merged, so this needs a rebase?

operator/watchers/k8s_service_sync.go Outdated Show resolved Hide resolved
@christarazi christarazi force-pushed the pr/christarazi/to-services-fixups branch 2 times, most recently from 77d3561 to 647f901 Compare June 7, 2023 01:34
@christarazi
Copy link
Member Author

christarazi commented Jun 7, 2023

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing with bpf_host

Failure Output

FAIL: Connectivity test between nodes failed

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/482/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@christarazi
Copy link
Member Author

christarazi commented Jun 7, 2023

/mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19

👍 created #25964

pkg/k8s/service_cache.go Outdated Show resolved Hide resolved
@christarazi christarazi force-pushed the pr/christarazi/to-services-fixups branch from 647f901 to 5d37c69 Compare June 7, 2023 21:26
@christarazi
Copy link
Member Author

christarazi commented Jun 7, 2023

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: connectivity-check pods are not ready after timeout

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/529/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments/questions about the OldEndpoints propagation, while the rest of the changes looks good to me.

pkg/k8s/watchers/watcher.go Outdated Show resolved Hide resolved
ID: esID.ServiceID,
Service: svc,
Endpoints: endpoints,
OldEndpoints: oldEPs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should OldEndpoints be set also in the other cases in which an UpdateService event is emitted? IMO it is mainly necessary in the DeleteEndpoints method below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given #25687 (comment), can you elaborate why it's necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring in particular to the event generated here. Essentially, that happens when one endpointslice gets deleted, triggering an update event with the remaining endpoints for that service. Since OldEndpoints is not propagated, the corresponding PrefixesToRelease rules will not be eventually generated by the K8sTranslator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also another aspect that I didn't realize earlier, OldEndpoints will contain the previous list of endpoints from a given endpointslice, while Endpoints the ones merged from all endpointslices. It seems to me that it should not create problems though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, I've set the OldEndpoints for all UpdateService events. Please take another look. Thanks for double checking!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I see that OldEndpoints and Endpoints are always the same for updates from remote service backends. Shouldn't they be different? (I'm not 100% sure about how this feature should play with global services)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toServices (currently) only supports backends that are outside of the cluster (world), and not pod-based backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same logic is also used in the external workloads case, and IIUC in that case the backends processed through that function might be outside of the cluster. Still that is most likely an extremely rare condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giorio94 Could you file a followup issue so that we don't forget this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christarazi christarazi force-pushed the pr/christarazi/to-services-fixups branch from 5d37c69 to ae516b4 Compare June 8, 2023 18:07
@christarazi
Copy link
Member Author

/test

This exposes the old endpoints object so that subsequent commits can
make use of it to perform proper diff logic for rule translation
(`pkg/k8s/rule_translate.go`).

This commit should have no functional impact as the code merely
references the new endpoint object and does not touch the newly exposed
old object.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Previously, toServices-based rules did not properly cleanup CIDR
identities. When service backends were removed or changed, the deletion
logic acted on the new object rather than on the old object, thus
the entries that were supposed to be deleted were simply added back in
generateToCidrFromEndpoint().

Fix this by passing the deletion logic the old endpoint object state
and performing a diff between the old and new states.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
The previous commit fixed toServices diffing logic. This commit adds a
test that was used to validate the fix.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This variable is no longer necessary because it doesn't actually prevent
ipcache interaction as of commit 4b87ccc ("pkg/k8s/watcher: fix
deadlock with service event handler & CES watcher."). Remove it as
provides no functional impact.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi
Copy link
Member Author

/test

@christarazi christarazi requested a review from giorio94 June 9, 2023 22:40
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@christarazi
Copy link
Member Author

Thanks for the very thorough review @giorio94!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deleting service selected by ToServices policy rules can cause policy denies until agent restart
5 participants