Respond with ICMP reply for traffic to services without backends #28157

dylandreimerink · 2023-09-14T09:29:29Z

So far we have been dropping packets meant for services which do not have endpoints/backends. This causes clients to needlessly wait for replies and to retry sending traffic. This PR adds the ability to send back an ICMP or ICMPv6 reply with Destination unreachable (type 3) + Port unreachable (code 3) whenever this happens.

This behavior is controllable via a new --service-no-backend-response flag, which defaults to reject so we match expected behavior by default. It can also be set to drop to preserve the existing behavior in case that was desired.

This new behavior works for both North/South traffic entering a node and East/West traffic responding to a request from a pod within the cluster.

Fixes: #10002

Respond with ICMP reply for traffic to services without backends

dylandreimerink · 2023-09-14T09:48:34Z

/test

dylandreimerink · 2023-09-14T10:19:25Z

/test

dylandreimerink · 2023-09-14T11:09:25Z

/test

dylandreimerink · 2023-09-14T11:54:41Z

/test

dylandreimerink · 2023-09-15T11:47:13Z

/ci-l4lb

dylandreimerink · 2023-09-15T12:09:16Z

/ci-l4lb

dylandreimerink · 2023-09-15T12:33:44Z

/test

pkg/datapath/linux/probes/probes.go

The --service-no-backend-response=reject feature requires the use of `bpf_skb_adjust_room` with the `BPF_ADJ_ROOM_MAC` mode to make room to add the outer IP + ICMP header. However this mode is only available after v5.2, so this commit adds a probe to check for the availability and will fall back to --service-no-backend-response=drop for kernels that do not support it. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

The current ingress conformance test adds an ingress and then does a curl to confirm it works. In the past, Cilium would have dropped the request packet until the datapath was setup. This silent dropping causes curl to retry for 15 seconds before giving up. With the ICMP reply however, curl gets an immediate response and will give up immediately. This causes the test to fail. So this commit adds manual retry logic and delays in the test script. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

This commit adds token bucket ratelimiting to the datapath. It is implemented purely in BPF. A new map is added to keep track of the buckets. A bucket can be keyed on anything, though since ICMPv6 is currently the only user it is keyed on ifindex. The value is the current amounts of tokens in the bucket and the last time we added tokens to the bucket. For every event we check if there is at least 1 token left in the bucket if so, we decrement the token count and continue, if not we execute the rate limiting action. Typically a timer would add new tokens into the bucket, in our case we keep track of the last time we added tokens and calculate how many tokens we should have added since then before we do the token check. This implements a burstable rate limiting mechanism. The burst size and token refil is configurable. For ICMPv6 it is currently set to 100 replies per second with a burst size of 1000. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com>

dylandreimerink · 2023-11-23T16:13:36Z

/test

aojea · 2023-11-23T18:03:04Z

lovely

aojea · 2023-11-23T18:25:39Z

This differs slightly from the kube-proxy behavior which would send back and type 3/code 3 port unreachable, given its the whole IP not a single port the address unreachable code made more sense

A funny thing is that it can be technically possible, that in the same Service, one Port has backends and other not, but this is an artifact of the capability of using named ports for Services , but as Tim says in , that is a pretty esoteric configuration kubernetes/kubernetes#24875 (comment) so I think that this is correct

dylandreimerink · 2023-11-24T12:27:12Z

This differs slightly from the kube-proxy behavior which would send back and type 3/code 3 port unreachable, given its the whole IP not a single port the address unreachable code made more sense

A funny thing is that it can be technically possible, that in the same Service, one Port has backends and other not, but this is an artifact of the capability of using named ports for Services , but as Tim says in , that is a pretty esoteric configuration kubernetes/kubernetes#24875 (comment) so I think that this is correct

Ah, the description of the PR is out of date, we changed this after review to match exactly what kube-proxy does to avoid potential implementation differences of the ICMP code for clients. I will correct the description so it doesn't cause future confusion.

gentoo-root

I'm too late for the party, but I've got some comments on the TBF implementation 😅

gentoo-root · 2023-11-24T15:36:09Z

bpf/lib/ratelimit.h

+	since_last_topup = ktime_get_ns() - value->last_topup;
+	if (since_last_topup > settings->topup_interval_ns) {
+		/* Add tokens of every missed interval */
+		value->tokens += (since_last_topup / settings->topup_interval_ns) *


Rounding here could skip intervals. For example, it this function is called at 0 s, 1.5 s and 3 s, it will add 1000 tokens at 0 s, another 1000 tokens at 1.5 s, and yet another 1000 tokens at 3 s = in total 3000 tokens. If, however, this function was called at 0 s, 1 s, 2 s and 3 s, it would add 1000 tokens at each call = in total 4000 tokens over the same time period.

Right. So a better way to keep track would be:

long cur_time = ktime_get_ns(); [...] long intervals = since_last_topup / settings->topup_interval_ns; long remainder = since_last_topup % settings->topup_interval_ns; value->last_topup = cur_time - remainder;

So a 1.5s, we set last_topup to 1s instead of 1.5s. So at the 3s mark we would add 2000 instead of 1000.

Is that correct?

Looks correct to me.

gentoo-root · 2023-11-24T15:36:30Z

bpf/lib/ratelimit.h

+	if (!value) {
+		new_value.last_topup = ktime_get_ns();
+		new_value.tokens = settings->tokens_per_topup - 1;
+		ret = map_update_elem(&RATELIMIT_MAP, key, &new_value, BPF_ANY);


This lookup-and-update is racy if called from two CPUs. Do we care?

I considered this. It would mean that the ratelimit isn't 100% accurate, letting more traffic through than the limit. To fix that we would need to use atomics which are slow. I thought the performance is more important in this given situation. Perhaps we should add this to the comments in case other wonder.

gentoo-root · 2023-11-24T15:37:21Z

bpf/lib/ratelimit.h

+		/* Add tokens of every missed interval */
+		value->tokens += (since_last_topup / settings->topup_interval_ns) *
+						  settings->tokens_per_topup;
+		value->last_topup = ktime_get_ns();


We should reuse the ktime_get_ns() value fetched above, otherwise it's another source of inexactness, although only a tiny one.

dylandreimerink · 2023-11-24T15:40:02Z

I'm too late for the party, but I've got some comments on the TBF implementation

I will make a follow-up PR for that, thanks for the feedback!

julianwiedmann

Two comments below - better late than never :).

julianwiedmann · 2024-01-02T09:51:38Z

bpf/lib/lb.h

+			   ctx_get_ifindex(ctx));
+	return ctx_redirect(ctx, ctx_get_ifindex(ctx), 0);
+}


I believe this needs an edt_set_aggregate(ctx, 0), to prevent false-positives in to-netdev's Bandwidth-Manager code.

That is a good point, I have not been able to check the interaction with the bandwidth manager. But I suspect you are right. Will have to do a followup.

julianwiedmann · 2024-01-02T09:56:37Z

bpf/lib/lb.h

+	cilium_dbg_capture(ctx, DBG_CAPTURE_DELIVERY,
+			   ctx_get_ifindex(ctx));
+	return ctx_redirect(ctx, ctx_get_ifindex(ctx), 0);
+}


When this ICMP packet hits to-netdev, did you check how it interacts with the SNAT engine?

I would expect that it gets dropped, whenever the addressed service IP equals IPV4_MASQUERADE.

did you check how it interacts with the SNAT engine?

No, I did not.

I would expect that it gets dropped, whenever the addressed service IP equals IPV4_MASQUERADE.

Right, which would be in the case of a node port service?

which would be in the case of a node port service?

Correct. That's a scenario we want to support, right?

Yes, I think so. How does host traffic normally deal with this? Additionally, I see where have a marker to skip SNAT ctx_snat_done_set(ctx); would calling it before doing the redirect help?

It has been a while since I looked in depth at the SNAT path.

Yes, I think so. How does host traffic normally deal with this?

It gets dropped :). We only support a limited set of ICMP types. I hope we can extend this as needed.

Additionally, I see where have a marker to skip SNAT ctx_snat_done_set(ctx); would calling it before doing the redirect help?

Yep, I had the same thought. I believe that would fit as work-around (and would even allow you to skip the HostFW in to-netdev ... that's another aspect we didn't consider yet in this PR). Long-term it would be best to teach the SNAT engine about this ICMP type.

julianwiedmann · 2024-01-03T07:48:32Z

bpf/bpf_lxc.c

+
+#ifdef SERVICE_NO_BACKEND_RESPONSE
+		if (ret == DROP_NO_SERVICE) {
+			ep_tail_call(ctx, CILIUM_CALL_IPV4_NO_SERVICE);
+			return DROP_MISSED_TAIL_CALL;
+		}
+#endif
+


With ep-routes enabled, the ICMP packet should now pass through to-container on the way back into the pod. Would we thus require an ingress network policy change to allow this traffic?

This feels very similar to the topic of avoiding policy for service loopback replies ...

A CT entry should be created for the outgoing connection to the service, so when doing policy checking the ICMP reply should be flagged as return traffic an thus not subject to any ingress policy.

Unfortunately lb4_local() currently doesn't create the RELATED entry (note the NULL for map_related). So I don't think there's any CT entry in place that would allow such ICMP traffic to pass through network policy enforcement.

Right, but isn't the CT entry created here before we get to the LB stage?

I see that depends on ENABLE_PER_PACKET_LB being enabled or not

Right, but isn't the CT entry created here before we get to the LB stage?

Nope, that part is only reached after selecting the backend (this CT entry tracks the client -> backend connection).

ti-mo · 2024-01-05T16:42:38Z

bpf/lib/common.h

@@ -597,7 +599,7 @@ enum {
 #define DROP_INVALID_EXTHDR	-156
 #define DROP_FRAG_NOSUPPORT	-157
 #define DROP_NO_SERVICE		-158
-#define DROP_UNUSED8		-159 /* unused */


@dylandreimerink This drop reason wasn't added in flow.proto and drop.go.

Under normal circumstances, we shouldn't reuse any of these, since renaming a proto field/type causes a backwards-incompatible change. We're lucky that in this case, drop reason 159 is actually missing from the proto as well as from drop.go. 😅

SERVICE_BACKEND_NOT_FOUND = 158; NO_TUNNEL_OR_ENCAPSULATION_ENDPOINT = 160;

In any case, I'm marking all unused ones as deprecated in #29482.

cc @rolinh

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 14, 2023

dylandreimerink added kind/enhancement This would improve or streamline existing functionality. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. labels Sep 14, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 14, 2023

dylandreimerink mentioned this pull request Sep 14, 2023

Respond with ICMP reply for traffic to services without backends #28025

Closed

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 06ff822 to 9ef142b Compare September 14, 2023 10:04

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 9ef142b to cd1a0bc Compare September 14, 2023 10:51

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from cd1a0bc to 982335a Compare September 14, 2023 11:48

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 982335a to e333851 Compare September 15, 2023 11:30

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from e333851 to 4dbf77a Compare September 15, 2023 11:55

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch 2 times, most recently from 18834c0 to a9a4a6c Compare September 15, 2023 12:22

dylandreimerink marked this pull request as ready for review September 25, 2023 15:39

dylandreimerink requested review from a team as code owners September 25, 2023 15:39

ti-mo self-requested a review November 23, 2023 14:10

ti-mo reviewed Nov 23, 2023

View reviewed changes

pkg/datapath/linux/probes/probes.go Outdated Show resolved Hide resolved

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 40c9631 to 5fa64c3 Compare November 23, 2023 15:07

dylandreimerink requested a review from ti-mo November 23, 2023 15:08

ti-mo approved these changes Nov 23, 2023

View reviewed changes

pkg/datapath/linux/probes/probes.go Outdated Show resolved Hide resolved

pkg/datapath/linux/probes/probes.go Outdated Show resolved Hide resolved

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from 5fa64c3 to dcea760 Compare November 23, 2023 15:47

dylandreimerink added 3 commits November 23, 2023 16:48

dylandreimerink force-pushed the pr/dylan/svc-no-backend-response branch from dcea760 to 6c2ae70 Compare November 23, 2023 15:49

dylandreimerink added this pull request to the merge queue Nov 23, 2023

Merged via the queue into main with commit 93f1619 Nov 23, 2023
208 checks passed

dylandreimerink deleted the pr/dylan/svc-no-backend-response branch November 23, 2023 17:20

sayboras added affects/v1.13 This issue affects v1.13 branch affects/v1.14 This issue affects v1.14 branch labels Nov 24, 2023

gentoo-root reviewed Nov 24, 2023

View reviewed changes

joestringer added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Nov 27, 2023

julianwiedmann reviewed Jan 2, 2024

View reviewed changes

julianwiedmann reviewed Jan 3, 2024

View reviewed changes

ti-mo reviewed Jan 5, 2024

View reviewed changes

julianwiedmann mentioned this pull request Feb 2, 2024

Kube-proxy-replacement incorrectly detects aws-cni pod interfaces as host interfaces (1.15 regression) #30563

Closed

2 tasks

aojea mentioned this pull request Feb 6, 2024

Requests to a service with no local endpoints are dropped rather than ICMP rejected kubernetes/kubernetes#121371

Open

julianwiedmann mentioned this pull request Apr 5, 2024

Traffic to Service with no endpoints is dropped #31786

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respond with ICMP reply for traffic to services without backends #28157

Respond with ICMP reply for traffic to services without backends #28157

dylandreimerink commented Sep 14, 2023 •

edited

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 15, 2023

dylandreimerink commented Sep 15, 2023

dylandreimerink commented Sep 15, 2023

dylandreimerink commented Nov 23, 2023

aojea commented Nov 23, 2023

aojea commented Nov 23, 2023

dylandreimerink commented Nov 24, 2023

gentoo-root left a comment

gentoo-root Nov 24, 2023

dylandreimerink Nov 24, 2023

gentoo-root Nov 24, 2023

gentoo-root Nov 24, 2023

dylandreimerink Nov 24, 2023

gentoo-root Nov 24, 2023

dylandreimerink commented Nov 24, 2023

julianwiedmann left a comment

julianwiedmann Jan 2, 2024

dylandreimerink Jan 2, 2024

julianwiedmann Jan 2, 2024

dylandreimerink Jan 2, 2024

julianwiedmann Jan 3, 2024

dylandreimerink Jan 3, 2024

julianwiedmann Jan 3, 2024

julianwiedmann Jan 3, 2024 •

edited

dylandreimerink Jan 3, 2024

julianwiedmann Jan 3, 2024

dylandreimerink Jan 3, 2024

dylandreimerink Jan 3, 2024

julianwiedmann Jan 3, 2024

ti-mo Jan 5, 2024

Respond with ICMP reply for traffic to services without backends #28157

Respond with ICMP reply for traffic to services without backends #28157

Conversation

dylandreimerink commented Sep 14, 2023 • edited

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 14, 2023

dylandreimerink commented Sep 15, 2023

dylandreimerink commented Sep 15, 2023

dylandreimerink commented Sep 15, 2023

dylandreimerink commented Nov 23, 2023

aojea commented Nov 23, 2023

aojea commented Nov 23, 2023

dylandreimerink commented Nov 24, 2023

gentoo-root left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylandreimerink commented Nov 24, 2023

julianwiedmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianwiedmann Jan 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dylandreimerink commented Sep 14, 2023 •

edited

julianwiedmann Jan 3, 2024 •

edited