New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the CT expiration timeout for egress gateway packets #24905
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gentoo-root! The fix for the outbound direction looks great, I think that on its own should be sufficient to resolve #23289 already.
The reply direction is a bit trickier, see below.
bpf/lib/nat.h
Outdated
if (egress_gw_rev_snat_ct_needed(ip4)) | ||
target->egress_gateway = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this won't work as-is. The reply hasn't been RevSNATed yet (that only happens in snat_v4_rewrite_ingress()
), so ip4->daddr
is still the EgressIP and the packet won't match any EgressGW policy entry. We also don't capture the from_local_endpoint
condition, so EgressGW replies to local endpoints would get double-accounted.
I wonder if we can simplify the whole logic quite a bit ... when initially creating the SNAT & RevSNAT entries on Egress, just store a needs_ct
flag into each entry (like the host_local
flag today - or maybe we can even re-use that flag? It originated from the same purpose). And then just use that flag to decide whether a ct_lookup()
is needed for subsequent packets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this won't work as-is. The reply hasn't been RevSNATed yet (that only happens in snat_v4_rewrite_ingress()), so ip4->daddr is still the EgressIP and the packet won't match any EgressGW policy entry.
How does it work for SNAT then? snat_v4_prepare_state also happens before SNAT is performed, and this is the point where target->egress_gateway = true
is set.
We also don't capture the from_local_endpoint condition, so EgressGW replies to local endpoints would get double-accounted.
Hmm, for some reason, after going through the code, I thought this was not relevant for the other direction. I'll add the check.
when initially creating the SNAT & RevSNAT entries on Egress, just store a needs_ct flag into each entry
Great idea, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this won't work as-is. The reply hasn't been RevSNATed yet (that only happens in snat_v4_rewrite_ingress()), so ip4->daddr is still the EgressIP and the packet won't match any EgressGW policy entry.
How does it work for SNAT then? snat_v4_prepare_state also happens before SNAT is performed, and this is the point where
target->egress_gateway = true
is set.
On egress SNAT, ip4->saddr
is the PodIP and ip4->daddr
is the externalIP. So it matches the policy (and we set target->egress_gateway
and target->addr
accordingly). The packet then gets masqueraded with the policy's EgressIP (based on target->addr
).
Before ingress revSNAT, ip4->saddr
is the externalIP and ip4->daddr
is the EgressIP. The revSNAT switches ip4->daddr
to the PodIP - so now the packet matches the policy entry again.
We also don't capture the from_local_endpoint condition, so EgressGW replies to local endpoints would get double-accounted.
Hmm, for some reason, after going through the code, I thought this was not relevant for the other direction. I'll add the check.
It shouldn't break things completely. But we would update the CT entry here, and then a second time in to-container
. So every reply packet gets accounted twice.
when initially creating the SNAT & RevSNAT entries on Egress, just store a needs_ct flag into each entry
Great idea, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed that, please take a look.
Also I think I should add the relevant tests here, including yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a test for reply packets, and it seems this implementation is still buggy:
--- FAIL: TestBPF (3.28s)
--- FAIL: TestBPF/tc_egressgw_snat.o (0.14s)
bpf_test.go:231: Conntrack lookup 1/2: src=1.0.168.192:111 dst=1.11.0.110:1234
bpf_test.go:231: Conntrack lookup 2/2: nexthdr=6 flags=1
bpf_test.go:231: CT verdict: New, revnat=0
bpf_test.go:231: Conntrack create: proxy-port=0 revnat=0 src-identity=0 lb=0.0.0.0
bpf_test.go:231: Failed to map addr=1.11.0.110 to identity
bpf_test.go:231: Conntrack lookup 1/2: src=1.11.0.110:1234 dst=4.3.2.1:34615
bpf_test.go:231: Conntrack lookup 2/2: nexthdr=6 flags=0
bpf_test.go:231: CT verdict: New, revnat=0
bpf_test.go:231: Conntrack create: proxy-port=0 revnat=0 src-identity=0 lb=0.0.0.0
--- FAIL: TestBPF/tc_egressgw_snat.o/tc_egressgw_snat2_reply (0.00s)
bpf_test.go:437: bad RX packet count (expected 1, actual 0)
bpf_test.go:231: Conntrack lookup 1/2: src=1.0.168.192:111 dst=1.11.0.110:1234
bpf_test.go:231: Conntrack lookup 2/2: nexthdr=6 flags=1
bpf_test.go:231: CT entry found lifetime=265870, revnat=0
bpf_test.go:231: CT verdict: Established, revnat=0
--- FAIL: TestBPF/tc_egressgw_snat.o/tc_egressgw_snat3 (0.00s)
bpf_test.go:437: bad RX packet count (expected 1, actual 0)
The reply packet doesn't match the CT entry, because revSNAT hasn't been done at that point yet =/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reply packet doesn't match the CT entry, because revSNAT hasn't been done at that point yet =/
🤯 oh joy. Yes, sure looks like it - on egress we're creating the CT_EGRESS entry with the pre-NAT tuple, but on ingress the CT lookup then uses a pre-revNAT tuple. So the CT lookup won't find a match at all, and we just create a new CT_INGRESS entry instead? wow. Are you actually seeing that additional CT entry in your testing?
So we'd basically need to revSNAT the tuple that's used for the lookup. And this is already affecting host-originating connections. Then maybe let's just go with your minimal EgressGW fix after all, and sort out the whole reply-path mess in a separate PR? 😞
Ugh. And it's even more complicated. We're jumping through all sort of hoops to extract the L4 ports in snat_v4_nat()
- but then ct_lookup4()
of course does its own extraction again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe let's just go with your minimal EgressGW fix after all, and sort out the whole reply-path mess in a separate PR?
Yes, makes sense, let's deploy the minimal fix, I'll update this pull request and create a follow-up one.
Are you actually seeing that additional CT entry in your testing?
I'll check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up pull request (not ready yet): #25112
Please review this one, so that I can progress with the hotfix.
Are you actually seeing that additional CT entry in your testing?
Yes, I checked that an additional entry is indeed created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked that an additional entry is indeed created.
nice / ugh. At least that confirms our understanding of the code.
1131e12
to
69a0ab7
Compare
6cf56c5
to
c3296d3
Compare
When the SNAT entry already exists for a connection that goes through the egress gateway (i.e. it's not the first packet), ct_lookup4 is skipped. That means that ct_update_timeout is not called, and the CT entry will be garbage collected while the connection is still active. To fix this bug, do the CT lookups for the second and further packets of connections using egress gateway as well. Fixes: 4532996 ("bpf: nat: always track egress gateway connections") Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
c3296d3
to
c405731
Compare
Send another packet in the reply direction and test whether it passed through revSNAT on the egress gateway. Note that it's not checked that the reply packet passed through connection tracking, because of a bug: cilium#25110 Send a second packet, and test whether it passed through CT tracking on the gateway node. Common pktgen and check code is extracted to separate functions, and the existing tc_egressgw_skip_excluded_cidr_snat makes use of it. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
c405731
to
db4bcfb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks good!
int ret = egressgw_snat_check(ctx, false, 2, 1); | ||
|
||
struct egress_gw_policy_key in_key = { | ||
.lpm_key = { EGRESS_PREFIX_LEN(24), {} }, | ||
.saddr = CLIENT_IP, | ||
.daddr = EXTERNAL_SVC_IP & 0Xffffff, | ||
}; | ||
|
||
/* Remove the policy to eliminate interference with other tests. */ | ||
map_delete_elem(&EGRESS_POLICY_MAP, &in_key); | ||
|
||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, is that a fix for the existing excluded-CIDR test below? We're using the same key, so it should just overwrite the existing policy entry - no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using the same key, so it should just overwrite the existing policy entry - no?
I noticed that the excluded-CIDR test below runs first, not last. It's probably because of the alphabetic order.
bpf/lib/nat.h
Outdated
if (egress_gw_rev_snat_ct_needed(ip4)) | ||
target->egress_gateway = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked that an additional entry is indeed created.
nice / ugh. At least that confirms our understanding of the code.
/test |
/test-1.26-4.19 |
/test-1.26-4.19 |
/test-1.27-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks again for tacking this! Just a couple of non blocking comments for next time we touch this code
* already tracked). NAT_DIR_EGRESS is implied. Do ct_lookup4 also for | ||
* established connections to extend the expiration timeout. | ||
*/ | ||
} else if (target->egress_gateway && !target->from_local_endpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can get rid of all these brackes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of the next time we touch this code, I have a follow-up WIP pull request, and this snippet looks better there:
5e81d6d#diff-ef710180f228e8f6bbad249e2670dbc42a54f45da3bbc4c5aebbb1f3865979e9R309-R325
(Although still with brackets, but the comments naturally fit inside.)
When the SNAT entry already exists for a connection that goes through the egress gateway (i.e. it's not the first packet), ct_lookup4 is skipped. That means that ct_update_timeout is not called, and the CT entry will be garbage collected while the connection is still active.
To fix this bug, do the CT lookups for the second and further packets of connections using egress gateway as well.
Fixes: 453299601c8c ("bpf: nat: always track egress gateway connections")
Similarly, add connection tracking for egress
gateway reply packets as well, in order to extend the expiration timeout
of the conntrack entry when reply packets arrive.