Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the CT expiration timeout for egress gateway packets #24905

Merged
merged 3 commits into from Apr 28, 2023

Conversation

gentoo-root
Copy link
Contributor

@gentoo-root gentoo-root commented Apr 14, 2023

When the SNAT entry already exists for a connection that goes through the egress gateway (i.e. it's not the first packet), ct_lookup4 is skipped. That means that ct_update_timeout is not called, and the CT entry will be garbage collected while the connection is still active.

To fix this bug, do the CT lookups for the second and further packets of connections using egress gateway as well.

Fixes: 453299601c8c ("bpf: nat: always track egress gateway connections")

Similarly, add connection tracking for egress
gateway reply packets as well, in order to extend the expiration timeout
of the conntrack entry when reply packets arrive.

Fix the bug when long-living connections using egress gateway may be reset.

@gentoo-root gentoo-root added release-note/bug This PR fixes an issue in a previous release of Cilium. feature/egress-gateway Impacts the egress IP gateway feature. labels Apr 14, 2023
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gentoo-root! The fix for the outbound direction looks great, I think that on its own should be sufficient to resolve #23289 already.

The reply direction is a bit trickier, see below.

bpf/lib/nat.h Outdated
Comment on lines 1043 to 1044
if (egress_gw_rev_snat_ct_needed(ip4))
target->egress_gateway = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this won't work as-is. The reply hasn't been RevSNATed yet (that only happens in snat_v4_rewrite_ingress()), so ip4->daddr is still the EgressIP and the packet won't match any EgressGW policy entry. We also don't capture the from_local_endpoint condition, so EgressGW replies to local endpoints would get double-accounted.

I wonder if we can simplify the whole logic quite a bit ... when initially creating the SNAT & RevSNAT entries on Egress, just store a needs_ct flag into each entry (like the host_local flag today - or maybe we can even re-use that flag? It originated from the same purpose). And then just use that flag to decide whether a ct_lookup() is needed for subsequent packets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this won't work as-is. The reply hasn't been RevSNATed yet (that only happens in snat_v4_rewrite_ingress()), so ip4->daddr is still the EgressIP and the packet won't match any EgressGW policy entry.

How does it work for SNAT then? snat_v4_prepare_state also happens before SNAT is performed, and this is the point where target->egress_gateway = true is set.

We also don't capture the from_local_endpoint condition, so EgressGW replies to local endpoints would get double-accounted.

Hmm, for some reason, after going through the code, I thought this was not relevant for the other direction. I'll add the check.

when initially creating the SNAT & RevSNAT entries on Egress, just store a needs_ct flag into each entry

Great idea, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this won't work as-is. The reply hasn't been RevSNATed yet (that only happens in snat_v4_rewrite_ingress()), so ip4->daddr is still the EgressIP and the packet won't match any EgressGW policy entry.

How does it work for SNAT then? snat_v4_prepare_state also happens before SNAT is performed, and this is the point where target->egress_gateway = true is set.

On egress SNAT, ip4->saddr is the PodIP and ip4->daddr is the externalIP. So it matches the policy (and we set target->egress_gateway and target->addr accordingly). The packet then gets masqueraded with the policy's EgressIP (based on target->addr).

Before ingress revSNAT, ip4->saddr is the externalIP and ip4->daddr is the EgressIP. The revSNAT switches ip4->daddr to the PodIP - so now the packet matches the policy entry again.

We also don't capture the from_local_endpoint condition, so EgressGW replies to local endpoints would get double-accounted.

Hmm, for some reason, after going through the code, I thought this was not relevant for the other direction. I'll add the check.

It shouldn't break things completely. But we would update the CT entry here, and then a second time in to-container. So every reply packet gets accounted twice.

when initially creating the SNAT & RevSNAT entries on Egress, just store a needs_ct flag into each entry

Great idea, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed that, please take a look.

Also I think I should add the relevant tests here, including yours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a test for reply packets, and it seems this implementation is still buggy:

--- FAIL: TestBPF (3.28s)
    --- FAIL: TestBPF/tc_egressgw_snat.o (0.14s)
        bpf_test.go:231: Conntrack lookup 1/2: src=1.0.168.192:111 dst=1.11.0.110:1234
        bpf_test.go:231: Conntrack lookup 2/2: nexthdr=6 flags=1
        bpf_test.go:231: CT verdict: New, revnat=0
        bpf_test.go:231: Conntrack create: proxy-port=0 revnat=0 src-identity=0 lb=0.0.0.0
        bpf_test.go:231: Failed to map addr=1.11.0.110 to identity
        bpf_test.go:231: Conntrack lookup 1/2: src=1.11.0.110:1234 dst=4.3.2.1:34615
        bpf_test.go:231: Conntrack lookup 2/2: nexthdr=6 flags=0
        bpf_test.go:231: CT verdict: New, revnat=0
        bpf_test.go:231: Conntrack create: proxy-port=0 revnat=0 src-identity=0 lb=0.0.0.0
        --- FAIL: TestBPF/tc_egressgw_snat.o/tc_egressgw_snat2_reply (0.00s)
            bpf_test.go:437: bad RX packet count (expected 1, actual 0)
        bpf_test.go:231: Conntrack lookup 1/2: src=1.0.168.192:111 dst=1.11.0.110:1234
        bpf_test.go:231: Conntrack lookup 2/2: nexthdr=6 flags=1
        bpf_test.go:231: CT entry found lifetime=265870, revnat=0
        bpf_test.go:231: CT verdict: Established, revnat=0
        --- FAIL: TestBPF/tc_egressgw_snat.o/tc_egressgw_snat3 (0.00s)
            bpf_test.go:437: bad RX packet count (expected 1, actual 0)

The reply packet doesn't match the CT entry, because revSNAT hasn't been done at that point yet =/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reply packet doesn't match the CT entry, because revSNAT hasn't been done at that point yet =/

🤯 oh joy. Yes, sure looks like it - on egress we're creating the CT_EGRESS entry with the pre-NAT tuple, but on ingress the CT lookup then uses a pre-revNAT tuple. So the CT lookup won't find a match at all, and we just create a new CT_INGRESS entry instead? wow. Are you actually seeing that additional CT entry in your testing?

So we'd basically need to revSNAT the tuple that's used for the lookup. And this is already affecting host-originating connections. Then maybe let's just go with your minimal EgressGW fix after all, and sort out the whole reply-path mess in a separate PR? 😞

Ugh. And it's even more complicated. We're jumping through all sort of hoops to extract the L4 ports in snat_v4_nat() - but then ct_lookup4() of course does its own extraction again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe let's just go with your minimal EgressGW fix after all, and sort out the whole reply-path mess in a separate PR?

Yes, makes sense, let's deploy the minimal fix, I'll update this pull request and create a follow-up one.

Are you actually seeing that additional CT entry in your testing?

I'll check that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up pull request (not ready yet): #25112

Please review this one, so that I can progress with the hotfix.

Are you actually seeing that additional CT entry in your testing?

Yes, I checked that an additional entry is indeed created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked that an additional entry is indeed created.

nice / ugh. At least that confirms our understanding of the code.

@julianwiedmann julianwiedmann linked an issue Apr 17, 2023 that may be closed by this pull request
2 tasks
@gentoo-root gentoo-root force-pushed the egressgw-rst-fix branch 2 times, most recently from 6cf56c5 to c3296d3 Compare April 20, 2023 19:01
When the SNAT entry already exists for a connection that goes through
the egress gateway (i.e. it's not the first packet), ct_lookup4 is
skipped. That means that ct_update_timeout is not called, and the CT
entry will be garbage collected while the connection is still active.

To fix this bug, do the CT lookups for the second and further packets of
connections using egress gateway as well.

Fixes: 4532996 ("bpf: nat: always track egress gateway connections")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
julianwiedmann and others added 2 commits April 25, 2023 17:55
Send another packet in the reply direction and test whether it passed
through revSNAT on the egress gateway. Note that it's not checked that
the reply packet passed through connection tracking, because of a bug:

cilium#25110

Send a second packet, and test whether it passed through CT tracking on
the gateway node.

Common pktgen and check code is extracted to separate functions, and the
existing tc_egressgw_skip_excluded_cidr_snat makes use of it.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good!

Comment on lines +311 to +322
int ret = egressgw_snat_check(ctx, false, 2, 1);

struct egress_gw_policy_key in_key = {
.lpm_key = { EGRESS_PREFIX_LEN(24), {} },
.saddr = CLIENT_IP,
.daddr = EXTERNAL_SVC_IP & 0Xffffff,
};

/* Remove the policy to eliminate interference with other tests. */
map_delete_elem(&EGRESS_POLICY_MAP, &in_key);

return ret;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, is that a fix for the existing excluded-CIDR test below? We're using the same key, so it should just overwrite the existing policy entry - no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the same key, so it should just overwrite the existing policy entry - no?

I noticed that the excluded-CIDR test below runs first, not last. It's probably because of the alphabetic order.

bpf/lib/nat.h Outdated
Comment on lines 1043 to 1044
if (egress_gw_rev_snat_ct_needed(ip4))
target->egress_gateway = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked that an additional entry is indeed created.

nice / ugh. At least that confirms our understanding of the code.

@julianwiedmann
Copy link
Member

/test

@julianwiedmann julianwiedmann marked this pull request as ready for review April 26, 2023 06:20
@julianwiedmann julianwiedmann requested a review from a team as a code owner April 26, 2023 06:20
@julianwiedmann julianwiedmann added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Apr 26, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.3 Apr 26, 2023
@gentoo-root
Copy link
Contributor Author

Cannot initiate the connection to us.archive.ubuntu.com:80

/test-1.26-4.19
/test-1.27-net-next

@gentoo-root
Copy link
Contributor Author

/test-1.26-4.19

@gentoo-root
Copy link
Contributor Author

/test-1.27-net-next

Copy link
Member

@jibi jibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again for tacking this! Just a couple of non blocking comments for next time we touch this code

* already tracked). NAT_DIR_EGRESS is implied. Do ct_lookup4 also for
* established connections to extend the expiration timeout.
*/
} else if (target->egress_gateway && !target->from_local_endpoint) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can get rid of all these brackes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of the next time we touch this code, I have a follow-up WIP pull request, and this snippet looks better there:

5e81d6d#diff-ef710180f228e8f6bbad249e2670dbc42a54f45da3bbc4c5aebbb1f3865979e9R309-R325

(Although still with brackets, but the comments naturally fit inside.)

bpf/tests/tc_egressgw_snat.c Show resolved Hide resolved
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 26, 2023
@pchaigno pchaigno merged commit a99e0b6 into cilium:main Apr 28, 2023
57 checks passed
@joamaki joamaki mentioned this pull request May 2, 2023
8 tasks
@YutaroHayakawa YutaroHayakawa mentioned this pull request May 10, 2023
14 tasks
@YutaroHayakawa YutaroHayakawa added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels May 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.3 May 10, 2023
@gentoo-root gentoo-root added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels May 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.3 May 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.10 May 24, 2023
@gentoo-root gentoo-root mentioned this pull request May 25, 2023
1 task
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.12 in 1.12.10 May 25, 2023
@julianwiedmann julianwiedmann added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Jul 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.10 Jul 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.10 Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. feature/egress-gateway Impacts the egress IP gateway feature. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.12.10
Backport done to v1.12
1.13.3
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

Egress Gateway interrupts long running tcp connections
5 participants