Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctmap: add support for GC of DSR orphaned entries #21626

Merged
merged 4 commits into from Oct 18, 2022

Conversation

jibi
Copy link
Member

@jibi jibi commented Oct 7, 2022

Let's start from some context: normally, when traffic on a node needs to
get SNAT-ed by the datapath, the following map entries get created:

CT table:

Dir Source Destination
OUT src IP/port dst IP/port

NAT table:

Dir Source Destination Xlate target
OUT src IP/port dst IP/port XLATE_SRC nat IP/port
IN dst IP/port nat IP/port XLATE_DST src IP/port

Given this, the current algorithm to clear orphan NAT entries (i.e.
entries that are not backed anymore by a related CT one) can be
described as follow:

for each ingress NAT entry:

  • derive the related egress CT key:

    source      = ingress NAT entry xlate target
    destination = ingress NAT entry source
    
  • if there's no entry for the egress CT key

    • derive the related egress NAT entry (aka original tuple/ingress
      reverse tuple):

      source       = ingress NAT entry xlate targe
      destination  = ingress NAT entry source
      xlate target = ingress NAT entry destination
      
    • delete both ingress and egress NAT entries

This commit is about adding support for collecting orphaned NAT entries
created by the DSR loadbalancer mode.

In DSR mode, when a client connects to a NodePort service, the traffic
flow will be something like:

  • client connects to a second node where the NodePort service is running
  • this second node selects a backend and (unless the backend is running
    locally) forwards the traffic to a third node where the backend is
    running, while:
    • keeping the original client source IP
    • encoding the original destination in an IP option
    • rewriting the destination to the backend IP
  • in the third node, the original destination is extracted from the IP
    option and the following bpf entries are created:

CT table:

Dir Source Destination Flags
IN src IP/port dst IP/port DSR

NAT:

Dir Source Destination Xlate target
OUT dst IP/port src IP/port XLATE_SRC nat IP

where:

  • src IP is the IP of the client
  • dst IP is the IP of the backend running on the node
  • nat IP is the IP of the nodeport node (second node), which was
    encoded by the node itself in an IP option before forwarding traffic
    to the third node.
    This is used to implement DSR as the reply traffic leaving the
    current node is forwarded back to the original client with this IP
    as source

Considering the entries that get created by the regular (i.e. non DSR)
NAT operations, to implement GC of the DSR SNAT entries we can extend
the algorithm as follows:

for each egress NAT entry:

  • derive the related ingress CT key:

    source      = ingress NAT entry destination
    destination = ingress NAT entry source
    
  • derive the related egress CT key:

    source      = ingress NAT entry source
    destination = ingress NAT entry destination
    
  • if both the ingress and egress CT keys don't have a corresponding
    entry:

    • delete the egress NAT entry

If both ingress and egress CT keys for a given NAT entry don't exist we
can assume the entry is orphan, as the lack of the ingress key will
cover the DSR case while the lack of the egress key will cover the
"regular" one.

There's one last catch: to simplify this writing, we discussed the
logical content of the CT keys, but in practice every key has the
source and destination addresses swapped because of the way the datapath
creates these entries.

Fixes: #21346

@jibi jibi added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. needs-backport/1.12 labels Oct 7, 2022
@jibi jibi requested a review from a team as a code owner October 7, 2022 14:59
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.12.3 Oct 7, 2022
@jibi jibi force-pushed the pr/jibi/implement-nat-gc-for-dsr-mode branch 2 times, most recently from b131bd6 to 8ede1c4 Compare October 7, 2022 15:48
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jibi! This all sounds sane, but some questions to make sure my understanding is correct ...

pkg/maps/ctmap/ctmap.go Outdated Show resolved Hide resolved
pkg/maps/ctmap/ctmap.go Outdated Show resolved Hide resolved
@qmonnet qmonnet added this to Needs backport from master in 1.12.4 Oct 12, 2022
@qmonnet qmonnet removed this from Needs backport from master in 1.12.3 Oct 12, 2022
@jibi jibi force-pushed the pr/jibi/implement-nat-gc-for-dsr-mode branch 2 times, most recently from c9af1f1 to dd7bcee Compare October 12, 2022 11:56
@jibi
Copy link
Member Author

jibi commented Oct 12, 2022

/test

Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good call on doing the refactor at the end 👍.

Just one minor thing that's worth addressing imho. But either way lgtm.

pkg/maps/ctmap/ctmap.go Show resolved Hide resolved
Let's start from some context: normally, when traffic on a node needs to
get SNAT-ed by the datapath, the following map entries get created:

CT table:

  | Dir | Source      | Destination |
  | --- | ----------- | ----------- |
  | OUT | src IP/port | dst IP/port |

NAT table:

  | Dir | Source      | Destination | Xlate target          |
  | --- | ----------- | ----------- | --------------------- |
  | OUT | src IP/port | dst IP/port | XLATE_SRC nat IP/port |
  | IN  | dst IP/port | nat IP/port | XLATE_DST src IP/port |

Given this, the current algorithm to clear orphan NAT entries (i.e.
entries that are not backed anymore by a related CT one) can be
described as follow:

for each ingress NAT entry:
  - derive the related egress CT key:

        source      = ingress NAT entry xlate target
        destination = ingress NAT entry source

  - if there's no entry for the egress CT key
    - derive the related egress NAT entry (aka original tuple/ingress
      reverse tuple):

          source       = ingress NAT entry xlate targe
          destination  = ingress NAT entry source
          xlate target = ingress NAT entry destination

    - delete both ingress and egress NAT entries

This commit is about adding support for collecting orphaned NAT entries
created by the DSR loadbalancer mode.

In DSR mode, when a client connects to a NodePort service, the traffic
flow will be something like:

- client connects to a second node where the NodePort service is running
- this second node selects a backend and (unless the backend is running
  locally) forwards the traffic to a third node where the backend is
  running, while:
  - keeping the original client source IP
  - encoding the original destination in an IP option
  - rewriting the destination to the backend IP
- in the third node, the original destination is extracted from the IP
  option and the following bpf entries are created:

CT table:

  | Dir | Source      | Destination | Flags |
  | --- | ----------- | ----------- | ----- |
  | IN  | src IP/port | dst IP/port | DSR   |

NAT:

  | Dir | Source      | Destination | Xlate target     |
  | --- | ----------- | ----------- | ---------------- |
  | OUT | dst IP/port | src IP/port | XLATE_SRC nat IP |

  where:

  - src IP is the IP of the client
  - dst IP is the IP of the backend running on the node
  - nat IP is the IP of the nodeport node (second node), which was
    encoded by the node itself in an IP option before forwarding traffic
    to the third node.
    This is used to implement DSR as the reply traffic leaving the
    current node is forwarded back to the original client with this IP
    as source

Considering the entries that get created by the regular (i.e. non DSR)
NAT operations, to implement GC of the DSR SNAT entries we can extend
the algorithm as follows:

for each egress NAT entry:
  - derive the related ingress CT key:

        source      = ingress NAT entry destination
        destination = ingress NAT entry source

  - derive the related egress CT key:

        source      = ingress NAT entry source
        destination = ingress NAT entry destination

  - if both the ingress and egress CT keys don't have a corresponding
    entry:
    - delete the egress NAT entry

If both ingress and egress CT keys for a given NAT entry don't exist we
can assume the entry is orphan, as the lack of the ingress key will
cover the DSR case while the lack of the egress key will cover the
"regular" one.

There's one last catch: to simplify this writing, we discussed the
_logical_ content of the CT keys, but in practice every key has the
source and destination addresses swapped because of the way the datapath
creates these entries.

Fixes: #21346
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
After adding support to the GC for NAT entries created by the DSR logic,
we ended up handling twice the case for orphan egress entries as
currently when we find an orphan ingress one we derive the related
egress one and delete it as well.

Since this case is already handled by the logic that goes through
egress NAT entries, simply remove it.

Signed-off-by: Gilberto Bertin <jibi@cilium.io>
@jibi jibi force-pushed the pr/jibi/implement-nat-gc-for-dsr-mode branch from dd7bcee to 97b3698 Compare October 18, 2022 09:37
@jibi
Copy link
Member Author

jibi commented Oct 18, 2022

CI was green, just fixed an intermediate commit which got overwritten by the refactoring in the last one (net result is no changes to the tree https://github.com/cilium/cilium/compare/dd7bcee3f55265f61125e03278d772d932d17861..97b3698c549332973bcbf0cbe686c2ff3d5d0c9c)

Marking as ready to merge

@jibi jibi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 18, 2022
@qmonnet qmonnet merged commit 183726c into master Oct 18, 2022
@qmonnet qmonnet deleted the pr/jibi/implement-nat-gc-for-dsr-mode branch October 18, 2022 17:13
@qmonnet qmonnet mentioned this pull request Oct 19, 2022
6 tasks
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.12 in 1.12.4 Oct 19, 2022
@michi-covalent michi-covalent added this to Backport pending to v1.12 in 1.12.5 Nov 16, 2022
@michi-covalent michi-covalent removed this from Backport pending to v1.12 in 1.12.4 Nov 16, 2022
@joestringer joestringer added this to Backport done to v1.12 in 1.12.4 Nov 18, 2022
@joestringer joestringer added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Nov 18, 2022
@joestringer joestringer removed this from Backport pending to v1.12 in 1.12.5 Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.12.4
Backport done to v1.12
Development

Successfully merging this pull request may close these issues.

Ephemeral ports of an egress ip may be exhausted in NAT mapping table
4 participants