New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctmap: add support for GC of DSR orphaned entries #21626
Conversation
b131bd6
to
8ede1c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jibi! This all sounds sane, but some questions to make sure my understanding is correct ...
c9af1f1
to
dd7bcee
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good call on doing the refactor at the end 👍.
Just one minor thing that's worth addressing imho. But either way lgtm.
Let's start from some context: normally, when traffic on a node needs to get SNAT-ed by the datapath, the following map entries get created: CT table: | Dir | Source | Destination | | --- | ----------- | ----------- | | OUT | src IP/port | dst IP/port | NAT table: | Dir | Source | Destination | Xlate target | | --- | ----------- | ----------- | --------------------- | | OUT | src IP/port | dst IP/port | XLATE_SRC nat IP/port | | IN | dst IP/port | nat IP/port | XLATE_DST src IP/port | Given this, the current algorithm to clear orphan NAT entries (i.e. entries that are not backed anymore by a related CT one) can be described as follow: for each ingress NAT entry: - derive the related egress CT key: source = ingress NAT entry xlate target destination = ingress NAT entry source - if there's no entry for the egress CT key - derive the related egress NAT entry (aka original tuple/ingress reverse tuple): source = ingress NAT entry xlate targe destination = ingress NAT entry source xlate target = ingress NAT entry destination - delete both ingress and egress NAT entries This commit is about adding support for collecting orphaned NAT entries created by the DSR loadbalancer mode. In DSR mode, when a client connects to a NodePort service, the traffic flow will be something like: - client connects to a second node where the NodePort service is running - this second node selects a backend and (unless the backend is running locally) forwards the traffic to a third node where the backend is running, while: - keeping the original client source IP - encoding the original destination in an IP option - rewriting the destination to the backend IP - in the third node, the original destination is extracted from the IP option and the following bpf entries are created: CT table: | Dir | Source | Destination | Flags | | --- | ----------- | ----------- | ----- | | IN | src IP/port | dst IP/port | DSR | NAT: | Dir | Source | Destination | Xlate target | | --- | ----------- | ----------- | ---------------- | | OUT | dst IP/port | src IP/port | XLATE_SRC nat IP | where: - src IP is the IP of the client - dst IP is the IP of the backend running on the node - nat IP is the IP of the nodeport node (second node), which was encoded by the node itself in an IP option before forwarding traffic to the third node. This is used to implement DSR as the reply traffic leaving the current node is forwarded back to the original client with this IP as source Considering the entries that get created by the regular (i.e. non DSR) NAT operations, to implement GC of the DSR SNAT entries we can extend the algorithm as follows: for each egress NAT entry: - derive the related ingress CT key: source = ingress NAT entry destination destination = ingress NAT entry source - derive the related egress CT key: source = ingress NAT entry source destination = ingress NAT entry destination - if both the ingress and egress CT keys don't have a corresponding entry: - delete the egress NAT entry If both ingress and egress CT keys for a given NAT entry don't exist we can assume the entry is orphan, as the lack of the ingress key will cover the DSR case while the lack of the egress key will cover the "regular" one. There's one last catch: to simplify this writing, we discussed the _logical_ content of the CT keys, but in practice every key has the source and destination addresses swapped because of the way the datapath creates these entries. Fixes: #21346 Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
After adding support to the GC for NAT entries created by the DSR logic, we ended up handling twice the case for orphan egress entries as currently when we find an orphan ingress one we derive the related egress one and delete it as well. Since this case is already handled by the logic that goes through egress NAT entries, simply remove it. Signed-off-by: Gilberto Bertin <jibi@cilium.io>
dd7bcee
to
97b3698
Compare
CI was green, just fixed an intermediate commit which got overwritten by the refactoring in the last one (net result is no changes to the tree https://github.com/cilium/cilium/compare/dd7bcee3f55265f61125e03278d772d932d17861..97b3698c549332973bcbf0cbe686c2ff3d5d0c9c) Marking as ready to merge |
Let's start from some context: normally, when traffic on a node needs to
get SNAT-ed by the datapath, the following map entries get created:
CT table:
NAT table:
Given this, the current algorithm to clear orphan NAT entries (i.e.
entries that are not backed anymore by a related CT one) can be
described as follow:
for each ingress NAT entry:
derive the related egress CT key:
if there's no entry for the egress CT key
derive the related egress NAT entry (aka original tuple/ingress
reverse tuple):
delete both ingress and egress NAT entries
This commit is about adding support for collecting orphaned NAT entries
created by the DSR loadbalancer mode.
In DSR mode, when a client connects to a NodePort service, the traffic
flow will be something like:
locally) forwards the traffic to a third node where the backend is
running, while:
option and the following bpf entries are created:
CT table:
NAT:
where:
encoded by the node itself in an IP option before forwarding traffic
to the third node.
This is used to implement DSR as the reply traffic leaving the
current node is forwarded back to the original client with this IP
as source
Considering the entries that get created by the regular (i.e. non DSR)
NAT operations, to implement GC of the DSR SNAT entries we can extend
the algorithm as follows:
for each egress NAT entry:
derive the related ingress CT key:
derive the related egress CT key:
if both the ingress and egress CT keys don't have a corresponding
entry:
If both ingress and egress CT keys for a given NAT entry don't exist we
can assume the entry is orphan, as the lack of the ingress key will
cover the DSR case while the lack of the egress key will cover the
"regular" one.
There's one last catch: to simplify this writing, we discussed the
logical content of the CT keys, but in practice every key has the
source and destination addresses swapped because of the way the datapath
creates these entries.
Fixes: #21346