New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unreachable route for pod IP on deletion #18505
Conversation
|
Commit 0dc69103e8534582519daa6ced7177dab7ef82e7 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
0dc6910
to
865a754
Compare
|
I tried the option I mentioned in the PR description to clean up unreachable routes on IP release |
|
Thanks a lot for taking a stab at this! My main concern currently is that Would it make sense to introduce a controller which periodically GCs these routes instead? |
Removing the route is only important for cases where an IP is removed from the pool and could be added to another node (because it would make this IP unreachable from the host with the route). This is clearly possible with dynamic cloud-based IPAM but it should not for kubernetes I think. I don't know enough about clusterpool
If it runs periodically, we may have race conditions if an IP/Cidr is removed from a node and added to another before GC runs |
Clusterpool is similar to Kubernates IPAM in the sense that in both cases a node is assigned a pod CIDR from which pod IPs are carved out (those pod CIDRs can also be recycled when a node goes down). I'm fine to only enable unreachable routes in cloud-based IPAMs for now, as I agree that there the problem is likely more prominent than with kubernetes/clusterpool. In that case however, the addition of the unreachable route in the |
|
@gandro I think we can still be impacted by the original issue in kubernetes IPAM: on GCP for instance we run Cilium with Kubernetes ipam and get podCIDRs from the CloudAllocator Kubernetes IPAM (an alias range CIDR is allocated by GCP on VM creation, the controller manager discovers it and adds it to podCIDR in the node spec, which Cilium discovers). With that setup a stale IP would likely be blackholed too and having an unreachable would be better. We only need to remove the unreachable routes if an IP address can be removed from a node IPAM pool (I don't think this is currently possible with the kubernetes plugin) because when the node is deleted it does not matter anymore (as this route is purely local to the node) |
|
@lbernail Yep, agreed. I'm trying to think of other cases (besides cloud IPAM) where a pod IP could get reassigned to another node. I currently can only think of the following:
Edit: I think in any case, might be worth to add a new |
|
Yes I can definitely add a flag that we could even default to false for now to avoid surprises! |
|
@gandro I added the flag, let me know what you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for adding the flag. I'm fine to merge this as is, as it seems pretty straightforward implementation-wise and allows us to experiment with it.
cc also @borkmann
|
The PR is currently 3 commits with the 3 different features: add unreachable (always) / remove unreachable in CRD-based IPAM / gate feature behing a flag. |
|
@gandro I just rebased. We should be good 🤞 |
|
/test |
|
@ldelossa What do you think of the PR? |
| @@ -209,6 +209,15 @@ func Delete(ip net.IP, compat bool) error { | |||
| scopedLog.WithField(logfields.Rule, egress).Debug("Deleted egress rule") | |||
| } | |||
|
|
|||
| // Mark IP as unreachable to avoid triggering rp_filter after endpoint deletion for new packets to pod IP | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the comments of this function now need updating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the comment after the test, it feels clearer
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, suggest updating comments on Delete to explain new route replace calls.
52e6217
to
6e67cdc
Compare
|
For the top-hat on duty: This remains I don't think we need to re-run CI again, as it was fully green before the last push (as is apparent by MLH adding the |
Do you think we should add more details? Maybe along the lines of:
|
|
@lbernail that sounds good to me. The comments currently maintain that no routes are "deleted" however with your changes, this is now not strictly the case, correct? So just updating the prose there would be beneficial for later viewers IMO. |
The route deletion is unrelated to endpoint deletion. We only delete the unreachable route when the IP is not unassigned from the node, which can only happen in some specific IPAM configurations (ipam=aws in particular). Do you think it's worth mentioning that the route will be garbage collected in that scenario? |
|
@lbernail yes, I think it would be useful IMO. |
When a pod is deleted, we remove the route table entry for the pod IP. However, this IP might still be used by clients for a whiile (it takes time for endpoint changes and DNS to propagate). When this happens, SYN packets will be black-holed and clients will only detect the problem after tcp_syn_retries/timeout. In addition, the server will likely log a warning for a Martian packet (if rp_filter is 1 or 2). This commit explicitly adds an unreachable route to the old IP: - clients trying to connect will immediately get an ICMP error - the server will not log a Martian packet warning Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
When we add an unreachable route for the pod IP on pod deletion, this route is replaced when a new pod reuses the IP. However with CRD-based IPAM, an old unused IP can be removed from the ciliumNode pool and reused on another node (on AWS this can happen when aws-release-excess-ips is enabled). This commit makes sure we remove unreachable routes to IPs when they are released from the node pool. Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Adding unreachable routes to the old pod IP on pod deletion is a new behavior. This commit adds a flag to enable it. The flag is disabled by default to avoid surprises on updates. We could consider changing this default in the future once we have more feedback on the feature. Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
6e67cdc
to
01d3d3d
Compare
|
LGTM |
|
We were interested in such feature because of some UDP connected socket which is not deleted on a source Pod when the destination Pod is deleted. From what I understood, this feature cannot work in cluster-pool IPAM mode but rather in ENI IPAM mode, where Pod are assigned /32 IP address (using the ReplaceRoute function implies that a host route already exists, pointing to the destination Pod). @lbernail would it be possible to confirm this ? |
Addresses #13451
I tested this on 1.8 and things worked as expected as shown with route change monitoring, while deleting and adding a few pods:
When a pod is deleted, the route to its IP is replaced with an unreachable route
When a pod is created, the route is replaced with a route to the pod veth (so if an unreachable existed, it's replaced)
This is very simple and I'll let this run in a test cluster for a while
What's missing is the deletion of the unreachable route when release-excess-ips is enabled. We could maybe do it in updateLocalNodeResource but this code currently does not interact with host routes/rules at all