Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unreachable route for pod IP on deletion #18505

Merged
merged 3 commits into from Feb 9, 2022

Conversation

lbernail
Copy link
Contributor

Addresses #13451
I tested this on 1.8 and things worked as expected as shown with route change monitoring, while deleting and adding a few pods:

$ ip monitor route
unreachable 10.132.6.42
10.132.11.107 dev lxc26596370b4a9 scope link
unreachable 10.132.7.37
10.132.7.37 dev lxcac1fafd305bc scope link
unreachable 10.132.7.37
unreachable 10.132.11.107
10.132.7.37 dev lxc5ba6f1edbb6d scope link
10.132.11.107 dev lxc630f0d9a19c4 scope link

When a pod is deleted, the route to its IP is replaced with an unreachable route
When a pod is created, the route is replaced with a route to the pod veth (so if an unreachable existed, it's replaced)

This is very simple and I'll let this run in a test cluster for a while

What's missing is the deletion of the unreachable route when release-excess-ips is enabled. We could maybe do it in updateLocalNodeResource but this code currently does not interact with host routes/rules at all

@lbernail lbernail requested review from a team and joamaki January 17, 2022 16:24
@maintainer-s-little-helper
Copy link

Commit 0dc69103e8534582519daa6ced7177dab7ef82e7 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jan 17, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 17, 2022
@lbernail lbernail force-pushed the lbernail/add-unreachable-on-delete branch from 0dc6910 to 865a754 Compare January 17, 2022 16:29
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jan 17, 2022
@lbernail lbernail requested review from a team and gandro January 18, 2022 09:38
@lbernail
Copy link
Contributor Author

I tried the option I mentioned in the PR description to clean up unreachable routes on IP release
(I kept it as a separate commit but if we agree on the approach I'll squash)

@lbernail lbernail changed the title Work in progress: Add unreachable route for pod IP on deletion Add unreachable route for pod IP on deletion Jan 18, 2022
@gandro
Copy link
Member

gandro commented Jan 18, 2022

Thanks a lot for taking a stab at this!

My main concern currently is that updateLocalNodeResource is only used in CRD-based IPAM modes (i.e. ENI, Azure, Alibabacloud). This means for all other IPAM modes (e.g. clusterpool or kubernetes) will never remove any of the "unreachable endpoint" routes.

Would it make sense to introduce a controller which periodically GCs these routes instead?

@lbernail
Copy link
Contributor Author

My main concern currently is that updateLocalNodeResource is only used in CRD-based IPAM modes (i.e. ENI, Azure, Alibabacloud). This means for all other IPAM modes (e.g. clusterpool or kubernetes) will never remove any of the "unreachable endpoint" routes.

Removing the route is only important for cases where an IP is removed from the pool and could be added to another node (because it would make this IP unreachable from the host with the route). This is clearly possible with dynamic cloud-based IPAM but it should not for kubernetes I think. I don't know enough about clusterpool

Would it make sense to introduce a controller which periodically GCs these routes instead?

If it runs periodically, we may have race conditions if an IP/Cidr is removed from a node and added to another before GC runs

@gandro
Copy link
Member

gandro commented Jan 18, 2022

Removing the route is only important for cases where an IP is removed from the pool and could be added to another node (because it would make this IP unreachable from the host with the route). This is clearly possible with dynamic cloud-based IPAM but it should not for kubernetes I think. I don't know enough about clusterpool

Clusterpool is similar to Kubernates IPAM in the sense that in both cases a node is assigned a pod CIDR from which pod IPs are carved out (those pod CIDRs can also be recycled when a node goes down).

I'm fine to only enable unreachable routes in cloud-based IPAMs for now, as I agree that there the problem is likely more prominent than with kubernetes/clusterpool. In that case however, the addition of the unreachable route in the routing.Delete function should only be added if we're running in a cloud-based IPAM mode.

@lbernail
Copy link
Contributor Author

@gandro I think we can still be impacted by the original issue in kubernetes IPAM: on GCP for instance we run Cilium with Kubernetes ipam and get podCIDRs from the CloudAllocator Kubernetes IPAM (an alias range CIDR is allocated by GCP on VM creation, the controller manager discovers it and adds it to podCIDR in the node spec, which Cilium discovers). With that setup a stale IP would likely be blackholed too and having an unreachable would be better.

We only need to remove the unreachable routes if an IP address can be removed from a node IPAM pool (I don't think this is currently possible with the kubernetes plugin) because when the node is deleted it does not matter anymore (as this route is purely local to the node)

@gandro
Copy link
Member

gandro commented Jan 18, 2022

@lbernail Yep, agreed.

I'm trying to think of other cases (besides cloud IPAM) where a pod IP could get reassigned to another node. I currently can only think of the following:

  • IPAM:CRD with manual assignment. A user might move an IP from one node to another. However, removing IPs from a node in that mode is not recommended, as there are other issues with that as well (i.e. a similar race we had on ENI which we solved via the released-ips handshake)
  • Dynamic Cluster Pool (work-in-progress in ipam: Support dynamic allocation of pod CIDRs in cluster pool #17656) - there pod CIDRs can get dynamically re-allocated to nodes without the node going down. In that case it will be important to clean up any unreachable routes too. Should be fairly straight forward however.

Edit: I think in any case, might be worth to add a new --enable-unreachable-routes flag to the agent, to disable the feature if necessary.

@lbernail
Copy link
Contributor Author

Yes I can definitely add a flag that we could even default to false for now to avoid surprises!

@gandro gandro added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jan 25, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 25, 2022
@lbernail lbernail requested a review from a team January 31, 2022 11:28
@lbernail lbernail requested a review from a team as a code owner January 31, 2022 11:28
@lbernail
Copy link
Contributor Author

@gandro I added the flag, let me know what you think
I've backported the feature to 1.8 (the version we currently use) and everything looks fine on an experimental cluster (flag works, removing pods add unreachable, adding a new pod reuses the IP and overwrites the unreachable route, releasing an IP clears unreachable)
If we agree on the model, I can test it at much larger scale

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for adding the flag. I'm fine to merge this as is, as it seems pretty straightforward implementation-wise and allows us to experiment with it.

cc also @borkmann

@lbernail
Copy link
Contributor Author

The PR is currently 3 commits with the 3 different features: add unreachable (always) / remove unreachable in CRD-based IPAM / gate feature behing a flag.
Of course I can squash them if we feel it's better

@lbernail
Copy link
Contributor Author

lbernail commented Feb 3, 2022

@gandro I just rebased. We should be good 🤞

@lbernail
Copy link
Contributor Author

lbernail commented Feb 3, 2022

/test

@lbernail
Copy link
Contributor Author

lbernail commented Feb 7, 2022

@ldelossa What do you think of the PR?

@@ -209,6 +209,15 @@ func Delete(ip net.IP, compat bool) error {
scopedLog.WithField(logfields.Rule, egress).Debug("Deleted egress rule")
}

// Mark IP as unreachable to avoid triggering rp_filter after endpoint deletion for new packets to pod IP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the comments of this function now need updating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the comment after the test, it feels clearer
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 7, 2022
Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, suggest updating comments on Delete to explain new route replace calls.

@gandro
Copy link
Member

gandro commented Feb 8, 2022

For the top-hat on duty: This remains ready-to-merge from a CI perspective.

I don't think we need to re-run CI again, as it was fully green before the last push (as is apparent by MLH adding the ready-to-merge label). The latest force-push hasn't changed any logic at all: https://github.com/cilium/cilium/compare/52e6217b43d821179568b8cdb6cb82611d846995..6e67cdc5a580cebfa9048c5b3bf82b8b4341a731

@lbernail
Copy link
Contributor Author

lbernail commented Feb 8, 2022

LGTM, suggest updating comments on Delete to explain new route replace calls.

Do you think we should add more details? Maybe along the lines of:

  • replace route for pod IP with an unreachable route on pod deletion
  • when the IP is reused the unreachable route will be replaced with a route through the appropriate veth

@ldelossa
Copy link
Contributor

ldelossa commented Feb 8, 2022

@lbernail that sounds good to me. The comments currently maintain that no routes are "deleted" however with your changes, this is now not strictly the case, correct? So just updating the prose there would be beneficial for later viewers IMO.

@lbernail
Copy link
Contributor Author

lbernail commented Feb 8, 2022

@lbernail that sounds good to me. The comments currently maintain that no routes are "deleted" however with your changes, this is now not strictly the case, correct? So just updating the prose there would be beneficial for later viewers IMO.

The route deletion is unrelated to endpoint deletion. We only delete the unreachable route when the IP is not unassigned from the node, which can only happen in some specific IPAM configurations (ipam=aws in particular). Do you think it's worth mentioning that the route will be garbage collected in that scenario?

@ldelossa
Copy link
Contributor

ldelossa commented Feb 8, 2022

@lbernail yes, I think it would be useful IMO.

When a pod is deleted, we remove the route table entry for the pod IP.
However, this IP might still be used by clients for a whiile (it takes
time for endpoint changes and DNS to propagate). When this happens,
SYN packets will be black-holed and clients will only detect the
problem after tcp_syn_retries/timeout. In addition, the server will
likely log a warning for a Martian packet (if rp_filter is 1 or 2).

This commit explicitly adds an unreachable route to the old IP:
- clients trying to connect will immediately get an ICMP error
- the server will not log a Martian packet warning

Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
When we add an unreachable route for the pod IP on pod deletion, this
route is replaced when a new pod reuses the IP. However with
CRD-based IPAM, an old unused IP can be removed from the ciliumNode
pool and reused on another node (on AWS this can happen when
aws-release-excess-ips is enabled).

This commit makes sure we remove unreachable routes to IPs when they
are released from the node pool.

Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Adding unreachable routes to the old pod IP on pod deletion is a new
behavior. This commit adds a flag to enable it. The flag is disabled
by default to avoid surprises on updates. We could consider changing
this default in the future once we have more feedback on the feature.

Signed-off-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
@lbernail lbernail force-pushed the lbernail/add-unreachable-on-delete branch from 6e67cdc to 01d3d3d Compare February 9, 2022 09:22
@lbernail
Copy link
Contributor Author

lbernail commented Feb 9, 2022

@ldelossa that makes complete sense, I added a more detailed comment. Let me know if it's clear enough
@gandro I had to rebase because of very small conflict in Documentation/cmdref/cilium-agent.md but I don't think we need to run tests again

@ldelossa
Copy link
Contributor

ldelossa commented Feb 9, 2022

LGTM

@maximumG
Copy link

maximumG commented Mar 13, 2023

We were interested in such feature because of some UDP connected socket which is not deleted on a source Pod when the destination Pod is deleted.

From what I understood, this feature cannot work in cluster-pool IPAM mode but rather in ENI IPAM mode, where Pod are assigned /32 IP address (using the ReplaceRoute function implies that a host route already exists, pointing to the destination Pod). @lbernail would it be possible to confirm this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider generating TCP RST / ICMP destination unreachable in response to stale IP
6 participants