Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.9] pkg/datapath/linux: Fix asymmetric IPsec logic on delete #18847

Merged

Commits on Feb 18, 2022

  1. pkg/datapath, pkg/node/manager: Clarify NodeValidateImplementation godoc

    [ upstream commit a6e8477 ]
    
    Document the intent of NodeValidateImplementation().
    
    Signed-off-by: Chris Tarazi <chris@isovalent.com>
    christarazi committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    c60736b View commit details
    Browse the repository at this point in the history
  2. pkg/datapath/linux: Remove unnecessary branch in IPsec route functions

    [ upstream commit 7e5022e ]
    
    These if-statements are unnecessary because upon code analysis, we can
    tell that it's not possible for the input to be nil. Remove these
    statements to simplify the flow of the function. In other words, now we
    know for a fact that calling these function will result in a route
    insert.
    
    Signed-off-by: Chris Tarazi <chris@isovalent.com>
    christarazi committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    3d6bec4 View commit details
    Browse the repository at this point in the history
  3. pkg/datapath/linux: Add CIDR logfield to IPsec route logs

    [ upstream commit e41aea0 ]
    
    This helps in scenarios where the user reports this log msg, but we are
    missing the actual CIDR from the route that failed to be deleted or
    created.
    
    Signed-off-by: Chris Tarazi <chris@isovalent.com>
    christarazi committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    7961177 View commit details
    Browse the repository at this point in the history
  4. pkg/datapath/linux: Fix asymmetric IPsec logic on delete

    [ upstream commit 0bd4e04 ]
    
    With ENI IPAM mode and IPsec enabled, users were reporting cases where
    connectivity to particular pods breaks, and correlated with those drops,
    the following error msg:
    
    ```
    Unable to delete the IPsec route OUT from the host routing table
    ```
    
    In addition, it was also reported that the connectivity outage would
    only last for a few minutes before resolving itself.
    
    The issue turned out to be that upon node deletion, the logic to handle
    the IPsec cleanup is asymmetric with the IPsec logic to handle a node
    create / update.
    
    Here's how:
    
      * With ENI mode and IPsec, subnet encryption mode is enabled
        implicitly.
      * Background: Users can explicitly enable subnet encryption mode by
        configuring `--ipv4-pod-subnets=[cidr1,cidr2,...]`.
        * Background: ENIs are part of subnet(s).
        * Cilium with ENI mode automatically appends the node's ENIs'
          subnets' CIDRs to this slice.
        * For example, node A has ENI E which is a part of subnet S with
          CIDR C. Therefore, `--ipv4-pod-subnets=[C]`.
      * This means that each node should have an IPsec OUT routes for each
        pod subnet, i.e. each ENI's subnet, as shown by
        (*linuxNodeHandler).nodeUpdate() which contains the IPsec logic on a
        node create / update.
      * Upon a node delete [(*linuxNodeHandler).nodeDelete()], we clean up
        the "old" node. When it gets to the IPsec logic, it removes the
        routes for the pod subnets as well, i.e. removes the route to the
        ENI's subnet from the local node. From the example above, it'd
        remove the route for CIDR C.
      * This is problematic because in ENI mode, different nodes can share
        the same ENI's subnet, meaning subnets are NOT exclusive to a node.
        For example, a node B can also have ENI E with a subnet C attached
        to it.
      * As for how the nodes were fixing themselves, it turns out that
        (*Manager).backgroundSync() runs on an interval which calls
        NodeValidateImplementation() which calls down to
        (*linuxNodeHandler).nodeUpdate() thereby running the IPsec logic of
        a node create / update which reinstates the missing routes.
    
    Therefore, we shouldn't be deleting these routes because pods might
    still be relying on them. By comparing the IPsec delete logic with [1],
    we see that they're asymmetric. This commit fixes this asymmetry.
    
    [1]: Given subnetEncryption=true, notice how we only call
         enableSubnetIPsec() if the node is local. That is not the case on
         node delete.
    
         ```
         func (n *linuxNodeHandler) nodeUpdate(oldNode, newNode *nodeTypes.Node, firstAddition bool) error {
         	...
    
         	if n.nodeConfig.EnableIPSec && !n.subnetEncryption() && !n.nodeConfig.EncryptNode {
         		n.enableIPsec(newNode)
         		newKey = newNode.EncryptionKey
         	}
    
         	...
    
         	if n.nodeConfig.EnableIPSec && !n.subnetEncryption() {
         		n.encryptNode(newNode)
         	}
    
         	if newNode.IsLocal() {
         		isLocalNode = true
         		...
         		if n.subnetEncryption() {
         			n.enableSubnetIPsec(n.nodeConfig.IPv4PodSubnets, n.nodeConfig.IPv6PodSubnets)
         		}
         		...
         		return nil
         	}
         ```
    
    Fixes: 645de9d ("cilium: remove encryption route and rules if crypto
    is disabled")
    
    Co-authored-by: John Fastabend <john@isovalent.com>
    Signed-off-by: Chris Tarazi <chris@isovalent.com>
    christarazi and John Fastabend committed Feb 18, 2022
    Configuration menu
    Copy the full SHA
    83c0b2a View commit details
    Browse the repository at this point in the history