Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon, node: Remove old, discarded router IPs from cilium_host #17762

Merged

Conversation

christarazi
Copy link
Member

In the previous commit (referenced below), we forgot to remove the old
router IPs from the actual interface (cilium_host). This caused
connectivity issues in user environments where the discarded, stale IPs
were reassigned to pods, causing the ipcache entries for those IPs to
have remote-node identity.

To fix this, we remove all IPs from the cilium_host interface that
weren't restored during the router IP restoration process. This step
correctly finalizes the restoration process for router IPs.

Fixes: ff63b07 ("daemon, node: Fix faulty router IP restoration
logic")

Signed-off-by: Chris Tarazi chris@isovalent.com

@christarazi christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/daemon Impacts operation of the Cilium daemon. needs-backport/1.10 labels Nov 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.6 Nov 2, 2021
@christarazi christarazi force-pushed the pr/christarazi/cilium_host-stale-state branch from df6934c to 9bc9466 Compare November 2, 2021 19:46
@christarazi
Copy link
Member Author

/test

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will do the right thing in the common circumstances, although I would like to know that the log messages will include all possibly-relevant information in them (I found this a little hard to tell at face value).

Most of the discussion below is just looking over this code with fresh eyes and I found it a bit confusing for the case where the overall IP allocation CIDR range changes. Maybe we're not so worried about that case, but it looks like it could cause some hiccups if anyone did change the CIDR.

daemon/cmd/daemon.go Outdated Show resolved Hide resolved
daemon/cmd/daemon.go Show resolved Hide resolved
pkg/node/address.go Show resolved Hide resolved
daemon/cmd/daemon.go Outdated Show resolved Hide resolved
daemon/cmd/daemon.go Outdated Show resolved Hide resolved
@christarazi christarazi force-pushed the pr/christarazi/cilium_host-stale-state branch from db54bba to abd35cf Compare November 3, 2021 22:42
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@pchaigno
Copy link
Member

pchaigno commented Nov 9, 2021

/test

@christarazi christarazi force-pushed the pr/christarazi/cilium_host-stale-state branch 2 times, most recently from a9c54b3 to 587f4c1 Compare November 23, 2021 01:56
@christarazi
Copy link
Member Author

/test

@christarazi christarazi added the release-blocker/1.11 This issue will prevent the release of the next version of Cilium. label Nov 23, 2021
@qmonnet
Copy link
Member

qmonnet commented Nov 30, 2021

/test-gke

@qmonnet
Copy link
Member

qmonnet commented Nov 30, 2021

/test-1.16-netnext

@qmonnet
Copy link
Member

qmonnet commented Nov 30, 2021

/test-1.22-4.19

Job 'Cilium-PR-K8s-GKE' hit: #17353 (88.79% similarity)

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Host firewall With native routing

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@qmonnet
Copy link
Member

qmonnet commented Nov 30, 2021

Job 'Cilium-PR-K8s-GKE' hit: #17353 (88.79% similarity)

No, the stack trace is closer to the one in #17545.

/test-gke

@christarazi
Copy link
Member Author

GKE hit #17307

@christarazi christarazi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 30, 2021
@qmonnet qmonnet merged commit fcd0039 into cilium:master Nov 30, 2021
@qmonnet qmonnet mentioned this pull request Nov 30, 2021
18 tasks
@christarazi christarazi deleted the pr/christarazi/cilium_host-stale-state branch December 1, 2021 18:21
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.0 Dec 1, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.6 Dec 1, 2021
@nathanjsweet nathanjsweet added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Dec 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.0 Dec 2, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 10, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.6 Dec 10, 2021
@houminz
Copy link

houminz commented Sep 1, 2022

@christarazi Hi chris, I found another problem when remove old, discard router IPs from cilium_host.

In my scenario #20879, restoredIP is nil, thus leading family to IPv6. However old router IP is IPv4, we could not list IPv4 address, thus old router IPs are not removed.

Maybe we should re-consider the AddrList logic?

func removeOldRouterState(restoredIP net.IP) error {
	l, err := netlink.LinkByName(defaults.HostDevice)
	if err != nil {
		return err
	}

	family := netlink.FAMILY_V6
	if restoredIP.To4() != nil {
		family = netlink.FAMILY_V4
	}
	addrs, err := netlink.AddrList(l, family)
	if err != nil {
		return err
	}
	if len(addrs) > 1 {
		log.Info("More than one router IP was found on the cilium_host device after restoration, cleaning up old router IPs.")
	}

	var errs []error
	for _, a := range addrs {
		if restoredIP != nil && restoredIP.Equal(a.IP) {
			continue
		}
		if err := netlink.AddrDel(l, &a); err != nil {
			errs = append(errs, fmt.Errorf("failed to remove IP %s: %w", a.IP, err))
		}
	}
	if len(errs) > 0 {
		return fmt.Errorf("failed to remove all old router IPs: %v", errs)
	}

	return nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.11 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.10.6
Backport done to v1.10
1.11.0
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

None yet

8 participants