Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downgrade L2 Neighbor Discovery failure log to Debug #31179

Merged

Conversation

YutaroHayakawa
Copy link
Member

@YutaroHayakawa YutaroHayakawa commented Mar 6, 2024

Suppress the "Unable to determine next hop address" logs. While it shows the L2 neighbor resolution failure, it does not always indicate a datapath connectivity issue.

For example, when "devices=eth+" is specified and the device naming/purposing is not consistent across the nodes in the cluster, in some nodes "eth1" is a device reachable to other nodes, but in some nodes, it is not. As a result, L2 Discovery generates an "Unable to determine next hop address".

Another example is ENI mode with automatic device detection. When secondary interfaces are added, they are used for L2 Neighbor Discovery as well. However, nodes can only be reached via the primary interface through the default route in the main routing table. Thus, L2 Neighbor Discovery generates "Unable to determine next hop address" for secondary interfaces.

In both cases, it does not always mean the datapath has an issue for KPR functionality. However, the logs appear repeatedly, are noisy, and the message is error-ish, causing confusion.

This log started to appear for some users who did not see it before from v1.14.3 (#28858) and v1.15.0 (in theory). For v1.14.3, it affects KPR + Tunnel users because of f2dcc86. Before the commit, we did not perform L2 Neighbor Discovery in tunnel mode, so even if users had an interface unreachable to other nodes, the log did not appear.

For v1.15.0, it affects to the users who used to have the unreachable interface. 2058ed6 made it visible. Before the commit, some kind of the errors like EHOSTUNREACH and ENETUNREACH were not caught because FIBMatch option didn't specified. After v1.15.0, users started to see the log.

Fixes: #28858

Downgrade L2 Neighbor Discovery failure log to Debug

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 6, 2024
@YutaroHayakawa YutaroHayakawa added the release-note/misc This PR makes changes that have no direct user impact. label Mar 6, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 6, 2024
@YutaroHayakawa YutaroHayakawa added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 6, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.15.2 Mar 6, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.8 Mar 6, 2024
@YutaroHayakawa YutaroHayakawa marked this pull request as ready for review March 6, 2024 05:50
@YutaroHayakawa YutaroHayakawa requested a review from a team as a code owner March 6, 2024 05:50
@YutaroHayakawa
Copy link
Member Author

/test

Suppress the "Unable to determine next hop address" logs. While it shows
the L2 neighbor resolution failure, it does not always indicate a
datapath connectivity issue.

For example, when "devices=eth+" is specified and the device
naming/purposing is not consistent across the nodes in the cluster, in
some nodes "eth1" is a device reachable to other nodes, but in some
nodes, it is not. As a result, L2 Discovery generates an "Unable to
determine next hop address".

Another example is ENI mode with automatic device detection. When
secondary interfaces are added, they are used for L2 Neighbor Discovery
as well. However, nodes can only be reached via the primary interface
through the default route in the main routing table. Thus, L2 Neighbor
Discovery generates "Unable to determine next hop address" for secondary
interfaces.

In both cases, it does not always mean the datapath has an issue for KPR
functionality. However, the logs appear repeatedly, are noisy, and the
message is error-ish, causing confusion.

This log started to appear for some users who did not see it before from
v1.14.3 (cilium#28858) and v1.15.0 (in theory). For v1.14.3, it affects KPR +
Tunnel users because of f2dcc86. Before the commit, we did not perform
L2 Neighbor Discovery in tunnel mode, so even if users had an interface
unreachable to other nodes, the log did not appear.

For v1.15.0, it affects to the users who used to have the unreachable
interface. 2058ed6 made it visible. Before the commit, some kind of the
errors like EHOSTUNREACH and ENETUNREACH were not caught because
FIBMatch option didn't specified. After v1.15.0, users started to see
the log.

Fixes: cilium#28858

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
@YutaroHayakawa
Copy link
Member Author

/test

@YutaroHayakawa
Copy link
Member Author

Conformance Ginkgo: #31040

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 7, 2024
@YutaroHayakawa YutaroHayakawa added this pull request to the merge queue Mar 7, 2024
Merged via the queue into cilium:main with commit db7b3ef Mar 7, 2024
63 checks passed
@YutaroHayakawa YutaroHayakawa deleted the neighbor-discovery-debug branch March 7, 2024 00:53
@jibi jibi mentioned this pull request Mar 12, 2024
9 tasks
@jibi jibi added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Mar 12, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.8 Mar 12, 2024
@jibi jibi mentioned this pull request Mar 12, 2024
16 tasks
@jibi jibi added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 12, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.15 in 1.15.2 Mar 12, 2024
@jibi jibi added needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 13, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.15 to Needs backport from main in 1.15.2 Mar 13, 2024
@jrajahalme jrajahalme removed this from Needs backport from main in 1.15.2 Mar 13, 2024
@jibi jibi added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Mar 13, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.15.2 Mar 13, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.14 to Needs backport from main in 1.14.8 Mar 13, 2024
@jrajahalme jrajahalme added this to Needs backport from main in 1.15.3 Mar 13, 2024
@jrajahalme jrajahalme removed this from Needs backport from main in 1.15.2 Mar 13, 2024
@jibi jibi added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Mar 13, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.8 Mar 13, 2024
@jibi jibi added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 13, 2024
@thorn3r thorn3r added this to Backport pending to v1.14 in 1.14.9 Mar 13, 2024
@thorn3r thorn3r removed this from Backport pending to v1.14 in 1.14.8 Mar 13, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Mar 16, 2024
@jrajahalme jrajahalme moved this from Needs backport from main to Backport done to v1.15 in 1.15.3 Mar 26, 2024
@jrajahalme jrajahalme moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.9 Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
No open projects
1.14.9
Backport done to v1.14
1.15.3
Backport done to v1.15
Development

Successfully merging this pull request may close these issues.

Cilium v1.14.3 failed to retrieve route for remote node IP
4 participants