Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-neigh: Fix unit test flake #16072

Merged
merged 3 commits into from May 12, 2021
Merged

node-neigh: Fix unit test flake #16072

merged 3 commits into from May 12, 2021

Conversation

brb
Copy link
Member

@brb brb commented May 10, 2021

See commit msgs.

Fix #16040
Fix #16075

We don't return early if arping was skipped. This can happen when
insertNeighbor() is invoked by the non-refresh path and nexthop is not
new.

Make sure that lastPing is updated only if arping was sent and it was
successful (if hwAddr != nil condition).

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb added area/daemon Impacts operation of the Cilium daemon. release-note/ci This PR makes changes to the CI. needs-backport/1.8 labels May 10, 2021
@brb brb requested review from a team and kkourt May 10, 2021 18:18
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.0-rc2 May 10, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.9.7 May 10, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.10 May 10, 2021
@brb
Copy link
Member Author

brb commented May 10, 2021

test-runtime

@brb
Copy link
Member Author

brb commented May 10, 2021

test-net-next

err = linuxNodeHandler.NodeAdd(nodev1)
c.Assert(err, check.IsNil)
// insertNeighbor is invoked async
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment appears a few lines below. Was this maybe forgotten from an earlier version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed redundant comments.

@@ -1026,18 +1027,41 @@ func (s *linuxPrivilegedIPv4OnlyTestSuite) TestArpPingHandling(c *check.C) {
})
c.Assert(err, check.IsNil)

wait := func(now time.Time, nodeID nodeTypes.Identity, isDelete bool) {
err := testutils.WaitUntil(func() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading of the code now is a lower bound for when the operation happened.

However, it is used only for insertion and not deletion (since for the latter, we do not have a deletion timestamp to check).
Given this, and the overall structure of the function, I wonder if it would be better to have two versions of this function one for deletes and one for inserts.

I think this would also improve readability on the caller side (at the cost of some code duplication).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading of the code now is a lower bound for when the operation happened.

Did s/now/before/. Hope it's more clear.

However, it is used only for insertion and not deletion

Made before as *time.Time, so that nil can be passed for the wait for the deletion.

Given this, and the overall structure of the function, I wonder if it would be better to have two versions of this function one for deletes and one for inserts.

I want this waiting routine to be concise to avoid polluting the test case. Also, to follow DRY principle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

We can inspect the neighLastPingByNextHop map to check when
insertNeighbor() or deleteNeighbor() was called.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb force-pushed the pr/brb/fix-arping-ut-flake branch from 90dc1bf to 984ae0e Compare May 11, 2021 16:23
@@ -1026,18 +1027,41 @@ func (s *linuxPrivilegedIPv4OnlyTestSuite) TestArpPingHandling(c *check.C) {
})
c.Assert(err, check.IsNil)

wait := func(now time.Time, nodeID nodeTypes.Identity, isDelete bool) {
err := testutils.WaitUntil(func() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

@brb
Copy link
Member Author

brb commented May 11, 2021

@kkourt Thanks for the review! I pushed one more commit to fix #16075 (concurrent code is difficult). PTAL.

It's possible that in the case of multiple concurrent insertNeighbor()
executions the oldest (or older) goroutine will overwrite the latest
arping result due to the fine-grained locking.

To fix this, avoid updating neigh entry if we detect that prev last ping
timestamp is after our arping timestamp.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb
Copy link
Member Author

brb commented May 11, 2021

test-net-next

@brb
Copy link
Member Author

brb commented May 11, 2021

test-runtime

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last patch looks good to me as well.

Out of curiosity, was this triggered because of the small refresh period:

option.Config.ARPPingRefreshPeriod = time.Duration(1 * time.Nanosecond)

@brb
Copy link
Member Author

brb commented May 12, 2021

Out of curiosity, was this triggered because of the small refresh period:

option.Config.ARPPingRefreshPeriod = time.Duration(1 * time.Nanosecond)

Noup, I've set such low refresh period to make sure that updates of the periodic goroutines (scheduled in the unit tests) don't return early due to the last ping check.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.0-rc2 May 14, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Needs backport from master in 1.10.0-rc2 May 14, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.0-rc2 May 14, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.0-rc2 May 17, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.0-rc2 May 17, 2021
@aanm aanm moved this from Needs backport from master to Backport done to v1.9 in 1.9.8 May 27, 2021
@joestringer joestringer moved this from Needs backport from master to Backport done to v1.8 in 1.8.11 Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
No open projects
1.10.0-rc2
Backport done to v1.10
1.8.11
Backport done to v1.8
1.9.8
Backport done to v1.9
7 participants