Ensure purging neighbor cache for stale deletes #1433

mrjana · 2016-09-07T17:53:02Z

When stale delete notifications are received, we still need to make sure
to purge sandbox neighbor cache because these stale deletes are most
typically out of order delete notifications and if an add for the
peermac was received before the delete of the old peermac,vtep pair then
we process that and replace the kernel state but the old neighbor state
in the sandbox cache remains. That needs to be purged when we finally
get the out of order delete notification.

Signed-off-by: Jana Radhakrishnan mrjana@docker.com

When stale delete notifications are received, we still need to make sure to purge sandbox neighbor cache because these stale deletes are most typically out of order delete notifications and if an add for the peermac was received before the delete of the old peermac,vtep pair then we process that and replace the kernel state but the old neighbor state in the sandbox cache remains. That needs to be purged when we finally get the out of order delete notification. Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>

aboch · 2016-09-07T18:20:10Z

drivers/overlay/peerdb.go

+	// leftover sandbox neighbor cache and not actually delete the
+	// kernel state.
+	if (eid == pEntry.eid && vtep.Equal(pEntry.vtep)) ||
+		(eid != pEntry.eid && !vtep.Equal(pEntry.vtep)) {


As a curiosity, this could be compacted to if (eid == pEntry.eid) == vtep.Equal(pEntry.vtep) { }

That will be clever but it will be awfully lot difficult to decipher the intent.

Currently peerDbDelete returns false in the error cases (ie: d.peerDb.mp[nid] not present or peerKey not present). Both are still valid error cases. I think we should retain the boolean return and add peerEntry as additional return type. So the existing !d.peerDbDelete check can be retained in peerDelete and the if check added above is not necessary. Its hard to follow the logic with the current check. We just need the eid == pEntry.eid && vtep.Equal(pEntry.vtep) in the DeleteNeighbor call. wdyt ?

Even though the logic in the PR may be a little hard to follow it is absolutely necessary because even if endpoint Ids did not match we need to remove state in sandbox neighbor case (look for eid != pEntry.eid && !vtep.Equal(pEntry.vtep)) which might have existed from a previous neighbor add that did not get removed because we got another add before the delete.

Ok. yeah, I see why we can't bail out on a false return from peerDbDelete. But I think the if check should also also include eid != pEntry.eid && vtep.Equal(pEntry.vtep) ?

No it should not. If we do we will delete the vtep for the current valid entry which is specifically why we are doing the eid match to avoid a stale(out of order) delete from actually deleting anything. About the only thing that a stale delete should do is remove a stale sandbox neighbor cache.

Ok, I was thinking of the case where the task gets rescheduled and placed on the same node. But in that case the neighbor entry is actually correct and not stale.

LGTM. Good that you caught this in the testing. It would have been impossible to debug in live setups.

aboch · 2016-09-07T18:20:20Z

LGTM

sanimej · 2016-09-16T17:51:38Z

@mrjana Change in #1403 was to avoid sending the stale delete events from networkdb. Shouldn't that avoid overlay driver getting stale deletes ? Is this change specifically to handle the case of a out of order delete when the same IP/mac has been used for a different endpoint after a task reschedule ?

mrjana · 2016-09-16T17:54:19Z

Yeah this is specifically for the same IP/mac being used in a different key and that happens more often than I would like for thanks to the (correc) default IPAM behavior of giving out the last release IP.

sanimej · 2016-09-16T18:01:07Z

Its probably many seconds between a task going down and that event gets processed by the manager, rescheduling/dispatching happens, the tasks starts running on the new node and a new table event gets generated for that. Its a bit surprising that an out of order UDP packet can be received after such a delay. I am wondering may be this symptom was seen because of gossip retransmission of the delete event and won't be an issue after #1403 fix ?

mrjana · 2016-09-16T18:06:49Z

This problem happens when someone creates and removes services in a loop at rapid click. It doesn't happen normally and yes I saw this issue while doing the back-to-back service create/rm test on top of #1403 fix.

sanimej · 2016-09-16T18:45:23Z

drivers/overlay/peerdb.go

 	peerDbWg.Wait()

 	d.peerDb.Lock()
 	pMap, ok := d.peerDb.mp[nid]
 	if !ok {
 		d.peerDb.Unlock()
-		return false
+		return peerEntry{}
 	}


@mrjana The key for peerEntry in peerNetworkMap.mp is just the peer IP & mac

pKey := peerKey{ peerIP: peerIP, peerMac: peerMac, }

In the case of out of order delete we will end up deleting the entry for the newly created peer entry. I think we need eid as part of the key because it will change in this case where we get an add before the delete of the old ep ?

We won't delete it incorrectly. We do check for eid match below and that is needed in the caller as well.

ah, sorry... its right below.

GordonTheTurtle added the status/0-triage label Sep 7, 2016

aboch reviewed Sep 7, 2016
View reviewed changes

sanimej reviewed Sep 16, 2016

View reviewed changes

sanimej merged commit d2141cd into moby:master Sep 16, 2016

mrjana mentioned this pull request Sep 27, 2016

Update CHANGELOG for 1.12.2 moby/moby#26956

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure purging neighbor cache for stale deletes #1433

Ensure purging neighbor cache for stale deletes #1433

mrjana commented Sep 7, 2016

aboch Sep 7, 2016

mrjana Sep 7, 2016

sanimej Sep 16, 2016

mrjana Sep 16, 2016

sanimej Sep 16, 2016

mrjana Sep 16, 2016

sanimej Sep 16, 2016

mrjana Sep 16, 2016

aboch commented Sep 7, 2016

sanimej commented Sep 16, 2016

mrjana commented Sep 16, 2016

sanimej commented Sep 16, 2016

mrjana commented Sep 16, 2016

sanimej Sep 16, 2016

mrjana Sep 16, 2016

sanimej Sep 16, 2016

Ensure purging neighbor cache for stale deletes #1433

Ensure purging neighbor cache for stale deletes #1433

Conversation

mrjana commented Sep 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aboch commented Sep 7, 2016

sanimej commented Sep 16, 2016

mrjana commented Sep 16, 2016

sanimej commented Sep 16, 2016

mrjana commented Sep 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment