Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overlay vxlan races #2146

Merged
merged 2 commits into from Jul 11, 2018

Conversation

Projects
None yet
5 participants
@ctelfer
Copy link
Contributor

ctelfer commented May 9, 2018

This function takes the patch offered in #1800 and attempts to address the issues found in #1765. It removes a race condition documented in the issue where "re-once"ing the creation of an overlay sandbox can race with incoming join requests to cause a leak/collision with the vxlan interface. This PR addresses said issue by removing the "re-once" pattern and replacing it with a traditional mutex+boolean initializer pattern. This also allows removing some restore error handling by having the sandbox join/leave perform proper cleanup on error. More information is available in the commit logs of the PR.

@selansen

This comment has been minimized.

Copy link

selansen commented May 9, 2018

CI Failed

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented May 9, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@eb6b2a5). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2146   +/-   ##
=========================================
  Coverage          ?   40.48%           
=========================================
  Files             ?      139           
  Lines             ?    22491           
  Branches          ?        0           
=========================================
  Hits              ?     9105           
  Misses            ?    12048           
  Partials          ?     1338
Impacted Files Coverage Δ
drivers/overlay/ov_network.go 2.78% <0%> (ø)
drivers/overlay/peerdb.go 9.88% <0%> (ø)
osl/interface_linux.go 60.15% <0%> (ø)
drivers/overlay/joinleave.go 0% <0%> (ø)
drivers/overlay/overlay.go 28.63% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb6b2a5...26212fe. Read the comment docs.

n.joinCnt++
}
func (n *network) joinSandbox(s *subnet, restore bool, incJoinCnt bool) error {
networkOnce.Do(networkOnceInit)

This comment has been minimized.

@fcrisciani

fcrisciani May 10, 2018

Member

if 2 joinSanbox are happening, the first one will trigger this networkOnceInit, while the second one will continue forward acquiring the lock. Is it ok for the second one to go ahead also before the first networkOnceInit is completed?
If I look at the code I guess this will create a potential race with the n.initSandbox(restore) that actually if restore is false will try to delete the VNI
Am I missing something?

This comment has been minimized.

@ctelfer

ctelfer May 10, 2018

Author Contributor

I checked on this very early on... once.Do() holds a mutex while doing the initialization. If two goroutines execute the same once.Do() concurrently, the second one will block until the first one completes.

})
return s.initErr
subnetErr := s.initErr
if !s.sboxInit {

This comment has been minimized.

@fcrisciani

fcrisciani May 10, 2018

Member

isn't this dead code?
if at line 304 was false, will be set to true
if it was already true then won't enter here anyway

This comment has been minimized.

@ctelfer

ctelfer May 10, 2018

Author Contributor

This is s.sboxInit not n.sboxInit ... So no, this is different. :)

This comment has been minimized.

@fcrisciani
// failure of vxlan device creation if the vni is assigned to some other
// network.
if deleteErr := deleteInterface(vxlanName); deleteErr != nil {
logrus.Warnf("could not delete vxlan interface, %s, error %v, after config error, %v\n", vxlanName, deleteErr, err)

This comment has been minimized.

@fcrisciani

fcrisciani May 10, 2018

Member

you can remove the trailing \n

This comment has been minimized.

@ctelfer

ctelfer May 10, 2018

Author Contributor

ah, yes, will do.

s.sboxInit = true
}
}
if subnetErr != nil {

This comment has been minimized.

@selansen

selansen May 10, 2018

We declare subnetErr inside if !s.sboxInit { . and again using it for outside the if scope. Can we declare outside of if and make the scope to function level.

This comment has been minimized.

@ctelfer

ctelfer May 10, 2018

Author Contributor

Good catch. That's definitely a bug. It is not supposed to be redeclared within the 'if'.

@ctelfer ctelfer force-pushed the ctelfer:fix-overlay-vxlan-races branch from 0026dc2 to 5b8a15a May 10, 2018

@ctelfer

This comment has been minimized.

Copy link
Contributor Author

ctelfer commented May 10, 2018

Pushed an update with a much simplified locking scheme. There is now only a single lock for each network and all its subnets. I originally put in the second lock because of a deadlock (which I had hit during testing before the initial PR) described in the commit log. This new version breaks the deadlock by posting the notification to initialize the peerDB on a join using a fresh goroutine. This prevents the 'join' from deadlocking waiting on the channel of the peerDB goroutine while the peerdb goroutine is waiting for the 'join' to release the network lock.

This version also addresses the comments by @fcrisciani (newlines) and @selansen (redeclaration of subnetErr).

@@ -296,29 +294,39 @@ func (d *driver) RevokeExternalConnectivity(nid, eid string) error {
return nil
}

func (n *network) incEndpointCount() {
func (n *network) joinSandbox(s *subnet, restore bool, incJoinCount bool) error {
networkOnce.Do(networkOnceInit)

This comment has been minimized.

@fcrisciani

fcrisciani May 15, 2018

Member

can you keep this comment:

	// If there is a race between two go routines here only one will win
	// the other will wait.
s.initErr = n.initSubnetSandbox(s, restore)
})
return s.initErr
subnetErr := s.initErr

This comment has been minimized.

@fcrisciani

fcrisciani May 15, 2018

Member

we can simply avoid using this subnetErr, and just leverage the s.initErr

This comment has been minimized.

@fcrisciani

fcrisciani May 15, 2018

Member

also for line 321

This comment has been minimized.

@ctelfer

ctelfer May 16, 2018

Author Contributor

Actually, there is a very specific reason why it isn't. We want to return:

  • s.initErr iff it was previously set
  • otherwise we want to return the result of n.initSubnetSandbox(s, restore) regardless of whether that value gets stored in s.initErr. So, on line 321 we do not want to just return s.initErr. It may not have been set if there was a failure but we are not in a restore case.

This is the semantics from the #1800 and I've preserved them. The logic basically implies that if an error is recoverable, return it so it is flagged appropriately, but to make it persistent. But if the error is not recoverable (restore case), make it persistent.

This comment has been minimized.

@fcrisciani

fcrisciani May 17, 2018

Member

do we expect that the error failure will be different? on multiple retry?

@@ -470,7 +477,7 @@ func (n *network) generateVxlanName(s *subnet) string {
id = n.id[:5]
}

return "vx-" + fmt.Sprintf("%06x", n.vxlanID(s)) + "-" + id
return "vx-" + fmt.Sprintf("%06x", s.vni) + "-" + id

This comment has been minimized.

@fcrisciani

fcrisciani May 15, 2018

Member

considering that we are touching this line, this will be more efficient:
fmt.Sprintf("vx-%06x-%v", s.vni,id)

@@ -483,7 +490,7 @@ func (n *network) generateBridgeName(s *subnet) string {
}

func (n *network) getBridgeNamePrefix(s *subnet) string {
return "ov-" + fmt.Sprintf("%06x", n.vxlanID(s))
return "ov-" + fmt.Sprintf("%06x", s.vni)

This comment has been minimized.

@fcrisciani

fcrisciani May 15, 2018

Member

same here just with a single Sprintf


n.setVxlanID(s, 0)
for _, vni := range vnis {
n.driver.vxlanIdm.Release(uint64(vni))

This comment has been minimized.

@selansen

selansen May 15, 2018

why do we need extra for loop instead of using previous logic where its all done inside one place. What are we trying to achieve by doing this ?

This comment has been minimized.

@ctelfer

ctelfer May 16, 2018

Author Contributor

Fair question. The vxlanIdm.Release() operation can do locking and write to store while other overlay networks are coming and going as well. So, I was thinking that this prevents holding the network lock through all those other potentially blocking operations. (i.e. collect the work list using the lock and then go through the process of releasing them) But since this function only gets called while the driver is holding the driver lock, that may be a superfluous optimization.

}

return vnis, nil
}

func (n *network) obtainVxlanID(s *subnet) error {
//return if the subnet already has a vxlan id assigned
if s.vni != 0 {
if n.vxlanID(s) != 0 {

This comment has been minimized.

@selansen

selansen May 15, 2018

Other places you replaced n.vxlanID(s) with s.vni or n.vni. Here we are using n.vxlanID(s) instead of directly accessing with s.vni . Trying to understand why this change is required ?

This comment has been minimized.

@ctelfer

ctelfer May 16, 2018

Author Contributor

Also fair. The revamped locking holds the network lock through network.join() and network.leave() operations. As such, calling n.vxlanID(s) within those methods (or their sub-functions) would be a double lock. That's the reasons for the changes above.

However, the changes above also remove the subnet locks and place subnet locking all within the domain of the network's lock. As such, network methods need to acquire the network lock before accessing any of the subnet fields. This driver calls obtainVxlanID() outside of the context of network.join() in several places. Hence accessing the subnet members requires doing so through the network lock.

The lack of locking in obtainVxlanId() was, in principle, a race condition before that just wasn't addressed.

@@ -1059,7 +1067,7 @@ func (n *network) obtainVxlanID(s *subnet) error {
return fmt.Errorf("getting network %q from datastore failed %v", n.id, err)
}

if s.vni == 0 {
if n.vxlanID(s) == 0 {

This comment has been minimized.

@selansen

selansen May 15, 2018

same as above

return s.vni
}

func (n *network) setVxlanID(s *subnet, vni uint32) {
n.Lock()
defer n.Unlock()

This comment has been minimized.

@fcrisciani

fcrisciani May 15, 2018

Member

the defer is just slower, in this case for a 3 lines methods, I would keep it as before

@fcrisciani
Copy link
Member

fcrisciani left a comment

few minor comments, rest looks good

@selansen

This comment has been minimized.

Copy link

selansen commented May 16, 2018

LGTM

@ctelfer ctelfer force-pushed the ctelfer:fix-overlay-vxlan-races branch from 5b8a15a to 26212fe May 16, 2018

@ctelfer

This comment has been minimized.

Copy link
Contributor Author

ctelfer commented May 16, 2018

Addressed the comment, defer and sprintf comments above and have re-pushed.

@fcrisciani
Copy link
Member

fcrisciani left a comment

LGTM

@fcrisciani

This comment has been minimized.

Copy link
Member

fcrisciani commented Jul 6, 2018

@ctelfer are we good with merging this?

Santhosh Manohar and others added some commits Jun 10, 2017

Cleanup interfaces properly when vxlan plumbling fails
Signed-off-by: Santhosh Manohar <santhosh@docker.com>
Signed-off-by: Chris Telfer <ctelfer@docker.com>
Refactor locking for join/leave to avoid race
Instead of using "sync.Once" to determine whether to initialize a
network sandbox or subnet sandbox, we use a traditional mutex +
initialization boolean.  This is because the initialization state isn't
truly a once-and-done condition.  Rather, libnetwork destroys network
and subnet sandboxes when the last endpoint leaves them.  The use of
sync.Once in this kind of scenario requires, therefore, re-initializing
the Once which is impoissible.  So the approach that libnetwork
currently takes is to use a pointer to a Once and redirect that pointer
to a new Once on reset.  This leads to nasty race conditions.

In addition to refactoring the locking, this patch merges the functions
joinSandbox(), and joinSubnetSandbox(). This makes the code both cleaner
and it also holds the network and subnet locks through the series of
read-modify-writes avoiding further potential races.  This does reduce
the potential parallelism which could be applied should there be many
joins coming in on many different subnets in the same overlay network.
However, this should be an extremely minor performance hit for a very
obscure case.

One important pattern in this commit is that it is crucial to avoid
sending peerDB messages while holding a driver or network lock.  The
changes herein defer such (asynchronous) notifications until after
release of such locks.  This prevents deadlocks where the peerDB
blocks acquiring said locks while the network method blocks trying
to send to the peerDB's channel.

Signed-off-by: Chris Telfer <ctelfer@docker.com>

@ctelfer ctelfer force-pushed the ctelfer:fix-overlay-vxlan-races branch from 26212fe to 5d113d1 Jul 10, 2018

@ctelfer

This comment has been minimized.

Copy link
Contributor Author

ctelfer commented Jul 10, 2018

Rebased to head of master, updated a few comments and re-tested a few times to make sure this looks good. Everything has been clean.

I'll admit that I'm still a bit nervous after #2143 / #2180. But from a rational standpoint, what this PR does is remove the unsafe re-onceing from the overlay code, make joins and leaves to the overlay sandboxes atomic, and merge in #1800 error handling code. It should be safe to merge.

@fcrisciani fcrisciani merged commit 206ed7f into docker:master Jul 11, 2018

6 checks passed

ci/circleci: build Your tests passed on CircleCI!
Details
ci/circleci: builder Your tests passed on CircleCI!
Details
ci/circleci: cross Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
ci/circleci: unit-tests Your tests passed on CircleCI!
Details
dco-signed All commits are signed

@ctelfer ctelfer deleted the ctelfer:fix-overlay-vxlan-races branch Jul 11, 2018

@m4r10k

This comment has been minimized.

Copy link

m4r10k commented Aug 30, 2018

@fcrisciani do you know which Docker version will include this merge (vendors this libnetwork hash)? We are running 18.03 and are regularly experiencing the issue that the vxlan file is already existing.

28686: vx-00101b-nmh1s: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4123 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29077: vx-001017-sxegy: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4119 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
23209: vx-001006-tlzfj: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4102 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29632: vx-001011-ox0iz: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4113 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29658: vx-001028-jjw67: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4136 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
28399: vx-00103a-i7t6y: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4154 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
28911: vx-001020-ups39: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4128 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.